With the recent rise of big data adoption, Mapador is always asked the question, “How do you track big data lineage”? To answer this question, we need to establish common ground.
Although data lineage is often referenced in various contexts, they are categorically different as demonstrated below:
- System lineage, which represent lineage on a very high level depicting how systems connect to each other. We refer to this as the white board drawings that your enterprise architect will draw at every meeting.
- Operational lineage, which is what the Systems group always talk about – how system A connects to system B and when they get a little granular, they can refer to the databases or even the tables and how data moves from one table to another. This lineage is very helpful in a data warehouse scenario, where data is collected from multiple sources, land in a landing zone, then staged in a staging zone and finally, reports are created.
- Data lineage – which is what Mapador refers to as the tracking of the transformation on a field level inside a data store. For example, a global organization often reports on various amounts in different currencies. Before the data is presented on the final report, all the different currency amount need to be converted to the base currency for the final report. The proper data lineage means that every transformation for the amounts tracked, needs to explicit show the source(s) , target(s) and type of transformation.
Now, since we have established our common ground, lets go back to the Big Data lineage question.
In a big data world, files are ingested into the Hadoop file system via one mechanism or another . These files then go through some transformation(s) and there could be a resulting analytics/report file – isn’t that the case why we do big data ? 🙂 – For most of the cases, if the data transformation was done via SQL like languages, the Hadoop file system keeps track of the transformation. BUT, it gets really tricky when the transformation is done via Java, Python, Scala etc.. where data is read and manipulated then stored to a new file. While big data systems would track the metadata of this process, the resulting lineage is more of an operational lineage – bullet “b” above.
When it comes to data governance and regulatory reporting, the operational level of data lineage is not sufficient in majority of the cases. Regulators and CDOs are interested in how my data is calculated and manipulated more than their interest in how the data moves from one system to another.
Because you are leveraging big data technology, this does not mean that the process of constructing the data lineage is any bigger than what it is today. All is needed is the proper architecture and the solutions in place to create the single cohesive data lineage maps that answer the questions of the CDOs, regulators and even your application developers?
Do we agree now that (big data) lineage is not the same as big (data lineage). Love to hear back from you.