(Big Data) Lineage is NOT Big (Data Lineage)

(Big Data) Lineage is NOT Big (Data Lineage)

With the recent rise of big data adoption, Mapador is always asked the question, “How do you track big data lineage”?  To answer this question, we need to establish common ground.

Although data lineage is often referenced in various contexts, they are categorically different as demonstrated below:

  1. System lineage, which represent lineage on a very high level depicting how systems connect to each other.  We refer to this as the white board drawings that your enterprise architect will draw at every meeting.
  1. Operational lineage, which is what the Systems group always talk about – how system A connects to system B and when they get a little granular, they can refer to the databases or even the tables and how data moves from one table to another.  This lineage is very helpful in a data warehouse scenario, where data is collected from multiple sources, land in a landing zone, then staged in a staging zone and finally, reports are created.
  1. Data lineage – which is what Mapador refers to as the tracking of the transformation on a field level inside a data store.  For example, a global organization often reports on various amounts in different currencies.  Before the data is presented on the final report, all the different currency amount need to be converted to the base currency for the final report.  The proper data lineage means that every transformation for the amounts tracked, needs to explicit show the source(s) , target(s) and type of transformation.

Now, since we have established our common ground, lets go back to the Big Data lineage question.

In a big data world, files are ingested into the Hadoop file system via one mechanism or another .  These files then go through some transformation(s) and there could be a resulting analytics/report file – isn’t that the case why we do big data ? 🙂 –   For most of the cases, if the data transformation was done via SQL like languages, the Hadoop file system keeps track of the transformation.  BUT, it gets really tricky when the transformation is done via Java, Python, Scala etc.. where data is read and manipulated then stored to a new file.  While big data systems would track the metadata of this process, the resulting lineage is more of an operational lineage – bullet “b” above.

When it comes to data governance and regulatory reporting, the operational level of data lineage is not sufficient in majority of the cases.  Regulators and CDOs are interested in how my data is calculated and manipulated more than their interest in how the data moves from one system to another.

Because you are leveraging big data technology, this does not mean that the process of constructing the data lineage is any bigger than what it is today.  All is needed is the proper architecture and the solutions in place to create the single cohesive data lineage maps that answer the questions of the CDOs, regulators and even your application developers?

Do we agree now that (big data) lineage is not the same as big (data lineage).  Love to hear back from you.

Your (Data) Family tree

Your (Data) Family tree

As data volumes grow every day, organizations strive to create a confidence level in their data. This is becoming one of the primary tasks of the CDOs (Chief Data Officers) where they work closely with both the business users and the technology providers to ensure the accuracy, completeness and security of the enterprise newly discovered wealth (i.e., data collected/being collected).
Sometimes though, the activity of collecting and validating data sources is like sitting at the holiday dinner table and asking the elders of the family about the family tree – who the ancestors are, where they came from and where they have settled?  Since the family elders sometimes do not have the most “accurate” data of the family tree, we end up with either having conflicting messages or inaccurate picture of the family tree.
I am sure you have seen the commercials about the online software that “collects” some records coupled with user input to create an online family tree for a low fee of $9.99 per month, while providing a DNA test for additional cost.  A common question that is asked is how accurate is this information?  Would you bet your life on it?
Back to our data governance world.  The previous couple of paragraphs tells the story of every CDO who has purchased an expensive enterprise data management tool and asked his/her team to start collecting information about where the data is coming from and where it is going to?
Luckily, there are cases where the “DNA test” can help such as metadata stored in databases, some documentations, shared business vocabulary, taxonomies, ontologies and, technology to business cross-reference documentation.  However, they do not give the complete picture, especially in the cases where data is transformed and not explained or tracked while moving from point A to point B.
CDOs need to invest in the data lineage requirement when building their enterprise data management framework and focus on automating the data collection and tracking of the transformations.  Automation not only provides the speed required for the “governance” part of the CDO position, it also ensures repeatability and the continuous update of the data lineage for your organization’s critical data elements.
Read more about automated data lineage

Successful Data Conversion Projects Using Automated Application Mapping

Successful Data Conversion Projects Using Automated Application Mapping

 

The Challenges

 

During application migration, a key success factor is always the effective and correct migration of existing data from the old platform to the new. The two obvious extremes for accomplishing this work are a fully manual vs. an automated approach. While an automated solution is naturally preferred, typically no ‘out-of-the-box’ conversion tools exist are able to handle the complex environments found in practice. Consequently, more often than not, companies fall back to the manual approach. Costs for the design, implementation and use of a throw-away automated procedure typically outweigh the costs of manual conversions.

 

The resulting data conversion challenges can be described as:

 

  • Confirming the scope for the data migration
  • Understanding the ‘core’ data that must be migrated and its inter-relationships
  • Capturing verifiable data consistency rules
  • Establishing a consistent, repeatable and auditable conversion process
  • Automating the conversion process, wherever it is cost-effective
  • Creating test plans to ensure the accuracy and completeness of conversion
  • Reducing business impact and project risk

 

 

Benefits of Automated Application Mapping

 

Application Mapping addresses the challenges described above in three key areas:

First, it brings a robust and time-tested process to the table. All project activities are identified in the plan and can be estimated based on the accurate results of the mapping process, eliminating any guess work.

 

Secondly, the necessary understanding of the data and its relationships can be developed more quickly and effectively with Application Mapping, by viewing the ‘live’ system information. This ensures that the information on which critical decisions are based is complete, current and authoritative (unlike manual reviewing of portions of the data and out-dated documentation).

 

Thirdly, instead of having to develop automated conversion routines from scratch, Application mapping can quickly identify the common patterns inside the application to enable higher degree to conversion routine reusability.

 

 

 

Conversion Framework

 

The following diagram illustrates a sample conversion solution using Automated Application Mapping:

 

data_migration
The shaded area represents Application Mapping’s “black box” conversion solution. The process receives as its input an extract file containing data to be converted (prepared by the client) and produces as its output a load file containing the transformed data ready to be loaded into the target database.

 

The conversion engine also uses as input a series of static (i.e. prepared only once) parameter files that describe mapping specifications for each table, file, column or field that is within the scope of the automated solution. The specifications reference data definition language (DDL) statements that describe the source and target data stores.

 

Repeated Conversion Execution

 

Actual conversion of data is typically planned to take place in segments over a period of time. This helps to mitigate project risks (avoiding the ‘big bang’ approach to data conversion).

 

Automated data conversion can be re-run as often as required with minimal effort. It is still possible that some data will need to be converted manually and entered into the target database via direct database edit. This manual conversion may take place before or after each automated conversion run.

 

leave us a comment, Tell us about your data conversion experience.

3 VSAM to DB2 Conversion Strategy Options

3 VSAM to DB2 Conversion Strategy Options

 

Many organizations who run legacy application still have a large percentage of its critical data stored as VSAM based files. while VSAM as a technology has been around for many years and stood the test of time, the advantages of a relational database management system lie in a data structure that is easy to understand. specially these days with the advances in data analytics field – bigdata- , the need to have data structures that easily accessible is becoming a real issue today.

 

Over the years. the Mapador platform has helped many of our clients transfer they key data – such as CIF files- to a relational database – specially DB2.

here is our approach and some strategies for VSAM to DB2 conversion

 

 

  1. Determine data and process requirements
  •  Determine existing physical model.
  • Using existing VSAM files determine what are the relationships between files.
  • Output would be a physical data model
  • Determine how physical model corresponds to current business and logical model (if there is one).
  • If entity relationships are defined in current logical model and do not correspond to the VSAM physical model then impact on existing processes using vsam data need to be analyzed.
  • If entity relationships are not defined in current logical model :
    • Determine how missing entity relationships would be integrated into the current business and logical model
    • Analyze impact on processes, using VSAM data

 

  1. Determine Conversion Strategy
  •  If there are significant discrepancies in the models or other factors that would justify a redesign/rewrite of the application then intermediate solution would be to:
    • Define DB2 tables to correspond with VSAM records exactly.
    • Convert existing code and data to sql/db2/ (see conversion strategy 1)
  • If there are no discrepancies in the models and no other factors that would justify a redesign/rewrite of the application then solution would be to :
    • Define db2 tables to correspond with VSAM physical model
    • Convert existing code and data to SQL/DB2 (see conversion strategy 2 )
  • If there are minimal discrepancies and no other factors that would justify a redesign/rewrite of the application then solution would be to:
    • Integrate VSAM physical model into current model(s), using DB2 for physical model
    • Convert existing code and data to SQL/DB2 (see conversion strategy 3)

Conversion strategy 1

Since this strategy is intermediate term, the main focus is to convert the existing applications from Vsam to DB2 with the least amount of program logic changes. Almost all VSAM statements could be automatically or manually converted without any logic changes. (except return code checking). Consolidating multi-file reads into consolidated SQL joins would not be done since this would require more effort.

Conversion strategy 2

Since this strategy is longer term, the main focus is to convert the existing applications from VSAM to DB2 making maximum use of SQL capabilities. Some VSAM statements can be directly (automatically) converted to SQL.(read,write,delete) without any logic changes at all (except return code checking) . However, any programs that navigate through multiple files should be re-structured to be either platform independent or to coincide with the full capabilities of SQL e.g selecting only required fields, not entire record.

Conversion strategy 3

This strategy would be similar to conversion strategy 2. In addition, program changes would be required to reflect the changes in the model.

VSAM Statements   DB2 SQL Commands   Comments
Write Insert Automated
Read Select Automated
ReWrite Update Automated + ?
Delete Delete Automated + ?
Open / Close Not Applicable Automated
Start Browser Declare & Open Cursor (Single table) Automated + ?
Read Next Fetch Automated + ?
Read Prev Fetch Automated + ?
End Browse Close Cursor Automated

read more about how Mapador can automate application mass change

What is the difference between code scanning and code parsing

At Mapador we always get asked the question “what is the difference between code scanning and code parsing” quiet often.business-magnifying-glass-ps

Code scanning is primarily a database that contains instances of code verbs and variables along with their physical location – such as filenames and folders- . code scanning is a very useful tool to understand how many instances of a certain variable exist in the application code.

 

Code parsing on the other hand is the capturing of the implied hierarchy of the source components and transforming it into an Abstract Syntax Tree (AST) that is suitable for further processing. The AST is serialized as XML  -at Mapador, we use the Mapador Syntax Description Language (MSDL) schema.

Each language is dissected according to its syntax, with results output to a common database, in a common format, enabling to connect various technology environments  and follow components through different technologies.parsetree

 

Parsing understands redefine structures, parent-child and ancillary relationships. As such, impacts of field changes can be truly followed through.

 

One common use of code parsing the need for mass application change, such as expanding a field, while it can be easy to do within the context of a single technology/language, Parsing can lend a hand when an application crosses multiple technology domains. for more info, read Mapador’s Mass Change solution