On January 15, 2015, Target Canada shocked the business world by filing for bankruptcy protection. The retail giant’s Canadian stores had been open for less than two years, their operation marred by high profile failures: delayed openings, empty shelves, and annual losses that added up to 2.1 billion USD. Analysts, reporters, and Target’s own employees unanimously identified the culprit – bad data in SAP.[fusion_builder_container hundred_percent=”yes” overflow=”visible”][fusion_builder_row][fusion_builder_column type=”1_1″ background_position=”left top” background_color=”” border_size=”” border_color=”” border_style=”solid” spacing=”yes” background_image=”” background_repeat=”no-repeat” padding=”” margin_top=”0px” margin_bottom=”0px” class=”” id=”” animation_type=”” animation_speed=”0.3″ animation_direction=”left” hide_on_mobile=”no” center_content=”no” min_height=”none”][ref]Exclusive: Target Canada’s supply chain gridlock: how Barbie SUVs snarled traffic[/ref]
This in and of itself was not a new story; the two largest food retailers in Canada – Loblaws and Sobeys – had both had significant issues with their SAP implementations. Target believed that flawed data conversions had caused these implementations to stumble.[/fusion_builder_column][fusion_builder_column type=”1_1″ background_position=”left top” background_color=”” border_size=”” border_color=”” border_style=”solid” spacing=”yes” background_image=”” background_repeat=”no-repeat” padding=”” margin_top=”0px” margin_bottom=”0px” class=”” id=”” animation_type=”” animation_speed=”0.3″ animation_direction=”left” hide_on_mobile=”no” center_content=”no” min_height=”none”][ref]Target believed the problems other retailers faced were due to error in data conversion.[/ref] To avoid this issue, Target decided to fill the Canada SAP systems with entirely new data rather than converting data from their US databases. Unfortunately for Target, this strategy did not pay off and led to many of the well-publicized failings that shuttered its Canadian stores for good.
Why was Target Canada so afraid of data conversion?
Done correctly, conversion allows for a smoother cutover between old and new systems – no telling customers they must remake their online order accounts, for example – and saves countless labor hours that would otherwise be spent populating data by hand. For many companies, especially in retail, conversion is the only sensible choice. But despite best laid plans, many implementations face uncertainty in the data conversion process. Late changes to functional requirements, large data volumes, and complicated data dependencies all present opportunities for data headaches leading up to and beyond go-live. And with the critical role data plays in modern business processes, a small conversion flaw can have a cascading effect that will bring an entire system to its knees. This is why the best ERP implementations take a strong stance on ensuring the completeness and accuracy of data conversion. However, not all such data validation strategies are created equal.
When “Good” isn’t Good Enough
Currently, many ERP implementations perform data validation with statistical sampling, where a subset of the total data volume is randomly selected and checked for accuracy. The primary advantage of this method is assumed to be time – it is faster to compare 5,000 records than 500,000. For companies with exacting data quality requirements, this method falls flat. Questions remain on how to group data in order to sample it and on how many samples need to be taken in order to detect problems. Furthermore, if statistical sampling indicates that there are issues with data conversion, a team of analysts still has to search through the entire database to find flawed records and fix them. While this seems draconian, it is often preferable to running the entire conversion process over again.
Checking the sample itself can also be a lengthy process. Typically, the sample data from the legacy system and the target system are loaded into Excel and then compared using VLOOKUPs. This ties up project resources, is prone to errors, and must be repeated as many times as there are conversions (both test runs and “for real”). The size of data that Excel can validate is also limited – using VLOOKUPs to validate data much larger than a reasonable sample is impossible due to limitations on the amount of memory Excel can use and hard limits on the number of rows a worksheet can contain. This is all not to mention the challenging logistics of organizing the hundreds of necessary spreadsheets.
Some companies choose to go with commercial data conversion solutions that include data validation as part of their functionality. Products like WinShuttle and BusinessObjects Data Services certainly make the data conversion process much smoother. However, these programs can be costly – both licensing and specialized resources are needed to support their implementation and use. Such an option may prove to be well worth the investment, especially for projects with complex, high-volume data conversion requirements. But for many projects needing a data validation solution, these products are overkill.
A Better Way
On a recent project, Excel validation had been the strategy of choice. Employees resented having to manipulate Excel workbooks for days at a time in order to get results of questionable value. The client needed something fast and accurate, but without the complexity and expense of a proprietary solution.
I reasoned that in this era of in-memory computing, statistical sampling was no longer needed – if I took a programmatic approach and leveraged big data processing techniques, I could check every record quickly and provide far more granular results than available previously. This strategy would eliminate any uncertainty about the data quality and would provide the conversion team with detailed diagnostic information. Altogether, programmatic data validation would eliminate hundreds of hours spent doing Excel validation, reduce the time the conversion team spent hunting down errors, and provide absolute certainty about what data the client would be going live .
Funny Name, Powerful Tool
To accomplish this task, I selected the Python language and Wes McKinney’s excellent Pandas library. Both are free, open-source software. The combination of the two allows for simple, easy-to-understand code that leverages high-performance, in-memory computing strategies to process large volumes of data.
The first step in the script was to import the data. This project used .csv flat files to move data between legacy and SAP systems. A single line of code later I managed to get the client’s entire legacy article master – over half a million records with close to 100 fields each – into memory. The same was repeated for the client’s article master in SAP. Total runtime: less than a minute.
Comparing the two datasets at this point would have been an exercise in futility – the legacy data format was different from the data format used by SAP. In order to compare them, I had to transform the data. Thankfully, in-memory data transformations are extremely fast. Pandas makes them even easier by removing a lot of the dirty work involved – simply point it to what you want changed and it does the rest.
For example: The client’s legacy systems had dates in the format MM/DD/YYYY whereas SAP stored dates in the format YYYYMMDD. I told Pandas to split the legacy date string by the “/” character, giving me three groups of numbers. I then told it to reassemble these groups of numbers into a new string in the required order. One line of code applied this entire change to hundreds of thousands of records simultaneously. Total runtime: two minutes.
I coded the dataset comparison with an eye to the results that would be of the most use to the business and to the conversion team. I decided on a three-tiered matching system based on the primary key of each record and its non-primary fields[/fusion_builder_column][fusion_builder_column type=”1_1″ background_position=”left top” background_color=”” border_size=”” border_color=”” border_style=”solid” spacing=”yes” background_image=”” background_repeat=”no-repeat” padding=”” margin_top=”0px” margin_bottom=”0px” class=”” id=”” animation_type=”” animation_speed=”0.3″ animation_direction=”left” hide_on_mobile=”no” center_content=”no” min_height=”none”][ref]For example, an article’s primary key is its article number, its non-primary fields include information like the article name, the article description, its dimensions, etc.[/ref] :
- Failed Records – records whose primary keys were not found in both systems. These records were considered to have not been converted at all.
- Partial Matches – records whose primary keys were found in both systems but one or more of their non-primary fields had discrepancies.
- Complete Matches – records whose primary keys were found in both systems and their non-primary fields had no discrepancies.
Judicious use of Pandas produced three different datasets: a list of failed records, a list of complete matches, and a list showing all partial matches side-by-side with extra columns indicating what fields matched and what fields didn’t. Total runtime: 5 minutes.
The final stage in the script needed to turn this information into actionable intelligence. For this I used the xlsxwriter library which the programmatic creation of formatted Excel spreadsheets from Python. I put together a quality report spreadsheet that showed overall data quality metrics as well as a field-by-field breakdown of the partial matches. The code also exported the full datasets to .csv files so that they could be reviewed by the business or by the conversion team. Total runtime: 6 minutes.
At this point the conversion team would review the Excel report and the dataset exports. The conversion team would then fix any errors and another run of the Python script would determine if this had solved the problem. Using this iterative process, data errors were found and fixed at a dizzying pace. Once the conversion team was happy with the data quality, the business leads would have an opportunity to review the report and spot-check the data. Because the conversion team had had the opportunity to fix any errors, the reports typically showed impeccable data quality.
The programmatic approach described here made a major difference. Performance exceeded all expectations: a data object comprising 12 million records was validated in less than 45 minutes on a standard business laptop. Project resources were freed up to do tasks more suited to their areas of expertise. The conversion team no longer had to hunt for the source of data errors. It felt great to know that we can go live with trustworthy data. But the best feeling of all was having employees thank me for removing the burden of data validation from their shoulders.
In my next blog on this topic, I will discuss why Python is becoming a tool of choice, but here are some hints: short learning curve, low cost, speed, repeatability, and community support. It doesn’t hurt that the name is a reference to Monty Python!
For an example of statistical sampling for data validation, see: SUGI 16
For a look into the statistical caveats frequently encountered in statistical sampling, see: Approximate Is Better than “Exact” for Interval Estimation of Binomial
Proportions and How to calculate a sample size for validating correct/incorrectness of records in a data table?