Data Migration - X88 a new profiling paradigm?

Finally I get to write the promised blog on X88. Had another couple of conversations with them and I think I finally understand why they are making waves amongst the Data Migration and Data Quality cognoscenti.

First of all let's look at what they are not. Although they produce transformed data that could be loaded into a target system, they recognise that they are not a fully fledged ETL vendor. Given that a number of their key players were the powers behind Avellino they certainly know more than enough about ETL to understand that a migration engine needs functionality for scheduling, error handling, managing roll back, monitoring and reporting delivery etc. etc. This they do not provide.

What is their key differentiator then? This is perhaps best explained by stepping through an imaginary profiling iteration. Firstly they would point the tool at the legacy data stores. They then tokenise their findings. Thus a field with, say, “Johny Morris” in it would be stored in such a way that subsequent queries using this metadata would bring out all the fields that have "Johny", "Morris" "Johny Morris" or even close matches. They could even perform the same queries on linked tables. On small data sets this is not much of an advantage, but on the sorts of terabyte data sets we increasingly work with, existing tools could perform these sorts of task but might take weeks to complete.

Of course at the same time they also perform all the standard profiling activities – field patterns, possible foreign key matches, max and min values etc. All of this metadata is in held in the Pandora database and so is rapidly available.

What this means is that the subsequent "What if" analysis can be performed at real time speeds.

Their approach would then see you creating a series of "Views" (logical tables) that examine and apply appropriate transformations to each field to get desired results. It is possible to view the results as they are applied and further profile them to check you are heading in the right direction.

All of this is to be performed on real, live, data so at the end you will end up with a view that corresponds to your target, populated with a full data set as it will be at load time. X88 also produces a full audit log of all the transformations that have been applied and an output set of mappings which (to return to our beginnings) you then feed into the load tool of your choice. The idea is that the View should be designed hand in hand with your Business Engagement team and its values represent values you should find in the target when the real ETL has been run – so the whole process becomes auto testing.

Along the way, one of the many aspects that I particularly liked was the ability to create Notes against views. This means that, where for instance, your profiling reveals that there are, say, 4000 Customer Records that have no contact details, these could be held in a View. The attached Note would become the basis of your DQR. A unique identifier for the note is available. This could be either the DQR ID itself or would be an easy cross reference. The note can be assigned to someone and by periodically re-running the View, would track completion of the DQR. The notes are exportable. Therefore we have the basis of a data quality workflow.

So a cool piece of software with a clear view as to how to maximise its utility. But where does it sit in our world of Data Migration? Here I have to issue a couple of warnings. Firstly I was reviewing a beta version of their next release. Second it (wrongly as it turned out) seemed to me from our conversation that their thinking is embedded in the world of data integration possibly rather than Data Migration. In Data Integration the target is known. In Data Migration the target is often not defined until the very last moment in the programme timeline. Waiting for the Target before starting data migration activities kills many large programmes. You need to move as much activity as possible up the timeline. However as subsequent conversations showed, there is nothing intrinsic in their technology to prevent us from using it earlier in the lifecycle. Only we would have to work against a notional target.

And this is where X88 has a big plus. X88, as we have shown, does not purport to be an ETL tool – it will create load data from legacy but it is best used in conjunction with a best of class migration controller. What it produces is the extract and transformation specifications. These are optimised. So it is possible to iterate down the time line, clearing out the major data prep issues and zeroing in on the target as it becomes known. I can see how that would give you a major advantage in a data migration project. You can be productive early but conserve your earlier work for re-use in the final mappings.

The key question for me then is how automated could you make the interface between the X88 output and the migration controller? In theory this is possible but as yet (with the exception of Expressor) not readily available for a wide range of existing ETL tools.

Interestingly though, in the conversation I've had with the X88/Expressor/Emunio, there was the suggestion of a wholly different development methodology. As more companies adopt an Agile approach, it was suggested that X88 could be used as part of an Agile type co-located development factory. The business users, the Business Analysts and the data guys would huddle together around the X88 screens working out the data design for the target at the same time they were designing the solution in the Target.

Although I find this interesting, I have a number of issues with this. Firstly and most fundamentally it goes against my preferred approach of maintaining two distinct work streams – one designing the new system and one designing the migration. I would always encourage the new system designers to have a free hand to optimise the new system for the benefit of the business and not to look back at the legacy data. Leave that to the migration teams. Now of course if we in migration simply can't get the right data for the new designs or if we see something that has been overlooked in the new design but that is clear from the legacy then we flag it up. But don't compromise a design that your company may have to live with for the next twenty years to make it easier for the migration that will be over in a matter of months.

It is easier than might be thought to fall into these traps. It’s one reason why I dislike normalisation as an analysis (rather than design) technique. If the data store you are analysing was badly analysed in the first place then all normalisation does is reproduce those mistakes – except in a better structure. There are enough tools within Practical Data Migration (PDM) – like the System Retirement Policies – to support the business in their use of historical data. So don’t compromise your Target design.

I also have doubts about the scalability of Agile in very large migrations. Take a case I came across recently where a wholesaler realised, on migration, that a bug in existing software meant that many of its customers benefited from miss-applied discounts. Now from a technical viewpoint the solution is easy. Check historical sales volumes against the discount codes and apply the correct one on migration. From a business perspective it is far more complicated. There are legal, marketing, customer relations, financial and PR aspects to all this. I anticipate that it will take them weeks and weeks to come to a decision – as indeed it should. The reputation of their company and future orders are at stake. Coming across an issue like that and expecting it to be resolved within an Agile cycle is naïve. You have to find these things out early to get yourself the elapsed time you need to fix them if you are not to impact delivery dates.

Finally I'm not quite sure how the process design and data design will work in this fast track approach, although I have seen at least one successful methodology that does work the two in tandem like this.

However the thought of that co-located working has set me thinking about how useful it would be to be able to rapidly prototype data design solutions in a workshop or hot house environment. It certainly bears dwelling on and would boost productivity.

Check out X88 at www.x88.com.

Johny Morris
jmorris@pdmigration.com