Data Migration - Data Quality

When embarking on a Data Migration project just how good do you have to make your data?

It's a funny thing but years ago, when I first started out in my data migration career, I was struck by the unrealistic aspirations of my clients. Most saw the implementation of their new system as the chance to start afresh. Here was a clean sheet upon which they could record their business data with perfect clarity, free of the imperfections that bedevilled their old systems. This was so much the case that my standard presentation included a warning against this "Blue sky" thinking. Indeed, those of you who have taken the trouble to read my book on the subject, will know that I devote many pages to explaining why, within the constraints of a data migration project, it is not realistic to expect to deliver perfect quality data. "No company needs, wants or will pay for perfect quality data" I warn.

Now it seems the pendulum has swung the other way. On the last few Data Migration exercises I have been involved in, it has been an explicit policy statement that we must not improve the data beyond the state in which we found it, unless it threatened the Data Migration itself. And by threaten they meant the technical, software, aspect. As long as we can get it over the wall then if it was bad before and it is bad now, well at least we haven't made things worse.

Perhaps it is the dawning truth that Data Migration projects are high risk (backed up the Bloor Research findings that 80% of Data Migration projects overrun either their financial or time budgets) that leads so many to this overly pessimistic view.
Now I find myself pushing the other way. Whatever the cause, an over tight control of data quality enhancement is just as threatening to a Data Migration project as an over optimistic one.

Why is that? Well what one must remember is that although data is never perfect in the old system, surrounding it are a series of work arounds and process fixes that manage known data quality issues - to the extent that they become a natural part of the job. Stuffed into a new system these old reassurances fail.

Secondly, it must be remembered, hidden in the data is a legacy tail of half-worked fixes, software upgrades and changes to validation, all of which mask data quality issues. I was reminded of this on an occasion recently where I was transferring data between an existing software package and a new instantiation that was the first package re-badged. There was no difference in the software. A regulatory requirement specified physical separation and creating a new instance of the same software was the obvious answer.

But still there would have been load problems between the two instances if we hadn't been diligent up front! How could that be?

The answer of course was that data could sit quite happily in the old database but would fail validation when passed through the API into the new instantiation because of the tightening of validation checks.

Finally there are the more insidious and slippery semantic issues. Workarounds built up over years may well mask the poor data in one data set from the eventual recipient. The obvious example here is name data. When I received my electricity bill recently it had mysteriously reverted to Mr and Mrs Morris. This may or may not have amused the ex-Mrs Morris and it only made my laugh but think of the possible offence if the ex-Mrs Morris were not living happily elsewhere but recently deceased? We are back to the issue of Plato's cave (see a previous blog if you are confused). Someone, somewhere, on the upgrade of the database has assumed that what is in the database represents external reality or at least a publicly acceptable deviation from reality. Following the mantra of no enhancement - if it was wrong before it is wrong now - they have exposed their company to public ridicule or worse. This account, of course, does not even consider the impact on downstream processes of these reversions of data from corrected to uncorrected. The migration may work, but existing business processes fail.

It is the case that the mere act of taking data out of one repository and putting it into another will transform that data. So there is no alternative. Yes, we do have to rigorously prioritise our data enhancement activity. But no, we cannot dispense with it.