Data migration - start early, stay ahead

John Morris

Photo of John Morris When we come to implement a new system we know we should start with clean, appropriate data, otherwise we risk not realizing the benefits the new system promised and our return on investment may be compromised. John Morris navigates around the maze that is data migration and comes out the other side unscathed.

On a day-to-day basis we also know that our existing data has weaknesses - good in some areas and poor in others. There are missing records. There is the backwash of ancient, long-fixed problems whose influence is still present in the data today. There are the daily corruptions that occur because of broken processes or differing data needs.

There are databases that are getting more and more out of step with one another and reports from different sources that can never be reconciled because of structural issues.

But somehow when we come to move data from one system to another we tend to forget all of this. We operate as if in a vacuum, looking at only the technical aspects of the task and assume that as the data ran the legacy systems and the legacy systems ran the business then the data must be broadly all right and it's just a question of moving it.

Figure 1 shows a typical industry standard approach to data migration:

  • A set of rules is applied to the source applications.
  • These extract, transform and load (ETL) rules are represented in load software, which will be either locally crafted or may be one of the newer ETL tools specifically built for the purpose.
  • Validation is applied to the data.
  • Any items that fail to load are rejected and generally passed to our user population to resolve.

So what is wrong with this model?

Blue sky

This approach typically aims to deliver perfect data to the new system. Yet, as we all know, there is no such thing as perfect quality data. No enterprise wants, needs or will pay for perfect quality data. There is never time to source, prepare and clean our current data to a level of perfection.

There is no business case for it either. We obviously wish to improve data quality within our new system but to improve data quality beyond the level that is necessary for successful operation runs into the problem of diminishing marginal returns. Each successive gain in quality comes at greater and greater expense.

What is needed is a process of discussion with key stakeholders that will produce an acceptable level of quality that will meet our technical and business objectives. It is not clear in the industry standard model where this discussion might occur.

Technical possibility over business need

With the introduction of ETL tools it is becoming clearer that what is being offered here is a technical solution to a business requirement. The solution is framed in technical terms. We talk about translation tables and error rates. We speak of validated data sets and fully normalized data. We talk of referential integrity.

All of which is necessary but none of which is sufficient. Because we are following technical drivers there is a tendency to prioritize that data which meets technical criteria over data that is best suited for business need - the commonest example being preference to access data from legacy, corporate, data stores and ignoring data from local spreadsheets.

The one fulfils our referential integrity criteria and may be technically more easily accessible but the other may be more accurate in business terms.

Uncontrolled recursion

These two features - blue sky thinking and technical drivers - lead to the problem of uncontrolled recursion. Data is sucked up from the designated legacy systems; that which passes some arbitrary technical tests is loaded, that which does not is rejected and cascaded back onto a largely unprepared user population to fix.

Some data is corrected, this is then fed back into the hopper, a further attempt at loading is made and a reduced set of errors emerges. And so it goes on until we run out of time, customer patience and possibly money.

Then a mad scramble commences as we load what we can as opposed to what we should. Timelines are compromised, desirable features of new software never get used, business cases are compromised etc.

Lack of project controls

Our blue sky thinking has not prepared us for the need to prioritize one data quality issue over another and we do not have the products to manage the process of prioritization to deliver this in a controlled manner. There are no project products designed to support the user population in resolving data issues but which may cascade back on them in the tens of thousands.

Because the first we know about the scale of data errors is when we take our first cut of live data, usually for testing but sometimes for real, we have not built into our plan any contingency for exposing data quality issues early enough in the timeline to be effective. But migration projects do not have to be this way.

A better way to cook

First of all let us get rid of the notion that the migration aspect of a project should start only in the final third of the programme timeframe. This is the intuitive view, I agree - I mean how could we possibly start the data migration activity before we know what the target data set will look like? But think again.

We are building a bridge between two business-as-usual realities - the pre-migration world and the new post-migration world. If we were building a real bridge then we would be advised to survey the ground on both sides, not just on the destination end. The pre-migration world is available to us from the moment the project is initiated. We can start performing our initial data survey activities pretty much as soon as the project funding arrives.

Now it can be argued that until we know what the destination is in detail, we cannot know for certain the precise data items we need or what their transformation will be. In an absolute sense this is true. But in practice we do know that if, say, we are building a new human resource system, then we will be lifting substantial parts of the HR department's data. If the bigger programme scope includes timesheeting then we will need to include some of the work management data as well.

So we can start with a data review of our legacy before we know in detail the target system. And in the project world of strictly limited end dates, any work that can be brought forward is to be welcomed because it buys you contingency at the sharp end!

Other benefits of starting early

As we saw with the industry standard approach, there are other aspects of a data migration exercise that are better started earlier:

  • designing end-user support mechanisms for fixing data problems;
  • agreeing the prioritization mechanisms;
  • deciding what constitutes acceptable quality data;
  • identifying the key stakeholders and getting their sponsorship of the project.

These are tasks critical to the success of the project, but they always have long elapsed times because they are dependant on a succession of meetings with often hard to pin down business people.

These are also 'soft' tasks, dealing with setting up mechanisms and project controls, not the 'hard' data-munching tasks that we, as technologists, feel more at home with. But they have to be done and can be done independent of access to the new system data model.

Figure 2 illustrates what a model project should look like. I've superimposed the industry standard model on top. As you can see, in terms of timeline and effort the technology aspect, far from being the most significant, is dependent upon a hinterland of project and business activity.


In a nutshell, for a successful data migration project:

  • Start early.
  • Start with the legacy and especially the legacy data quality issues.
  • Be business- not technology-focused.
  • Design (or borrow) a project product set that addresses the whole problem not just the final third.

John Morris, author of the recently published book Practical Data Migration (ISBN 9781902503718), has over 20 years' experience in the industry. He has worked for some of the biggest systems integrators (PWC, Logica CMG, CSC) on some of the biggest data migration projects over the last 10 years. For further information please contact John on email: