Data migration and Pandora’s box of tricks

Introducing another of our sponsors for Data Migration Matters 6 - my old friends at X88 and their Pandora software.

I was checking back through my blogs and see that I first wrote about these guys back in 2009. So what have they been up to since then and why are they still seen as one of the leaders and innovators in the data profiling/data quality space?

First though to re-cap for those of you whose memories of blogs from over two years ago are somewhat dim or who are new to the whole data migration scene.

X88 (the company) was founded by a bunch of guys who, having been the power behind Avellino, really know their data onions. And it shows. Their ambition was to produce the next generation of data profiling/quality tool from the ground up.

One of the ways they were determined to do this was to get to grips with the latency between kicking off a query and getting the result back. As we all know raw SQL can be notoriously slow if you run any kind of grouping or filtering query against large data sets. This is especially true if that data set has been optimised for speedy data entry (like most of our OLTP source databases). There are a number of tacks to resolving this. You can be clever with the processing (like Pervasive and their off the shelf parallelisation) or you can be clever about your sort, filter and merge algorithms (like Syncsort with their decades of large dataset experience). Then there is the whole big data, Hadoop area but this is something we are not so interested in here.

Pandora (the software) takes a different approach. They bring the data into their own mysterious database where it is pre-profiled and tokenised in such a way that subsequent queries are run against a dataset optimised for DQ type analysis (e.g. ‘Find me all the occurrences of matching names even in different columns in different tables’). This allows the tool to be used whilst sitting side by side with business domain experts without the wait time that kills any collaborative working.

Whilst performing the pre-profiling all the standard profiling indices are populated - field patterns, possible foreign key matches, max and min values etc. All of this metadata is in held in the Pandora database and so is rapidly available.

Their approach would then see you creating a series of ‘views’ (logical tables) that examine and apply appropriate transformations to each field to get desired results. It is possible to view the results as they are applied and further profile them to check you are heading in the right direction.

When all of this is to be performed on real, live, data you end up with a view that corresponds to your target, populated with a full data set as it will be at load time. Pandora also produces a full audit log of all the transformations that have been applied and crucially an output set of mappings which you can then pass to your ETL developers.

The point and click nature of Pandora means that metadata discovered in profiling can be reused in data quality monitoring. So, for instance, data pattern values captured in profiling can be used as the basis for the correct pattern matching in subsequent data quality passes.

Pandora is not a full power migration controller, nor does it pretend to be. It lacks key essentials, like automated rollback, re-start, throttling, real time reporting etc. However what it does allow you to do is to prepare real data from the legacy data stores, enriched and reformatted, ready to be handed to the software picked to perform the final mile lift and shift.

This is important. In major migrations the standard paradigm is that there will be an in-situ supplier or systems integrator (SI) contracted to deliver the working solution. They will perform the data migration of data prepared to their specification from a staging database. It is the client’s job to populate that database with data that conforms to the validation rules provided by the SI. Here the Pandora tool can perform the profiling, validation, enrichment and preparation to load the staging tables. This is a task that does not require the sophistication of a fully fledged migration controller. On the other hand, where the data load is smaller and everything is being done in-house, then you could maybe get by with the ETL capabilities of Pandora.

The in-situ SI example (or even the mandated use of specific load tools like MAPI for SAP) is a classic example of what, in practical data migration terms, we know of as the demilitarised zone. The DMZ concept allows us to separate responsibility for different parts of the ETL journey. Working outside of the DMZ is nearly always the responsibility of the client. The use of Pandora up to the DMZ does not conflict with the use of whatever tools are prescribed within the DMZ provided the structure of the staging table is understood.

Sticking with the PDMv2 theme, one of the many aspects that I particularly like is the ability to create notes against views. This means that where, for instance, your profiling reveals that there are, say, 4,000 customer records that have no contact details, these could be held in a view. The attached note would become the basis of your DQR. A unique identifier for the note is available. This could be either the DQR ID itself or would be an easy cross reference. The note can be assigned to someone and periodically re-running the view would track completion of the DQR. The notes are exportable. Therefore we have the basis of a data quality workflow.

This is not the only way of tracking your data readiness. The latest version of Pandora comes with built in data quality dashboards although I prefer the DIY method of creating a view with a column of pass/fail ticks then running a grouping query against the pass/fail ticks to get an overall score that can be passed into your graphics tool of choice. The built-in graphics are maybe not to everybody’s taste.

Finally the versions available. There is a studio edition that is free to download from the X88 website (see below). I would recommend this for anyone interested in getting to grips with what the heck we are talking about when we pontificate about profiling tools, data quality tools and the like. Obviously it lacks all the sophistication of the paid for variant - like the built in dashboards but also more significantly the collaborative features of Pandora that are needed for bigger projects.

There are some easy to follow tutorials on the X88 website led by the dulcet Scottish tones of my friend (and Pandora expert) Derek Munro. These will more than get you started.

The paid for bundle includes the data discovery, prototyping and DQ management modules giving you a comprehensive data management tool with business glossary with physical-logical link.

Check out X88 or better still come to Data Migration Matters 6 where you will meet the aforementioned Derek and get hands on instruction in the dedicated workshop area.