Data Migration - Expressor Studio and CDC from Talend

Yet more interesting developments from Expressor and Talend.

For those of you not familiar with the Expressor product I guess an introduction is in order. I think I last discussed them about 12 months ago and at that time I definitely saw them as being a disruptive technology. So to paraphrase what I said last time out:

Expressor is a metadata model based data migration tool. This means that in place of the spaghetti tangle of individual field to field connections of the traditional ETL tool, it models the source and target at a higher level of abstraction using its ‘Smart Semantics™’ framework which includes drawing tools and a semantic dictionary. This makes it easier to maintain, easier to re-use and easier to share with the business.

Now they are releasing a beta version of their next release. I have a copy and I’m hoping to find time to put it through its paces. I’ll let you know how I get on.

More importantly they have released a free to download ‘Studio’ version of their software. This offering of a free to use version of their product is yet another instance of the growing trend towards embracing some of the social network aspects of open source in building communities around products.

I have commented previously on how Informatica have been embracing this change recently in the development of their Informatica Marketplace. But, of course, it is Talend, out there on their own as true Open Source vendors, who are leading this change in attitude.

And talking of Talend I attended their CDC webinar recently. CDC stands for Change Data Capture for the uninitiated, and is the ability to spot changes to data items in a database. Once spotted, changes to data items can be the event trigger for other software to act on the changes.

The webinar was co-hosted with JapserSoft, the open source Business Intelligence company. Interestingly Jaspersoft have embedded the Open Studio version of Talend into their offering to provide it with the ETL functionality it requires, thus demonstrating the dynamic nature of open source development and its capacity to feed on itself. However this meant that the demonstrations were directed at Data Warehousing and Business Intelligence not Data Migration or integration.

But the ease of creating log or trigger based CDC from within Talend was impressive and CDC is the first and necessary step in building a forward synchronising Data Migration tool.

Forward synchronisation is essential in migrations in 24/7 or ‘always up’ environments (like Telco’s for instance). It allows you to run the migration without bringing down legacy systems, updating the target as changes are made to data items in the source that have already been migrated.

However although CDC is an essential first step it is by no means all you need. There are a number of other issues that need to be addressed.

Firstly there is working out where to apply the changes. Some changes will result in minor changes to single records (say a change in delivery time for an order). Others will cause cascading changes throughout data sets (for instance the deletion of a customer record). Working out how to apply these updates requires a separate set of mappings for each create, amend and delete event. So loads of work. This is where the metadata type of applications have an advantage. Using data models gives the software the intelligence to workout the cascading impact of changes.

Secondly there is the migration decomposition issue. Assuming that you are not performing a ‘Big Bang’ migration where all the data is loaded in one go, then you will have broken down the migration into smaller, discrete steps. Maybe you will be doing your migration region by region or function by function, but whatever it is, you will not be interested in all the changes that occur in the source, only for those data items that have been moved. Although this could be coded around, it will add an additional burden of configuration management, change control and of course coding and testing.

Thirdly there is the one way street problem. It is possible to write code where it is not obvious where the migrated data went to. Updates or deletes that follow might find it difficult to find the right record in the Target to update. More sophisticated software keeps track of individual Units of Migration so knows the matching unique identifiers at either end.

All these issues could be fixed in coding and design but fixed they would have to be. Of course with Talend’s open software ethos it may be someone in the community has already solved these issues. If so, and if you are reading this blog, please let me know.

I’m due to meet with Talend in a couple of weeks time. I’ll take these issues with me and raise them then. And I will of course let you know how I get on.

Finally thank you to everyone that has requested one of our PDMv2 posters. I apologise to everyone who has requested one but not received it yet. A combination of high demand and staff illness has limited our ability to cope. But fear not, if you have asked for one it will be despatched. If you have yet to request one - do so now before they all go. I have checked the email below and it does work (this time).

Johny Morris
jmorris@pdmigration.com