Data migration - why we need conceptual entity models

What are conceptual entity models and why do we need them? My attention has been drawn this week to an old article by Malcolm Chisholm entitled ‘Big Data and the Coming Conceptual Model Revolution.’

Malcolm’s epistle has been around for at least 12 months, but for whatever reason it has been re-published recently and this is the first I’ve seen of it. So if this is all old news to you I apologise but it puts clarity on a few issues that I’ve been meaning to write about for some time.

First off the bat I suppose I should define what I mean by ‘Conceptual Entity Model’. Now in my book Practical Data Migration I define a conceptual entity model as ‘A form of data model where atomic entities are grouped together to form higher-level entities that are meaningful to the enterprise’.

This may be accurate from a semantic point of view but it doesn’t really say why I need it or how I use it. For me a conceptual entity model holds the highest level collection of things we are interested in (in this case for our migration). So if I am working on a CRM migration then it is the customer and maybe also product, sales etc. I know that each of these is likely to be normalised into multiple lower level entities to make them useful - so customer may need addresses split out, especially if there have to be different delivery and billing addresses. I will also, when I get down that far, probably need to construct some kind of product matrix. But for my immediate purposes I keep my entities crude, generic and high level. I have yet to work on a project where the whole thing cannot be mapped onto less than a dozen conceptual entities.

But if it is so crude how can it be so useful? Well it accomplishes a number of things. First of all it is created, often in a single pass, in the first meetings with senior sponsors. My diagramming technique of the named box, the connecting line and the crows foot for entity, relationship and optionality / cardinality indicator is easy to understand. I can start having a meaningful discussion about scope using one piece of A4, a pen and less than half an hour’s precious time. On the side I am introducing some powerful but maybe not so technical folks to an appreciation of metadata structures without ever mentioning the dreaded ‘D’ word.

I don’t bother resolving the many to many relationships at the conceptual entity model level - I know as a data analyst that I will have to at a lower level but it doesn’t help in my conversation at that point. I also am aware that I may have to alter the model later (although this does not happen as often as you might imagine).

In addition to using the model as part of my project charter or scope I also start organising my project around it. Every data source and data issue is tagged with the conceptual entities it touches. This is immensely important when you are in the midst of a project with hundreds (thousands?) of data sources which generates hundreds of data issues. Finding that one spreadsheet that forms the link between two apparently unconnected data stores whose linking is vital to the programme is a lot easier if you have tagged all your stores by entity and all your issues by entity.

So what does our friend Malcolm have to say? Well he also has a go at the definitional problem and cites an article by Tom Haughey for a definition closest to the one I employ (there are of course others).

Malcolm disagrees with this definition though, for him a conceptual model has more granularity and is closer to what I would call an Entity Relationship Diagram or ERD (I do this partly because it allows me to make poor puns like ‘We need a nERD’) or more commonly known as a logical data model. This is a model at the level of individual entities. But heck, as the immortal bard put it ‘What’s in a name? That which we call a rose by any other name would smell as sweet.’

So what Malcolm and the rest of this blog is talking about here is a low level, maybe even lowest level ERD. And what is great about the use of these tools is that they allow us to compare the data structures of completely different technologies, be they star schema data warehouses, classic relational databases or even the new kids on the block - the map reduce structures of Big Data installations.

Malcolm sees this as a renaissance for classic data modelling.

Where I differ again from Malcolm is to think this ever went away. I learned my data modelling back in the early 80s when I was working in my first IT role for a company that ran Honeywell mainframes and their IDSII codicil database. This was a none-relational, network database similar to the more common IDMS. Without foreign keys, navigation for programmers was only really possible with the use of a handy diagram, the Bachman notation being the preferred approach. So I grew up, as it were, in the IT profession using data models until they became second nature. I remember my surprise, when landing my first job on a relational database site, to find there was no matching documentation.

I have consistently, since those early days, used data models to get to the bottom of similarities and differences between different data stores in an unambiguous way that completely disregards their technology. So they may be spreadsheets, or even, as I had to recently, structures as singular as the customer centric Microsoft Dynamics platform. It doesn’t matter. We can still analyse them using the same modelling tools and techniques.

So I think Malcolm, albeit in a different domain, has hit upon something significant. As we move away from an almost exclusively relational model (if we ever had one) towards a multitude of new technologies we are going to have to rely more on views that do not replicate the physical data structures. We are all going to need an ERD.

What is less surprising is that we in the data migration ghetto have been doing this for years. Codicil databases have never gone away as far as we are concerned - well at least not until we make them disappear on a project by project basis. Indeed on the PDMv2 training course we cover just such an example in our mapping activity (so be warned).

The version of Malcolm’s work I came across was is available here and contains some interesting observations about the items that most of us find hard to model. As you have probably gathered by now I disagree that it is impossible to model none-key dependencies but agree about code tables and levels of abstraction as being real challenges. I may devote another blog to overcoming the problem of code tables. (To give an example - two companies in the same industry merge. They are ostensibly using the same COTS platform and yet their data is differently structured. How do we model the way this is managed by the use of code tables within the applications?).

Johny Morris

jmorris@iergo.com