Like any disruptive new technology the arrival of BIG Data has upset the apple cart and, as we try to get to grips with it both technically and in business terms, we also need to get a handle on it conceptually. To do this we reach for modes of description that are to hand. Within the bounds of Data Management the obvious tool for describing a collection of data items is a data model. However when we are confronted with a Map Reduce data set it is nothing like a normalised, relational one (not that we ever get to see a fully normalised relational database in the real and grubby world of enterprise applications). But can we re-use these found items (emptied signifiers as we semilogists would call them) for this new job or is this brave new world so strange and different that we need new forms of representation?
And how can the discipline of Data Migration help us?
Firstly we have the industry standard naming definitional muddle that we come to expect with any term in IT. What are Data Models? Are they an abstract, transcendent representation of data that can be used to describe data as it is found in its many diverse forms, from spreadsheets to hard copy to relational data bases to map reduce arrays or are they part of the documentation that should accompany any well run database, very much tied to the technology they map?
My answer to this starts with the premise that a model, and data models are no different, is any representation that for a specific purpose, reduces the complexity of the problem domain and allows us to answer questions that would otherwise be difficult.
For me, a perfect example of a model would be the London Underground Map devised by Harry Beck in 1931. Prior to this brilliant piece of simplification, to use the underground you would have had to consult individual line plans and street maps with the stations indicated. With Beck’s design we can see on one diagram how to get from, say, Liverpool Street Railway Station to Piccadilly Circus, where to change, whether to stand on the up line or down line platform etc. Like all great design it is so easy and intuitive that once done it seems obvious. The reality of the underground (as I know from personal experience) is far more complex - a three dimensional monster of ticket halls, escalators, tunnels, viaducts, sewers, cabling ducts and platforms. What the genius of Beck did was to remove even the absolute relationship between the geographical placement of the stations and their mental representation. You can’t navigate the streets of London using the London Underground map but that is not what it is for.
So a model should be judged not absolutely by how closely it replicates the reality it models but by how useful it is in removing the detail extraneous to the purpose for which it is being designed whilst retaining sufficient accuracy in the details that serve our purpose in using it.
Lets us now see how this applies to Data Models then. In the Chrisholm piece there is a bit of a hang up on the technical specificity of Data Models. As an old grey beard of IT I was introduced to Data Modelling techniques back in the early 80’s. At the place where I was working the target was a network database. (Honeywell’s IDSII not unlike IDMS that ran on the IBM mainframes). For these databases, relationships between records in tables were held as hidden pointers within the database management system (DBMS) not as foreign keys on the records, so to get from, say, an order line to the order header and them to the customer you had to know how to “walk” the Orderline-Order set and then the Order-Customer set. The consequence was that it was impossible to use these databases without the aid of documentation that explained the set and tables relationships. In larger databases with thousands of tables this became even more imperative. So with every release of the database a new logical data model was produced using the famous Bachman diagramming convention.
As an aside, when I finally came into contact with relational databases I was astonished at the undocumented anarchy that reined in these development shops. Documentation was scant and never up to date. You found your links by looking for potential foreign keys in table definitions and tried out the connections with a few lines of SQL if you weren’t sure. After a while you became familiar with your neck of the woods. This is still the case with many applications today.
The relational model won out in the end for both crude commercial reasons - IBM (or Big Blue as it was know at the time) dominated the market place and went down a relational route - but also for speed of development. The anarchy of the relational world meant development could be far more rapid than the slow but meticulous crawl of the network DB.
However the lessons I learned about data modelling then have stuck with me. We could model real world relationships then reproduce them in our IDSII databases. The Bachman coding conventions were not unlike the more common (these days) “lines and crows feet” notation we are familiar with. There were some differences - arrow heads at the many end of a 1:M relation for instance and extensions that only made sense to the technology but in essence they were the same.
A key difference, of course is that they did not rely on foreign keys to say that entity A was related to entity B so as my career took me further and further into data management, I retained the habit of drawing Entity Relationship Diagrams (ERD) without necessarily troubling to show the access path of how two entities were related because at certain points in analysis you can be satisfied that they just are.
This has been a real help in Data Migration where we are confronted with data coming from multiple sources many of which are none relational. I often recount my experience whilst working on a software implementation at a London based utility of getting some data from hand drawn Victorian (originally) line diagrams.
What we are interested in is removing the complexity of technology and getting back to the basic nature of the underlying entities and their relations. Even within two applications running on the same DMBS addressing the same business problem, the way the databases are constructed can be quite different. When you have one application instantiated on Oracle and another held in multiple spreadsheets comparing physical structures really is not helpful. This is where our logical models do their job. Structural differences are far easier (and cheaper) to see in logical models than by investigating the problem using code.
These sorts of comparison are all part and parcel of day to day life on a Data Migration project.
So to come back to the original question about the use of Conceptual Models when addressing the specific problem domain of Big Data I don’t see where the problem lies. Of course the Map Reduce tables are not relational but the metadata relationship between a Customer and Order and an Order Line can be represented in the same form whatever the physical implementation. We can then judge whether representation A is the same as representation B - our model is optimised to answer this question, although it may not be the representation we would use to write the code.
I know I promised to produce a blog last week on the mini debate taking place over the testing on the Data Migration Professionals group on Linked In but I got so snowed under that I didn’t produce a blog at all. A situation I shall endeavour to rectify very soon (the blog that is - I do not think I will ever solve the peaks and troughs of work issue).