The database software market is huge (worth over $50 billion a year) giving vendors the incentive to deliver many new products – as well as new features in old products – in order to claim a share of this revenue.
Relational database products have been around for decades and are associated with some heavyweight methodologies, which use up a lot of resources in planning and designing, long before any working code is delivered. Consultants commonly promote these methodologies, because detailed planning reduces their risk and earns them revenue before any testable application is delivered.
The established database vendors have added many features and supporting tools to suggest that their databases are easier to use. In at least one case, the vendor says that, as a result, the work can be done ‘without expertise’!
What is a relational database?
The relational database was invented to tackle many problems found in earlier applications that were based on files and on other kinds of databases.
The building blocks of a relational database are tables of data and the links between these tables. This structure is ideal for the types of applications that are driven mostly by numbers, dates, times, and short text identifiers.
Relational databases record not only the data, but also the definitions of the data. The definitions include data types, links between tables and other rules. In earlier systems these definitions were in the applications. Now the shared database definitions ensure that the data is treated consistently across applications.
Relational databases support the safe use of data by many concurrent users, via permission rules that define who is allowed to do what and tuning options that enable the database performance to be optimised, without changing the applications using it. They also support auditability and recovery from failure.
For the database coder, relational databases provide a language called SQL that lets them tell the database what to do. The database works out how to do what has been asked of it, using sophisticated artificial intelligence in its ‘query optimizer’.
Relational databases are mature and robust technology that is now low-cost and is used in the back-office of almost every business, government and not-for-profit organisation on the planet. Relational databases are reliable, efficient, scalable and fast.
The problem of ‘big data’
The rise of social media companies, has seen some vast fortunes being made from systems that contain large amounts of text and multimedia. Other businesses see this and ask themselves whether they can also make money from their ‘big data’. Very often, they are tempted to buy the products that have been associated with the social media companies. This is a bit like trying to become successful by buying designer suits, expensive limos and corporate jets, because this what the successful bosses do. However, success is a bit more complicated than that.
The biggest banks, mobile phone companies and other retail giants have a lot of data, typically tens of terabytes. The social media companies typically have ten thousand times as much, hundreds of petabytes. That’s a big difference. The difference is not only in scale, it is also in content. The data managed by most organisations is largely numbers, dates and short text identifiers. This needs to be processed in completely different ways from text, which is what the social media companies deal with most.
New and improved?
The new products – ‘NoSQL databases’ and ‘big data’ systems – suggest that traditional design tasks, as well as tasks to improve data quality, can be abandoned. They offer nothing in place of these tasks. The consultants are less likely to propose heavyweight methodologies with these products, riding on the marketing messages from these vendors.
Those responsible for building new applications in medium to large organisations are faced with a choice between a front-heavy traditional database methodology and a ‘big data’ methodology that is unsuited to the type of data they are trying to manage. Neither option seems an ideal solution. What should they do?
Failings of the traditional approach
A Yale professor has written one of the most popular series of textbooks on relational databases (now in its seventh edition). It has well over a thousand pages and costs nearly $200. Pretty early on, it talks about conceptual data models, logical data models and physical data models. This implies the heavyweight waterfall-style methodology that has been the main factor in giving relational databases a bad name.
This layered conceptual-logical-physical approach encourages a working model which puts data-modellers in an ivory tower, far from developers and far from actual users. In their intellectual isolation, they use expensive modelling tools, which require expensive A0 printers to produce ugly wallpaper.
These models are invariably over-complicated, with misleading names, inappropriate data-types, and a rat’s nest of links. When the developers finally get onto the project (a year or two too late) they ignore the model and build what they believe they need. The developers are now in a mad rush, because the data modellers have spent all the money and used all the time.
It is crazy to suggest that the whole database must be designed three times before it can start to be used. The rules about which columns belong in which tables will hardly ever change as we investigate other tables; there is no need to design the whole database before using any of it. And relational databases are easy to change; there is no need to fear it. Those who do spend months or years designing databases in isolation always get them wrong anyway; the real world changes and the design does not keep up with reality.
A better approach
Rather than chasing after promises of easy solutions, a better approach is to rethink the way we use the best technology available. Relational database development can be agile, with professional, reliable engineering. This does require discipline and skill, but it can be done and it is very rewarding to deliver a system that works well for its users at a reasonable cost.
Some people believe that agile development just means cutting corners, or using different ‘artifacts’. It is worth going back to the Agile Manifesto (see references) to make sure that our team, or our suppliers, are actually following the principles defined here and not just using the term ‘agile’ as an excuse for shoddy work. Agile development, as described here, is highly disciplined and delivers high-quality results.
A more productive approach was described by David Heinemeier Hansson in Agile Web Development with Rails (see references). The key idea is that you design the database as you need it. On each iteration, the developers may add to the database and change what they have already designed. Gradually, the database grows into exactly what the users need – and no more. This agile approach horrifies the heavyweight methodology enthusiasts, but it is far more effective. It needs to be done properly, of course.
The next section reviews the key practices used by agile developers; these are explained in Relational Databases for Agile Developers (see References).
Practical application of agile database development
Building a database one step at a time makes perfect sense. Why would you build an incorrect model and then normalise it to make a correct model? It is much better to build a normalised model at every iteration.
The changes from one iteration to the next are usually additions to what has already been built, and therefore have no adverse impact on work done in earlier iterations. Some iterations cause no database changes at all. Very occasionally, an iteration makes it necessary to split a table into two or to change the nature of a link.
SQL is the ideal tool to migrate the existing tables into the new structure with no loss or duplication of data, because it gives a non-procedural, text-based approach that can be tested, version-controlled and reliably repeated.
SQL migrations can be applied in organisations that use any development language: Java, Python, C#, whatever you choose. It is essential to write tests for every migration, of course, and the migration must be run in development and tested before it is moved to live. When we do go live, we know that our migration is correct.
Here is a simple checklist, which can be used for every database change.
It ensures that the database is well-structured and that it follows some other valuable, but less intellectually challenging principles, such as proper naming:
- Is every column atomic? (No compound fields)
- What is the primary key?
- Does every other column depend on the primary key?
- Does any column depend on another column that is not the primary key?
- Are there any repeating groups?
- Does every column have the most restrictive data-type possible for the data it holds?
- Does the name of the column accurately describe the data that will be stored in it?
- Are foreign keys defined in all cases where there is a relationship?
- Are nullable columns genuinely OK to be null? This applies to data columns and to foreign keys.
Like most checklists, this one works if the person using it understands the concepts that each item condenses into just a few words. In this case, the first five items cover what is described in the textbooks as ‘normalisation’. The last four items are some basic methods for ensuring better data quality.
With simplicity, comes strength
The data model must always represent the actual objects used in the application, using names that are in plain English and can be understood by users and developers alike. Every table in the database represents a concrete, specific thing in the real world. For example, it is a mistake to collapse disparate ‘events’ into a single events table.
If customers can make a purchase, or a complaint, then these must be two separate tables. Trying to shoehorn different concepts into the same table leads to generic names that can be misinterpreted and columns that are mandatory in some records and optional, or even forbidden, in others. The data model would get more complex and would require more complex processing.
Working within defined constraints
Every table has a primary key. You should always define primary and foreign key constraints in the database, because then the database ensures that there are no broken links.
Always use strong data-types. If a column value is supposed to be an integer, use the database to define it as an integer, so that the application can safely assume that it is an integer and not keep checking every time it tries to do arithmetic with a value from this column. The same applies to floating-point numbers, exact decimal numbers, dates, timestamps and booleans.
When a column cannot be null, tell the database when creating the table and the database then ensures that this column is never null.
Never use database ‘views’
Views are sold as a way of fighting-off change by making the database look like an earlier version to avoid changing an out-dated application. There should be one current version of the database and every application that uses it must keep up. That may sound harsh, but the alternative leads to complexity, confusion, errors and poor performance. Nobody wants any of those. A bit of work now is always cheaper than managing a decaying system.
Use a small number of tools and use each one only for what it is good at. Use the database for managing data and use the application language to write applications. Do not use the database ‘stored procedure’ language for writing applications, not even parts of applications.
Implications for the database products we choose
The working practices described above mean that only real relational databases will do the job: PostgreSQL, MySQL, Oracle, Microsoft SQL Server, DB2, SAP/Sybase SQL Server and just a few others.
The NoSQL (Not Only SQL) databases will not do the job because they have limited functionality and reliability. Different products have different shortcomings: some have weak querying, several have been known to lose data, most use data structures that lack rigour and performance is usually patchy.
Also the Hadoop1 ‘ecosystem’ will not do the job. Apache makes very clear in its documentation that Hadoop is not suitable for transactional systems. This rules out Hadoop for most applications, since most of them are transactional systems. Hadoop is used more often for data warehousing systems, where large batches of data are loaded in bulk, and most of the interaction with the system is through read-only queries. For these cases Hadoop is actually more complex, more expensive, less reliable and slower than conventional relational databases.
The experience which drove this assertion came from a project with a team of about 30 skilled engineers, on a large and powerful cluster, and with support from the vendor for two years. There was a comparable system, with the same data, running on a single Microsoft SQL Server database and managed by just one person. The SQL Server system was a fraction of the cost of the Hadoop system and it was faster.
This experience is supported by David DeWitt’s benchmark (see references). It is surprising that the cost and complexity are so high, since the claims of the companies promoting the Hadoop ‘ecosystem’ are the opposite.
What about blockchain?
Blockchain is another tool that will not do the job. Recent articles in this publication2 have described the blockchain architecture and highlighted the hype surrounding blockchain. A significant reason to avoid blockchain is its willful squandering of natural resources. You can find many more reasons in ‘Blockchain – Examining the Technical Architecture’ (see references).
Is there a better way?
One day, there may be a better way of handling large amounts of transactional data in an accurate and reliable way. If such an innovation comes along, it will be exciting and should be welcomed, but this has not happened yet. Currently, relational databases are the best tools for most data management jobs and none of the trendy alternatives are anywhere near as good.
We can waste resources by using heavyweight waterfall methods to build our systems. We can also waste resources by using more complex and unsuitable tools. Or, we can use disciplined agile development practices with high-performance relational databases, to build flexible and reliable applications quickly. As things stand at present, this should be our preferred approach.
1 The Apache Hadoop software is open-source and free, but, like most corporates, this example used a specialist company to provide packaged releases of the ‘Hadoop ecosystem’ and to provide technical support, training and consultancy. These services, of course, are not free.
2 ITNOW – Winter 2018 focused on Blockchain. See also Patrick McConnell in the Spring 2019 issue Blockchain – Examining the Technical Architecture.