How to design a database

Database design should be agile, iterative and highly disciplined, writes Ron Ballard FBCS.

Why agile? So that users get what they want and so that they get it quickly. Why iterative? So that we can check frequently that we are going in the right direction. Why highly disciplined? Because, for almost all applications, the data matters. It must be accurate, complete, up-to-date and accessible.

The process

Understand what your users need. You cannot design a database in an ivory tower. Your design must reflect what the organisation is actually doing and what it wants to do in the near future. You have to understand this and you have to go with it as it changes.

Step by step. Design what is needed for the current iteration (or ‘sprint’). That’s it - you don’t need to design the whole database up front. As with other software engineering, you should minimise the number of connections and dependencies between different parts of your database. This helps you to build useful sections of the model before the whole thing is complete. Build a bit, test it, build another bit, test everything you have so far, repeat.

Test your design. Talk through each of your users’ requirements and make sure your design enables each of these to be met in a straightforward and efficient way. Check again as you build each part of the database.

Work with the developers. See how they are using your design. Is anything giving them trouble? Can you fix it? Should you fix it? Can you advise them how to use your design in a better way?

When things change, be prepared to change your design. If you think of a better way of doing something, change your model. Support the changes with scripted, tested migrations.

Repeat. This process will work very well if you use the right tools and apply the right disciplines and skills.

The right tools

You need to use a proper relational database. That means PostgreSQL, MariaDB, MySQL, or (if you like spending money) Oracle Database, Microsoft SQL Server, SAP/Sybase SQL Server, IBM DB2, Teradata, Snowflake. A few others are available and a few new database products appear from time to time. Don’t believe the hype; check what they do before using them.

Most organisations have less than a terabyte of structured data. There is no point in trying to follow the big social media companies who have exabytes of unstructured gossip. An exabyte is a million terabytes. Your organisation is, almost certainly, a tiny fraction of the size of Google or Facebook and your data is, almost certainly, highly structured, with dates, timestamps, decimal numbers (for money), integers (for counts of things), and short character strings (for names of people, places and products).

The social media companies have very poor data quality - their data is gossip from people who make mistakes when they type their posts, sometimes they post lies. The social media companies mitigate the poor data quality to some extent with massive computing power scanning massive amounts of data. They find enough grains of truth in there to sell advertising. Normal organisations cannot afford such processing power and cannot afford to guess. I want my bank account records, my medical records, and most other records, to be 100% accurate.

The NoSQL databases won’t help you, because they are not as robust, functional, or reliable as the relational databases; they are harder to use and they don’t perform as well.

A data lake is even worse than the NoSQL databases and when they tell you that you don’t need a schema because they use ‘schema-on-read’, remember that this means that your data is undefined - you have to pick it apart every time you access it. The schema is the detailed definition of your data structure, which we begin to describe below.

The right disciplines and skills

I said that you shouldn’t try to do all the design up-front, but you still need to apply the disciplines and skills, at every iteration.

The disciplines and skills are a set of principles that are necessary to make a relational database work well.

A database consists of data organised into rectangular tables, with columns and rows; each table is a bit like a spreadsheet. But in a database, every row has the same columns - you can’t have some rows with five columns and some with seven, for example. Also, the column headings are stored in special tables that you don’t normally see - they are managed by the database and called ‘metadata’ (data that describes data). The metadata records not only the column names, but also the type of data that each column can hold. If you describe a column as an integer, then it can only hold whole numbers.

You can also ask the database to make sure that none of the cells in this column are empty; if you do that, you can be sure that every cell in that column has a value. Other columns may hold dates, timestamps, Boolean values (true/false), character strings up to the length you specify, real numbers (for latitude and longitude, for example), decimal numbers (for money), and some other defined ‘data-types’.

One value in each cell

Each cell (the intersection of a row and a column) can hold only one value at one time. Of course you could use character strings that hold something like:

Diagram of a data table displaying vehicles by ID and description

How do you pick out the manufacturer, or the registration number? Do you split the string at the spaces? Will that work?

Diagram of a data table displaying vehicles by ID, make, model and registration

If you split the data into its component fields when the user enters it, then you never have to ask those questions again - you can find the make or model or registration number without having to parse the description or match it against lists of makers or models or registration numbers, as in the vehicle table:

Diagram of a data table displaying Make, Model and Registration of vehicles

Every row has a unique key

In the vehicle table above, we have called the first column ‘ID’. This is a unique identifier (or ‘primary key’) for each row. Every table we create should have a primary key and we should label it as a primary key, so that the database ensures that it always has a value and is always unique. You don’t have to call it ‘ID’ and you don’t have to make it an integer, but these are good choices.

Every column depends on the primary key, and nothing else, not like this:

Diagram of a data table displaying Name, Birth Year and Engine Size

Unlike in the person table, every column depends on the primary key, and nothing else. In a table of people, each person does have a name and a year of birth, those things definitely depend on the person identified by the ID, but the Engine size does not depend on the person. The person may have more than one car, or none, or may share a car with someone else.

If you do want to record information about people and their cars, you do it as shown below:

Diagram of data tables displaying information about people and vehicles

Here, the person table has information only about the person; the vehicle table has information about the vehicle. We can see that Jane has two cars and Sarah does not have a car. The Person_ID in the vehicle table shows us which person owns the car.

But what if we want to show that a car belonged to different people over time (or even at the same time)? For that we need a different structure, as shown below:

Diagram of data tables displaying information about people, registered keepers and vehicles

This shows one table for the people, one table for the vehicles and a table of links, the registered keepers of the cars over time. Here we can see that Dumas sold the Jeep to Jane on 18-Sep-2018.

These are the only structures you need in a relational database. We see many other structures invented in live databases, but they are mistakes - they are unnecessary and only make queries more complex.

Learning

Whenever I meet people who show that they know how to use relational databases, I ask them where they learned this. The answer is never in school, never in university, (unless they went to university a long time ago) sometimes on courses and mostly by working with other people who know this and by reading books and very carefully selected web-postings.

I find it shocking that we are letting people loose on our important data without giving them the tools to do it properly or the knowledge of how much harm they can cause by doing it badly.

We need to improve education and training for people who build database systems. There is an opportunity for BCS to have a big, positive impact here.

In this short article, it is not possible to give enough examples to cover the questions that every database designer will ask themselves and their colleagues as they try to make the data model match what the users need. Neither is it possible to make the case for my assertion that these simple constructs are enough to describe very nearly all the systems used in businesses, governments and charities, however my book, Relational Databases for Agile Developers, attempts to do this in plain English. For more information, visit: http://thedatastudio.net/