It is unlikely that anyone would want to use machinery from the 1970s in any kind of production situation, but the same is not true of data - nor of all the software. David Holdsworth, recently retired from Leeds University Information Systems Services, explains.

My own university values its historical cosmic ray data, some of it originally processed on KDF9 in the 1960s. The DWP, previously DHSS, seems to know about my graduated pension contributions from the 1960s, about which I'm rather glad.

IT professionals are waking up to the fact that they have lost digital information from the early history of computing, that would now be of real historic interest. This is sad, and historians would be justified in complaining that IT professionals were too busy looking towards the new, with the result that the old was prematurely discarded.

Many of today's records are born digital and can be of vital long-term importance. Not only is much important information born digital, its whole life is digital. For example, the design data of a new nuclear power station will originate in a CAD system - a CAD system likely to have been obsolete for many years by the time the station is decommissioned.

So, when we look to the future, we need to look beyond the immediate Vista.

Digital preservation is about keeping data in such a way that its significant content can still be extracted and understood long after the systems that gave it birth have been superseded. The trick is to do this cost-effectively.

It seems to me that obsolete data can be categorised:

  1. Things that must be kept by law;
  2. Things that must be destroyed by law;
  3. Things that we choose to keep;
  4. Things that we are certain can be thrown away;
  5. Things that we would like to keep if we have room;
  6. Things that we would like to throw away, but are not sure about;
  7. Things that we think we have kept but cannot find;
  8. Things that we have kept but now cannot decipher;
  9. Things that we have not kept, but now wish that we had.

Storage technology's apparently unstoppable advances really do deal with issues 1 and 3. Categories 2 and 4 present no long-term future issues. Categories 5 to 9 all represent regrets and failures of one kind or another. Our aim is to minimise the amount of data that ends up in categories 5 to 9.

Categories 5 and 6 can be minimised by making storage very cheap, because doubts can be eliminated by putting stuff in category 3. Categories 7,8 and 9 are products of past mistakes, and illustrate the dangers faced by today's data.

The key to long-term survival of our data is to keep it in such a way that when the time comes to access it, we can find it, read the medium upon which it is written, and (if necessary) convert it to a format that we can process with current software.

Moving with the times - hardware

All media becomes unreadable, or at least difficult to read, eventually. Often this is caused by unavailability of appropriate reader hardware, long before the medium itself has physically deteriorated.

A finite stream of bytes is conveniently represented as a file, which can be copied with ease and can be accommodated within all of the technology of today, and of the foreseeable future.

We can preserve a file indefinitely merely by copying it onto new technology media as its current storage medium becomes obsolete. Some digital objects are not obviously in the form of a file (e.g. an audio CD), and need to be converted into a file in such a way as to preserve their significant properties.

Once we have our digital object in the form of a stream of bytes, we can copy it as many times as we like, and the copies are all indistinguishable from the original byte-stream.

Moving with the times - software

We also need reasonable expectation of being able to do something useful with digital information that we have stored over the decades. There is widespread agreement that the perils of media obsolescence and decay are avoided by copying stored data onto newer technology media.

There is quite widespread disagreement about the best way to avoid the perils of software obsolescence. All digital data needs some form of software to give it meaning to human beings, and without that accessibility of meaning there is no clear value in retaining digital data.

Whatever indexing system we use in our long-term data store, we need to record meta-data describing the nature of each digital object that we store, technical data about its format, and information about its purpose, origins, etc.

The widespread disagreement concerns the choice between (a) repeatedly converting the data so that it is easily processed by current software systems, and (b) storing the data as a faithful byte-for-byte copy of the original, and having meta-data that tells us which software to use to convert from the obsolete format as and when we wish to gain access to the data.

My strong personal preference is for option (b).

I have also seen a recommendation for a third way in which everything is stored in some wonder format that is capable of correctly representing every type of data that we wish to retain. Such an approach would surely be costly in designing such a format and then revising the design to take on board new types of digital data, all the while maintaining its software tools.

Opposition to option (b) tends to centre around the difficulty of keeping the conversion utilities working through the decades. I believe that this difficulty is grossly overestimated, and that option (a) carries with it a long-term risk of loss of data quality.

This comes about because of the inevitable compromises that are made in format conversion when the obsolete format has features that do not have clear parallels in the new format. Option (b) allows authentication using digital signatures, but for option (a) authentication is an unsolved problem.

Moving with the times - standards

Standards are the recommended antidote to quixotic suppliers' always looking to differentiate their products from those of their competitors. The pre-eminent standard in the world of long-term digital preservation is the Open Archival Information System, known to its friends as OAIS, and to officialdom as ISO 14721.

It is a generic standard, specifying the recommended style of implementation of digital preservation rather than laying down detailed requirements. Read §2 for a good overview of the issues, and §4.2 for a structural model (in UML) for preservation meta-data.

Of the various file formats in use today, some are well-established standards (e.g. GIF, JPEG, MARC) others are just the property of commercial software suppliers who are free to discard them when they judge them to have outlived their usefulness, from their own point of view, or they might just go bust.

Sometimes such a format can attain such pre-eminence that others implement the ability to process it. Microsoft is unlikely to go bust very soon, but I remember that once upon a time there were claims that Digital Equipment Corporation was bigger than IBM. Standards are a good thing. I wish I could remember who first pointed out that we are fortunate to have so many of them.

Recent developments of format registries giving details of many different file types are seen as an invaluable initiative enabling the IT community to retain knowledge vital to the future processing of data in obsolete formats.

It is open to an organisation to mandate that only data in a certain set of formats will be taken into long-term storage, but some organisations have to take data in whatever format its publishers choose to supply it.

Any attempt to define a set of acceptable formats runs the risk of excluding important material. Such a policy also runs the risk of stifling innovation.

Moving with the times - no nasty surprises

Our business continuity strategy should aim to exploit the advantages of the new without losing the assets of the old - an approach to business continuity that does not stifle innovation.

Think how you would retain access to and ownership of your vital data if the supplier of your data management software went bust. Look to systems where the integration is via durable interfaces. This will enable your systems to evolve without periodic uncomfortable revolutions.

Be wary of committing data to software with secret internal data formats. Fortunately, really widely used secrets get cracked. Look back a decade or two and think how to deal with data from that era today. Twenty years ago we had little idea how today’s IT would look. Today we have little idea of how IT will look a few decades hence.

Further reading