Spurious quantification and data quality

It has been observed that 88.2 per cent of all statistics are invented on the spur of the moment. Putting a number on something is often preferred to qualitative assertions, with good reason. It’s been a while since I’ve written about that old friend Garbage In-Garbage Out, GIGO. In the last few weeks I’ve seen a number of presentations and reports that have drawn this to my attention.

I commend a recent report by Chris Yiu, of the Think Tank Policy Exchange, on Big Data in Government. One of the recommendations is the need to create a new class of ‘data scientists’ to support policy development and evidence of impact. The idea seems to be important in addressing some of my concerns over the hyping of the big data opportunity.

The IT Industry is full of wonderful examples of meaningless statistics.

One of my favourites is the claim that automatic web translation is over 80 per cent accurate. My gripe with this claim is the lack of a description of what is meant by 100 per cent. For example, translate into German the following phrase ‘Now is the discount of our winter tents’. Or try this into French; ‘Into every reign, a little life must fall’.

The figure of 100 per cent, as far as I can see, excludes any humour, irony, rhyme or poetry, or any cultural content. Take that out of the Shakespeare canon and there’s not much left. It doesn’t feel like 80 per cent to me.

If I ever endowed an award, the Yapp prize would be for the most inappropriate invocation of Moore’s law. The current winner by a mile was a claim I saw for a prototype software agent. It was now 50 per cent accurate and applying Moore’s law it would be 100 per cent in two years! Presumably it would be 200 per cent accurate in four.

As an aside, if you ever want to rile me, the claim that X is 600 per cent thinner is a good start. The most recent variation was a claim for green credentials by cutting packaging by 400 per cent.

In much of what I read and hear about big data, there is a big focus on open standards as a prerequisite. That’s not what I object to, far from it. My concern is the assumption that the ‘content’, the data itself is accurate.

Let me wade into the unemployment statistics debate. If you are self-employed, then as far as the unemployment statistics are concerned, you are employed. That is not true however, for benefits or pensions. Recently, 196,000 jobs in further education were reclassified as private sector. Yet FE colleges are treated as public bodies for the purpose of EU procurement rules. By that logic outsourcers providing public services are public sector. So, if you want to look at the rate at which the private sector is growing / shrinking or changing then you need more than open standards for data encoding.

I may be missing something, but it isn’t intuitively obvious to me that open garbage is inherently superior to proprietary garbage. Indeed the claim that using open standards you can mix open garbage from multiple sources worries me. That feels more likely to create something toxic rather than insight. That’s why the suggestion that we need data scientists to create that insight is one I support.

In a world of constant change, it is useful to find a few fixed points, the eternal verities.

One that has never let me down is the following observation: Statistics to a politician are what a lamp-post is to a drunk, more for support than illumination.

Some years ago I was called in to assist in a review of an NHS project. The DoH was concerned about rising costs of the internal market and a consultant’s report was circulated on suggested actions to cut bureaucracy. There was concern that the project costs and timescales might be impacted by this action. Perhaps we would have to re-engineer the processes, redesign workflow or the like. Mid-way through a project this could be a big problem.

There was nothing to fear! The targets were met by re-designating nurse managers as nurses and a few other tweaks. Sure enough, the bureaucracy was reduced.

A few months later I was part of the team when a health minister came to visit. In his speech he congratulated us all for the success of the project. He announced that he was pleased at the contribution it had made to reducing bureaucratic costs. Oh frabjous joy!

The strength of Chris Yiu’s paper is that it understands the real world and is ambitious in its intent, but not through rose-tinted glasses.

As Bill Clinton should have said: ‘it’s the data stupid’.