The commercial value of big data

When it comes to science and engineering, there is no doubt in my mind that ‘big data’ has both elements of challenge and opportunity.

The initial difficulties over the IPO at Facebook have led many commentators to question the underlying assumption that the possession of large amounts of personal data can automatically be monetised. Much of the focus has been on the shift from the PC to the mobile platform and the effectiveness of advertising on a mobile platform.

There is considerable hype around the field of social media analytics.

I’d like to suggest a thought experiment to test whether the ‘story’ makes sense and what issues may arise in attempting to realise the commercial value of big data.

Before doing so let me suggest some of my potential caveats.

First, in intelligence work, it is knowing what you are looking for in a huge amount of data that makes a difference. In retrospect, after 9/11/2001, the fact that a group of individuals were interested in flying lessons but didn’t want to take off or land turned out to be crucial. That information was known to the security services but its significance was not understood. Hindsight is a wonderful thing.

Similarly, the authors of Superfreakonomics point out that a marker for a suicide bomber in a group of suspects is the absence of life insurance.

The problem is that now these two relationships are understood, how behaviour will change to reduce detection is unclear.

So, I think it is important that the value of information, or insight, has to be there in advance and will not be changed by it becoming apparent.

So, to my thought experiment:

Imagine that I am happy to make all my digital data open and available to a single data mining platform. The claim I want to test is that using this data, it is possible to create commercial value out of influencing my decisions. I am not suggesting that every decision is manipulated or influenced, but over a year enough decisions are impacted to justify targeted ads, coupon offers and other incentives to make the use of the digital data profitable to the platform owner.

First, imagine that you are the platform owner. A supermarket wants you to influence people’s choice of Sunday lunch and buy their offerings.

First, out of the millions of users, what data would you use? What model would you use to select the best 100,000 targets? How would you know that model would be effective in advance?

For me, here is the first unproven part of the story. For the ‘big data’ commercial advocates, there must be algorithms that can trawl the data and create outcomes better, that is to say more cost effective, than traditional advertising. Where is the evidence that such algorithms exist? How will these algorithms be created and evaluated and improved upon if they do exist.

One problem is that in a huge data set, there may be many spurious correlations and the difference between causation and correlation hard to prove.

Imagine that the supermarket was trying to push white meats rather than red. One ground might be healthier life styles and reduction of the risk of obesity. Listening to a medical statistician some time ago, all sorts of possible factors may appear relevant. I illustrate one, my shoe size. The statistician pointed out that large shoe sizes were a marker for Type II Diabetes. He pointed out that if you looked at the figure across the whole population, no marker was apparent but if you split out by ethnicity, some link may be conceived. However, how useful would my shoe size (adjusted for being Caucasian) be to your attempt to influence me to buy chicken not beef?

One approach of the big data advocates is the use of genetic algorithms. Suppose that an algorithm exists that shows that shoe size works for 1 per cent of the population to influence their choice of Sunday lunch. How will that help in creating a targeted campaign for say swimming to improve healthy life styles? If a campaign using the algorithm generated a profitable return, how would you learn from it to improve success on similar campaigns?

Running through this example, I come to the conclusion that the value lies in the model, the algorithms, more than the data. I am sceptical that such models can be created in the first place.

Now, the above example was based on the idea of all my information being available to a single platform. Once the information is fragmented across multiple platforms with different individuals using differing settings, then the chance of finding useful models to influence behaviour seems to be much harder.

All of this leads me to doubt that the claim that more is better on data is difficult to sustain and the commercial value unproven.

Now once we have 10-20 years of information we may well find some useful tools and insights to data. Advertising does work, but far from perfectly. The idea that with the current state of knowledge and tools and with current data that commercial value in excess of traditional advertising can be created seems more like wish fulfilment than a provable proposition.

Now that doesn’t mean that people shouldn’t be experimenting with the new platforms, nor that such value will not be created in time. If I am right, I think we may not see sustainable value and proven working models for a decade or more. Sustaining hype for that long seems like a tall proposition. Waiting till the models are proven is very high risk too.

In Taleb’s 'Black Swan' he refers to an experiment where one group was just given the share price and another group was given wider information. Their trading performance was compared. The one with the least information got the best performance.

It feels to me that today the investment is 80 per cent plus in the tools and technology and 10 per cent+ is in the meaning of the data. That may need to be reversed if the dream is to be realised.

So what we need is the right model and the right data. Importantly we have to know what these are in advance to sell the commercial value of big data.

Look at the vast literature on automated stock market systems and comparison studies on group versus automated versus hybrid systems on performance. All we are trying to do is guess whether a share might go down or up and by how much. We have 100 years of data and the best we’ve achieved is that these models work till they don’t. Momentum-based models that outperformed the market consistently 1997-2007 now flop as there is no momentum.

Now with five years of big data, the claims are being made that we can influence my purchase of food, books, cars, holidays and all manner of things using big data.

That leads me also to the conclusion that the value of big data commercially sits not in generic social media sites but focused sites such as Linked-in. Because, in that case, the data is around work and jobs the information is more focused and likely to be useful in targeting individuals with ads.

So, for me, the likely long-term vision of social media is a set of interoperable specialist platforms. That feels right as the balance between privacy and openness seems more manageable in the long-term. I will share work data with a different group from friends’ data or family data.

It makes me think about where Google Circles will go next!

Sceptic, moi?