Through The Panama Papers, we’ve been learning more about the offshore tax haven activity of the global élite.
The organisation behind this epic data scoop - all 2.6 terabytes and 11.5 million documents worth of it, a dataset far greater than Wikileaks or Snowden worked on - is The International Consortium of Investigative Journalists (ICIJ), a network of reporters committed to breaking stories of global public interest.
While the world continues to discuss this financial scandal, and work out its implications for regulation, let’s consider the technology that helped the group achieve this story - and how it can be of relevance for any enterprise looking at data.
That technology is graph database software. Graph databases recommend themselves for huge projects of this sort - finding patterns in vast amounts of unstructured, ‘flat’ PDF data - because they are very adept at managing highly-connected data and helping users pose complex queries.
That’s because instead of working with data the way a traditional relational business database does, graphs use a simple network representation incorporating entities called nodes, properties and edges to define and store data. Simple - but powerful, this architecture makes them highly efficient at analysing interconnections between data, allowing journalists to ‘follow the money’ and spot a story, or a set of connections not previously visible, in ways they have never been able to before.
Understanding relationships at huge scale
As a point of fact, graphs echo the way humans intuitively think about and work with information. And once that data model is coded into a scalable architecture, a graph database is unsurpassed at surfacing the connections in huge and complex datasets. For example, as Mar Cabra, the ICIJ’s Data and Research Unit Editor, has stated, graph databases can process a large volume of highly connected data easily and efficiently.
She also needed an intuitive solution that didn’t require any major intervention from data scientists or developers to operate, while the data discovery and analysis process had to be accessible to journalists around the globe, regardless of their technical competence.
Cabra found the answer in a graph database tool, Neo4j, which she says is a ‘revolutionary discovery tool’ that’s ‘transformed our investigative journalism process.’ Graph database tools excel at spotting relationships in data. As she says: ‘Understanding relationships on a huge scale is what graph techniques are so great at.’
‘That doesn’t mean journalists haven’t struggled to grasp a world where stories are mined out of databases and established through isolating connections in electronic audit trails’, she told graph database developers in London. But as spotting relationships inside data at scale is not something that can be performed manually, graph databases have proved themselves to be an essential tool. ‘It’s really worked’, she says, ‘Just by expanding dots, we found a lot of information that had not been found previously and connections we’d missed when we had just looked at the documents individually.’
At the start of a major data journey
Whatever else we can be sure of, as the Panama Papers continues to unfold, it’s only with world-class tools like graph databases that the investigation of vast and complex datasets like this can occur. However, the great news is that graph databases are of benefit to far more people than investigative journalists like Mar Cabra and her team.
For years, for example, big web firms like Google and Facebook have used graph to derive insight and value from massive amounts of data. Data is their differentiation, and their business models depend on increasingly sophisticated ways of working with information. And while graph technology is central to what they do, analysis of large volumes of unstructured data is now available to all of us - from a startup trying to disrupt a market to brands trying to work with data to provide a better service.
The Panama Papers were important, but we haven’t seen anything yet in terms of solving data and relationship problems at huge scale. And I suspect graphs are going to be right at the heart of it.