Open source intelligence

Alex Healing and Simon Thompson, BT

Web 2.0 lets users contribute content as well as consume it. Content is being generated and distributed by a wider and deeper sector of society, from bloggers in Basingstoke to Wikipedia contributors in Wutan. The conversations on the modern internet are meant to be two-way - as well as talking, it is important to listen - and new technologies are being developed to enable organisations, developed on Taylorist principles of hierarchy, to identify, filter and navigate through the sea of views, discussions and reviews that have suddenly started to slosh around the global information infrastructure.

We call the ability to tap into and exploit user-generated content on the internet 'open source intelligence'. Alex Healing and Simon Thompson report.

On the live web we identify three types of open information publishing systems: centralised, exemplified by Wikipedia; decentralized, as observed in the so-called blogosphere; and distributed, as seen in the support, feedback and interest communities that have evolved, for example, around online games.

Wikipedia is a centralised mechanism; articles are initiated, edited and community-reviewed within the bounds of the Wikipedia domain. This provides for the degree of auditing and quality assurance appropriate for an encyclopedia. A study by Nature found that its quality on scientific articles was only slightly lower than Encyclopedia Britannica's[1], with a panel of experts identifying four mistakes in their review in comparison to three in the Britannica's entries. But Wikipedia is delivered worldwide and for free, and edits and updates of entries are distributed immediately.

Figure 1. Three types of open information publishing systems as observed on the web in 2008.

In contrast to the centralised nature of Wikipedia, the blogosphere is a completely decentralised, loosely coupled and unstructured information publishing system that has seen explosive growth. There are over 100 million blogs worldwide and around 175,000 new blogs published every day[2]. Loose associations of blogs can be identified from the co-referencing of one author to another, but no formal relationship exists between blogs; they are separately authored and completely independent. The blogosphere is the Wild West of the internet; it is anarchic, unregulated and unreliable, but open, fresh and unrestrained.

The final class of mechanism is the distributed information ecosystem. These grow up and are dependent on a centralised source of information and events - a common topic of interest that the group identifies itself with and are seeking information about. Examples of this can be a product, hobby, scientific subject or celebrity. A good exemplar of this type of system is the ecosystem that has evolved around the online game World of Warcraft (WoW).

Key to the structure of the WoW ecosystem is the game itself and the databases that it uses to define the activities of players. These information sources are published and updated in a traditional product release cycle. Subsequently updates and changes are analysed in blogs and on forums and then reedited and republished to wider communities via news sites. Reference sources are maintained by the community in the form of databases that provide commentary on items, quests and activities in the game and guides and information articles. Other ecosystems have emerged around Harry Potter, DC Comics, the EU Constitution and various presidential candidates.

Grasping the open source intelligence opportunity

No intelligence can be acquired by an organisation if there is no one who produces user-generated content that is relevant; it is in an organisation's interest to facilitate content creation. Hubbub is a tool that we built that enables a community to create a knowledge base, for customer support, while managing the brand risk and focusing the community on a particular topic. In contrast to most existing online forum systems, Hubbub encourages users to access other users' posts via a natural language query. This way we learn more about what the customer's issue is and can better link their query to both support agents and community members that can help.

Comprehending user-generated content with a view to adopting the knowledge that it contains is the next part of the open source intelligence opportunity. In order to better integrate and categorise unstructured information, work in the emerging field of visual data mining has proved a promising approach. Cyclone is a tool developed, which visually clusters information sources fed into it based on their similarity but also allows for this clustering behaviour to be refined through successive user interactions. The tool aids the transition from a corpus of information loosely structured in a so-called folksonomy to a formalised taxonomy, necessary in order for the analysis of the information to scale as well as being able to combine it with any existing organisational information structure. 

The final part of the open source intelligence opportunity is noticing interesting activity on the internet and actioning it, and DebateScape is a tool that we have developed that uses RSS and content scraping to monitor sources of user-generated content on the web, such as forums, and to convert activity on them into events that can be notified to subscribing applications. In the context of BT's business needs we have constructed a customer service infrastructure that allows the management of support groups that can reach out to and help customers who are posting problems on third party forums or on blogs.

Barriers

There are a number of barriers to open source intelligence.

  • Legal - In some spheres the use of other people's content for your own purposes is illegal; for example a DJ sourcing the freshest new music to remix into a track would wind up in court in short order.
  • Privacy - Closely related to legal issues are those of privacy. Inference from open information stores may reveal personal information about individuals that was not intended to be published.
  • Economic and social - Motivations for all participants in a production system are changed once one of them stands to gain economically; if organisations are perceived as exploiting information publication systems, the content creators that drive them may withdraw.
  • Technical - Some of the problems of open source intelligence are grand challenges of computer science research. For example, natural language understanding is an open problem and the semantic web effort is widely agreed to be far from realisation.

Open source intelligence: a reality

User-driven information publishing systems have enabled consumers to generate and distribute their ideas and views globally and persistently. Systems of quality control, indexing and advertising have emerged to allow users to navigate through the information produced by these collaborative systems. New tools are being developed that enable, in a pragmatic way, traditional organisations to interact with these resources and the communities that drive them. We call this activity 'open source intelligence'. However, we are well aware of the wide range of initiatives and labels that others apply to this movement: social networking, user-generated content and crowd-sourcing are just a few examples. Organising and explaining these as a coherent whole is beyond the scope of this article and beyond the intellectual capacity of the authors at this time.

What we would need to do is rigorous scientific work that would give us a proper understanding of the fundamentals, which drive the development and use of these tools. The dynamics of social networks and online communities of creators are poorly understood. The interaction of people and the technology of communication have only been studied since the technology came into existence, and this kind of study has generally been sidelined in the rush to grab the benefits that these technologies offer.

We need to develop a foundational understanding of the principles of wide area communication and publishing systems and the people who use them to interact in both time and space. Because of this, BT has become a founding sponsor of the Web Science Initiative[3]. This group will work to understand the reciprocal relationship between the web and society, and will enable us to measure and analyse the potential and usefulness of ideas like open source intelligence with the degree of rigor that is required to justify substantial future investment.

References
1. Giles J (2005) Internet encyclopaedias go head to head. Nature 438, 900-901.
2. Technorati. About Technorati (
http://technorati.com/about).
3. O'Hara K, Hall W (2008) Web Science (
http://eprints.ecs.soton.ac.uk/15682/1/OHara-Hall-ALT-N-Web-Science.pdf).

Alex Healing is a senior researcher in the Centre for Information and Security Systems Research at BT where he works on the application of AI techniques to enable autonomic behaviour in future ICT systems.

Simon Thompson is a chief researcher in the Intelligent Systems Research Centre at BT. He has worked on a variety of research projects sponsored by BT, national and international funding agencies.