From data storage to information retrieval

Tony Rose, vice-chair of the BCS Information Retrieval Specialist Group, looks at the management problems associated with the huge volumes of data that society produces.

It has often been said that we are currently living through an information revolution. The internet age has undoubtedly brought with it a massive increase in the volume of data being produced and stored worldwide, driven by an ever-increasing demand for communication networks and online information access.

For example Computing magazine reports that global information storage grew by about 30 per cent between 1999 and 2003, and that during 2002 about five exabytes of new, unique data was stored on print, film magnetic and optical storage (a volume roughly equivalent to 37,000 times the size of the book collection in the Library of Congress).

Moreover when we examine the implications of this for individuals across the globe, the consequence is that almost 800Mb of recorded information is produced for each person on earth in a single year.

So given the massive growth of all this data, the question is: how can it be effectively managed? In particular, what kind of retrieval strategies should we adopt to ensure that we can find the right piece of information at the right time?

Structured vs unstructured data

It is claimed that as much as 80 per cent of corporate information is stored in unstructured form, i.e. as text documents, emails, web pages, images, audio files etc2.

This contrasts with structured information - data that was stored according to some predefined template or schema. Evidently in the latter case the schema is a core element of the retrieval strategy, facilitating precise query formulation and hence effective retrieval.

Conversely in the former case its absence has spawned a whole industry devoted to the categorisation, search and navigation of large quantities of text and other unstructured information, and a thriving academic research community dedicated to information retrieval (IR).

Current retrieval practice

Most readers will no doubt be familiar with Google and other internet search engines: you type in a few key words to describe your information need, hit return and within a second or two are presented with a list of links to documents that you hope will be relevant to your query.

Evidently a proportion of them will indeed be relevant (we refer to this measure as the precision of the search engine) and if you are lucky you may also find that all the known relevant documents will be in the list somewhere (we call this measure recall).

Of course, on the web we can never really calculate a true recall figure as there is simply no way of ever knowing how many relevant documents there are out there. But for a fixed collection such as a library or corporate database the recall figure can be a very important measure of a retrieval system's effectiveness.

So how do search engines actually work? The objective of most commercial search engines is to measure the conceptual distance between your query and each document in the database, and then return those documents that provide the best match.

To do this they must employ some kind of model or representation for the documents and queries. However most current text retrieval technology is built around relatively primitive models that represent documents simply as unordered sets of terms (i.e. character strings) with numeric weights that determine their relative importance.

Moreover the matching process is often equally primitive, being based around a few basic statistical formulae that return a measure of how well one set of terms matches another.

Incidentally one reason why Google has been so successful compared to other search engines (apart from its adherence to a minimalist approach, concentrating on effective search when many other players were trying to become universal portals) is that it makes very effective use of what little structure there is within web pages: the hyperlinks.

By analysing the link structure of documents, and identifying which documents are linked to which other documents, it can develop a notion of the value of each document, independently of its relevance to any particular query.

Therefore by combining this information with the traditional term-based relevance, it can maximise the probability that only the best, most relevant documents will be returned to the user.

Not surprisingly Google and other web search engine companies keep the precise details of their ranking algorithms confidential, as part of a continual arms war with the search engine optimisation companies - companies who try to get their clients listed high up in the search rankings for particular key words.

Enter natural language processing

However despite the innovations of Google and others it is clear that a document is much more than simply a collection of terms: words can be combined into phrases with specific meanings dependent on their order (e.g. a 'Venetian blind' is not the same as a 'blind Venetian'); and phrases may then combine to form structural or discourse dependencies, or make co-references to each other and so on.

But as long as the fundamental unit of representation remains the 'bag of words', then much of this conceptual content will be lost.

Inevitably the experience for the search engine user is that they are often presented with a list of irrelevant documents, and they must then endure the chore of inspecting each one until they find the one that addresses their information need.

Whilst this may be tolerable for the casual web user, it can often prove unacceptable for corporate clients or professional information researchers, particularly those in legal or financial sector where the costs of erroneous or out-of-date information are particularly high.

Consequently much IR research effort in recent years has been directed toward developing more sophisticated representation models and matching algorithms, often based around natural language processing (NLP) techniques.

NLP technology can provide many of the basic building blocks for advanced search such as:

summarisation: the ability to produce a coherent summary or abstract of a document;
named entity recognition: the ability to identify key conceptual units within a document, such as the names of people, places, companies etc;
topic detection and tracking: the ability to follow different themes in a changing news feed;
word sense disambiguation: the ability to differentiate the particular senses a word may have, e.g. 'bank' as in 'the edge of a river' and 'bank' as in 'financial institution');
information extraction: a combination of the above and other techniques to enable specific patterns or facts to be extracted from text or other unstructured data (sometimes referred to as text mining);
machine translation: the ability to translate from one natural language to another.

Yet despite many recent successes in NLP research (and the subsequent over-inflated claims of many search technology providers), we are still a long way from the Holy Grail of understanding the conceptual content of a document.

Consequently the many information professionals who rely on such tools will have to wait a little longer for an answer to their prayers, and the numerous AI researchers around the world need not fear for their jobs just yet.

Future directions

Despite all the efforts and investment put into search technology there is one aspect of IR that remains largely unchallenged: the notion that the objective of most commercial search engines is to return documents that provide the best match for a keyword query. But why should this be so?

After all, a great many information needs would be better expressed in the form of a specific question rather than a general statement of intent expressed as a set of key words. For example, if my information need is to answer the question, 'Who are the major search technology providers in the UK today?'

I would rather ask precisely this and be given a concise list of company names in return, than issue a keyword query and receive a set of documents through which I must wade to find the specific pieces of information I need.

But to support this functionality the search engine must employ much more sophisticated content models and matching algorithms from the most advanced NLP research in question answering.

Indeed some search engine companies have already established a brand identity or value proposition in precisely this area: AskJeeves (www.ask.com) being the most well-known example (although their service relies on the work of significant numbers of human editors rather than through the exclusive use of technological solutions).

Perhaps a more modest intermediate goal would be to focus on passage retrieval, i.e. the ability to return specific sections or paragraphs rather than whole documents so that the user can focus more immediately on the key sentences.

Another potential challenge for search engine providers is multi-linguality. English may currently be the most popular language on the internet but its dominance is becoming less pronounced, and other languages (particularly Chinese) are growing rapidly.

Information searchers from these communities will inevitably want to be able to access English content using queries expressed in their own native languages and, to accommodate this, a search engine needs to support cross-language retrieval.

And of course there is the problem of how to access data that isn't visible to web search engines at all, such as the content of product catalogues, library catalogues, patent filings, flight schedules, biomedical data and so on that reside in corporate databases around the world.

Some have estimated that the major search engines index as little as one per cent of the known web, and the remaining content, with all its rich structure, remains inaccessible behind a wall of registration gateways and dynamically generated links.

Evidently the challenge for the next generation of web search engines is to find ways to mine this 'deep web' and take advantage of its vast quantities of structured data to provide meaningful, interactive views onto the search results.