Web intelligence

Professor Nigel Shadbolt of the University of Southampton gave the BCS RSI lecture and covered the extent to which intelligent web services are evolving to cope with diverse sources of information on a global scale.

With only a  tiny percentage of users 'trained' in any way to use the world spanning human construct that is a truly disruptive technology there are new opportunities to be gained from use of the World Wide Web.

Artificial Intelligence (AI) is an often used and often misunderstood term. 21st century AI is not like the movies - murderous robots and paranoid computers - but embedded, adaptive and smart software that lifts the burden of tedious and difficult, often repetitive jobs.

Moreover, AI is now gradually being woven into the web, ranging from software agents to bid in electronic auctions to smart search engines.

There has been progress in the more popularized areas of AI: we have AI principles used in video controls, Microsoft embeds AI in a number of its applications, Sony builds on AI work in its Aibo robot dogs, games software exploits AI in many of the behaviours generated by synthetic characters.

Science too has benefited, where the progress areas have included large scale drug property modelling for pharmaceutical firms, advances in combinatorial chemistry, virtual reality applications to train physicians and enhanced scanning applications to map tumours. Yet most AI applications go unnoticed.

In common with much science and technology, one of the principal drivers for AI has been military applications.

Alan Turing worked on decoding Enigma during wartime intercepts using the Bombe; John Von Neumann used his game theory methods to analyse the cold war stand-off and worked on the Manhattan Project; Norbert Wiener, the founder of cybernetics was also very keen on cross-functional research through his interest in biology and psychology.

Now AI is used in intelligent signal processing: in eavesdropping via satellite and machine translation; data and information fusion to produce situational awareness; automatic scheduling and planning in military back-office logistics; and in unmanned autonomous vehicles such as the Predator aircraft and Rodney Brooks' Packbot.

The development of true web intelligence is enabled by the power of the web, itself driven by the pace of technological change. For example, the growth of computational speed, power and ubiquity is staggering.

An accepted measure is the number of million instructions per second that US$1,000 can buy, and this number that has increased a million fold since 1978.

Moore's Law, which has largely held true for 40 years, has produced orders of magnitude processing speed increase and required constant migration and system obsolescence.

When the computer Deep Blue beat world champion Gary Kasparov at chess it demonstrated that in addition to elegant solutions, brute force is a valuable computing asset. The victory required the evaluation of 100,000,000 positions per second.

To demonstrate this growth's impact on the World Wide Web, 1990 saw the first web server and by 1992 there were 26, increasing to 200 by the next year. But by 1998 there were 329 million pages available, 665 million users by 2002 and in 2005 Google estimates it has 8 billion pages indexed.

This network is in every country on earth, yet only a tiny percentage of users are in any way trained. The web is ubiquitous, driven by largely open protocols and languages and employs a strong open standards approach.

The way that humans and computers assess web pages differs hugely. When we see a page we know what the title is, what part constitutes a URL, what a picture of speaker is; we can infer this information without being told.

The aim of the semantic web is to enable machines to draw these same deductions through the use of metadata and metacontent.

This meta information can be compared to the 'angle bracketed' information used in HTML commands, except that they will describe the concept of an object and its relation to other concepts rather than just how it is presented - its font, size, colour and so on.

The metadata will be able to show the underlying concepts that link a 'conference' to being an 'event', which may be linked to a 'publication'.

So the semantic web is about an abundance of metadata to more richly describe content; it's about sharing meaning, not just bytes and bits.

Like human memory, which works in a very rich environment, indexing things in many different ways, the semantic web will enable the annotating of anything - web pages, publications, databases, academic papers - from multiple perspectives.

As an example, scientists now routinely annotate the origin of results from all sorts of fields: physics, chemistry, psychology, astronomy and so on.

In chemistry, using a rich annotation approach, chemical structures could actually be proposed by a machine analysing metadata, which would then be checked by humans for analogies and applications.

Indeed this kind of approach is becoming vital as there are now not enough humans to analyse the sheer quantity of information being produced.

This approach needs substantial cooperation, of course, and shows that the semantic web is about sharing meaning between communities, individuals and machines themselves.

This linking of information requires an ontological approach - the sharing of a common concept of a domain.

A real life example is the Human Genome Project, which developed shared languages to describe information, producing a schema to exchange information on the Web between databases.

Languages to describe agreed vocabularies will thus become the basis for the next generation web. So the W3C approved Ontology Web Language (OWL) would sit above such things and RDF and XOL, which themselves evolved from HTML and XML.

The Advanced Knowledge Technologies (AKT) project, (an EPSRC-funded collaboration between the Universities at Southampton, Aberdeen, Edinburgh, Sheffield and the Open University) aims to build the support tools and methods for the next generation web.

The background to these works is the knowledge life cycle, defined as: acquire, model, reuse, retrieve, publish, maintain, and sometimes concluding with discarding or purging some content.

The aim of this research is the production of intelligent web services: for example services that could take a publication then classify and index it. The system could be given papers and so on to train it and will then classify against its defined taxonomy.

Another service could be web annotation, which would need to work out a way of associating rich annotations descriptors with existing web pages. A semantic web search service would be another possible new application.

This would pull results based, not on rankings, but on a focused relational detection - what the page is actually about, rather than the number of occurrences of specific words or word groups.

For these services to work an integrated view is required. To use mammography in medicine as an example, at present the results from MRI scans, biopsies, X-rays and so on use different languages and terms.

These disciplines would need to interact via an agreed ontology which would allow medical practitioners to communicate effectively and produce aligned image analyses, classifications and natural language reports, all integrated by semi-intelligent web services via the internet. The benefits to patients would be huge.

With the high speed of change and superabundance of content, web intelligence urgently needs to be established through web services here in the UK, to the benefit of the military establishment, the scientific community, our industrial and business base, and society at large.

For more information on the AKT Project see www.aktors.org.

In a nutshell

  • AI is but embedded, adaptive, smart software that lifts the burden of tedious and difficult, often repetitive jobs.
  • The development of web intelligence is enabled by the power of the web and the pace of technological change.
  • The aim of the semantic web is to enable machines to assess web pages like humans do, through the use of metadata and metacontent.
  • This approach is becoming vital as there are not enough humans to analyse the quantity of information being produced.

Three enduring and commonly held IT truths arising from Moore's Law:

  • Any computer you can afford to own is obsolete
  • Anything state of the art you can't afford
  • State of the art to obsolete takes a microsecond.

This article first appeared in May 2005 ITNOW.