Mining for new knowledge in big data

Multimedia Editor Justin Richards recently interviewed this year’s Lovelace Medal winner, Professor Georg Gottlob, who is currently leading an international group on web data extraction and wrangling.

Can you tell me a little bit about your current role, what your position entails?

In Oxford, my current role is Professor of Informatics. And I’m leading a group there on web data extraction and on data wrangling. Data wrangling is how to get data together from different sources and to uniformise them, reason about them, and prepare them for further processing such as analytics and machine learning.

The project on data wrangling is called VADA. VADA means value added data, or value added data systems. And what we are doing is reasoning over data. It’s less about data extraction, and more about what you do with the data, and how you can reason on top of data.

You get data from different sources, you put them together and you mine new knowledge from this data. We are using machine learning, and rule-based reasoning methods, to get new knowledge on top of this. This is a huge project, about £6 million, and it’s spread over Edinburgh, Manchester and Oxford, and I’m the Principal Investigator of this project. So that’s my current role in Oxford.

Companies like Google, Amazon, and Facebook have suddenly noticed it’s not only pure data, but also knowledge that is important. Therefore, they have created big knowledge graphs, with which they try to incorporate knowledge from Wikipedia, knowledge with rules; ontological knowledge. For example, the fact that the President is a person is not just data, it’s a kind of rule.

If X is the President, then X must be a person, it cannot be, say, a dog. In Wikipedia, you have implicitly a lot of rules and relationships like this, because Wikipedia provides you also with characterisations and classifications. So, for a person you see the different classes in which this person is, and subclasses and parent classes etc. There are so many other sources of knowledge that you can have.

What will you be talking about during your upcoming Lovelace Lecture?

I will be talking about one of my major projects, which is web data extraction, and also about reasoning on top of data.

Web data extraction is often required commercially, if a company needs data for decision-making. The data could, for instance, be for an electronics retailer, and that could be the prices that your competitors sell the same items at, or if you’re a hotel chain and you would like to know what the prices of your competing hotels are in the same areas.

If their prices are going down, so you need to consider whether you also have to drop your prices. Otherwise people would go to the other hotel etc. Same with supermarkets and so on.

So, now where can you find this information? There are different ways of doing this. One is to write a screenscraper for particular websites, for, in this case, competing supermarkets. You write a screenscraper and all the data that is relevant to you comes automatically into your database and you can query this data and you can see whether this data is relevant to you. Maybe there is some product that they are lowering the price on and so you have to react to this.

When I was working in Vienna, one of my main research goals was to lay the foundations of web data extraction. What is web data extraction, what does it mean, how can one do it? What’s the logic of it , but also how can one build an efficient and user-friendly system for web data extraction. Such a system would have an embedded browser, to navigate to websites of interest and click on the things that you would like to extract.

Then the system would generalise it and would continue to work on this webpage, and every time there is new information on the webpage the system would recognise it and send it to you. Such data extraction systems are called interactive wrapper generators for web data extraction.

In Vienna, back in 2001, I founded a company, Lixto, jointly with my post-docs, which developed a system that would allow a manager or a computer engineer to very quickly build screenscrapers for determined websites. Lixto was acquired in 2013 by McKinsey, who now use this software heavily for many of their customers.

Later, at Oxford, again with bright post-docs, we investigated how wrappers for websites, in certain application domains, such as real estate, used cars, restaurants, and so on, can be built fully automatically based on formalised knowledge about these domains. Based on our research, we created the Oxford spin-out, Wrapidity, which was recently acquired by the Meltwater media intelligence corporation.

What are the biggest challenges that your area of research is facing at the moment?

The biggest challenge in our current research is to create a knowledge graph management system, one that is, at the same time, very fast, very capable of handling big data, and that can integrate two important parts - machine learning on the one side, and also reasoning on the other. So, combining machine learning and reasoning I would say is the biggest challenge. It’s like the left part of the brain and the right part of the brain.

Machine intelligence has the same paradigm as human intelligence. In human intelligence, you have two types of reasoning. We have the intuitive reasoning, which is called sub-symbolic reasoning, which is done just by your neural network without knowledge input. For instance, a baby learns to crawl with its arms. Nobody tells the baby ‘you should do this’ - it just learns it. The baby’s neural network adapts.

And that’s, basically, and a bit oversimplified, the type of right part of the brain thinking. It’s automatic. Also, a very good example is person recognition. A baby can recognise different people, but nobody has explained to the baby how to do this. It learns it by itself, and that would be pure machine learning.

At some point, parents start to talk to us, and we try to get knowledge through language. Symbolic knowledge. We essentially get a piece of precompiled knowledge, because every person cannot rediscover everything by themselves via autonomous learning. For example, we cannot learn mathematics from the beginning just by ourselves, so we are getting mathematics taught to us, because others have developed it and have elaborated on it, but now it’s in appropriate symbolic form.

And we process symbolic knowledge a lot. We use one part of our brain (some say the left part) for making plans and for reasoning and another part (some say the right part) for sub-symbolic rapid decision making and for everything that is more intuitive.

And that’s exactly what the computer needs to do. To achieve a true level of artificial intelligence, the biggest challenge is currently how to combine sub-symbolic AI with symbolic artificial intelligence. The sub-symbolic would mean things like person recognition and machine learning and neural network technology. Symbolic artificial intelligence is reasoning with rules, using a language, establishing theories, making assumptions etc. These two branches of AI have co-existed for many years.

How do you see artificial intelligence progressing over the next five to 10 years?

I think machine learning has made huge progress. And if we could combine machine learning with symbolic AI, with its reasoning and rules, I think then that would be exactly the next quantum leap forward. I think this challenge will be overcome.

Also for autonomously driving cars, you cannot just use machine learning. At some point, you need rules that can, for example, overrule what they have already started, but it’s pretty ad-hoc, so we really need a new theory about how to combine rules, how to combine symbolic reasoning and decision-making with sub-symbolic AI.