Karen Spärck Jones is winner of the 2007 BCS Lovelace Medal. BCS managing editor Brian Runciman interviewed her. This interview also appears in the ebook Leaders in Computing.
By way of introduction, can you tell us something about your work?
In some respects I'm not a central computing person, on the other hand the area I've worked in has become more central and important to computing.
I've always worked in what I like to call natural language information processing. That is to say dealing with information in natural language and information that is conveyed by natural language, because that's what we use.
I think that what has been happening is that those kind of things that were initially thought of as external applications, rather like accounting packages, are becoming more central and not just because more people are using browsers and search engines, but because the information itself they are working with is becoming much more central to what people do.
You could argue that this natural language stuff is the material of an information layer, part of a computing system not just on the periphery.
I can see systems, even operating systems and security, making use of the information that's in that layer. It may be informal information and not nicely coded up, but it may be usable all the same. Natural language isn't coded up for us, but it's there and we use it.
What's surprising looking back on the fifty years of BCS is how old some of the ideas are.
What recent developments by others have impressed you most?
I'm not an IT professional, but a researcher. I don't use a lot of things that people swear by now because they're not particularly pertinent for my work.
But I do think that the web has made a difference and in my own area, progress has been made. In AI we may not be able to do some of the things that were originally hoped for, like the ideas from the Dartmouth conference in simulating humans. But it's done other valuable things on simulating.
Progress is sometimes made in ways that people didn't predict. Basic ideas can develop slowly but sometimes things come along that are effective yet were unpredictable. Many people say what's exciting now is images and video but I think that's very overrated. It's nice to see them, but if you want to talk about them, what are you talking with? Words.
Nevertheless, what is important is that research and professionals are connected; computer science produces the stuff that professionals use. One thing that gets me steamed up about teaching in schools is that they don't realise what work goes into producing the stuff they use. Take spreadsheets - it's hard work to produce a good spreadsheet package, but if people only learn how to use them, and not what's behind them, we're missing a trick.
BCS is pursuing professionalism in IT - what are your thoughts on this?
I certainly think that professionalism is very important. I took part in one of the Thought Leadership debates, about security and privacy, and I was having an argument with a young fellow there. He was slightly surprised that I said that to be a proper professional you need to think about the context and motivation and justifications of what you're doing.
A true professional will think like that. With ID cards, for example, I was concerned that people would treat it just as an opportunity to do a good software job, if the government's got a sufficiently good idea of what it wants. But things like that have a fundamental effect on people's lives and being a true professional means that you must contextualise your work.
Is there an ethical dimension there?
I think there is. This chap, who in many ways was thoughtful, said that his organisation only thinks about what the spec is and whether they can do a good job of it - what I call the first layer of being professional. But the second layer is the rationale for what you're doing.
You see I could probably write a very good program for choosing people to be killed for some reason, selecting people from a population by a particular criterion. But you might argue that a true professional would say, 'I don't think I should be writing programs about this at all.'
The point is that there is an interaction between the context and the programming task itself. And as we know with the privacy debate, getting the system architecture right is extremely difficult. You need a deep understanding of what the whole thing is about to get that right and to appreciate that it still won’t be perfect.
You don't need a fundamental philosophical discussion every time you put finger to keyboard, but as computing is spreading so far into people's lives you need to think about these things.
The UK has a problem attracting students into computer science courses often due to a geeky image, what should we be doing about that?
There is more than one reason why people aren't attracted. One of them is that teachers say they are fed up with the emphasis on what you might call shallow IT skills in schools. Focusing on whether you can you use a word processor or spreadsheet, so that you completely conceal what the actual things you use are like.
A very good example of that happened 10 years ago, but still applies. We were trying to get at girls in schools and we knew we had to get to the teachers first. We found that the spread of computing in administrative and secretarial world has completely devalued it. When one of the teachers suggested to the parents of one girl that perhaps she should go into computing the parents said: 'Oh we don't want Samantha just to be a secretary.'
That's nothing do with nerdiness, but the fact that it's such a routine thing.
Then there is the nerdy thing and also people don't see the challenge of designing, building, implementing, testing and evaluating programs. There are plenty of things that are very, very hard to do. Think about someone who wants to model climate change - you've got to do more in the program than just take a few equations and churn them.
Nerds often don't do proper computing either - it's more geeky one-upmanship. Then there's this endless dreary games playing. They talk about the wonders of modern graphics but if you look at screens with games on them they're not really very realistic.
People, because computing is so routine, don't think about the whole social context. Think about the NHS stuff - if that worked it would affect how the entire health service ran, from the nurses to the consultants and all points in between. People have got to understand that these systems are embedded in our lives.
It's getting across the challenging, fascinating, technical things to do. How do you capture a problem so you can write a programme about it?
This year BCS is trying to improve the public understanding of IT. What do you think we should do to achieve that?
It's interesting - the challenge is to convey why things are worth doing and why it's hard in a simple way. Like tracking a patient through their entire medical life: what's important, how do we relate items to each to each other?
It's the technical challenge of understanding the task and its social context.
How do you feel about winning the Lovelace Medal?
I was stunned, I looked up previous winners and thought 'what am I doing in this bunch of people?' But I was especially pleased to see that I was the first woman to get it. Very nice. I really appreciate it.
Looking back on your long career, is there anything you would do differently given the chance?
It's hard to say. You can't predict in the beginning what's going to happen anyway. There's unpredictability for a variety of reasons.
One is that people find out that they can or can't do things that they thought they were going to be able to do and so that tends to cause people to change course. Alternatively, something can come along from the side and it can blow away what you were doing, or blow you away in another way, and you realise that's the really interesting stuff.
As a researcher you don't usually jump from one thing to another. But there are adventitious factors. When you're older you can choose to an extent what you do, but when you're younger if there's no money then you have to go where the money is. Many computing research areas suffer from a dearth of money.
For example, the funding agencies in the US cut out everything to do with translation for rather bad reasons in the mid 60s. Researchers don't throw away what they've done but cut their cloth accordingly and make a shift. It's complicated. Research is amorphous and has overlapping threads and sub-areas and people move for a variety of reasons.
What impact would you like to see your research having on everyday life?
One thing I did when I was working on document retrieval in the 60s was to work on automatic classifications. You found classes of words to make matches, but some of the experiments we did didn't work out as we thought and we were trying to understand why it was happening.
This caused me to develop the idea of weighting words statistically - looking at word frequency and in how many documents a word occurs. Because in general if a word appears in a lot of documents most of them are not going to be of interest, so you think of inverse document frequency weighting and I published a paper about that in 1972.
This was put together with another type of weighting, which was how often a word appears in a document. At that time there was no immediate application in operational systems. There were bibliographic services, where it could have made a difference, but they were hooked in things like Boolean searching and thesauri and so on.
The library world is very conservative and they only slowly picked up the idea of natural language searching. Researchers were convinced by the 1980s that statistical searching was a good thing to do and by 1992, 20 years later, two things happened.
Firstly, a large research program was started up called TREC (Text Retrieval Conferences) and that attracted attention because the collections were so large that people thought statistical techniques would work.
More importantly, what came along a little bit later is the web. It accumulated a lot of information and the point of the web was that it was the computing community not the library community running it.
Mike Burrows, who originated Alta Vista, had a large usenet file that he wanted to retrieve from and I and a colleague had written a little paper called 'Simple proven approaches to text retrieval' in 1994 which contained all the basic ideas of this research and the theories underpinning it. Mike asked Roger Needham about retrieving from his usenet stuff and Roger gave him the paper.
Mike read it and started from there. So he didn't read a lot of library stuff; he took the paper and started the first of the modern search engines. Pretty much every web engine uses those principles.
There was no way these things could be implemented in a useful way in 1972, but by 1997 it was working with the full text it needed. So these statistical ideas that I've contributed to in one way or another are spreading around in this modern computing world.
Who in the IT industry inspired you or was a role model for you?
In a way my first employer - I worked initially for the Cambridge Language Research Unit. It was run by a lady called Margaret Masterman, who was extremely eccentric and was the person who started CLRU with some rather original ideas about how to do machine translation. She got a grant and employed people. Roger Needham worked there too, during his PhD.
She'd been a student in Cambridge and had suffered from all the chauvinism of the Oxbridge model of academic life and she was a firm believer in making sure that women got an opportunity. She had no prejudices about these things, but de facto she encouraged me because she hired me.
She was not a role model in the way she worked and I disagreed with some of her ideas, but she was a role model in that she showed me there is nothing to stop women working in this area. At that stage there were no opportunities for women. You have no conception of how narrow the career options were.
I think to some extent my husband Roger (Needham) has been a role model too, but in a very different way. What he did was encourage me. When you're on your own in a subject and living on soft money, as I did till I was over fifty, that's very valuable. He was always encouraging me and I could always talk to him about my work.
What are the biggest challenges facing your discipline?
The main challenge in text retrieval is that it is a very large area, on one hand represented by people like Google, on the other hand by all the skilled professionals that still use these specialised classification languages and things like that do very specialised searches.
The main problem with web engines is that you don't get anything much in from the user. Typically, if you search for a topic you don't get far with a collection file of billions of documents from a two word query.
People have tried all sorts of carrots to get searchers to put in better queries or interact a bit more - like using feedback to bootstrap a better query. All it requires is for individuals to mark what is useful so the next batch of documents are new.
The other challenge is how integrate image retrieval - speech retrieval is not such a challenge because it can be transcribed well enough to do retrieval. But image is a different ball game. How can you find images 'like this one' if you haven't got one to start with? Many people are trying to tag images with text, but it's very difficult to evaluate the efficiency of different methods.
People build systems and throw them at the user and say: 'Isn't this fun?' But that's not the same as demonstrating that a system is better. Controlled experiments are difficult to do with real, live users and they are expensive.
Evaluation of ideas in any field is important. For example we can all have an opinion about a translation and spot a good one or bad one. But you could have two equally good ones - does that make a difference? Maybe one is better than another. How can we find out?
What are your thoughts on the semantic web?
I think the semantic web has modified it's meaning, but the all-singing, all-dancing version, which is a model of the entire world of everything, I'll stick my neck out and say I think is fundamentally misconceived.
It's something that philosophers thought they could do in the seventeenth century, and Liebnitz was no slouch, but they couldn't do it. For good reasons, too: you can't code up the world, it's not tidy like that.
What you can do is code particular worlds for particular people and purposes. That's what biological taxonomies are about, but that's for experts.
In some respects there's an analogy with expert systems, people say if only we could get the expert's knowledge out of his head and coded up; and you can do that for a closed world. But you're likely to find yourself walking over the boundary of a closed world without realising it.
Say you want a specialist database on blood, a fluid. You don't include in that a lot of stuff, like if you drop fluid from a height it breaks up into drops. There's a lot of general information about fluids you wouldn't want to put in a haematological database.
But at some point people are assuming you know it. What's happening is that the semantic web stuff is going into knowledge representation, but there is a limit to what you can code up about the meaning of an ordinary language word.
We can't model everything. Many of the semantic web people are now thinking more about an upper layer and they are rediscovering some of the stuff that the AI people have already done. The model now is that you will have your specialised world models for specific domains and then a relatively solid bridging layer; a top layer that provides enough resource to get from one domain to another, but even that's hard to do.
It's hard to plug specialised knowledge bases into a general overarching layer. My model is to say you can have your specialised areas, but bridge in a lightweight way through words.
Take blood again. Let's just follow the word, don't go via blood is a fluid, blood is red, and has a temperature of x - the upper layer should be a much lighter connective structure, essentially a natural language approach.
Different domains do share vocabularies and words can mean different things, but they are similar enough that we can communicate.
Using speech applications seems to have applications for those who are disabled.
These things are fine. Everyone thought things would be revolutionised in IT applications when we had speech recognition. But they didn't appreciate how slow speech is.
You can use transcribed speech faster than speech is uttered. I can scan a document very quickly, but if I read it you it would take far longer. So the idea that we can throw away boring old text is completely unrealistic.
But that doesn't imply that speech interfaces aren't very useful for the disabled and people also want to transcribe speech - like the intelligence agencies.
If I wanted to rent a car abroad it would be jolly nice to pick up the phone and have my speech translated. All that's cool, but that's not the same as throwing away text for speech as the answer to everything.
What's your view on women in computing?
I think it's very important to get more women into computing. My slogan is: Computing is too important to be left to men.
I think women bring a different perspective to computing, they are more thoughtful and less inclined to go straight for technical fixes. My belief is that, intellectually, computer science is fascinating - you're trying to make things that don't exist.
In that respect it's like engineering, trying to build new things. Take skyscrapers - they had never existed and provided fundamental engineering challenges - weight, windforce and so on. We need women to see the intellectual challenges and social importance of computing, all of the things that computer systems are used for now and why it matters to society.
It seems to be a problem, perhaps more with girls than boys, that you've got to get them hooked young enough, then keep them hooked. If they are not interested by the time they are 13 you've lost it - I'm talking about girls who may go into the subject in some depth.
ID cards are a very good example of this. It's a fundamental notion - it will cause a person as a legal entity to have a particular definition.
Think about the implications of CCTV, another example, or health and education. Should we do all of our teaching via IT? What's the function of education - can this be achieved with IT?
What about climate change and sustainability? Think about the fact that most women drive - traffic modelling is a growing area? What is it? How should it be factored in charging for where you go, convenience of route and so on? All these things are part of the fabric of one's life.
So I've always felt that once you see how important computing is for life you can't just leave it as a blank box and assume that somebody reasonably competent and relatively benign will do something right with it.