Fighting cancer with multidimensional data analysis

Hao Zhang is CEO of Oxford Cancer Biomarkers, who has been developing a machine learning system to provide clinical decision support for colorectal cancer treating clinicians. He talks to Brian Runciman MBCS about the programme, big data, AI ethics, privacy issues and more.

BR: Tell us about your background

HZ: I was essentially trained as an immunologist, although I did my undergraduate in physics. In fact, microelectronics. It turns out that how you use super resolution microscopy to study the unit of computer chips or transistors architecture is similar to immunology, and also the laser manipulation technology can be used for life sciences.

Therefore, I was attracted by life sciences. The turning point was when I was in Cambridge in the Cavendish laboratory, their physics department. The Nobel laureate Steven Chu, who was the then Head of Physics Department in Stanford University, and then later became the Head of Energy Department in the United States, came to Cambridge and gave a series of lectures on how to use the physical technique to study biological samples on the molecular level. I was really fascinated by how things can be done using physics techniques, so I basically spent two years learning biology, almost self-taught.

I got into Oxford in molecular immunology, and then since have stayed in the field after my PhD. Also post Doc worked on cancer immunotherapy. And then I joined pharmaceutical industry, worked in Roche and AstraZeneca, two of the largest pharmaceutical companies in the world.

BR: Tell us about your current role.

HZ: Now I'm working at Oxford Cancer Biomarkers. It was founded in 2011 by two professors at Oxford University and initially sponsored by Oxford University through their venture fund. The focus is on what we call precision medicine - to tailor the treatment to individual patient’s specific circumstances. We work in a specific field called digital pathology and pharmacogenomics.

In healthcare these days, increasingly, we have got more and more information, which has multiple layers - hence also called high-dimensional data. For example, your electronic medical record, family disease history, personal genomic data, proteomic data and other omics data. Omics aims at the collective characterisation and quantification of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms.

It’s increasingly more common to include all of the proteins in the body - how they behave, if there's any defects - but ultimately it is about the proteins, which is basically the manifestation of these genes, including defective genes. Therefore, understanding how the proteins behave is critically important. A typical example is neurological disease. At least one hypothesis is that this is a set of proteins misfolded and deposited into the neural network in the brain, causing dementia.

Another set of data is the medical imaging and scanning in oncology to get a definitive diagnosis. A common practice is to take a slice of tissue, a tiny amount of flesh from the tumour of origin and then look at it under the microscope and sometimes staining it with different chemicals or molecules. We know precisely what's going on with the spatial resolution. If you think about the gene, there's no spatial resolution, it is in general encoded in a linear fashion in the DNA sequence. But on that tissue, you would know how the tumour or cancer cells are interacting with the surrounding environment cells.

That information provides a lot of clinical insights, so our company right now is focusing on the last bit, focusing on the medical image, especially the pathology image - the bit of a tissue taken from a patient and then scanned into a micrograph.

What we then do is produce an algorithm to automate the procedure of analysing how tumour cells interact with its surrounding environment cells and with the immune system, so that we can predict whether a patient’s cancer, after surgical removal, will come back. This can help clinicians to make critical decisions on treatment strategies and sequences, hence it is also called clinical decision support.

Algorithms

BR: Can you give us an example?

HZ: A very concrete example is a patient consultation on a stage-two cancer, which is relatively early stage. Common treatment is to surgically remove the tumour, but then the danger is in whether the cancer will come back. So far there are very few tools that would predict this.

So, what we're doing is looking at the features on the digital pathology image to be able to provide information on the likelihood of the cancer coming back after the surgery, so the clinician can tailor the post-surgical treatment based on this information. That's the core value of this product.

But of course, that's not the only way to predict whether the cancer going to come back, for example you can use molecular markers to stain the tumour cells, do the surgery, and precisely identify which are the tumour cells to remove all of them.

We use artificial intelligence to automatically analyse the tumour site instance, rather than rely on individual pathologist. We're on course to reach the point that we don't need any human involvement, so sample in, results out.

Then it can be decided whether some intervention needs to be taken, for example chemotherapy or targeted therapy. If the chance is very small of the cancer coming back, then it's better to wait and watch rather than giving toxic agents like chemotherapy, which kills both cancer and normal cells.

BR: Do you have a measurable impact on patient wellbeing, because it must cut down the amount of invasive surgery or chemotherapy? Do you have figures on the benefits for the patient?

HZ: Yes, we do, from the field we call health economics analysis. Essentially that's an indispensable component if you want to convince the NHS to adopt the technology or products. And of course, we need to convince NICE (the National Institute for Health and Care Excellence) in the UK to endorse that product. So, we're in the middle of this process.

We have a formal health economics study to address the exact question, not only in terms of the cost to the NHS or the cost to the society, but also the quality-of-life improvement for the patient. However, that's a very formal study at scale as we need statistical analysis to demonstrate its value.

BR: It's not until you get that final formal approval that the product can be rolled out into the NHS?

HZ: In general, when NICE recommends, there is a good chance that it can be widely adopted in the UK. Right now, we're selling to private hospital networks and to other European countries where we have our early access customers.

There are challenges. Chemotherapy is in general relatively inexpensive and usually clinicians tend to treat the patients with past experience and see if it works rather than thinking of avoiding the potential toxicity. And if the intervention, in this case chemotherapy, is relatively inexpensive, then a common challenge or problem is overtreatment. So, some clinicians may tend to think we should give patients the chemotherapy anyway to prevent the cancer coming back.

It is true from evidence-based medicine experiments, chemotherapy in general would improve or prevent cancer coming back. But it also has associated toxicity, and sometimes toxicity may kill a patient. So, for some patients, that benefit is very small, but the toxicity may harm the patient. We know statistically about 0.1% of the chemotherapy instances will cause very serious hard to or kill patients, so how do we avoid that?

Ethical data sourcing

BR: How did you gather your data?

HZ: Right, so there are several mainstream ways in general to get the data, but first I just want to point it out this is always the most challenging bit for any data analytics. I think it's not just for health care, for everything, the first step is to have the high quality data before any analysis.

Even though AI has seen so much hype, in reality the very concrete examples of AI solving problems are very limited apart from some of the very fancy games, or purchase pattern prediction - think of Netflix and Amazon for recommendation. It’s mostly been entertainment. So, the value added for other critical aspects of human life are very limited like say, in finance, people are trying to put AI into the system. I know from my own experience that HSBC is implementing some rudimentary AI tools into their online and mobile banking to predict your expenses. But most of the time it is not accurate.

In health care we have been talking about this a lot, but so far there are very few tools that actually change the medical practice, let alone revolutionise it. I think it's a bit like the AR/VR (augmented reality and virtual reality) thing - also known as metaverse - everyone feels that this is the next big movement, but nobody knows when and how.

This is the most challenging aspects I think for most of the AI-oriented healthcare related companies because like finance, healthcare, or health data, is probably the most precious and most personal data for most people. Therefore, how do we get the data?

Our approach is to work with research institutes and their clinical trial teams. This is how we got data access to some of the largest datasets in colorectal cancer. A specific commercial product was never the agenda in the early days. It was always how to work with research institutes to provide clinical insights, to increase the efficiency of the data analytics and increase the efficiency of clinical trials. That's how we started.

We were able to pool in multiple clinical trials data through multiple research institutes, including Oxford University. There are a number of other leading institutes in this endeavour such as Oslo University in Norway. The drawbacks are very obvious, because by definition it is not a product-driven approach from the beginning, so it is inefficient.

Another path is to buy data from commercial data companies, and this is probably also a new phenomenon in health. The drawback there is that it is hugely expensive. You can easily spend millions of pounds on data that may or may not be useful.

BR: So, you have an algorithm, I guess it must build up a sort of flexible model of the sort of image it's looking to identify as a cancer cell, and that's how you then train it? A narrow AI application, which AI is very good at doing. Is that right?

HZ: Yeah. The critical aspect of any AI applications is the use case. To a large extent the reason why AI has not been very prominent in clinical medicine and clinical care is that we haven’t found the best clinical use cases. How do we identify pain points or unmet medical needs that can be solved by AI? I think that's a critical question. We're doing some exploration. We think we're on one of the promising paths. But we can’t be certain.

Another related question is ’having identified a use case, what would be the best way to use AI technology?’ Imaging is one of the very early subdomains where people applied AI in medicine, pattern recognition being one of the key features for AI advantages.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

BR: How does the algorithm you produced interact with the underlying information about the protein folding and all those other elements in the background?

HZ: This is something I think for the people in the field to probably feel proud and also for people who are in the IT industry, especially for those in hardcore AI algorithm development felt this is like a trick. The truth is, most of the algorithms we are using have been around for a long time.

Google’s Deep Mind team made some really big headlines around the globe on protein folding, which is a huge achievement. But the algorithm being used is a set of tools that have been around for a while. Some people in the field joked that it's a bit like alchemy rather than science.

How do we distinguish why some companies are relatively more successful than others if the approaches are so well established? It is probably down to two things: number one, it is about identify the use case that can select the most suitable algorithms. The second thing is basically the ability to fine-tune the model to fit into that specific scenario. Therefore, I think in a sense it is like art, if not magic, because to fine-tune the parameters, in many cases, there is no rigorous science behind it. It's really based on experience and intuition in a lot of the times.

BR: So, you've got the sample you're going to test, you've got the background information on the genome and the background information on the proteins and then the model in the middle. Is that essentially how it works?

HZ: Yes. We need the data, biological samples. The protein or genes, those are the underlying connection among the different components in the data set.

This is something we in the field call the knowledge graph. In the knowledge graph the different pathways and the different interactivity, or the lack of, plays a huge role. I think that's probably one of the keys to really apply AI into practical medicine at this stage.

Modern molecular biology has been around pretty much from 1953, when the DNA structure was discovered, so molecular medicine has been rounded just about 70 years or so. There's so much unknown.

And even though we're saying big data, we're not really getting big data - we're getting one set of the data with multiple dimensions. The actual data set is relatively small, and we have one of the largest datasets for colorectal cancer in the world. It’s about 2,000 or 3,000 patients - that's a tiny amount of data in the big data criteria, but for each data point we have very high-dimensional features.

So how can you have a big data approach when you only have relatively little data? I think one of the critical tricks is to know how they are connected, so you don't really just get big data and try to find the pattern. That is where the knowledge graph comes in. We have been trying to build the knowledge graph, including in the genomics with how the genes interact and then proteomics, how the different proteins interact to each other but also other systems with regard to cancer. We know the cancer has a different microenvironment, so we study that tumour microenvironment.

We also study the immune system, because the immune system in the normal case would identify cancer cells and kill them. The very reason cancer became a prominent disease was because the immune system lost the capability to identify them or kill them. We study the immune system and then we try to link them together to form this interactive map, so that's how we are able to use a relatively small number of data points to address some of the challenges.

Privacy

BR: What about privacy issues?

HZ: This will become a very prominent challenge in healthcare because, with genomics alone you would be able to identify an individual patient. We have the Electronic Health Records and all the other omics data.

How does society decide what we should allow with that information? There is interaction with other organisations like insurance companies to consider. Could that cause a crisis for health insurance and discrimination employment?

BR: Because health care is actually the place where we are likely to see much more benefits for the use of big data than perhaps any other. The benefit cases are quite clear, aren't they?

HZ: Yeah. I actually believe that you don't have to trade off privacy or safety security with convenience, I think there must be technological solutions to allow the user to have both. Apple has used some of the technologies to inject some “digital noises” into the dataset, so that individuals won’t be identified while the collective insights can be derived anonymously from the data analyses.

Future development

BR: At the moment you are looking at colorectal cancer. Is the next step - having proved the validity of this way of modelling things - to then apply it to other types of cancer? Is that nontrivial or is that easier once you have sort of broken the back of it with one particular example?

HZ: Well, we hope it will be easier, but I think it's still going to be very challenging. One consideration is whether this colorectal cancer derived algorithm that has been clinically validated can be expanded to other cancer types - for example, breast cancer, ovarian cancer.

As I mentioned, we're looking at relatively early stage for colorectal cancer, but could it work for an even earlier stage - screening before the cancer actually forms? But also at a later stage - the metastatic stage. It is possible that the product could provide medical insights beyond chemotherapy, like targeted immunotherapy, or even predict what new therapies that can be used.

Finding a one-size-fits-all is challenging. We probably have to tailor each specific indication, but on the other hand maybe we're not on the right path. That's why we have to adjust for each cancer specifically. Is it possible to allow the algorithm to become self-sustaining, to adapt without intervention?

If the machine learning could grasp the dimensions within the knowledge graph, changing from cancer to cancer, if they could self-adapt that and then self-evolve to new algorithm, I think that would be really cool.

BR: Is that something you feel is possible in the short term?

HZ: Oh well, in life sciences short time we're talking five years.

BR: Yeah, OK five years then.

HZ: I think it is possible. We're doing one particular technique to allow the computer to detect cancer cells automatically.

In the past the pathologist had to manually circle all the cancer area. And then go on to the diagnosis. Right now, our preliminary data or a prototype of the algorithm can automatically detect the tumour regions.

The next step is - if the cancer changes, which also means the stage changes and the morphology changes the program in its our current stage has limited success to adapt that change, so you don't need to tell the program what the cancer cells look like. But, it could be vastly different between, for example, lung cancer and prostate cancer.

The program can recognize to a certain level - we got around 65 to 75% of the cases correct - but there's still a large room to improve, but that would be something that we hope great we could allow the program to do.

BR: It seems to me that the knowledge graph you're talking about can apply to the to the range of benefits too, because obviously if you if you can identify these different cancer stages, it directly helps people's health, it helps clinicians with their workload and it saves money for the NHS. That's a big graph.

HZ: Yeah, I absolutely agree, but I think we're still far away from the stage. That would be the goal we are aiming for.