What is a knowledge graph and how are they changing data analytics?

Knowledge graphs are a big trend in the advanced data science world, but aren’t as widely known as they could be. Graph Database Expert Maya Natarajan, says that needs to change.

What is a knowledge graph? Put simply, a knowledge graph is an interconnected dataset that's been enriched with meaning. The Turing Institute frames knowledge graphs as the best way to ‘encode knowledge to use at scale in open, evolving, decentralised systems.’

Using a knowledge graph, we can start to reason about the underlying data and use it for complex decision-making. While not every knowledge graph is built the same way, we’re finding that every graph data science project starts with a knowledge graph. (That rings true, since Gartner predicts that by 2025, graph technologies will be used in 80% of data and analytics innovations, up from 10% this year.)

When you take data and place it into a property graph store, relationships between data points become part of the data itself, with no JOIN tables needed, as is the case with a relational database. And that’s useful for building a knowledge graph. It provides the first level of context that such a structure will depend on.

We can term this ‘dynamic context’, because when you put data into a dynamic structure like a graph database, you get a structure that's immediately contextually connected to all of its neighbours. And if neighbours are connected to all their neighbours, the knowledge graph grows and becomes richer as new information is added.

Why are we going to all this knowledge graph trouble?

Context is always being dynamically added with graphs. By contrast, if you put information into a knowledge base, as opposed to a knowledge graph, you only get out what you put in. Knowledge bases have a static, shallow context, not a rich, dynamic one. A knowledge graph lends itself to advanced analytics because it connects everything you have on a topic, which ultimately fuels new types of discoveries. Because you capture relationships, you get out more information than you put in.

To flesh out knowledge graphs, you need semantics, which means the second layer of context. Adding this contextual or semantic layer is the step where you make the data smart. It drives intelligence into the data so that we can infer meaning from it. Semantics means meaning and that's what confers context.

What do semantics do for knowledge graph builders?

There are various categories of semantics: controlled vocabularies like synonym rings, taxonomies, ontologies and others. Synonym rings and taxonomies are typically ‘lightweight’, whereas ontologies are ‘heavyweight’. A ‘heavyweight’ ontology means it has more inference power, but it takes a long time to build and maintain.

The second category of semantics is entity resolution and analysis, determining whether records from different data sources represent the same entity. Entity resolution problems are familiar and widespread. If you have a set of products listed for sale on Amazon and the same products listed on another ecommerce site, they're going to have slightly different names and descriptions. This is because the vendors will name them differently and they’ll have different descriptions. They may have similar prices, but they'll have different unique identifiers and so on.

A human can look at a few such records from two or more sources and determine whether they refer to the same product or not - but not for a million products. To scale such activities, you need to be able to carry out entity resolution. This is an important problem to manage, particularly in a business where data consistency is key, especially when detecting fraud or preventing money laundering.

Finally, tagging, categorisation, classification and AI are all important in the context of knowledge graphs. AI is very important for a process called ‘knowledge graph completion’, a process by which missing entities and links are added to knowledge graphs, adding links that were simply missing as well as uncovering links that are present but undiscovered (think about genetic links in drug discovery).

So far, so abstract. Why would we go to all this trouble? It’s because knowledge graphs enable you to integrate all your data, right where it sits. They are not another system to fit into your landscape, but a structure that relates all of the data you have.

Knowledge graphs function as a non-disruptive insight layer on top of whatever existing data landscape or infrastructure you may have in place.

And because knowledge graphs connect all the data, a single knowledge graph can serve multiple purposes. A knowledge graph you initially create for Product 360 can also support bill of materials (BOM) management, which really entails looking at the same kinds of data from different angles.

You might wonder about the difference between a graph and a knowledge graph. In practice, it’s a short journey from graph to knowledge graph, achieved by simply adding semantics. When we asked our customers about knowledge graphs recently, the majority said they had already implemented them (67%). These are responses from large organisations in a wide range of vertical sectors across multiple use cases, so there is undoubted momentum. But why are knowledge graphs becoming such a hot topic for enterprises?

From siloed data to advanced analytics on connected data

Knowledge graphs lend themselves to multiple use cases. There are at least 40 that we know of and they tend to break down into two use case groups: data management and data analytics.

The data management knowledge graph’s aim is to drive action by either providing data assurance or data insight. Data assurance knowledge graphs focus on data aggregation, validation and governance. Examples of data assurance knowledge graphs include data lineage, data provenance, data governance, risk management and compliance.

The data insight knowledge graph’s function goes beyond visibility of information and focuses on exploration, deduction and inference of new knowledge. Examples of data insight knowledge graphs include Customer 360, Patient 360, Product 360, Agent 360 and so on. In this category, there is identity and access management, AML (anti-money laundering), root cause analysis, impact analysis, recommendations and others.

To bring all this back down to earth, let’s consider a common problem: fraud. Take the case of synthetic identity fraud, in which fraudsters combine elements of real identities into a plausible composite. One red flag is two people who share information that should be unique, like a driving license number. A graph query can look for shared identifiers that bear further investigation.

There’s also the decision-making family of knowledge graphs. These knowledge graphs are used for analytics, machine learning and other data science applications. Decision-making knowledge graphs aim to improve decisions, forecasts and predictions and prescribe optimal actions. Examples include churn analysis and what-if analysis, journey analytics and so forth.

Graph algorithms learn about the overall structure of your knowledge graph. For example, a community detection algorithm can reveal groups whose structure and transaction patterns are anomalous. When you see activity patterns that deviate from normal behaviour, they may indicate a fraud ring at work. Graph algorithms enable you to find such patterns across your knowledge graph.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

Finally, there is the graph-native machine learning category. The use cases in the machine learning category include link prediction, node classification and feature engineering. Consider that last item. To implement machine learning, you need what are termed ‘features’. Features capture what is predictive in your data.

Too often, data scientists use trial and error to find predictive features, but graph data science offers tools like graph algorithms or better still, graph embeddings, to learn what’s predictive in your graph and feed that information into machine learning models to fuel better predictions.

With knowledge graphs, you move from queries that provide data insights to graph algorithms that identify patterns in the larger graph. For example, suppose that when using graph algorithms, you uncover characteristics that are predictive of fraudulent activity. You can then add those features to your existing machine learning models, giving them additional predictive power.

Finally, graph embeddings learn from the specific data in your graph and find net new information: connections in your data that you have not yet discovered. Fraudulent ‘customer journeys’ often share certain characteristics, like opening a new account, steadily making smaller purchases (and payments) and then after establishing normal behaviour, maxing out the account and disappearing.

By encoding customer behaviour in a knowledge graph, you can create an embedding that looks for signals of fraudulent patterns before they lead to big losses. And remember, even though we mentioned fraud to keep our discussion concrete, there are numerous use cases, from identifying customer churn to optimising global supply networks.

The move toward knowledge graphs is happening across every industry and across data science teams. So for graphs and particularly knowledge graphs, it looks like The Turing Institute is onto something here. Make sure you and your data science team don’t get left behind.