Roger Needham lecture 2008

A revival of data dependencies for improving data quality

Speaker: Prof Wenfei Fan, University of Edinburgh’s School of Informatics.

Recent statistics reveal that between 1% and 5% of real-world data in enterprises is ‘dirty’: inconsistent, inaccurate, incomplete and/or stale. The prevalent use of the internet has caused the risk of creating and propagating dirty data to increase on an unprecedented scale.

Dirty data is estimated to cost industry in the USA alone, billions of dollars per year. There is no reason to believe that the scale of the problem is any different in the UK, or in any other society that is dependent on information technology, thus highlighting the need for principled approaches to the improvement of data quality.

This talk presents a recent approach for detecting and repairing real-life dirty data. It is based on conditional dependencies, a revision of database dependencies by enforcing bindings of semantically related data values.

As opposed to traditional database dependencies that were developed for improving the quality of schema, conditional dependencies provide a theory for improving the quality of the data.

Based on the theory, practical techniques have been developed for cleaning dirty data which effectively reduce human efforts and improve data quality. The techniques have drawn attention from industries in the UK and beyond.