Data quality: provenance, processing and preservation

Concerns about service quality extend beyond IT systems to actual data, writes Naveen Madhavan MBCS, Senior Product Specialist (Pathology) at Digital Health and Care Wales. Yet, the value of this data primarily resides in the insight it promises within a relevant context.

The complexity of how data is captured, mapped and transmitted also means it has the propensity to diminish in accuracy over time. Additionally, IT systems designed to ensure data integrity have little control over the quality of data that has been compromised at source or corrupted during transmission.

The bullwhip effect

Systems scientist, Jay Wright Forrester, theorised in 1961 that interpretations of consumer requirements can be distorted as the information travels upstream between distributors, wholesalers and producers. Referred to as the Forrester or the whiplash effect, the concept initially gained prominence to trace the movement of products.

In time, these amplification effects were observed to hold true for IT and healthcare. In simple terms, the principle states that an unexpected fluctuation in user activity can inadvertently cause service providers to overreact by exaggerating user-requirements, resulting in the overproduction of systems and resultant data that are ultimately wasted through lack of use.

In the rapidly expanding digital world, the quality of data is largely determined and managed through system design. Users of clinical systems tend to over-state IT requirements at the outset that they later ask to be scaled down during the testing phase as they realise that not everything that was initially requested would be of practical benefit.

It is also difficult to assess the relevance or usefulness of data that is generated from demand projections of patient activity whose scope also bring data storage and system performance challenges to the forefront.

Big data, shape and size

Aspects of the world that can’t be personally experienced can be understood through data. Increasingly more data is being generated from digitisation, devices, artificial intelligence and machine learning. Data has been described as the ‘new oil’, ‘the currency of our time’ and the ‘reduction of uncertainty’. Yet, big data is not just about the size of the data but also the many correlations and relational linkages that add to its complexity.

As data grows exponentially, data discussions will move from gigabytes (10⁹) and terabytes (10¹²) to petabytes (1PB = 210TB or 10¹⁵), exabytes (1EB = 210PB or 10¹⁸) and zettabytes (1ZB = 210EB or 10²¹) that will eventually consider capacity in yottabytes (1YB = 210ZB or 10²⁴).

While big data is exciting, it also presents myriad handling problems. The unprecedented growth of data draws attention to structural attributes such as velocity (capture), volume (increment), valence (complexity), veracity (accuracy), variety (variability) and value (importance).

Considerations for data handling decisions include retrievability, reliance, performance and cost, while privacy and security continue to act as inhibiting factors to data expansion strategies.

The volume of personal data stored in the cloud has been increasing significantly with organisations closely following suit to adopt hybrid approaches consisting of local and cloud solutions dictated by the practicalities of retrievability, security and performance.

Emerging storage technologies

After the industrial revolution, the volume of data doubled every ten years. After 1970, it doubled every three years and today, it doubles every two years. The global data that was created and copied reached 1.8ZB by 2011 and was estimated to reach 40ZB by 2020.

Big data storage does not necessarily mean thinking bigger but could mean just the reverse - ‘thinking smaller’. Next generation storage technologies are examining the structure of the original DNA that promises extensively more capacity to hold the world’s data.

So, where one Dalton is 1.67 x 10-24 grams and with the human genome weighing 3.59 x 10-12 grams (aka picogram), the culmination of this work could mean that all the world’s estimated data in 2020 could fit into ~90gms of DNA based storage.

Conversely, where the reliance has been on magnetic storage on hard disks, advances in multi-dimensional nanophononics are stretching the boundaries of optical storage solutions by altering the frequency and polarisation of the writing beam that determined conventional compact disc storage capacity limits.

Until then, magnetic tape data storage technology will remain in use as it remains affordable until emerging technologies overcome capacity, performance and cost hurdles to replace it.

Our emotional connection to data

Data is a stagnant commodity that becomes dynamic through a person’s association to it. In other words, it derives its strength and validity from one’s emotional affinity to it. Data also carries its own nature based on its construction and affiliations.

No one really cares about analytics - whether, advanced, impressive or state-of-the-art - until they are affected by the implications of it. Emotions play an active part on raising the strategic importance of specific data segments within information processing units.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

Users are drawn to specific data segments associated with their work. In healthcare, real time data is invaluable in providing a clinical diagnosis while the trends from historic data may be critical for the treatment of chronic illnesses. Similarly, studies of the human DNA can open predictions on potential illnesses or healthcare needs required in the future.

Emotions can induce feelings of personal responsibility for loss or corruption of data that play a critical role in health predictions just as concerns of reputational damage can limit the sharing the details of data irregularities to wider stakeholders.

Data quality - a matter of opinion

Understanding data quality draws interest to attributes such as accuracy, consistency, integrity, relevance, speed, security and timeliness. It’s also worth understanding that, although organisations provide procedural boundaries, perceptions of what is deemed as acceptable data are subjectively formed.

Consequently, the understanding of quality can differ between staff in the same team, following the same processes and undertaking similar tasks. The limitations of documenting every intimate action and keystroke compounds this variation. In addition, individual experiences, emotions and tolerances act as mediating effects during data quality assessments.

Although optimal quality data is desired, for practical reasons, this tends to veer more towards what is acceptable rather than what is perfect. The balance to ensuring data quality can be described as a seesaw effect between risk and preventative action. In other words, even when all the routine checks have been done to ensure the quality of a dataset, further scrutiny will inevitably reveal additional anomalies with a dataset that may require correction.

It is pertinent to realise that all data has errors. The time and resource required to ensure ideal quality conflicts with the urgency to present it within critical systems where they can be interpreted and acted upon in real time. It is for this reason that isolated data deliberations or analysis without a business use-case is of little significance.

Provenance adds value to the data by explaining how it was obtained. However, systems designed and deployed into operation prematurely to meet an urgent project requirement can pose a multitude of data quality problems. Further, data validation techniques can only go as far as checking data processing but offers no quality guarantee of its integrity at source.

When data goes wrong

Datasets lack the ability to determine what is worthwhile and what is arguably junk, which means good data often arrives conglomerated with inconsequential information. Most healthcare settings have systems and processes in place to maintain data standards. But although these play an active role in quality outcomes, it’s undeniable that human cognition, coordination, attention and personal integrity are critical to ensuring data quality.

When data goes wrong, no one wants to be associated to the stigma of being identified with causing the deviation. Unlike data that is impartial, people arrive with an ingrained self-serving bias that causes them to automatically attribute success to themselves and failure to others. Although aware of their own flaws or faults, people believe that their intentions are always inherently good and that it just isn’t possible for them to be wrong, which instinctively causes them to castigate others for unintended mistakes.

Systems that offer critical services are subject to more scrutiny during failure. Teams that work together to jointly deliver solutions can suddenly regress to their own silos when a data fault has been identified. The question of ownership for a problem becomes a logistic hot potato that eventually rests upon the processing unit where the anomaly occurred.

Attitudes towards data quality can sway between an explicit expectation among clinical staff that data should be always perfect at every stage, to a tacit knowledge by technical staff that data is fluid by nature and imperfect, which can be iteratively corrected against validated datasets to ensure its useability.

Whether service quality is maintained optimally or not, an upcoming audit provides the necessary impetus to tighten up data and related change documentation - at least within the area that is scoped for the audit.

Ensuring healthcare data quality is a long game. With resource constraints and the exponential increase of digital data, the pragmatic view would be to steer efforts towards data that that is critical for immediate application at the expense of cleaning volumes of poor quality data that has run past its use-by date and therefore lost its value.

When a decision is made based on data regardless of its quality, the responsibility for decision lies with the decision maker. In other words, data is only an enabler for decision making and is rarely conclusive. Until then, data will continue to be captured routinely until the point of saturation before discussions move to alternative areas of interest.

About the author 

Naveen Madhavan PhD MBA MBCS is a Senior Product Specialist (Pathology) at Digital Health and Care Wales. His doctorate explores the value of clinical information systems and he is a Visiting Fellow at the University of South Wales, UK.