It’s an obvious but increasingly important fact: we live in a data-driven world, writes Keren Pakes, General Manager of The Bright Initiative. It’s now estimated that, globally, we create around 2.5 quintillion bytes of data every 24 hours - that’s 2.5 followed by a staggering 18 zeros.

Given its value and importance in driving the global economy, data has often been referred to as the new oil. It has also been likened to gold. And in 2020, the International Data Corporation (IDC) and Qlik came up with a new comparison: data as the new water. ‘Like water, data needs to be accessible, it needs to be clean, and it is needed to survive,’ they proclaimed.

What is unstructured data?

Much of the data we produce isn’t neat or easy to work with; it’s unstructured and disparate. The IDC estimates that 80% of data worldwide will be unstructured by 2025. It cascades from social media, from audio and video streaming as well as from machines and sensors.

A large chunk of the data we produce is residue from our daily digital lives, instant and quickly forgotten. This includes public social media posts and the rich metadata attached, such as time, date, and location. But far from being useless waste, this data actually has great potential value in helping to make the world a fairer and more inclusive place.

The opportunity is recognised by the UK Government in its National Data Strategy, where it acknowledges that data can drive efforts to create a more inclusive, less biased society and to ‘level-up’. Data can be ‘used to harness the potential of regions right across the country, ensuring that people and organisations from the whole of the UK can benefit from the full value of the digital revolution.’

The trouble with data

In recent years, as data’s role in society has grown, there’s been concern raised about how data can exclude groups of people due to bias in the way it is collected - or in some cases, not collected at all.

Caroline Criado Perez’s best-selling book Invisible Women, released in 2020, made the point that data used on a global scale to inform medical, scientific and technological development, urban planning, and economic and social policy is biased against women.

In the book, she highlights and gives numerous examples of the gender data gap, emphasising how data collected is often based on male models, while the female population is overlooked. This exclusion can have some alarming and serious impacts. For example, there is a tendency to exclude female humans, animals, or cells in medical trials. Therefore, women get less effective treatment and experience more side effects.

The reason women are ‘invisible’, as Perez puts it, often happens due to designers, planners, and project teams being mainly male and overlooking women’s needs. These teams don’t question whether the data they are using is representative of women, and so they don’t look any further for new sources.

But, with so much information now available, mumbled excuses about representative datasets being hard to find, or not being available, are difficult to maintain. In disciplines like city planning and transportation - where we know a third more women than men use buses in the UK - unstructured public web data offers a rich, untapped stream for collecting and including the needs of women and, indeed, other underrepresented groups.

So, here’s one idea: we could find and analyse web-based reviews of local bus companies and public social media posts about bus services, to drive the design of more inclusive transport services - and better everyday lives for a bigger chunk of the population.

Getting ‘radical’

Inclusivity in data is something that’s also being given some serious attention at a national level. In 2020, the UK Statistics Agency (UKSA) launched a new strategy, with ‘inclusive’ as one of five principles. The agency pledged to ensure that ‘our statistics and our workforce reflect the experiences of everyone in our society so that everyone counts, and is counted, and no one is forgotten.’

For you

Be part of something bigger, join the Chartered Institute for IT.

At the same time, an independent taskforce was established to recommend how best to make a step-change in the inclusivity of UK data and evidence. In its final report, the taskforce highlights barriers to participation in traditional data collection exercises, such as the burden of repeated requests to certain population groups, lack of time, and perceptions of lack of benefit in taking part.

The report also suggests that both qualitative and quantitative data could be used where appropriate to provide a more comprehensive understanding of lived experiences - which can then be used to make more inclusive policies.

While collecting and carefully and responsibly drawing insight from alternative or external web data isn’t a quick fix for these challenges, it does offer those responsible for making data more inclusive something ‘radical’ - another notable, core principle in the UKSA’s future strategy.

Data collection methods

It won’t replace traditional and established data gathering methods. But harnessing unstructured web data could provide a good pathway to more inclusivity, complementing existing approaches - similar to efforts by the Office for National Statistics, which is trialling the use of scraped web data to produce more representative consumer price statistics.

However, we should also acknowledge that technology can be a barrier as well as an opportunity in terms of realising a more inclusive world. The taskforce notes from their consultations that some populations may have, unintentionally, become more excluded due to lack of digital skills. This may have been exacerbated by surveys moving to online platforms during the pandemic, which has in turn affected levels of participation - and possibly excluded people already under-represented or disadvantaged in some way.

Inclusive data for ESG

Diverse data from alternative sources is increasingly being used every single day by investors and insurers, and by businesses, to assess the environmental, social and governance (ESG) impacts of their decisions and actions.

UK-based respondents to a 2021 Vanson Bourne survey for Bright Data said that, on average, two-thirds (67%) of their organisation’s investment decisions are impacted by ESG factors. And 73% of survey respondents from organisations with 250-1000 employees stated they would definitely change business practices harmful to society, even if it did not make commercial sense.

To inform big decisions on investments and future strategies, which must consider wider societal impact, decision makers are now demanding better data. This is a call that alternative datasets, from the aggregation of large amounts of unstructured public web information, are starting to answer.

The argument runs that the more complete and representative of society that data is, the more responsible the decisions will be, and the better the outcomes will be for a more equitable world. Such data can help avoid bad decisions that could expose some communities disproportionately to climate change, negatively impact a population’s health, or make a section of society poorer.

Golden opportunity to collect better data

As the volume of unstructured data continues to grow at an unprecedented rate, an abundance of public web information gives us a chance to harness available information in a smart, new, and imaginative way.

We have a golden opportunity to embrace alternative data as part of an ambitious National Data Strategy, to use it responsibly and carefully (see the ‘Ethical dimensions’ box below), to plug our knowledge gaps, increase depth of understanding, hear different voices - and make our world fundamentally more inclusive.

Case study: data-driven diversity in the workforce

Bright Data’s public data collection tools are being used in the United States to gather and source information from talent networks - and help companies hire diverse candidates.

Mathison works with organisations and leading brands to help businesses create a workforce that is equally representative and inclusive of all races and ethnicities, LGBTQIA+ (lesbian, gay, bisexual, transgender, questioning, intersex and asexual), people with disabilities, veterans, and immigrants.

It pulls together hundreds of inclusive talent networks and uses AI to help employers find candidates for their most important roles. By embracing unstructured public data, Mathison has used Bright Data’s tools to more efficiently and effectively gather and source information from diversity talent networks, replacing work that was previously done manually.

‘Without the technology, we’d be forced to build and maintain datasets manually every time we partner with a new organisation, which would take time and resources away from our team’s ultimate goal of matching underrepresented talent with their dream jobs,’ said Dave Walsh, co-founder and CEO of Mathison.

Ethical dimensions

The amount of unstructured data we create as a digital society will only continue to grow - but we must collect and harness it in a way that is open, compliant, and responsible.

Research from the Open Data Institute suggests that people may be happy with their data being used to benefit society, but they may not want that data being used to assist the investment decisions of hedge funds, for example.

Transparency and openness are key if we are to work with citizens and businesses to build a level of public understanding and trust in data.

Greater familiarity with data will be important, moving it away from being abstract and technical to something the lay person is comfortable with and better understands. It’s here that the work of organisations like The Data Literacy Project will be increasingly important if we are to see data used effectively and as a force for good.

To find out more visit the Bright Initiative website.