A knotted pipeline

A knotted pipeline

Ada Lovelace Institute
Data-driven systems and inequalities in health and social care
Reading time: 107 minutes
Footnotes
Executive summary
Every day, the health and social system across the UK seeks to provide medical support, care and treatment to tens of millions of UK residents. This system comprises a complex network of devolved public health and national health bodies, such as NHS England and Public Health Scotland, and includes partnerships with private providers of health technologies, hardware and digital services.
Throughout its history, a major challenge the UK health and social care system has faced is addressing health inequalities: avoidable, unfair and systematic differences in health outcomes that exist between different demographics. 1
It remains a reality that health and wellbeing in the UK are experienced differently across different socioeconomic, ethnic, geographic and age. 2 This was evidenced repeatedly during the COVID-19 pandemic, which showed how data-driven systems promise 3 – and at times demonstrably provide – better health and wellbeing outcomes. But they also risk exacerbating existing inequalities in health, which can lead to some members of society experiencing harm and losing out on the benefits of health technologies.
Some of the ways these inequalities arise are well known, such as the lack of representativeness of – and the bias in – data used in analytics and AI, and through differential access to health technology.  But other ways are less obvious.
As data-driven systems become the new norm in the delivery of health and social care, it is vital to fully understand why and how complex data-driven systems can sometimes lead to inequitable outcomes.
Through the period of the COVID-19 pandemic, the adoption of data-driven systems has accelerated, influencing resource allocation and public health measures. New technologies range from real-time public health updates, like the NHS-developed COVID-19 dashboard that provided dynamic data about infections and vaccine coverage in the UK, to novel forms of healthcare delivery, like the use of Dr Doctor to manage appointments for vaccine distribution. 4
Data-driven systems can both alleviate and entrench inequalities – but the latter may be more likely when people’s needs are sidelined in service of an overreliance on data. Despite the positive contribution that some data-driven systems have made to the COVID-19 response, there remains very little evidence about the ways in which they may be interacting with, and impacting on, longer term socioeconomic inequalities in health.
This report is not intended as a critique of the many concerted efforts to address impacts from the COVID-19 pandemic, or of measures that were deployed in response to a rapidly changing public health emergency. Rather, the evidence in this report is intended to inform understanding of how to use data-driven technologies to achieve equity in future outcomes.
Scope
The goal of this report is to describe the complex interplay between data and inequalities in the across the UK. Acknowledging the potential for data-driven technologies to improve health and social care outcomes, this report scrutinises the ‘pipelines’ of data that power health technologies, identifying trends, approaches or limitations in data and data use that might undermine the beneficial outcomes sought.
While the evidence in this report comes from examples of data-driven systems in the context of the COVID-19 pandemic, its findings are relevant to the ongoing efforts to ensure that data-driven technologies deliver equal benefits for everyone. Indeed, careful consideration of the issues raised in this report will be crucial for ensuring that the increased significance of data-driven systems in the health and social care system across the UK do not compound existing inequalities.
In the research that underpins this report, we examined the mechanisms that impact on health and social inequalities as being sociotechnical in nature, meaning that they involved both people and their use of technology and infrastructures, rather than just the technology or infrastructure alone.
The term ‘data-driven systems’ is used to refer to technologies like data dashboards and algorithms, and the infrastructure they rely on to store, process and use data. But data-driven systems are more than just the technology: they also include the people who develop, design and use these technologies – engineers, product teams, clinicians, hospital administration staff and others – and the insights they produce.
At the top level, this report finds that addressing health inequalities in data-driven systems will require those involved in designing, developing, commissioning and using data and data-driven systems across the health and social care system to take a people-centred approach to data. This means they must always consider the context in which data is collected, used and understood, and consider who is involved in these decisions.
The pipeline refers to the sourcing, flow, curation and sharing of data in a health and social care system between government departments, local authorities, clinicians, public health officials and other actors, including private organisations. Data-driven systems rely on this pipeline but are often hampered by what can be seen as ‘knots,’ which disrupt how well data-driven systems work in reality.
‘Really it feels very like data plumbing: there’s lots of pipelines and repositories […] I mean it feels like we built more spaghetti, a little bit […] You know it’s quite challenging, when you want to validate the data quality stuff, I think it took somebody two weeks to understand the flows of the day to [sign] them off.’
Product owner
Understanding data-driven inequalities as knots in a pipeline
The UK Government’s Data Saves Lives paper emphasises the importance of data-driven technologies in transforming health and social care in England, including a more seamless flow of information between different health and social care organisations. 5
To be ‘successful,’ a data-driven system would rely on a flow of information from source to design and deployment that is free from unnecessary obstacles and is able to improve health and social care for all.
This report contends that the ‘smooth’ data pipeline is a hypothetical ideal that does not reflect the complexity and imperfection of reality. If data is to continue to be useful for healthcare decision-making, now is the time to recognise that data-driven technologies in health rely on a complex ‘knotted pipeline’ of information. 6
Technological solutions that do not consider the significance of individual decision-making – from data collection, data quality, curation and standardisation methods, to design and deployment of interventions – will increase local, national and global inequalities. 7  
The role of private and public partnerships will need to be considered, so that they are conducted in ways that engage diverse communities and enable open communication between local and national bodies, even if this is contrary to what the technology providers assume is the right direction.
In this report, we show that when conversations are not aligned, data-driven systems limit the ability of practitioners to use data to address inequalities.
People working within data-driven systems must recognise and respond to the knots in the pipeline in order to use data effectively, see its limitations, and enact the infrastructural and cultural changes needed to address inequalities.
This report sets out to make visible the complex data pipelines and decisions that interact to create downstream impacts on people’s lives.
The knots in the pipeline – a summary
Inequalities are poorly understood, described and measured
This knot describes how people involved in building data-driven systems may have a poor understanding of health inequalities, and that there is a general lack of clarity around whose responsibility it is to consider bias and inequality in system design.
Tension between local and national priorities for design and delivery of data-driven systems
This knot looks at how differing approaches to the design and delivery of data-driven systems are rooted in political priorities and power dynamics. This may compromise the quality of insights by overlooking local needs and knowledge in service of national goals, where those local insights may be vital for flagging and mitigating inequalities.
Data quality and diverging inferences
This knot addresses the fundamental issue of data quality – that having poor quality or incomplete data can cause unequal outcomes in downstream uses. It covers the challenge with time-delayed or out-of-date data, and the fact that not everyone is well represented in datasets used to make decisions about health and social care.
Data curation and lack of ability to recognise inequalities
This knot considers the pitfalls of data curation – when it is done in a top-down way that flattens out important nuance, and when coding and categorisation is done poorly. There is a tension between easily disseminated and digestible data visualisations, and the need for disaggregated data for teams to extract meaningful insights around inequalities or notice where there are gaps.
Data as a proxy for truths and reality
This knot looks at what happens when people trust too much in aggregated datasets and visualisations to provide an accurate picture of reality. It suggests that while proxy data and combining different datasets can provide some context, it is still important to interrogate the ways in which lived experiences might differ from what’s being presented, and how interventions should be adjusted to achieve equitable outcomes.
Lack of historical or situational knowledge
This knot highlights the importance of multidisciplinary knowledge in proper data analysis. Data that is devoid of local and historical context can lead to misinterpretation and poorly targeted interventions, with some demographics being overlooked or incorrectly categorised.
Creation of systemic unequal impacts
This section takes an overarching view of the knotted pipeline, and is crucial to understanding how the work done across different parts of the pipeline come together to have observable impacts for health inequalities.
It describes some of the ways this has happened over the pandemic, from marginalised communities being excluded from services targeted through algorithmic assessments, to the use of genomics data to explain health outcomes while obscuring important historical and social context. This section reminds us that, although certain knots may be more relevant for some than others, it is important to consider the pipeline as a whole to see how addressing these knots can help mitigate inequalities experiences across health and social care in the UK.
How to read this report
This report explores and describes a system-based approach to the complex question of how inequalities can be produced by data-driven healthcare systems. It does not make direct recommendations but instead, invites readers to reflect on their own capacities and roles in any healthcare data-driven system and the decisions they make daily that relate to that system or its use of data.
The report identifies barriers, or ‘knots’ in the system, which combine theoretical analysis and abstract concepts with real practices and challenges.
The knots have multiple relationships to each other, so there is no single start or end point, and no definitive order of severity or importance. We invite you to move through the pipeline to see how the individual knots feed into the overall impact on inequalities, and work your way through the knots to the end point, which describes the impact of each of the knots on the entire health and social care system.
Readers with a role in understanding and improving the health and social care system could start with the following:
If you are a commissioner of products or systems, working at the local or national level, inequalities are poorly understood.
This knot looks at the overall issue of definitions and approaches to deploying data-driven systems and provides a way to think about how data-driven systems could be impacting, and be impacted by, the range and scale of inequalities
If you are a policymaker, making decisions about health priorities for people, places or populations, 8 start with the knot describing the misalignment of local and national approaches to the delivery of systems. You can also start from this knot as a commissioner.
This knot addresses concerns about conflicting priorities between local and national bodies – enabling flexibility for local needs while inputting into national goals.
If you are responsible for data collection practices, coding, software and standards-setting, and analysis, as a system engineer or product designer (including as a private contractor) begin with the knot on data curation.
In this knot, you will read about how cleaning up data can enhance or impinge on the ability of data-driven system insights to both produce better outcomes and also reduce inequalities.
Similarly, irrespective of being publicly or privately commissioned, if you work as a product or team lead collecting or linking data, or as a researcher or data analyst, then the knot that describes data quality will be of interest.
This knot describes specific data quality issues related to how data is collected, the sources of data and how the interpretation of variables can lead to different inferences, thereby creating downstream inequalities if the assumptions in those data and inferences are not addressed.
Where your role requires you to create or analyse data systems, commission services or recommend interventions, the knots on what happens when we take ‘data as proxies for truths and realities’ and ‘the lack of historical or situational knowledge’ will be of interest. These knots apply to everyone who interacts with data-driven systems, and can be seen as a way of thinking about the problem and drawing links between seemingly abstract ideas to quantifiable or visible outcomes.
These knots include the concerns around mis-profiling or missing individuals in the data, due to the ways data collection has so far been organised, and how such oversights might leave some communities experiencing worse outcomes than others.
Finally, the section that describes the creation of systemic unequal impacts will be of interest to everyone, as this highlights all the ways in which individual knots align and interact to create and sustain inequalities in health.
As you start from a particular knot, we recommend moving through the remaining knots, reflecting on how one knot feeds into the next, and how your own daily work can culminate into much larger effects on communities.
Introduction and context
Data-driven systems are becoming ever more central to health and social care in the UK, and have the power to reduce or compound health inequalities in new ways.
This trend has accelerated during COVID-19, the first pandemic of the ‘algorithmic age’, as solutions to the global health crisis incorporated data-driven approaches at scale, bringing new opportunities and risks.
A focus on data-driven systems in health has continued beyond the pandemic response, with ‘digitally enabled healthcare’ a stated priority, and ‘better use of data’ touted as central to improving population health and tackling unequal outcomes. 5
Lessons from COVID-19
The data-driven technologies adopted and accelerated over the course of the pandemic  included predictive analytics to model the spread of the virus, apps to track and trace contacts or monitor movement, and algorithmic approaches to staging vaccine distribution. Data sharing also changed in response to the pandemic, as seen in prioritisation schemes 10 for those at increased risk of infection or illness, 11 and the increase in the use of AI 12 to manage patient triage and diagnoses.
All of these activities relied on data and the ability to bring sometimes disparate data sources into an analysable whole. This involved complex systems of people – clinicians, public health officials, researchers and developers of healthcare technologies, among others – who collected, cleaned, analysed and drew insights from this data. It also relied on data infrastructure, such as Trusted Research Environments (TREs) 13 or data dashboards that collated data for sharing. 
Social and health inequalities have been a prominent theme throughout the pandemic. Ethnic and geographic differences have been pronounced in COVID-19 mortality rates: the Health Foundation’s 2021 COVID-19 impact inquiry established that those younger than 65 in the poorest 10% of areas in England were almost four times more likely to die from COVID-19 than those in the richest. 14 In the first year of the pandemic, Black and South Asian people were twice as likely to die of COVID-19 than white people, taking into account differences in age and sex. 15
Did the very public health measures that were used to manage the pandemic result in worse health outcomes for some members of society? Inequalities in health are not new, but the COVID-19 pandemic and some measures taken to mitigate it have led to creation of new vulnerabilities in society, leading to the exacerbation of unequal outcomes in health and social care, and setting the stage for future challenges.
Data-driven systems and negative outcomes in health and social care
There is a high chance that the historic and existing inequalities that arise through data-driven systems are going to worsen with the proliferation of new technologies, 16 and there is a need to further understand the intersectional ways in which inequalities are experienced in different contexts.
It is at the nexus of factors such as digital connectivity as a new determinant of health, 17 the increasing turn to digital health activities through video and other remote consultations, 18 and data-driven strategies in health and social care settings 5 that concerns around inequalities can be addressed. To make sense of this, it is important to clarify how top-down health and care system structures interact with bottom-up societal mechanisms.
The entanglement of data and data-driven systems and inequalities is underpinned, in part, by how we collect data. As the UK Statistics Authority reported, there is important work to be done to collect better data in ways that represent everyone in society. 20 Beyond the pandemic, interventions stemming from this collection of data will continue.
This research is designed to be useful to commissioners, policymakers, technology designers and developers, researchers and data analysts working with data-driven systems in health and social care. Each of the knots highlights sections of practice and decision-making that should prompt those in positions of power and influence to consider how best to commission and design systems that produce more equitable outcomes.
A partnership to investigate data-driven inequalities
The Ada Lovelace Institute and the Health Foundation have been working in partnership 21 to explore how the accelerated adoption of data-driven technologies and systems during the pandemic may have affected inequalities, and the ramifications of this for the future of healthcare – beyond the pandemic. The aims of the partnership have been:
To develop the evidence base around the interaction between data-driven systems and inequalities in health.
To build a shared understanding of what action is needed to reduce inequalities in health, including how to design data-driven systems in ways that can reduce inequalities and improve health outcomes.
According to early pandemic research from the Health Foundation, factors such as being in poverty influenced one’s health status to the extent that ‘people in the bottom 40% of the income distribution are almost twice as likely to report poor health than those in the top 20%.’ 22 The research concluded that some interventions deployed to manage the pandemic would mean a greater burden on health for those in those lower income groups. 
Evidence from Public Health England showed that people from minority communities were also disproportionately being impacted by COVID-19, 23 while those who were at the frontline of health service provision also faced greater risk of adverse health outcomes. 24  
With the many ways that inequalities can be explored, the challenge is to show how contextual factors give rise to intersectional inequalities that lead to particular outcomes.
The Ada Lovelace Institute’s digital attitudes survey – the first activity of the partnership, conducted between January and February 2021 – highlighted further causes of inequalities that come not only from health datasets, but from the ability to participate in new health technologies. 25
Results showed that, as not everyone was sufficiently digitally connected, comfortable with or able to use digital technologies, the deployment of these technologies was creating a digital divide observable through several factors. For instance, those on lower incomes (defined as under £20,000/year) were less likely to participate in new health technologies, were less likely to participate in new health technologies, supporting the conclusions made by the Health Foundation on the impact of poverty and adverse health outcomes.
In Ada’s Data divide report, it emerged that those who were on lower incomes were also very likely to be at higher risk of contracting COVID-19, while those who self-identified as being from Black, Asian and other minority ethnic backgrounds were likely to use contact tracing, symptom tracking or mental wellbeing apps at a lower frequency than those self-identifying as white.
The partnership has undertaken research across three broad workstreams (a landscape review, public engagement through lived experience research and a case study exploring a particular data-driven technology), all looking at data-driven systems and potential impacts on inequalities in multifaceted ways.
The objectives and findings of this landscape review form only one piece of evidence, to be read with upcoming outputs from the other workstreams.
The landscape review aimed to map the mechanisms by which data-driven systems impact and interact with social and health inequalities and was informed by the following objectives:
To describe the mechanisms through which pandemic data-driven systems influenced existing health and social inequalities.
To help develop future recommendations that could influence the design and deployment of more equitable systems.
Evidence synthesis
The evidence of increasing inequality within patient and population groups during the pandemic – despite the benefits of data-driven insights – produced a real challenge to routine uses of data.
As systems are designed and built, it can be difficult to recognise how existing biases and assumptions that may negatively impact communities are embedded into routine decisions about what data to use and how to use it. In fact, data-driven systems that were deployed during the pandemic were not always built or deployed with societal difference (e.g. income and deprivation, access to family or community support or digital connection) in mind.
This is not to say that developers of data-driven technologies have not given thought to addressing societal differences, but the way that information has been used has sometimes created more disadvantage and inequality, if not between individuals, then between geographical areas. To describe inequalities and (specifically) any negative impacts emerging from data-driven systems, we needed to ask whether and how:
inequalities were defined and then measured in the use of data-driven systems
rapid response efforts in the pandemic inadvertently led to new or increased inequalities
using certain data (e.g. health and/or social care data), and data practices (such as standardisation of categories at local or national levels) have affected inequalities.
Responding to these questions helped us to understand how inequalities were mitigated or perpetuated through data and technologies.
A complex health system
Based on desk research and stakeholder interviews, we describe data-driven systems in health and social care as relying on a complex ‘knotted pipeline,’ whereby the knots are pinch points that must be addressed in order for data-driven systems to work without perpetuating inequalities.
We highlight three key issues which show the complexity of the health and social care system across the UK, how the knots in the pipeline come to be, and why it can be difficult to draw clear or linear links between the use of a data-driven technology and an impact on inequalities.
There are numerous data flows between different players in the health and social care system These impact on:
categorisations and coding of data, where there is often a lack of universal meaning or approaches to those codification
data owners, who often find themselves working across multiple systems with unclear responsibilities relating to the ownership of their data and roles within the broader data-driven ecosystem
who designates and takes responsibility for the function of the health and social care system’s numerous tasks. This makes it difficult to see how different decisions impact the overall function of a data-driven system, and can result in people duplicating decisions, or assuming someone else is responsible for those decisions.
The ability of a data-driven system to inform health interventions without exacerbating inequalities, and ideally even to address those inequalities, depends on decisions about how data is used. This can be seen through:
Data transfers and transformations – where decisions about whether data analysed at the local level needs to be ‘cleaned up’ for it to be shared and usable on a national scale. In transforming data, or standardising it, some local context vital to understanding inequalities may be lost.
The type of analytic platforms used – in responding to the pandemic at pace, it was observed that analyses done on already existing but inflexible software and applications meant that it was difficult or impossible to adapt the data to reflect emerging concerns. Insights can only be derived from what is available. On the other hand, if analytic capabilities are flexible and build from the ground up (with local data, for example) then they may be more likely to adapt to emerging needs.
Data pipelines have numerous pinch points, which we conceive as ‘knots’, where data does not, in fact, flow how it was intended or envisioned to. These appear in many of the infrastructures within the health and social care system.
When a data-driven system relies on data access, the interoperability of data systems becomes crucial. This means that, while keeping to data protection and information governance good practice, those analysing data must be able to do so across multiple systems. Interoperability can reduce the need to standardise in a top-down way (which risks losing nuance in the data – important for recognising inequalities). But that still depends on a higher-level universal understanding of what different data categories and variables mean, so that detailed datasets can be compared, even when their manner of collection differs.
Counter to narratives around the objectivity, efficiency and interoperability of data, these knots lead to inefficiencies, biases and inconsistencies in data practices, with potential downstream impacts on health and wellbeing inequalities. The result is a situation where the promises of data-driven systems are at odds with reality.
Failures to address the points above hamper the ability to prioritise and deploy data-driven technologies without harming communities or individuals.
We acknowledge that working with the metaphor of data as ‘flows’ through a system has limitations, 26 when the reality is that the flows occur in a constant feedback looping mechanism.
Case-studies
Following are descriptions of six data-driven systems that formed an important part of the evidence base for this research. The critique of these systems underpins the discussions of the knots, including of diversity of data, missing data quality, incomplete data sets, selective deployment or uptake of services, lack of contextual information or use of proxy data to build the systems or generate insights.
As the pandemic evolved, available information about the specific technologies changed, so here we present only overviews of these technologies, starting each with how we categorised them at the start of the research in the summer of 2021.
OpenSAFELY
A novel approach to establishing a Trusted Research Environment (TRE) to enable analysis of NHS patient records. It was built in direct response to COVID-19 as a technology platform that enabled millions of data points to be rapidly analysed to understand the evolution of the pandemic and health-related outcomes. It has since been used for near-real-time rapid analysis of health records, including analysis of poor outcomes from COVID-19 by ethnicity or by existing health conditions. These analyses have been used to influence the response to the coronavirus pandemic, for example supporting decisions about vaccine prioritisation. It enables research to be translated into policy and practice, and as of the start of July 2022, claimed a coverage of 58 million NHS patient records. 27
Genomics England and GenOMICC
Genomics England was an existing platform set up to analyse 100,000 genomes as part of the UK’s genomic pathway into the NHS. As part of the pandemic response, Genomics England accelerated its research by linking existing genomic and health data from UK Biobank and their related NHS patient records to conduct genomic analyses for COVID-specific research. This research also included recruiting around 35,000 people variously affected by COVID-19 and related analyses with those who did not have the virus. 28 The results from the research were used to inform treatment and interventions, and to predict which patients were likely to become seriously ill. 29 It is an example of another use of a TRE that was accelerated for research and clinical purposes. 
NHS Digital and Public Health England – COVID-19 Dashboards and Early Warning Systems
These dashboards were focused on the production of near real-time data flows, collating and mapping across different data points to produce views of vaccine coverage and COVID-19 infections by different regions and demographics. This influenced insights and interventions, notably the R-number, vaccine rollouts and national and local lockdowns; but was also used to predict and monitor potential spread. 30 Both dashboards predominantly showed data across England, as other devolved nations had their own coverage.
Q-Covid Algorithm and the Shielded Patient List (SPL)
This was a use of large-scale data analysis, bringing in insights from research on patient records, created using specific methodologies and algorithms to identify patients at high risk of being ill from COVID-19. The SPL ran from March 2020 to September 2021 and was created by NHS Digital (in England). 31 It influenced initial guidance for certain people to stay at home. Information was shared with various organisations such as local authorities, DHSC and those providing mental health services. In time, this data was shared with the University of Oxford, who created and updated the clinical tool to create a risk-prediction tool called QCovid. This became a clinical decision-making tool that was deployed in England via NHS Digital and influenced national policy on shielding. Both these interventions were built using various types of data. 32
Dr Doctor
The company’s innovative data-driven technology on scheduled appointments was one of the winners of Round 1 of AI in Health and Care Awards in September 2020. 33 It currently acts as private industry AI-based tool selectively used by different NHS Trusts for patient management including bed and surgery waiting lists and uses, in part, app-based patient-level data that allows direct communication between the hospital and the patient. Dr Doctor’s technologies have been used to manage local vaccinations and claims to predict who is likely to attend appointments in the future, 34 providing the ability to intervene for specific demographic populations.
Babylon Health – COVID-19 Care Assistant, Ask A&E  and GP at Hand
Babylon Health 35 was a pre-pandemic, digital-first private and NHS healthcare provider, whose remit accelerated and expanded in response to COVID-19. It provided digital triage services (Ask A&E) to support triage into some NHS Trusts, 36 COVID-specific guidance 37 and increasingly used AI tools in the provision of its systems. Babylon Health’s technologies have been assessed and evaluated by different authorities and – as with Dr Doctor – there have been concerns raised around selective data coverage and uptake, which may influence who benefits from those technologies.
Exploring the knots
The metaphor of a knotted pipeline is intended to show the difference between the imagined ideal of a data-driven system and the reality on the ground.
The knotted pipeline makes visible the messy inner workings of the data pipeline. This includes decisions on which data, practitioners, priorities and metrics of success also ‘come in’ at various points in the pipeline. These decisions matter because inequalities in health and care arise when those interactions do not start with an approach that embeds or builds equity by design.
This paper describes these knots as pinch points that prevent data-driven systems from working as well as they should, which helps us explore their interaction with health and social outcomes.
It does not provide any explicit in-depth critique of issues within data collection, data curation or data availability itself, but highlights how decisions around those issues interact while a data-driven system is being designed or deployed, to create potential negative outcomes. Organisations such as the UK Statistics Authority, 38 the Royal Society, 39 and the National Data Guardian’s office 40 have covered many aspects of good data use, including on data collection, analysis and transparency. This report adds to those analyses.
Figure 1: The knotted pipeline
This review is a desk-based scoping based on publicly available documents about data-driven systems in health and social care, with additional information shared by technology organisations,  supplemented by stakeholder interviews. The research was conducted in two concurrent steps – firstly, a document analysis and then discussion with stakeholders (see: Methodology).
Throughout the review, we use direct and paraphrased quotations from our stakeholder interviews, which spanned roles across public health, data analytics, business intelligence, research, digital product development, NHS commissioning, local councils and parliament.
Inequalities are poorly understood, described and measured
People involved in building data-driven systems may have a poor understanding of health inequalities, which is compounded by a lack of clarity on whose responsibility it is to consider bias and inequalities when designing these systems.  
Although most were aware of inequalities in data-driven systems, they were not usually able to articulate either what they are, or their approaches to ensuring people do not experience further inequalities. This was apparent in the stakeholder interviews: many acknowledged the existence of inequalities in data-driven systems, but noted that in responding to the pandemic, these were not usually the driving reason or focus for deploying or adopting data-driven interventions.
Various limitations arose relating to the complexity of embedding inequalities in the design of data-driven systems. One issue raised by stakeholders was that decisions about whether inequalities had to be regarded as clinical safety concerns – and then how those concerns could feed into interventions – were not yet routinely embedded.
Stakeholders that saw the need to have inequalities as part of the safety design of technologies encountered barriers such as technology manufacturers and deployers pushing instead for risk assessments that were not informed by an equity outcome.
These kinds of factors can make it difficult to measure how data-driven systems may mitigate or exacerbate inequalities, as relevant metrics aren’t built in.
For NHS Digital’s COVID-19 Dashboard, one stakeholder highlighted the multiple dimensions to inequalities beyond ethnicity and suggested that some ways of ensuring data-driven systems did not engender bias was by bringing data analysts and clinical standards together in the design of such tools. An additional measure to mitigate biases or inequalities would be to design systems as use cases, so that insights could be embedded in the process. Having the team reflect the populations the data comes from was seen as another potential approach to minimising adverse consequences of data-driven systems..
A similar sentiment was echoed by the stakeholder working in a private digital health company, who suggested that data-driven technologies should have diversity and equity considerations included as part of the safety standards of the products themselves. They cited updates to medical device guidance, 41 where those designing and deploying tools were reminded to ensure their tools worked for the populations they were looking after. There also appeared to be a general diffusion of responsibilities for thinking about impacts on inequalities. For instance, on one digital product team this responsibility fell in practice to the commissioners, but was not codified in processes. This meant that unless individual
commissioners happened to ask certain questions, concerns around bias and inequality may not be considered and designed for.
It was clear that despite this, the team did recognise potential tensions around having those conversations across different bodies including the NHS, for example when considering how to preserve the privacy of individuals whose data was being analysed.
‘The expert on UK specific health equalities issues is the NHS and I think it would be unhealthy for [us] to sort of be dictating in a very assertive way exactly what the NHS was doing with the most sensitive components of their own data, which is not to say that I mean one of the things we do is we make my team and other data ethics specialists and lawyers and philosophers here available for consultation…to engage in conversations like you know, if you were to do this, how would we go about it?
‘What’s the most ‘low lift’ way we can do this incredibly important analysis? We’ve made the underlying functionality readily available in the software. But, as you’ve correctly outlined, there is a real tension in this. Well, we think you should do it that way, but it’s your software. I mean, it’s your account, it’s your data, you’re managing it, you’re responsible for the compliance. You’re responsible for the transparency in the privacy announcements and all of that, and it becomes difficult for us to say, you have to process your most sensitive data in this way.’
Digital product team
From the perspective of NHS developers, as well as commercial organisations working in this space, an understanding of inequalities was not explicit. The need to respond to the pandemic through rapid, in-the-moment interventions meant several respondents acknowledged that they did not consider inequalities in the various decisions about how to gather or deploy data-driven insights.
When inequalities were considered, they were broad-brush and included ‘regular’ issues such as language, ethnicity and sometimes accessibility. Usually this was in relation to making a system better, or to address a particular need (e.g. vaccine uptake, coronavirus in an area, usability of information).
‘How is that being embedded in those systems and how are you making sure you’re not building new biases into the system as a whole? That’s a super complicated question, and I think that’s where you need experts and research, and I think probably the worst world to be in is where you got the wrong people…(but you need) data ethics, strong information governance, good policy, or the rest of it, which enables you to take those questions you know to take those steps, even if we maybe don’t really answer, you can take some controlled experiments.’
‘But for me I mean, I think it will come to you know, by processing things like ethnicity, are you contributing to improving health outcomes? Hopefully the answer is yes. Are there any other biases you’re baking into the system? How can you address those? Best practice I guess, is evolving in that space.’
Product owner
As one of the teams interviewed suggested, having data ethicists and philosophers in these discussions might have value. This supports a call for a humanities-based agenda in health, 42 which will help interrogate ways in which inequalities are observed and mitigated in data-driven systems.
Implications
This research shows examples where inequalities are poorly understood by people commissioning or designing data-driven systems, and where it’s not clear whose responsibility this is. This may lead to the development and adoption of data-driven systems that, for example, overlook people without access to digital technologies, miss ways in which data collection practices are not accurate, and/or categorise people in ways that obscure nuance. All of which can perpetuate health inequalities.
Systems must be set up and designed with inequalities in mind, from understanding what is meant by inequalities, to carefully mapping out available data to consider whether it’s sufficient, and embedding success measures that aim to reduce inequalities or detect exclusions. Commissioners and designers should be thinking about this from the outset – use cases that address inequalities can be a starting point for building systems, and working closely with marginalised communities throughout the development process may also ensure key decisions are holistically thought through.
Clarity around whose responsibility this is, creating multidisciplinary teams and agreeing on where it sits in the commissioning/design process, will help identify gaps.
Tension between local and national priorities for design and delivery of data-driven systems
Differing approaches to design and delivery of data-driven systems at the national and local level are linked to political priorities and power dynamics, which can compromise the quality of insights.
There was evidence that more thought went into addressing health and social inequalities at the local (rather than national) level. As a result, there were sometimes tensions about the best way to achieve equity in the deployment of data-driven interventions.
‘First of all is that think things that we tend to get from a national level […] The assumption is always that we want aggregated data and we don’t. Yeah, this is our local population. The more granular, the better. It is better to have disaggregated data that we then aggregate than it is having aggregated data in the first place. Because I think that doesn’t allow us to kind of slice things in the way that we want to.
‘A really good example of that is that the vast majority of stuff that we get and have got starts out [as an] LSO 43 area […] then someone will come along to us and say can you tell us about vaccines by ethnic group or by gender, or you know whatever? No, I can’t because I’ve got it at LSO level. 
‘All I can tell you is that there is something coincident between high levels of vaccine uptake in that LSO with its ethnic makeup […] So immediately by taking that choice to aggregate things and to do it in the way that we always do it that immediately inhibits our ability to analyse that data very well at all.’
Analyst in a local authority
Top-down decisions about who could do what with the data platforms may therefore impact and influence inequalities. Not addressing these interactions can also lead to siloing of work, which leads to duplication of systems, and potentially results data-driven systems that are not fit for purpose.
‘But from a local perspective, from an analyst perspective, we just want to be part of a wider landscape that deals with data as a system asset and an asset that we can be collectively using. You know I don’t need, as a local analyst, I don’t need somebody to develop at all for me to be able to visualise some data. I need the data.
‘I think a ground-up approach to a lot of this stuff would have been a lot more successful than a top-down one because you know, we do naturally, you know as places you know, it’s local authority, networks, health networks and so on. We do naturally do things and share them. You know that that’s what we do.’
Analyst in a local authority
This ground-up approach was not always agreed by those working at national levels, as those in positions to decide how health data could be used preferred data to be standardised first, so that national priorities on how to use data could be set:
‘Again, there are many dashboards, so there are many data sources […] but on the whole they should be using the same national datasets that often flow via NHS Digital. Yeah, OK, so the same issues that you would have had with data quality are the same issues that they might have as well ’cause the pipelines are long and also go across generations.’
Product owner
These power dynamics and ways of working often bypassed or obscured local intelligence and conversations which can flag demographic information that is important for mitigating inequalities.
Some stakeholders suggested it was possible to find a workaround to get more useful data from local sources – but that it would still be subject to political drivers, and still needed a nuanced understanding of inequalities or priorities:
‘You can select, you can setup functionality within [software] that allows those regions to directly submit their own original data to begin correcting that inequality instead of going through the whole like, ok, this came from here, this came from here. So we need to rectify this source system, but that’s difficult for the local political reasons that made it an incomplete data source in the first place, so you can integrate these data from different sources to begin to rectify specific missing pieces of data.
‘You can bring in a much more diverse set of stakeholders to begin to interrogate, like how a model is actually situated within a specific sociotechnical component. Locally interrogate the model to see how it’s performing against a certain mathematical function. You want to ask a much broader range of questions about is the data that the model was trained on suitable for the ultimate societal changes that that model is inflicting?’
Privately commissioned data team
Ultimately, deciding what was commissioned became a political issue, and there is no evidence that anybody took responsibility for adapting the commissions to what emerged as the neediest areas. Some stakeholders stated that they were only there to bring the data together, or felt disempowered to bring in local priorities:
‘I suppose there’s a “cause there’s a government definition”. That’s what you do. You put the data out and someone else decides what the policy is.’
Dashboard team leader
‘We did have people coming from London on a regular basis to kind of give us advice and say, tell us what we should be doing, because actually I think public health, when it’s within, at a local level, they know where the problems are, and we were really trying to make sure that we were communicating really effectively with our communities.’
Local public health director
Tensions between local and national decisions on how data was centralised and its uses meant that the priority areas in the pandemic, for some, became subject to knee-jerk reactions rather than managed as well-thought out focus areas.
Such ways of working led to a lack of cohesive communication and feedback between local and national bodies, impacting the system’s overall ability to address inequalities in health.
This was notable in one participant’s suggestion that decisions were not always shared widely and at times they found out about specific areas of national interest only when they had been mentioned in the news, rather than being directly informed:
‘So this is how, early in the pandemic, that this happened. They got called into an office and somebody asked me to produce them a map of everybody who’s over 60 years, of where people who are over 60 years old were in our borough, and I said, why do you want this? And they said because it’s been on the news that people who are over 60 are more susceptible to COVID. I spent the next hour telling them why that wasn’t a very, you know, very effective way of targeting any kind of support.’
Local public health analyst
Implications
Tensions between local and national priorities (and national most often taking precedence) can adversely impact data-driven systems’ ability to address health inequalities, as local nuance and knowledge can fall between the gaps. The use of data standards to process and clean data to fit the needs of a national public health body can cause the loss of contextual information that is useful for local public health officials.
Commissioners and policymakers need to think about how to create partnerships that allow flexibility to address local needs while also feeding into national goals and enabling comparative data. Coordination and communication between the different levels, and improved resource allocation will be important going forward.
Data quality and diverging inferences
Having high-quality data that accurately reflects the reality of the people it represents is essential for producing meaningful insights. As the phrase used frequently in data processing ‘garbage in, garbage out’ suggests, having poor quality or incomplete data can cause unequal outcomes in downstream uses.
UK Statistics Authority reporting shows that there is a data gap around what information is available, which impacts who can benefit from health interventions. 44 When there are groups not represented adequately in data 45 or in genomics studies 46 – including ethnic minorities, migrant communities, and asylum seekers – any analyses will result in further inequalities, as shown by a report by the Nuffield Trust in their review of NHS data. 47
One major data quality issue we identified was missing data. For example, when ethnicity data is missing for patients, coverage of certain demographics becomes patchy. 48
When emerging insights are based on this incomplete data, it can lead to worse outcomes for those demographics. While local data tended to be better for responding to the pandemic, the usefulness of data-driven insights is still limited by these gaps. Data quality depends on who enters the data, what they choose to record, and even whether or not people have ever interacted with the health and social care system, or the local council.
People missing from datasets might include migrants who may not have GPs, or people with mental health issues, who may never have contacted their local councils for support. If someone does not exist in a database, there was no way a data-driven system can make them visible. Indeed, one of OpenSAFELY’s reports noted that:
‘Our population, though extremely large, may not be fully representative of the full eligible population: it does not include individuals not registered with a general practice; or the 4% of patients registered at practices not using TPP and EMIS. 49 Primary care records, whilst detailed and longitudinal, can be incomplete on certain patient characteristics. Occupation is generally not available in the health record, so we were unable to assess the eligibility in priority groups 1 or 2 where this is based on occupation. This means that we could not determine the appropriateness of vaccination for those in the ‘other’ group. For patients aged 80 and over, 36.2% and 0.8% had missing ethnicity or IMD 50 information respectively.’ 51
Time-delayed or out-of-date data was also a problem, particularly in the context of a quickly evolving pandemic situation.  With little on-the ground capacity, data was sometimes submitted late, or – due to pandemic pressures – wasn’t always processed or deliveredor time across all data collectors. 52 For example, the COVID-19 situation report was often affected by data quality issues driven by some NHS Trusts not submitting data in time to be included in data releases.
In another instance, vaccination data in one centre was described by a product owner as coming in once a week on a Friday, meaning the insights needed to procure new vaccinations might have assumed little activity in those areas, which would have impacted on vaccine deployment, so that potentially it was unmatched to actual vaccine need. 53   54
These gaps or delays in data can lead to divergent interpretations of that data, and create systems that do not comprehensively respond to inequalities on the ground. This was the case especially when looking at data gaps that were driven by resource issues on the analysis side, rather than by lack of or ability to respond to local needs.
As the data pipeline involves many transformations, deploying a diverse and multi-disciplinary team to data collection activities gives higher chances of picking up gaps in data and ensuring the pipeline does not lose vital contextual information.
Two participants from different organisations even suggested that as one builds their team, they should ensure the team members are representative of the community. This approach strives for more inclusivity in data insights, should recognise gaps and inconsistencies quicker, and generate more equitable outcomes.
Implications
Gaps in data and time-lagged data (especially in a health emergency) can make it difficult to respond to inequalities on the ground. Any interventions made based on poor quality data are at risk of being inaccurately targeted, misaligned with the situation on the ground, and disadvantaging people who aren’t represented in the data.
For product leads, researchers and data analysts, having diverse and multidisciplinary teams can help pick up gaps in data which would lead to inaccurate insights. Finding ways to involve community representation in gathering data, and being clear about the ways in which it will be used (so analysis and insights are specific to the situations being addressed), could help create more inclusive and accurate datasets.
Data curation and lack of ability to address inequalities
Data curation risks losing nuance in the data that may be important for flagging and tackling inequalities, especially if it’s done in a top-down way. Aggregated data and visualisations can obscure important information about raw datasets and their limitations or original purpose.
During the COVID-19 pandemic, decisions had to be made about how to use health data beyond the purposes of directly supporting patient care – for which it was initially gathered.
While data-driven systems had mainly been set up to address regular, ongoing healthcare, in a public health emergency it was crucial that these systems could respond at pace. Data was curated and cleaned in a way that new and multiple players could make use of it quickly, but this also meant that nuances in the data were flattened in favour of simplified aggregations or visualisations.
Alongside this were issues of poor coding and categorisation, and the fact that data curation was largely done in a top-down way, which could obscure important information about raw datasets and their limitations or original purpose.
In a time when new private companies were playing a prominent role in data curation and onward uses, scrutiny fell on the role of institutions such as Palantir and the widespread use of their software, Foundry. Their influence led to new demands for standardisation and cleaning of data in ways that not everyone in the pipeline found useful.
When it came to processing or decentralising data, local partners preferred to have and manage their own raw data instead of having data insights imposed on them by various other platforms through centralised curation and storage. This data curation was expressed by one of the as further compromising the standard of the data that already came from badly designed systems prior to the pandemic:
‘The quality of data that’s being supplied to us by all of the organisations that collected data in the COVID response has been appalling, not just data quality for the same reasons that we all know that we have with data quality in terms of completeness and coverage and things that people you know won’t give us. Can’t give us, whatever. We all accept that. That’s what happens in life, and we’ve all worked in data. What I can’t accept is data quality problems that just come from poor system design. And that’s what we’ve had. We’ve had more issues that are down to poor system design, then down to the classical things.’
Business intelligence lead
During the pandemic, health data was often repurposed and used in ways that differed from the original intent behind its collection. While data had previously being aggregated for cancer, ethnicity monitoring and other public health surveillance, the pressure to combine different types of data from different sources for pandemic responses provided a new challenge.
When data needed to be analysed for pandemic-related purposes it was revealed that, at scale,. For example, the Office for National Statistics (ONS) audited ethnicity data sources and found that ‘this initial stocktake of the ethnicity data has highlighted potential issues with the comparability and coherence of the data sources.’ 55
One participant mentioned having to choose between building platforms that gave them flexibility to input important information about inequalities, or selecting a ready-made product that limited the insights that could be generated for different populations.
Sometimes data technologies fuelled a digital-first approach to data collection, with the data then being funnelled to address specific needs like vaccine management, surgery prioritisation and follow-up care. For example, when Dr Doctor rolled out its elective recovery management support, it produced information to enable patients to participate, including how patients could provide additional information to support decision-making. 56 This involved patients scoring their pain or wellbeing against set criteria. 57   58
A question one could ask about data curation is whether it’s better to use ‘routinely collected’ health data (that is: the data generated, collected and stored by the NHS as part of the delivery of services) for pandemic responses, or specifically collecting data to respond to an ongoing need.
Collecting data in response to a specific need may sometimes have advantages for nearer real-time, specific, patient-supported use of data, but its success in mitigating potential inequalities requires platform developers to ensure their tools are accessible (e.g. on language and accessibility), and be aware of biases in data that occur when patient-reported measures are used, 59 to decide who gets what care and when.
If a system relies too much on patient-reported measures, it can lead to unintended outcomes and inequalities – all related to missing information because of the way data has been curated, and selective participation because of access/use differences with digital technologies. 60
Questions around the role of large institutions (including private companies) in data curation and analysis were also raised. When tools such as Dr Doctor and the Foundry platform by Palantir are focused on fast data availability and seamless data flows, the lived experience of potential patients gets flattened out. The way that data is curated does not always address those social factors, which are important in determining health and wellbeing – for example data about people’s housing, employment and digital exclusion as social determinants of health. 61   62
Decisions around data curation and platform development also affected what it was possible to do with the data, sometimes limiting flexibility to discover or accurately represent inequalities. In one case, a participant suggested that there were trade-offs that had to happen in choosing which tools to use:
‘On one hand we avoided analysis paralysis, but I think I would have probably made some slightly different technology choices… I don’t want to use [proprietary name] products for all of our open data dashboards. So I think there’s some value in doing some [analysis] built using [another product]. It’s much more scalable at volume… but [proprietary name product] is like a white label product.
‘So instead of building it yourself you take it off the shelf, plug in some data, make something really nice and push it out. But I’m buying the product. What I’m not doing is handwriting code to run my visualisations straight up on the data. So I think like technology decisions for data there is, I always think the first fork in the road you have to say is like where am I going to use [proprietary name products] but if you’re using them, you know you’re probably going to have slightly less functionality, slightly less flexibility, but you move a lot faster because you just take it off the shelf right now, and it’s more supportable, because when writing your own code, how do you support it? You know? How do you know, there’s a whole load of other complexities that come with that right? And it’s in that decision and that balance. Somebody give me six months, I might have gone a different [route].’
Product owner
Some participants appeared frustrated at the use of standardisation across the NHS’s data systems and the many pipelines where data was transformed, which made it difficult to see how the data had changed over time.
‘So immediately by taking that choice to aggregate things […] immediately inhibits our ability to analyse that data very well at all.’
Asked what change they could have made given lessons from using data-driven systems in a pandemic, one team responded:
‘Don’t panic, because I think a lot of the services that, you know, the faults with the services that we’ve described, have come from people panicking and thinking we have to put something out there really quickly and, yes, sometimes we have to do things quickly. But what we should do is scale down our aspiration of what we’re going to do in order to do that.
‘You know if we have to do stuff quickly, just publish raw data and then work out what we need to do next, that’s the second thing I would certainly want to reflect in a set of recommendations from anybody, particularly when I don’t know what the target of this is.
‘Have an agency within government that is responsible for data infrastructure and not just in this kind of situation. Because we don’t do that at the moment. You know, if the DHSC has an initiative to launch, they hire some people who build data infrastructure… we haven’t done that with digital services, they designed all of the services that I have picked particular flaws with so maybe they’re not ready to do it either, but you know, it’s not like we’re short of people, it’s one of the biggest assets of this country. To be honest with you in terms of competitive advantage over other countries, we’re not exploiting that.’
Local public health team
Implications
It is important to consider questions not only of who holds responsibility for data curation (e.g. national bodies like NHS England, or private companies like Palantir), but also how cleaning and curating data for quick and/or widespread use can impact on the quality of insights and outcomes.
System engineers and product designers can think about what might be lost in standardisation and visualisation of data, and ways to embed flexibility in platforms, for example allowing users to see raw data or include context-specific information.
Again, coordination between different players in the data ecosystem will be important going forward: discussion between data analysts and commissioners or designers could help raise issues of limited functionality for flagging or addressing health inequalities.
Data as a proxy for truths and reality
There is always a limit to how accurately data can represent lived experience. Using proxy data can help to infer or estimate something not covered in a dataset, but this isn’t a silver bullet, and it’s still importan

Images Powered by Shutterstock