A Guardian investigation has confirmed that confidential health and genetic data from UK Biobank (one of the world's largest medical research repositories, holding records on 500,000 British volunteers) has been exposed online on dozens of separate occasions. The exposures were not the result of an external attack. They were caused by approved researchers inadvertently publishing sensitive datasets to public platforms while complying with journal and funder requirements to share analysis code. UK Biobank holds genome sequences, hospital diagnoses, brain scans, blood samples, and lifestyle data on half a million people. Some of that data has been sitting in public repositories, accessible to anyone, for an undetermined period.
What Happened
UK Biobank grants access to its dataset to scientists at universities and private companies worldwide for approved research purposes. Until late 2024, researchers were permitted to download data directly onto their own systems. Academic journals and research funders increasingly require researchers to publish the code used to analyse large datasets alongside their published papers; a transparency standard intended to make science reproducible.
In practice, researchers sometimes uploaded not just their analysis code but the underlying data files, or outputs derived from them, to public platforms such as GitHub. The Guardian investigation found these files had been posted online dozens of times. In at least one case, a single dataset contained millions of hospital diagnoses and associated dates for more than 400,000 participants.
To demonstrate the re-identification risk, the Guardian, with the consent of a UK Biobank volunteer, was able to locate what appeared to be that volunteer's extensive hospital diagnosis records using only their month and year of birth and details of a major surgery they had undergone. No names or addresses were required. The combination of quasi-identifiers in the exposed data was sufficient.
UK Biobank has contested the severity of the problem, stating that no identifying data such as names or addresses are provided to researchers, and that it has "never seen any evidence of any UK Biobank participant being re-identified." The institution tightened its data access model in late 2024, moving from direct download to a controlled access environment. However, the Guardian's investigation suggests exposures continued after this change, and the full scope of data that remains publicly accessible has not been confirmed.
What Was Taken
No single actor exfiltrated data in a targeted attack. The exposure occurred through inadvertent public posting by researchers. The data categories confirmed as exposed include:
- Hospital diagnosis records: ICD codes, diagnosis dates, associated clinical events for up to 400,000+ participants in a single dataset
- Genome sequences: UK Biobank holds full genome data on volunteers; the extent of genomic data exposure is not fully confirmed
- Medical history and lifestyle data: Biobank's dataset includes blood markers, physical measurements, mental health indicators, and lifestyle questionnaires
- Brain scan and imaging data: part of Biobank's data holdings; exposure status not confirmed for all categories
- GP records: the UK government extended Biobank's access to GP records last month, expanding the dataset's sensitivity going forward
The data does not include names or addresses as provided by Biobank, but the Guardian's re-identification demonstration confirms that quasi-identifiers within the dataset (birth month/year, surgical history, diagnosis patterns) are sufficient to identify individuals when cross-referenced against other available information.
Why It Matters
This incident reframes the threat model for sensitive research data. The adversary is not a ransomware group or a nation-state. It is the research process itself, specifically the intersection of open science mandates and inadequate researcher training on data handling.
Biobank data is categorically different from standard PII. Genetic information is permanent and familial; a breach does not just affect the individual volunteer, it exposes relatives who never consented to participate. Medical diagnosis records carry insurance, employment, and social stigma implications that persist for a lifetime. The UK government extended GP record access to Biobank last month, expanding the sensitivity of the dataset at precisely the moment when its governance failures are becoming visible.
The re-identification risk is not theoretical. The Guardian demonstrated it with a consenting volunteer in minutes, using only birth month/year and a surgical event. As AI tools make cross-referencing large datasets faster and more accessible, the threshold for re-identification drops continuously. Data that was considered adequately anonymised under 2010 standards is not adequately anonymised in 2026.
The systemic issue is that open science mandates, which are legitimate and valuable, were implemented without corresponding controls on what researchers could and could not include in their public code repositories. Biobank approved researchers for data access but did not effectively govern what happened to that data once it left its systems. The dozens of separate exposures identified by the Guardian suggest this is not an isolated error but a structural failure in the researcher data governance model.
The Attack Technique
There was no attack. The exposure vector is researcher negligence compounded by institutional governance failure.
The mechanism: researchers complying with open science requirements uploaded analysis scripts to public repositories (GitHub and similar platforms) without adequately separating code from data. In some cases, derived data outputs, which retain enough statistical signal to enable re-identification, were included alongside the code. In others, raw data subsets appear to have been uploaded directly.
Contributing factors: - Inadequate researcher training on the boundary between publishable code and sensitive data outputs - No automated scanning of researcher code repositories before publication for data leakage - Direct download model (in use until late 2024) gave researchers local copies of sensitive data with no technical controls on subsequent handling - No systematic monitoring of public repositories for Biobank-derived data exposure - Institutional denial: Biobank's public response focuses on the absence of names and addresses rather than the demonstrated re-identification risk, suggesting the governance review has not fully engaged with the actual threat model
What Organizations Should Do
-
Implement automated data leakage detection for research outputs. Any institution that grants access to sensitive datasets (medical, genomic, financial, or otherwise) should deploy automated scanning of public repositories for data derived from those datasets. Tools exist to detect statistical fingerprints of datasets in public code repos. This is not a solved problem but it is an addressable one.
-
Separate code from data in open science workflows. Open science mandates require publication of analysis code; they do not require publication of the data used in that analysis. Enforce a hard technical separation: researchers publish code that references data via secure API or controlled access environment; raw data and derived outputs never touch public repositories. This requires both policy and tooling.
-
Re-evaluate anonymisation standards against current re-identification capabilities. Data considered adequately anonymised five years ago may not meet the bar today. AI-assisted cross-referencing, the proliferation of public datasets, and social media footprints have collectively lowered the re-identification threshold. Conduct a re-identification risk assessment against your current anonymisation approach using contemporary tooling.
-
Audit all historical researcher code publications for data leakage. If your institution has been granting data access to researchers who publish open science outputs, conduct a retrospective audit of public repositories associated with approved research projects. The Guardian found dozens of exposures; a targeted search of GitHub, Zenodo, OSF, and similar platforms for institution-identified data signatures is achievable and necessary.
-
Move from download-based to query-based data access models. UK Biobank moved in this direction in late 2024; the controlled access environment prevents researchers from holding local copies of raw data. This architectural shift is the single most effective control against inadvertent researcher exposure. Institutions still operating download-based access models for sensitive research data should treat this as a priority remediation.
-
Notify affected participants and conduct a formal re-identification assessment. UK Biobank's current posture, denying that re-identification has occurred while declining to conduct a systematic assessment, is not a defensible position under UK GDPR and the Data Protection Act 2018. A formal re-identification risk assessment conducted by an independent third party, with results disclosed to participants, is the appropriate response to a confirmed exposure of this scale.