UK Biobank health data keeps ending up on GitHub
I'm a researcher studying privacy, and I started tracking the DMCA notices that UK Biobank sends to GitHub. I tracked 110 notices filed so far, targeting 197 code repositories by 170 developers across the world.The exposure of Biobank data on GitHub is the latest in a long series of governance challenges for UK Biobank. (My colleague and I have an editorial in the BMJ about this: http://bmj.com/cgi/content/full/bmj.s660?ijkey=dEot4dJZGZGXe...). The latest is today, with information of all half a million members listed for sale on Alibaba.Looking at the takedown notices, we often see specific files being targeted rather than entire repositories (possibly to justify the copyright infringement as required for a takedown notice, not a copyright expert; although it is clear that they only use DMCA notices as a last resort, for GitHub users they cannot identify, and who were likely not given access in the first place). A quarter of the files are genetic/genomics. Tabular data account for another large share and could contain phenotype or health records.
146 points by Cynddl - 36 comments
All 500,000 participants for sale on Alibaba...
And official response: https://www.ukbiobank.ac.uk/news/a-message-to-our-participan...
To me it seems rather naive to have done that.
After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.
If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.
Too late to do anything about it now though :(
Then there's the question of trust. You probably have friends you know not to tell certain secrets to, because they believe they get to delegate your secrets onwards to people they trust. The further away someone is from you, the less respect they will show. Researchers have been loaning the dataset in good faith to people who they trust, but who probably didn't take the whole secrecy thing as seriously.
With 20k researchers this was inevitable. The kind of factors above need to be factored in when designing on what grounds such a dataset is to be released.
Reckless harm prevention is the root of many evils.
[0] https://www.bennett.ox.ac.uk/blog/2025/02/opensafely-in-brie...
I'm also amused (in a good way) by the fact that SAS isn't supported as an analysis platform...
And some information on how they were distributing it to researchers: https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...
> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.
> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.
https://biobank.ctsu.ox.ac.uk/crystal/download.cgi
I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.
Enforcing Jupytext is a good adaption, and gives you all the, arguably really nice, comfort from a notebook, and the proper code practice from SW engineering.
It looks like they've identified the institutions, at least... but aren't identifying it to the public for now. Are there going to be consequences? Are they going to be identified and sanctioned beyond "having their access suspended?"
In the US, HHS wouldn't hestitate to name, shame, and impose a sanction with corrective action plans. Not knowing much about how things work across the pond, I'm sure CMS PII gets used more often in research without these leaks left and right.
If an 'anonymised' medical record says the person was born 6th September 1969, received treatment for a broken arm on 1 April 2004, and received a course of treatment in 2009 after catching the clap on holiday in Thailand - that's enough bits of information to uniquely identify me.
And medical researchers are usually very big on 'fully informed consent' so they can't gloss over that reality, hide it in fine print or obsfucate it with flowerly language. They usually have to make sure the participants really understand what they're agreeing to.
It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.
Yeah, sorry about that
Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.
Most researchers likely would want to summarize these data in a similar way anyway, so this works out nicely.
If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though
In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.
But what this illustrates to me is that researchers are just really careless, despite everything we make them agree to in data transfer agreements. It seems absurd to have little cubicles like this https://safepodnetwork.ac.uk/ (think Mission Impossible 1) but I do despair.