Safeguarding sensitive health information whilst making it accessible for analysis is a delicate balancing act that researchers and institutions face.

De-identification of health data is a potential solution: anonymised information that cannot be traced back to individuals so that data can be shared, stored and managed responsibly.

However, the de-identification method poses risks and concerns for the goldmine of health data.

In the fight for health data privacy, legal frameworks - like the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR) - play a crucial role. These regulations require organisations to implement adequate security measures and de-identification processes - but these are not infallible.

The risks of health data exposure

Personal health information (PHI) is a treasure trove of sensitive data, including medical history, treatments, and genetic information. Healthcare breaches can compromise individual privacy, lead to financial losses, damage to the institution's reputation, and a significant erosion of public trust in healthcare data security.

According to the Department of Health and Human Services, the US healthcare sector has suffered roughly 295 breaches in the first half of 2023 alone, impacting over 39 million individuals.

Several high-profile cases of health data breaches have raised serious concerns, underscoring the urgency of robust privacy protection.

To balance the pursuit of data-driven healthcare insights with privacy protection, de-identification emerges as a viable solution.

De-identification methods and challenges

Pseudonymisation involves replacing direct identifiers with unique codes or pseudonyms, which adds an extra layer of protection. Anonymisation is the process of removing direct identifiers, such as names and addresses, from health data.

These are not sure solutions, as determined attackers may still find ways to associate pseudonyms with real identities. Similarly, even anonymised data can be re-identified through linkage attacks with other datasets. There have been real-life examples of data being de-anonymised by cross-referencing easily locatable information.

In 2019, Google and the University of Chicago were sued for misusing patient data. The suit claimed that even though the data was de-identified, the expertise in AI and data mining made it possible to re-identify patients.

Studies have shown that 87% of Americans can be identified by their zip code, gender and date of birth. In some cases, researchers have found that they could correctly re-identify 99.98% of individuals in anonymised data sets with only 15 demographic attributes - even if the dataset was incomplete. This paper also successfully re-identified nearly all individuals de-identified with the HIPAA Safe Harbor method, with additional safeguards like sampling, randomisation and generalisation proving to not be foolproof techniques.

The answer? Strengthen data privacy regulations for health data

Health data privacy regulations must address the de-identification processes, as well as more securely allowing for data utility and data-sharing practices.

Data utility ensures valuable insights can be extracted from datasets - with more specific regulation, health data can be used and shared securely for research and analysis, without de-prioritising individuals’ personal privacy. Collaboration between healthcare organisations and research institutions is essential to develop industry-wide standards and best practices for de-identification, consistency and accountability.

As the volume of health data increases, so does the risk of re-identification through sophisticated techniques, such as machine learning algorithms. Privacy-by-design principles mean organisations can proactively address potential privacy risks and ensure compliance with regulations.

Redaction is a crucial de-identification tool that encourages prioritising privacy throughout the data lifecycle. A comprehensive, secure solution could be to combine methods of de-identifying data such as synthetic data generation, contextual de-identification, consistent hashing, and strong data access controls. These can further enhance protection against re-identification risks, ensuring a balanced approach to preserving individual privacy in health data analysis.

In the quest for data-driven healthcare advancements, protecting individuals' privacy is non-negotiable. De-identification needs to improve through legal frameworks, industry collaboration, public awareness, and embracing privacy-by-design for a comprehensive approach that achieves a harmonious balance between data utility and privacy preservation in the health data world.

Need expert guidance in de-identifying your visual data?

Contact us

The challenge of de-identification in the health data world