Anonymisation
Sharing data that contains personal information is often problematic from a legal point of view, especially with the introduction of the GDPR in May 2018. The best way of dealing with the legal problems is to de-personalise the data. This transforms the data from ‘personal data’ by replacing or removing information that may identify an individual.
The following methods help in de-personalising/anonymising a dataset
- Removal of direct identifiers. This can include, but is not restricted to, identifiers such as names, dates, geographic information, telephone numbers, email addresses, etc
- Reduction in precision. For instance this could be applied to remove day and month from dates of birth, which are highly identifying, and leave year of birth which is more effective at preserving anonymity. Post code information could be reduced to Post code district (eg L69) or for even less precision only the post code area (eg L) could be retained.
- Aggregation. Rather than include the raw data itself, it may be more advisable to group the data instead. Instead of including age, a band of ages could be introduced – 16-25, 26-35, 36-45 etc. Care should be taken at the upper and lower ranges of certain variables to ensure anonymity is preserved, so taking the age example there may be very few people in a dataset over the age of 90 and the band may have to be modified to take this into account.
- Textual data should be thoroughly searched for identifying information such as the direct identifiers listed above. When found these identifiers should be replaced with a consistent pseudonym. Where search and replace techniques are used, you should exercise care to ensure wrongly spelled identifiers are not missed from the procedure. In many cases given the time and effort required to check textual data it may be worth considering how much data is really necessary and how much can be discarded before sharing takes place.
Anonymisation of data is not an exact science and throughout the process you should be aware of the potential for re-identification. If you consider there may be a high risk of your research subjects being re-identified (for instance by combining the data with other easily-obtainable datasets), it may be appropriate to control distribution by using data sharing agreements.
The UK Data Archive has guidance about anonymising qualitative or quantitative data in a research setting.