General Data Management Plan
Version 0.2.
Last amended: 18-12-2023
Introduction
This document, the General Data Management Plan (GDMP), aims to provide the general policies with regards data held by the CBF in pursuit of its research and business goals.
Each project has its own project data plan (PDP) that holds details on the owner, data type, sensitivity, and the proposed location of the processed, interim and final data. It will also provide details on where the data is to be stored upon manuscript publication.
This document is to be read in conjunction with the PDP.
Ethics approval
The CBF must confirm with the collaborator or client that the experiment, data collection, and proposed analysis has undergone appropriate ethical scrutiny and any permissions required to use the data have been acquired.
Data Sensitivity Classes
The provenance of the data instructs how the data is required to be stored and protected. For the sake of this document, we have classified data into the five types outlined below.
Class 1: Public Data.
Data that is freely available to the public. Examples could be government statistics, social media posts, but more often from biological data repositories.
Class 2: Non-Human Data.
Experimental data from plant/animal models.
Class 3: Anonymised Data.
Human data that is rendered anonymous in such a manner that the data subject is not identifiable either directly or indirectly.
Class 4: Pseudonymised Data.
Pseudonymisation is defined within the UK-GDPR as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual”. There is a residual risk of re-identification.
Class 5: Clinical Data.
Clinical or personal data that allows an individual to recognized in a straightforward manner. This is the highest sensitivity class of data – this data would normally only be available within a trusted research environment (TRE)
Acceptance of Data from Collaborator or Client.
All data must be accompanied by metadata. It is the client’s responsibility to check that the CBF has the metadata required for the agreed analysis. Delays to analysis caused by incomplete metadata are costly and must be avoided. Only the client can fix metadata efficiently. Colour or any other visual formatting is not to be used to differentiate samples or conditions.
Data Storage.
The CBF will use their best endeavours to secure the information. The level of security will depend on the sensitivity class of the information.
Class 1: Public Data.
If analysis requires a particular version or depends on a certain layout of the data, then it should be secured and backed up in case the data becomes unavailable later.
Class 2: Non-Human Data.
Reasonable precautions will be taken to avoid inadvertent disclosure. The data will by default be considered proprietary and compiled for the exclusive use of the client.
Class 3: Anonymised Data.
Whilst this gives the greatest protection to individuals and fully anonymized data falls outside the scope of UK-GDPR regulations, truly anonymized data can be less useful for some research purposes.
Class 4: Pseudonymised Data.
This is the most common type of data we receive. Whilst difficult, it may be possible to identify subjects by combing various sources of data. However, if the process of subject identification is difficult, costly, or require data from multiple locations, then it can be stored and processed for research purposes.
Data is to be either held on a secure data server, within an encrypted SQL/NoSQL database or, during project implementation, on encrypted drives of CBF desktops and laptops.
Class 5: Clinical Data.
Clinical data, where subjects can be identified by either direct or indirect means will normally be held in a trusted research environment (TRE) compliant to the current ISO 27001. Clinical data does not leave the TRE and all analysis takes place within the environment. Often clinical data is already held within a TRE and our work would be carried out there. The CBF does not currently have such a TRE and if necessary, would outsource this infrastructure requirement. Any costs would be included in any agreement before commencing work.
Backing up the Data.
The client or collaborator is responsible for the safety of the raw data. During the project the CBF will be responsible for backing up the processed, interim, and results data. The data may be backed up to the CBF high-capacity data server, Dropbox Enterprise, or UoL’s SharePoint or One Drive. Code will be stored and kept updated on the CBF git repository.
How long do we store the data.
Unless falling under UK-GDPR regulations or agreed otherwise with the client, we will retain the processed data for five years from receipt of the original source data. After which the data could be removed from our active server and transferred to some other archive media (tape, DVD, removable SSD) to be secured under lock and key for a further five years.
Part of our agreement with the client may specify that we upload data to an online repository. Once upload has been verified, we reserve the right to transfer that data immediately onto archive media. On project completion, the locations of all data will be logged.
Sustainable Data Storage
The CBF is focused on sustainable computing.
Whilst the capacity of the CBF infrastructure is considerable, it is finite. Holding the data on our servers is more energy efficient than cloud storage but there is still an energy cost. The PDP will specify the data management for the project and be the is first part of the project data log. Here we will note the location of the processed data, the interim data, and the destination of the final data. It will also mark the approximate date for movement to a suitable archive. Close management of the data cycle is essential to monitoring our energy footprint.
Back to: Computational Biology Facility