Table 1
Commonly used processes to access biomedical microdata (summarised from Burton et al. 2015).
| Access method | Description | Examples |
|---|---|---|
| Repository release | Data are stored in a repository and released to users with or without governance controls. | NCDS (Power and Elliot, 2005), UK Biobank (Sudlow et al. 2015), UK Data Archive,1 European Genome-phenome Archive (Lappalainen et al. 2015). |
| Repository release mitigating disclosure | Repository releases data to users in a modified format to prevent disclosure. | Methods include: Aggregation based on the microdata, data redaction/suppression, addition of noise, simulation data with the equivalent structure (Karr and Reiter, 2014; Shlomo etal. 2015). Anonymisation/pseudonymisation of the data (e.g. Sweeney, 2002; Elliot et al. 2016). |
| Repository direct access-analysis | Users can analyse data stored in a repository. Restrictions on data extraction or analytic functionality may apply. | UK Data Service Secure Lab,2 UK SERP (Jones et al. 2016). Open source solutions include: DataSHIELD (Gaye et al. 2014; Wolfson et al. 2010), ViPAR (Carter et al. 2016). |

Figure 1
Data partitioning most commonly utilised in health sciences (from Gaye et al. 2014). If there is no data partitioning (a) then the data can be analysed all together, if the data is partitioned horizontally (b) or vertically (c) then computational or statistical methods to co-analyse the data must be employed.

Figure 2
An example infrastructure for single site DataSHIELD.

Figure 3
The DataSHIELD infrastructure for co-analysis of horizontally partitioned data from three separate data providers.
Table 2
Reproducing the ana ysis of McGReady et al. (2015) within DataSHIELD.
| Original description in paper | DataSHIELD command | DataSHIELD output | ||
|---|---|---|---|---|
| seroprevalence for HIV 0.47 % (0.30 – 0.76 95 % CI) | ds.glm() | 0.004723534 (Odds Ratio) 0.002938389 (lower 95 % CI) 0.007584949 (upper 95 % Cl) | ||
| seroprevalence for HIV (17/3599) | ds.table1D() | negative positive | 0 1 Total | 3582 17 3599 |
| syphilis was lower in refugees (1/1469) | ds.table2D() | refugee – migrant status | ||
| syphilis yes no total | refugee n/a n/a 1469 | migrant n/a n/a 2123 | ||

Figure 4
An example of a line of digitised text from a typical structured data file from the British Library (data from Bertha’s Earl. A novel, Lady Lindsay, 1891).

Figure 5
Digitised text data dictionary as stored in Opal.

Figure 6
An example of word length analysis that can be performed on the complete text that produces a non-disclosive output.

Figure 7
An example of a disclosive analytic output where identifiable information (first names and surnames) are present.

Figure 8
Scatterplot of variables x and y. The circles indicate the coordinates of original variables x and y and the solid trend line shows their positive correlation. The crosses are the centroids of each three nearest neighbours of the original data points which generate a non-disclosive scatter plot created by DataSHIELD method. The dashed trend line shows the correlation of those centroids.

Figure 9
Two dimensional representation of the prototype data exploration interface (vARC) in virtual reality applied to simulated ALSPAC data (courtesy Masters of Pie, Lumacode).
