Have a personal or library account? Click to login
DataSHIELD – New Directions and Dimensions Cover

Figures & Tables

Table 1

Commonly used processes to access biomedical microdata (summarised from Burton et al. 2015).

Access methodDescriptionExamples
Repository releaseData are stored in a repository and released to users with or without governance controls.NCDS (Power and Elliot, 2005), UK Biobank (Sudlow et al. 2015), UK Data Archive,1 European Genome-phenome Archive (Lappalainen et al. 2015).
Repository release mitigating disclosureRepository releases data to users in a modified format to prevent disclosure.Methods include:
Aggregation based on the microdata, data redaction/suppression, addition of noise, simulation data with the equivalent structure (Karr and Reiter, 2014; Shlomo etal. 2015).
Anonymisation/pseudonymisation of the data (e.g. Sweeney, 2002; Elliot et al. 2016).
Repository direct access-analysisUsers can analyse data stored in a repository. Restrictions on data extraction or analytic functionality may apply.UK Data Service Secure Lab,2 UK SERP (Jones et al. 2016).
Open source solutions include:
DataSHIELD (Gaye et al. 2014; Wolfson et al. 2010), ViPAR (Carter et al. 2016).
dsj-16-660-g1.png
Figure 1

Data partitioning most commonly utilised in health sciences (from Gaye et al. 2014). If there is no data partitioning (a) then the data can be analysed all together, if the data is partitioned horizontally (b) or vertically (c) then computational or statistical methods to co-analyse the data must be employed.

dsj-16-660-g2.png
Figure 2

An example infrastructure for single site DataSHIELD.

dsj-16-660-g3.png
Figure 3

The DataSHIELD infrastructure for co-analysis of horizontally partitioned data from three separate data providers.

Table 2

Reproducing the ana ysis of McGReady et al. (2015) within DataSHIELD.

Original description in paperDataSHIELD commandDataSHIELD output
seroprevalence for HIV 0.47 % (0.30 – 0.76 95 % CI)ds.glm()0.004723534 (Odds Ratio)
0.002938389 (lower 95 % CI)
0.007584949 (upper 95 % Cl)
seroprevalence for HIV (17/3599)ds.table1D()negative
positive
0
1
Total
3582
    17
3599
syphilis was lower in refugees (1/1469)ds.table2D()refugee – migrant status
syphilis
yes
no
total
refugee
n/a
n/a
1469
migrant
n/a
n/a
2123
dsj-16-660-g4.png
Figure 4

An example of a line of digitised text from a typical structured data file from the British Library (data from Bertha’s Earl. A novel, Lady Lindsay, 1891).

dsj-16-660-g5.png
Figure 5

Digitised text data dictionary as stored in Opal.

dsj-16-660-g6.png
Figure 6

An example of word length analysis that can be performed on the complete text that produces a non-disclosive output.

dsj-16-660-g7.png
Figure 7

An example of a disclosive analytic output where identifiable information (first names and surnames) are present.

dsj-16-660-g8.png
Figure 8

Scatterplot of variables x and y. The circles indicate the coordinates of original variables x and y and the solid trend line shows their positive correlation. The crosses are the centroids of each three nearest neighbours of the original data points which generate a non-disclosive scatter plot created by DataSHIELD method. The dashed trend line shows the correlation of those centroids.

dsj-16-660-g9.png
Figure 9

Two dimensional representation of the prototype data exploration interface (vARC) in virtual reality applied to simulated ALSPAC data (courtesy Masters of Pie, Lumacode).

Language: English
Submitted on: Oct 31, 2016
|
Accepted on: Apr 5, 2017
|
Published on: Apr 19, 2017
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2017 Rebecca C. Wilson, Oliver W. Butters, Demetris Avraam, James Baker, Jonathan A. Tedds, Andrew Turner, Madeleine Murtagh, Paul R. Burton, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.