DataSHIELD – New Directions and Dimensions

Rebecca C. Wilson; Oliver W. Butters; Demetris Avraam; James Baker; Jonathan A. Tedds; Andrew Turner; Madeleine Murtagh; Paul R. Burton

doi:10.5334/dsj-2017-021

DataSHIELD – New Directions and Dimensions

Volume 16 (2017): Issue 0

By: Rebecca C. Wilson , Oliver W. Butters, Demetris Avraam, James Baker, Jonathan A. Tedds, Andrew Turner , Madeleine Murtagh and Paul R. Burton

Open Access

|Apr 2017

Figures & Tables

Table 1

Commonly used processes to access biomedical microdata (summarised from Burton et al. 2015).

Access method	Description	Examples
Repository release	Data are stored in a repository and released to users with or without governance controls.	NCDS (Power and Elliot, 2005), UK Biobank (Sudlow et al. 2015), UK Data Archive,¹ European Genome-phenome Archive (Lappalainen et al. 2015).
Repository release mitigating disclosure	Repository releases data to users in a modified format to prevent disclosure.	Methods include: Aggregation based on the microdata, data redaction/suppression, addition of noise, simulation data with the equivalent structure (Karr and Reiter, 2014; Shlomo etal. 2015). Anonymisation/pseudonymisation of the data (e.g. Sweeney, 2002; Elliot et al. 2016).
Repository direct access-analysis	Users can analyse data stored in a repository. Restrictions on data extraction or analytic functionality may apply.	UK Data Service Secure Lab,² UK SERP (Jones et al. 2016). Open source solutions include: DataSHIELD (Gaye et al. 2014; Wolfson et al. 2010), ViPAR (Carter et al. 2016).

Data partitioning most commonly utilised in health sciences (from Gaye et al. 2014). If there is no data partitioning **(a)** then the data can be analysed all together, if the data is partitioned horizontally **(b)** or vertically **(c)** then computational or statistical methods to co-analyse the data must be employed.

An example infrastructure for single site DataSHIELD.

The DataSHIELD infrastructure for co-analysis of horizontally partitioned data from three separate data providers.

Table 2

Reproducing the ana ysis of McGReady et al. (2015) within DataSHIELD.

Original description in paper	DataSHIELD command	DataSHIELD output
seroprevalence for HIV 0.47 % (0.30 – 0.76 95 % CI)	ds.glm()	0.004723534 (Odds Ratio) 0.002938389 (lower 95 % CI) 0.007584949 (upper 95 % Cl)
seroprevalence for HIV (17/3599)	ds.table1D()	negative positive	0 1 Total	3582 17 3599
syphilis was lower in refugees (1/1469)	ds.table2D()		refugee – migrant status
		syphilis yes no total	refugee n/a n/a 1469	migrant n/a n/a 2123

An example of a line of digitised text from a typical structured data file from the British Library (data from *Bertha’s Earl. A novel*, Lady Lindsay, 1891).

Digitised text data dictionary as stored in Opal.

An example of word length analysis that can be performed on the complete text that produces a non-disclosive output.

An example of a disclosive analytic output where identifiable information (first names and surnames) are present.

Scatterplot of variables x and y. The circles indicate the coordinates of original variables x and y and the solid trend line shows their positive correlation. The crosses are the centroids of each three nearest neighbours of the original data points which generate a non-disclosive scatter plot created by DataSHIELD method. The dashed trend line shows the correlation of those centroids.

Two dimensional representation of the prototype data exploration interface (vARC) in virtual reality applied to simulated ALSPAC data (courtesy Masters of Pie, Lumacode).

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/dsj-2017-021 | Journal eISSN: 1683-1470

Journal RSS Feed

Language: English

Page range: 21 - 21

Submitted on: Oct 31, 2016

Accepted on: Apr 5, 2017

Published on: Apr 19, 2017

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

data privacy,

sensitive data,

distributed data

© 2017 Rebecca C. Wilson, Oliver W. Butters, Demetris Avraam, James Baker, Jonathan A. Tedds, Andrew Turner, Madeleine Murtagh, Paul R. Burton, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 16 (2017): Issue 0

DataSHIELD – New Directions and Dimensions

Figures & Tables

Table 1

Figure 1

Figure 2

Figure 3

Table 2

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Paradigm

My account