Can Peripheral Blood-Derived Gene Expressions Characterize Individuals at Ultra-high Risk for Psychosis?

Status	N	Subgroups	Mean age (years)	Gender, N (%)	Ethnicity, N (%)
UHR subjects	56	APS: 43 (76.8%)	22.1	Male: 21 (75.0%)	Chinese: 21 (75.0%)
		BLIPS: 3 (5.4%)		Female: 7 (25.0%)	Malay: 7 (25.0%)
		Vulnerable: 15 (26.8%)
Healthy controls	28	None	22.5	Male: 21 (75.0%)	Chinese: 21 (75.0%)
Healthy controls	28	None	22.5	Female: 7 (25.0%)	Malay: 7 (25.0%)

**Preliminary variance-based analysis.** A) PCA scatterplots demonstrating that data normalization can improve the signal-to-noise ratio, enhancing discrimination between sample classes. Note that no feature selection is done here. Here we compare None, Quantile, GFS, and SVA. GFS and SVA seem to boost the class discrimination signal the most. B) Distribution of variance at each PC level shown as a series of bar plots, where the first bar corresponds to PC1, the second corresponds to PC2, and so on. In “None,” note that without any form of normalization, most variance is concentrated in PC1. A high concentration of variance in the first PC is usually indicative of the presence of a large amount of technical artifact. All normalization methods appear to balance the distribution of variance among the subsequent PCs, but also note that the relative scale of remaining variance after GFS and SVA processing is much lesser than for log-converted data.

Table 2.

Significant association between data factors (class, gender, and ethnicity) against each principal component 110

PC	Class			Indeterminate Gender				Indeterminate Ethnicity
PC	None	Quantile	GFS	SVA	None	Quantile	GFS	SVA	None	Quantile	GFS	SVA
1	0.00	0.38	0.00	0.38	0.36	0.02	0.56	0.34	0.35	0.01	0.91	0.97
2	0.16	0.14	0.00	0.00	0.01	0.81	0.85	0.20	0.12	0.69	0.69	0.85
3	0.23	0.05	0.10	0.22	0.25	0.09	0.01	0.85	0.05	0.95	0.01	0.37
4	0.24	0.00	0.21	0.73	0.24	0.19	0.36	0.20	0.76	0.86	0.66	0.82
5	0.00	0.00	0.13	0.21	0.09	0.78	0.16	0.03	0.92	0.91	0.95	0.23
6	0.70	0.02	0.14	0.03	0.25	0.72	1.00	0.00	0.95	0.30	0.37	0.64
7	0.03	0.27	0.33	0.06	0.87	0.59	0.07	0.30	0.07	0.79	0.11	0.20
8	0.12	0.02	0.98	0.14	0.34	0.19	0.94	0.36	0.33	0.62	0.31	0.01
9	0.30	0.22	0.08	0.87	0.23	0.00	0.22	0.01	0.77	0.88	0.90	0.13
10	0.59	0.23	0.86	0.79	0.01	0.42	0.74	0.31	0.59	0.65	0.05	0.50

[i] Note. Boldface indicates significance below 0.05.

**How normalization affects statistical feature selection and prediction modeling.** A) Histograms showing the p value distributions (x axis) following feature selection (based on the F test) and corrected for multiple testing via BH. Data are processed in four ways (None, Quantile, GFS, and SVA). The importance of normalization is obvious here. With simple log-conversion, most gene features will be reported as significant, and we should expect that many of these will be false positives. The p value distributions for Quantile and SVA are more within expectations, while GFS tends to be highly conservative here. B) Significant feature overlap based on a cutoff of 0.01. None, Quantile, GFS, and SVA report a total of 5,877, 256, 5, and 556 significant genes, respectively. Among these, only one gene (*MAGEB16*) is common among all four methods. The overlaps with GFS tend to be deeper with Quantile and SVA. C) Distributions of p values (based on SVA’s set of p values following F test and BH correction) showing that intersecting genes (common between Quantile, GFS, and SVA) are more significant than those that are not common among them. We disregarded the 5,482 significant genes in None, as they are quite likely to be false positives anyway. D) Cross-validation tests demonstrating that GFS, followed by SVA, tends to pick more relevant genes and build better models using the shrunken-centroid classifier. Data are evenly split into training and validation sets. All features were used to train the classifier. Cross-validation accuracy is the total number of correctly predicted class labels (control and subject) in the validation dataset (where 0 means no class labels were correctly predicted and 1 means all were correctly predicted). This is repeated 1,000 times to generate the violin plot, as shown.

**Gene fuzzy scoringbased gene signature is functionally relevant.** A) An unsupervised clustering method (hierarchical clustering; Euclidean distance and average linkage) on the set of GFS significant features (the ones in bold are the original five at a cutoff of 0.01, while the additional seven are included based on a cutoff of 0.05), yielding good separation between our sample classes. The cutoff was loosened to 0.05 to include more genes and boost sensitivity in functional analysis. B) Functional network (derived from GeneMANIA) among the significant GFS genes, pointing toward neurological functions and a high level of interconnectivity among other undetected genes. Despite its strong presence as a significant feature, *MAGEB16* does not appear to be functionally associated with the other genes. C) Half samples used for training following statistical feature selection (signature), the remaining half for validation. The cross-validation prediction accuracy is the proportion of correctly predicted validation class labels. In each round, a random signature equal to the size of the inferred signature is also generated, and its cross-validation performance is evaluated similarly. Although classifier accuracy fell for GFS (compare Figure 2D), it strongly outperforms random signatures, suggesting that signatures inferred from GFS are more likely meaningful or relevant. This is not so for other normalization methods (compare Goh et al., 2017, Supplementary Figure 1).

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.1162/CPSY_a_00007 | Journal eISSN: 2379-6227

Journal RSS Feed

Language: English

Submitted on: Mar 20, 2017

Accepted on: Jun 7, 2017

Published on: Dec 1, 2017

Published by: MIT Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

© 2017 Wilson Wen Bin Goh, Judy Chia-Ghee Sng, Jie Yin Yee, Yuen Mei See, Tih-Shih Lee, Limsoon Wong, Jimmy Lee, published by MIT Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 1 (2017): Issue 0

Can Peripheral Blood-Derived Gene Expressions Characterize Individuals at Ultra-high Risk for Psychosis?

Figures & Tables

Table 1.

Study design and factor description of study sample

Figure 1.

Table 2.

Significant association between data factors (class, gender, and ethnicity) against each principal component 110

Figure 2.

Figure 3.

Paradigm

My account