Table 1.
Study design and factor description of study sample
| Status | N | Subgroups | Mean age (years) | Gender, N (%) | Ethnicity, N (%) |
|---|---|---|---|---|---|
| UHR subjects | 56 | APS: 43 (76.8%) | 22.1 | Male: 21 (75.0%) | Chinese: 21 (75.0%) |
| BLIPS: 3 (5.4%) | Female: 7 (25.0%) | Malay: 7 (25.0%) | |||
| Vulnerable: 15 (26.8%) | |||||
| Healthy controls | 28 | None | 22.5 | Male: 21 (75.0%) | Chinese: 21 (75.0%) |
| Female: 7 (25.0%) | Malay: 7 (25.0%) |

Figure 1.
Preliminary variance-based analysis. A) PCA scatterplots demonstrating that data normalization can improve the signal-to-noise ratio, enhancing discrimination between sample classes. Note that no feature selection is done here. Here we compare None, Quantile, GFS, and SVA. GFS and SVA seem to boost the class discrimination signal the most. B) Distribution of variance at each PC level shown as a series of bar plots, where the first bar corresponds to PC1, the second corresponds to PC2, and so on. In “None,” note that without any form of normalization, most variance is concentrated in PC1. A high concentration of variance in the first PC is usually indicative of the presence of a large amount of technical artifact. All normalization methods appear to balance the distribution of variance among the subsequent PCs, but also note that the relative scale of remaining variance after GFS and SVA processing is much lesser than for log-converted data.
Table 2.
Significant association between data factors (class, gender, and ethnicity) against each principal component 110
| PC | Class | Indeterminate Gender | Indeterminate Ethnicity | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| None | Quantile | GFS | SVA | None | Quantile | GFS | SVA | None | Quantile | GFS | SVA | |
| 1 | 0.00 | 0.38 | 0.00 | 0.38 | 0.36 | 0.02 | 0.56 | 0.34 | 0.35 | 0.01 | 0.91 | 0.97 |
| 2 | 0.16 | 0.14 | 0.00 | 0.00 | 0.01 | 0.81 | 0.85 | 0.20 | 0.12 | 0.69 | 0.69 | 0.85 |
| 3 | 0.23 | 0.05 | 0.10 | 0.22 | 0.25 | 0.09 | 0.01 | 0.85 | 0.05 | 0.95 | 0.01 | 0.37 |
| 4 | 0.24 | 0.00 | 0.21 | 0.73 | 0.24 | 0.19 | 0.36 | 0.20 | 0.76 | 0.86 | 0.66 | 0.82 |
| 5 | 0.00 | 0.00 | 0.13 | 0.21 | 0.09 | 0.78 | 0.16 | 0.03 | 0.92 | 0.91 | 0.95 | 0.23 |
| 6 | 0.70 | 0.02 | 0.14 | 0.03 | 0.25 | 0.72 | 1.00 | 0.00 | 0.95 | 0.30 | 0.37 | 0.64 |
| 7 | 0.03 | 0.27 | 0.33 | 0.06 | 0.87 | 0.59 | 0.07 | 0.30 | 0.07 | 0.79 | 0.11 | 0.20 |
| 8 | 0.12 | 0.02 | 0.98 | 0.14 | 0.34 | 0.19 | 0.94 | 0.36 | 0.33 | 0.62 | 0.31 | 0.01 |
| 9 | 0.30 | 0.22 | 0.08 | 0.87 | 0.23 | 0.00 | 0.22 | 0.01 | 0.77 | 0.88 | 0.90 | 0.13 |
| 10 | 0.59 | 0.23 | 0.86 | 0.79 | 0.01 | 0.42 | 0.74 | 0.31 | 0.59 | 0.65 | 0.05 | 0.50 |
[i] Note. Boldface indicates significance below 0.05.

Figure 2.
How normalization affects statistical feature selection and prediction modeling. A) Histograms showing the p value distributions (x axis) following feature selection (based on the F test) and corrected for multiple testing via BH. Data are processed in four ways (None, Quantile, GFS, and SVA). The importance of normalization is obvious here. With simple log-conversion, most gene features will be reported as significant, and we should expect that many of these will be false positives. The p value distributions for Quantile and SVA are more within expectations, while GFS tends to be highly conservative here. B) Significant feature overlap based on a cutoff of 0.01. None, Quantile, GFS, and SVA report a total of 5,877, 256, 5, and 556 significant genes, respectively. Among these, only one gene (MAGEB16) is common among all four methods. The overlaps with GFS tend to be deeper with Quantile and SVA. C) Distributions of p values (based on SVA’s set of p values following F test and BH correction) showing that intersecting genes (common between Quantile, GFS, and SVA) are more significant than those that are not common among them. We disregarded the 5,482 significant genes in None, as they are quite likely to be false positives anyway. D) Cross-validation tests demonstrating that GFS, followed by SVA, tends to pick more relevant genes and build better models using the shrunken-centroid classifier. Data are evenly split into training and validation sets. All features were used to train the classifier. Cross-validation accuracy is the total number of correctly predicted class labels (control and subject) in the validation dataset (where 0 means no class labels were correctly predicted and 1 means all were correctly predicted). This is repeated 1,000 times to generate the violin plot, as shown.

