Simulation study

Rationale

The purpose of the simulation study was to 1) study the relationship between the covariance between datasets (controlled through DIABLO design matrix) and the error rate and types of variables selected, 2) explore whether the classification error of a DIABLO model with the full design can improved by incorporating additional independent components, and 3) comparison of classification error rates with existing integrative classification methods (Concatenation and Ensemble).

Simulation details

Briefly, three omic datasets consisting of 200 samples (split equally over two groups) and 260 variables were generated by modifying the degree of covariance and discrimination, resulting in four types of variables: 30 correlated-discriminatory (corDis) variables, 30 uncorrelated-discriminatory (unCorDis) variables, 100 correlated-nondiscriminatory (corNonDis) variables, and 100 uncorrelated-nondiscriminatory (unCorNonDis) variables.

DIABLO performs competitively with other integrative methods

Simulation: vary noise and fold-change and compare with other schemes (concatenation/ensembles)

Three integrative classification methods were applied to generate multi-omic biomarkers panels of 180 variables each: a DIABLO model with either a full design (where the sum of correlations between all pairwise combinations of datasets, as well as between each dataset and the phenotypic outcome, were maximised) or the null design (where only the sum of correlations between each dataset and the phenotypic outcome was maximised, see Methods), a concatenation-based sPLSDA classifier which consists of naively combining all datasets into one, and an ensemble of sPLSDA classifiers where a separate sPLSDA classifier was fitted for each omics dataset and the consensus predictions were combined using a majority vote scheme.

Integrative prediction frameworks including multi-step approaches (concatenation, ensemble) and DIABLO to identify multi-omics molecular signatures. Concatenation-based integration combines multiple datasets into a single large dataset, with the aim to predict a phenotype of interest. Ensemble-based classification methods construct a predictive model on each individual dataset before combining the model predictions. None of these approaches account or model relationships between datasets and thus limit our understanding of molecular interactions at multiple functional levels. DIABLO simultaneously maximizes the associations between datasets and a phenotype of interest to identify a correlated set of variables of different omics-types that are also discriminatory. The prediction is based on each omics-associated component derived from the model (see Supplementary Note). All methods presented here are data-driven approaches, which do not use any prior knowledge such as from curated biological databases (eg. protein-protein interactions).

The covariance between datasets was held constant at 1, whereas the fold-change was varied from 0-2, and noise was varied from 0.2-1. As expected the error rate was roughly 50% when the fold-change was zero, regardless of noise level for all methods. When the fold-change was set to 1, the DIABLO model the full design (DIABLO_full) had a higher error rate as compared to the other methods, for noise levels less than 1. However, when the noise level was increased to match the level of the fold-change all methods performed similarly. When the fold-change was increased to 2 (higher than both the covariance and noise levels) the DIABLO_full model performed similarly with the other methods.

Trade-off between correlation and discrimination

Contour plots of the average error rate (10-fold cross-validation) over 20 simulations at different levels of covariance and fold-change, using DIABLO with the full and null design, retaining 1 component.

full design

null design

types of variables selected

As can be observed, increasing the covariance between datasets significantly increases the error rate (blue to red) for a given fold-change in the case of the full design but not the null design.

Improving predictive performance by incorporating independent information (ncomp = 2)

full design

null design

types of variables selected

when additional independent information is added in the DIABLO (full design) model by retaining additional components (but retaining the number of variables), the error rate significantly improves and becomes more similar to that of the DIABLO (null design) model.

Conclusion

This simulation highlights how the design (connection between datasets) affects the flexibility of the DIABLO model, resulting in a trade-off between discrimination or correlation. DIABLO_null focused on selecting discriminatory variables and disregarded most of the correlation between datasets (null design), whereas DIABLO_full selected highly correlated variables across all datasets. Since the variables selected by DIABLO_full reflect the correlation structure between biological compartments, we hypothesized that they might provide a balance between prediction accuracy and biological insight.