Case study 1: Breast cancer
Case study 1: Breast cancer
- Analysis 1: Comparison of classification performance between DIABLO models and other classification schemes (Concatecation and Ensemble) - only using mRNA, miRNA and CpGs
- Number of samples and variables per dataset
- Phenotype breakdown
- Tune DIABLO models
- estimate DIABLO train and test error rates
- Tune sPLSDA models
- estimate sPLSDA train and test error rates (Concateantion and Ensemble)
- tune and estimate Enet train and test error rates (Concateantion and Ensemble)
- plot of error rates
- table of error rates
- numberof variables in each panel
- Analysis 2: DIABLO identified known and novel multi-omics biomarkers of breast cancer subtypes - using mRNA, miRNA, CpGs and proteins
- tuning a DIABLO_full model using 4 datasets
- Optimal DIABLO model
- run DIABLO - with optimal keepX
- Feature Plot
- Evaluate performance of diablo panel using additional data (test datasets)
- Individual class error rate per PAM50 subtype
- Component plots
- Heatmap
- Network
- Geneset enrichment analysis
- References
Analysis 1: Comparison of classification performance between DIABLO models and other classification schemes (Concatecation and Ensemble) - only using mRNA, miRNA and CpGs
Number of samples and variables per dataset
## mRNA miRNA CpGs Proteins Attribute
## 1 379 379 379 379 Number of Samples
## 2 16851 349 9482 115 Number of features
Phenotype breakdown
## Y.train
## Basal Her2 LumA LumB
## 76 38 188 77
Tune DIABLO models
estimate DIABLO train and test error rates
Tune sPLSDA models
estimate sPLSDA train and test error rates (Concateantion and Ensemble)
tune and estimate Enet train and test error rates (Concateantion and Ensemble)
plot of error rates
table of error rates
mean_err | sd_err | cohort | modName |
---|---|---|---|
0.21 | 0.0091 | train | DIABLO_Null |
0.19 | NA | test | DIABLO_Null |
0.22 | 0.0057 | train | DIABLO_Full |
0.21 | NA | test | DIABLO_Full |
0.15 | 0.0130 | train | Concatenation_sPLSDA |
0.18 | NA | test | Concatenation_sPLSDA |
0.25 | 0.0140 | train | Ensemble_sPLSDA |
0.28 | NA | test | Ensemble_sPLSDA |
0.14 | 0.0072 | train | Concatenation_Enet |
0.20 | NA | test | Concatenation_Enet |
0.11 | 0.0016 | train | Ensemble_Enet |
0.23 | NA | test | Ensemble_Enet |
numberof variables in each panel
modName | CpGs | miRNA | mRNA |
---|---|---|---|
DIABLO_Null | 22 | 42 | 60 |
DIABLO_Full | 17 | 17 | 55 |
Concatenation_sPLSDA | NA | NA | 60 |
Ensemble_sPLSDA | 40 | 55 | 60 |
Concatenation_Enet | 38 | 2 | 118 |
Ensemble_Enet | 127 | 45 | 96 |
Analysis 2: DIABLO identified known and novel multi-omics biomarkers of breast cancer subtypes - using mRNA, miRNA, CpGs and proteins
We next demonstrate that DIABLO can identify novel biomarkers in addition to biomarkers with known biological associations using a case study of human breast cancer. We applied our biomarker analysis workflow to breast cancer datasets to characterize and predict PAM50 breast cancer subtypes.
A standard DIABLO workflow. The first step inputs multiple omics datasets measured on the same individuals, that were previously normalized and filtered, , along with the phenotype information indicating the class membership of each sample (two or more groups). Optional preprocessing steps include multilevel transformation for repeated measures study designs and pathway module summary transformations. DIABLO is a multivariate dimension reduction method that seeks for latent components – linear combinations of variables from each omics dataset, that are maximally correlated as specified by a design matrix (see Methods section). The identification of a multi-omics panel is obtained with l1 penalties in the model that shrink the variable coefficients defining the components to zero. Numerous visualizations are proposed to provide insights into the multi-omics panel and guide the interpretation of the selected omics variables, including sample and variable plots. Downstream analysis include gene set enrichment analysis.
After preprocessing and normalization of each omics data-type, the samples were divided into training and test sets.
Overview of multi-omics datasets analyzed for method benchmarking and in two case studies. The breast cancer case study includes training and test datasets for all omics types except proteins.
The training data consisted of four omics-datasets (mRNA, miRNA, CpGs and proteins) whereas the test data included all remaining samples for which the protein expression data were missing. The optimal multi-omics biomarker panel size was identified using a grid approach where, for any given combination of variables, we assessed the classification performance using a 5-fold cross-validation repeated 5 times.
Breast cancer multi omics study: optimal multi-omics biomarker panel for PAM50 subtypes. A grid was used to identify the optimal combination of variables select from each omics datasets. The following grid values was used for each omics dataset: mRNA = [5, 10, 15, 20], miRNA = [5, 10, 15, 20], CpGs = [5, 10, 15, 20], Proteins = [5, 10, 15, 20], across 3 components. The centroids distance measure was used to compute the error rate. The optimal multi-omics panel consisted of 20 mRNAs, 20 miRNAs, 15 CpGs and 15 proteins on component 1, 5 mRNAs, 5 miRNAs, 5 CpGs and 20 proteins on component 2, and 20 mRNAs, 20 miRNAs, 5 CpGs and 20 proteins on component 3.
tuning a DIABLO_full model using 4 datasets
Optimal DIABLO model
The optimal multi-omics panel consisted of 45 mRNA, 45 miRNAs, 25 CpGs and 55 proteins selected across three components with a balanced error rate of 17.9±1.9%.
## # A tibble: 3 x 4
## # Groups: comp [3]
## keepX meanError sdError comp
## <fct> <dbl> <dbl> <fct>
## 1 20_20_15_15 0.315 0.0161 comp1
## 2 5_5_5_20 0.219 0.0239 comp2
## 3 20_20_5_20 0.179 0.0194 comp3
optimal keepX
## $mRNA
## [1] 20 5 20
##
## $miRNA
## [1] 20 5 20
##
## $CpGs
## [1] 15 5 5
##
## $Proteins
## [1] 15 20 20
run DIABLO - with optimal keepX
Number of variables of each omic-type in the diablo panel
## mRNA miRNA CpGs Proteins
## 45 45 25 55
overlap between the different omic compartments (mRNA,miRNA,CpGs and Protein)
- all mRNA, CpGs and proteins have been converted to gene symbols
overlap between the mRNA and CpGs
## [1] "ORMDL3"
overlap between the mRNA and Proteins
## [1] "GATA3" "INPP4B" "AR"
overlap between the diablo panel features (mRNA,miRNA,CpGs and Protein) and with curated databases
Feature Plot
This panel identified many variables with previously known associations with breast cancer, as assessed by looking at the overlap between the panel features and gene sets related to breast cancer based on the Molecular Signature database (MolSigDB) [23], miRCancer [24], Online Mendelian Inheritance in Man (OMIM) [25], and DriverDBv2 [26]. The feature plot depicts the variable contributions of each omics-type indicated by their loading weight (variable importance). Variables not found in any database may represent novel biomarkers of breast cancer.
Evaluate performance of diablo panel using additional data (test datasets)
Number of samples in the train and test datasets
## Basal Her2 LumA LumB Set
## 1 76 38 188 77 Train
## 2 102 40 346 122 Test
Individual class error rate per PAM50 subtype
## Basal Her2 LumA LumB Overall.ER Overall.BER
## 0.04901961 0.20000000 0.13294798 0.53278689 0.20327869 0.22868862
Component plots
The plot below depicts the consensus and individual omics component plots based on the optimal biomarker panel, along with 95% confidence ellipses obtained from the training data and superimposed with the samples from the test data. The majority of the samples were within the ellipses, suggesting a reproducible multi-omics biomarker panel from the training to the test set, that was predictive of breast cancer subtypes (balanced error rate = 22.9%). The consensus plot corresponded strongly with the mRNA component plot, depicting a strong separation of the Basal (error rate = 4.9%) and Her2 (error rate = 20%) subtypes. We observed a weak separation of Luminal A (LumA, error rate = 13.3%) and Luminal B (LumB, error rate = 53.3%) subtypes.
Heatmap
Similarly, the heatmap showing the scaled expression of all features of the multi-omics biomarker panel, depicted a strong clustering of the Basal and Her2 samples whereas the Luminal A and B were mixed.
Network
Overall, the features of the multi-omics biomarker panel formed a densely connected network comprising of four communities where variables in each community (cluster) were densely connected with themselves and sparsely connected with other clusters.
Number of variables of each omic-type in the red cluster
## mRNA miRNA CpGs Proteins
## 20 21 15 16
Geneset enrichment analysis
The largest cluster in the network consisted of 72 variables; 20 mRNAs, 21 miRNAs, 15 CpGs and 16 proteins (red bubble) and was further investigated using gene set enrichment analysis. We identified many cancer-associated pathways (e.g. FOXM1 pathway, p53 signaling pathway), DNA damage and repair pathways (e.g. E2F mediated regulation of DNA replication, G2M DNA damage checkpoint) and various cell-cycle pathways (e.g. G1S transition, mitotic G1/G1S phases), demonstrating the ability of DIABLO to identify a biologically plausible multi-omics biomarker panel. This panel generalized to new breast cancer samples and implicated previously unknown molecular features in breast cancer, which could be further validated in experimental studies.
References
- Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods [Internet]. 2014 [cited 2016 Jan 19];11:333–7. Available from: http://www.nature.com/doifinder/10.1038/nmeth.2810
- Rohart F, Gautier B, Singh A, Cao K-AL. mixOmics: An R package for ‘omics feature selection and multiple data integration. PLOS Comput Biol [Internet]. 2017 [cited 2018 Jan 29];13:e1005752. Available from: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005752
- The TCGA Research Network. The Cancer Genome Atlas [Internet]. Available from: http://cancergenome.nih.gov/
- Singh A, Yamamoto M, Kam SHY, Ruan J, Gauvreau GM, O’Byrne PM, et al. Gene-metabolite expression in blood can discriminate allergen-induced isolated early from dual asthmatic responses. Hsu Y-H, editor. PLoS ONE [Internet]. 2013 [cited 2015 Jul 18];8:e67907. Available from: http://dx.plos.org/10.1371/journal.pone.0067907
- Singh A, Yamamoto M, Ruan J, Choi JY, Gauvreau GM, Olek S, et al. Th17/Treg ratio derived using DNA methylation analysis is associated with the late phase asthmatic response. Allergy Asthma Clin Immunol [Internet]. 2014 [cited 2016 Mar 2];10:32. Available from: http://www.biomedcentral.com/content/pdf/1710-1492-10-32.pdf
- Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst [Internet]. 2015 [cited 2018 Jan 30];1:417–25. Available from: http://linkinghub.elsevier.com/retrieve/pii/S2405471215002185
- Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA-cancer association database constructed by text mining on literature. Bioinformatics [Internet]. 2013 [cited 2018 Jan 30];29:638–44. Available from: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt014
- Hamosh A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res [Internet]. 2004 [cited 2018 Jan 30];33:D514–7. Available from: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gki033
- Chung I-F, Chen C-Y, Su S-C, Li C-Y, Wu K-J, Wang H-W, et al. DriverDBv2: a database for human cancer driver gene research. Nucleic Acids Res [Internet]. 2016 [cited 2018 Jan 30];44:D975–9. Available from: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkv1314