Case study 1: Breast cancer

Analysis 1: Comparison of classification performance between DIABLO models and other classification schemes (Concatecation and Ensemble) - only using mRNA, miRNA and CpGs

Number of samples and variables per dataset

##    mRNA miRNA CpGs Proteins          Attribute
## 1   379   379  379      379  Number of Samples
## 2 16851   349 9482      115 Number of features

Phenotype breakdown

## Y.train
## Basal  Her2  LumA  LumB 
##    76    38   188    77

Tune DIABLO models

estimate DIABLO train and test error rates

Tune sPLSDA models

estimate sPLSDA train and test error rates (Concateantion and Ensemble)

tune and estimate Enet train and test error rates (Concateantion and Ensemble)

plot of error rates

table of error rates

mean_err sd_err cohort modName
0.21 0.0091 train DIABLO_Null
0.19 NA test DIABLO_Null
0.22 0.0057 train DIABLO_Full
0.21 NA test DIABLO_Full
0.15 0.0130 train Concatenation_sPLSDA
0.18 NA test Concatenation_sPLSDA
0.25 0.0140 train Ensemble_sPLSDA
0.28 NA test Ensemble_sPLSDA
0.14 0.0072 train Concatenation_Enet
0.20 NA test Concatenation_Enet
0.11 0.0016 train Ensemble_Enet
0.23 NA test Ensemble_Enet

numberof variables in each panel

modName CpGs miRNA mRNA
DIABLO_Null 22 42 60
DIABLO_Full 17 17 55
Concatenation_sPLSDA NA NA 60
Ensemble_sPLSDA 40 55 60
Concatenation_Enet 38 2 118
Ensemble_Enet 127 45 96

Analysis 2: DIABLO identified known and novel multi-omics biomarkers of breast cancer subtypes - using mRNA, miRNA, CpGs and proteins

We next demonstrate that DIABLO can identify novel biomarkers in addition to biomarkers with known biological associations using a case study of human breast cancer. We applied our biomarker analysis workflow to breast cancer datasets to characterize and predict PAM50 breast cancer subtypes.

A standard DIABLO workflow. The first step inputs multiple omics datasets measured on the same individuals, that were previously normalized and filtered, , along with the phenotype information indicating the class membership of each sample (two or more groups). Optional preprocessing steps include multilevel transformation for repeated measures study designs and pathway module summary transformations. DIABLO is a multivariate dimension reduction method that seeks for latent components – linear combinations of variables from each omics dataset, that are maximally correlated as specified by a design matrix (see Methods section). The identification of a multi-omics panel is obtained with l1 penalties in the model that shrink the variable coefficients defining the components to zero. Numerous visualizations are proposed to provide insights into the multi-omics panel and guide the interpretation of the selected omics variables, including sample and variable plots. Downstream analysis include gene set enrichment analysis.

After preprocessing and normalization of each omics data-type, the samples were divided into training and test sets.

Overview of multi-omics datasets analyzed for method benchmarking and in two case studies. The breast cancer case study includes training and test datasets for all omics types except proteins.

The training data consisted of four omics-datasets (mRNA, miRNA, CpGs and proteins) whereas the test data included all remaining samples for which the protein expression data were missing. The optimal multi-omics biomarker panel size was identified using a grid approach where, for any given combination of variables, we assessed the classification performance using a 5-fold cross-validation repeated 5 times.

Breast cancer multi omics study: optimal multi-omics biomarker panel for PAM50 subtypes. A grid was used to identify the optimal combination of variables select from each omics datasets. The following grid values was used for each omics dataset: mRNA = [5, 10, 15, 20], miRNA = [5, 10, 15, 20], CpGs = [5, 10, 15, 20], Proteins = [5, 10, 15, 20], across 3 components. The centroids distance measure was used to compute the error rate. The optimal multi-omics panel consisted of 20 mRNAs, 20 miRNAs, 15 CpGs and 15 proteins on component 1, 5 mRNAs, 5 miRNAs, 5 CpGs and 20 proteins on component 2, and 20 mRNAs, 20 miRNAs, 5 CpGs and 20 proteins on component 3.

tuning a DIABLO_full model using 4 datasets

Optimal DIABLO model

The optimal multi-omics panel consisted of 45 mRNA, 45 miRNAs, 25 CpGs and 55 proteins selected across three components with a balanced error rate of 17.9±1.9%.

## # A tibble: 3 x 4
## # Groups:   comp [3]
##   keepX       meanError sdError comp 
##   <fct>           <dbl>   <dbl> <fct>
## 1 20_20_15_15     0.315  0.0161 comp1
## 2 5_5_5_20        0.219  0.0239 comp2
## 3 20_20_5_20      0.179  0.0194 comp3

optimal keepX

## $mRNA
## [1] 20  5 20
## 
## $miRNA
## [1] 20  5 20
## 
## $CpGs
## [1] 15  5  5
## 
## $Proteins
## [1] 15 20 20

run DIABLO - with optimal keepX

Number of variables of each omic-type in the diablo panel

##     mRNA    miRNA     CpGs Proteins 
##       45       45       25       55

overlap between the different omic compartments (mRNA,miRNA,CpGs and Protein)

  • all mRNA, CpGs and proteins have been converted to gene symbols

overlap between the mRNA and CpGs

## [1] "ORMDL3"

overlap between the mRNA and Proteins

## [1] "GATA3"  "INPP4B" "AR"

overlap between the diablo panel features (mRNA,miRNA,CpGs and Protein) and with curated databases

Feature Plot

This panel identified many variables with previously known associations with breast cancer, as assessed by looking at the overlap between the panel features and gene sets related to breast cancer based on the Molecular Signature database (MolSigDB) [23], miRCancer [24], Online Mendelian Inheritance in Man (OMIM) [25], and DriverDBv2 [26]. The feature plot depicts the variable contributions of each omics-type indicated by their loading weight (variable importance). Variables not found in any database may represent novel biomarkers of breast cancer.

Evaluate performance of diablo panel using additional data (test datasets)

Number of samples in the train and test datasets

##   Basal Her2 LumA LumB   Set
## 1    76   38  188   77 Train
## 2   102   40  346  122  Test

Individual class error rate per PAM50 subtype

##       Basal        Her2        LumA        LumB  Overall.ER Overall.BER 
##  0.04901961  0.20000000  0.13294798  0.53278689  0.20327869  0.22868862

Component plots

The plot below depicts the consensus and individual omics component plots based on the optimal biomarker panel, along with 95% confidence ellipses obtained from the training data and superimposed with the samples from the test data. The majority of the samples were within the ellipses, suggesting a reproducible multi-omics biomarker panel from the training to the test set, that was predictive of breast cancer subtypes (balanced error rate = 22.9%). The consensus plot corresponded strongly with the mRNA component plot, depicting a strong separation of the Basal (error rate = 4.9%) and Her2 (error rate = 20%) subtypes. We observed a weak separation of Luminal A (LumA, error rate = 13.3%) and Luminal B (LumB, error rate = 53.3%) subtypes.

Heatmap

Similarly, the heatmap showing the scaled expression of all features of the multi-omics biomarker panel, depicted a strong clustering of the Basal and Her2 samples whereas the Luminal A and B were mixed.

Network

Overall, the features of the multi-omics biomarker panel formed a densely connected network comprising of four communities where variables in each community (cluster) were densely connected with themselves and sparsely connected with other clusters.

Number of variables of each omic-type in the red cluster

##     mRNA    miRNA     CpGs Proteins 
##       20       21       15       16

Geneset enrichment analysis

The largest cluster in the network consisted of 72 variables; 20 mRNAs, 21 miRNAs, 15 CpGs and 16 proteins (red bubble) and was further investigated using gene set enrichment analysis. We identified many cancer-associated pathways (e.g. FOXM1 pathway, p53 signaling pathway), DNA damage and repair pathways (e.g. E2F mediated regulation of DNA replication, G2M DNA damage checkpoint) and various cell-cycle pathways (e.g. G1S transition, mitotic G1/G1S phases), demonstrating the ability of DIABLO to identify a biologically plausible multi-omics biomarker panel. This panel generalized to new breast cancer samples and implicated previously unknown molecular features in breast cancer, which could be further validated in experimental studies.

References

  1. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods [Internet]. 2014 [cited 2016 Jan 19];11:333–7. Available from: http://www.nature.com/doifinder/10.1038/nmeth.2810
  2. Rohart F, Gautier B, Singh A, Cao K-AL. mixOmics: An R package for ‘omics feature selection and multiple data integration. PLOS Comput Biol [Internet]. 2017 [cited 2018 Jan 29];13:e1005752. Available from: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005752
  3. The TCGA Research Network. The Cancer Genome Atlas [Internet]. Available from: http://cancergenome.nih.gov/
  4. Singh A, Yamamoto M, Kam SHY, Ruan J, Gauvreau GM, O’Byrne PM, et al. Gene-metabolite expression in blood can discriminate allergen-induced isolated early from dual asthmatic responses. Hsu Y-H, editor. PLoS ONE [Internet]. 2013 [cited 2015 Jul 18];8:e67907. Available from: http://dx.plos.org/10.1371/journal.pone.0067907
  5. Singh A, Yamamoto M, Ruan J, Choi JY, Gauvreau GM, Olek S, et al. Th17/Treg ratio derived using DNA methylation analysis is associated with the late phase asthmatic response. Allergy Asthma Clin Immunol [Internet]. 2014 [cited 2016 Mar 2];10:32. Available from: http://www.biomedcentral.com/content/pdf/1710-1492-10-32.pdf
  6. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst [Internet]. 2015 [cited 2018 Jan 30];1:417–25. Available from: http://linkinghub.elsevier.com/retrieve/pii/S2405471215002185
  7. Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA-cancer association database constructed by text mining on literature. Bioinformatics [Internet]. 2013 [cited 2018 Jan 30];29:638–44. Available from: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt014
  8. Hamosh A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res [Internet]. 2004 [cited 2018 Jan 30];33:D514–7. Available from: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gki033
  9. Chung I-F, Chen C-Y, Su S-C, Li C-Y, Wu K-J, Wang H-W, et al. DriverDBv2: a database for human cancer driver gene research. Nucleic Acids Res [Internet]. 2016 [cited 2018 Jan 30];44:D975–9. Available from: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkv1314