Benchmarking
Benchmarking
- DIABLO identifies molecular networks with superior biological enrichment
- SNF data description
- number of samples in each group
- number of variables in each dataset
- Multi-omic biomarker panels
- Number of features per panel
- Component plots
- Silhouette coeffients per datasets per phenotype
- Overlap in panels
- Gene set enrichment analysis
- Connectivity
- Network attributes
DIABLO identifies molecular networks with superior biological enrichment
To assess this, we turn to real biological datasets. We applied various integrative approaches to cancer multi-omics datasets (mRNA, miRNA, and CpG) – colon, kidney, glioblastoma (gbm) and lung – and identified multi-omics biomarker panels that were predictive of high and low survival times. We then compared the network properties and biological enrichment of the selected features across approaches.
Overview of multi-omics datasets analyzed for method benchmarking and in two case studies. The breast cancer case study includes training and test datasets for all omics types except proteins.
SNF data description
- The SNF datasets were part of the datasets used in the Nature Methods paper on Similarity Network Fusion (SNF); https://www.nature.com/articles/nmeth.2810
- The cancer datasets include GBM (Brain), Colon, Kidney, Lung and Breast (the Breast cancer dataset was excluded in order to avoid confusion with the case study on Breast Cancer)
- The datasets were obtained from: http://compbio.cs.toronto.edu/SNF/SNF/Software.html
- Survival times were provided for each disease cohort. The median survival time was used to dictomize each response variables into low and high survival times.
number of samples in each group
## colon kidney gbm lung Sum
## high 33 61 105 53 252
## low 59 61 108 53 281
## Sum 92 122 213 106 533
number of variables in each dataset
- mRNA transcripts or cpg probes that mapped to the same gene were averaged
## colon kidney gbm lung
## mrna 17814 17665 12042 12042
## mirna 312 329 534 352
## cpg 23088 24960 1305 23074
Multi-omic biomarker panels
Multi-omics biomarker panels were developed using component-based integrative approaches that also performed variable selection: supervised methods included concatenation and ensemble schemes using the sPLSDA classifier [14], and DIABLO with either the null or full design (DIABLO_null, and DIABLO_full); unsupervised approaches included sparse generalized canonical correlation analysis [15] (sGCCA), Multi-Omics Factor Analysis (MOFA), and Joint and Individual Variation Explained (JIVE) [23] (see Supplementary Note for parameter settings). Both supervised and unsupervised approaches were considered in order to compare and contrast the types of omics-variables selected, network properties and biological enrichment results. A distinction was made between DIABLO models in which the correlation between omics datasets was not maximized (DIABLO_null) and those when the correlation between omics datasets was maximized (DIABLO_full).
Unsupervised
JIVE
MOFA
sGCCA
Supervised
Concatenation_sPLSDA
Ensemble_spslda
DIABLO_null
DIABLO_full
Number of features per panel
Each multi-omics biomarker panel included 180 features (60 features of each omics type across 2 components). Approaches generally identified distinct sets of features. The plots below depict the distinct and shared features between the seven multi-omics panels obtained from the unsupervised (purple, sGCCA, MOFA and JIVE) and supervised (green, Concatenation, Ensemble, DIABLO_null and DIABLO_full) methods. Supervised methods selected many of the same features (blue), but DIABLO_full had greater feature overlap with unsupervised methods (orange).
Component plots
colon component plot
Silhouette coeffients per datasets per phenotype
Overlap in panels
Colon
Intersection plot
Venn diagram
Kidney
Intersection plot
Venn diagram
GBM
Intersection plot
Venn diagram
Lung
Intersection plot
Venn diagram
Gene set enrichment analysis
Finally, we carried out gene set enrichment analysis on each multi-omics biomarker panel (using gene symbols of mRNAs and CpGs) against 10 gene set collections (see Methods) and tabulated the number of significant (FDR=5%) gene sets. The DIABLO_full model identified the greatest number of significant gene sets across the 10 gene set collections and generally ranked higher than the other methods in the colon (7 collections), gbm (5 collections) and lung (5 collections) cancer datasets, whereas JIVE outperformed all other methods in the kidney cancer datasets (6 collections). Unlike all other approaches considered, DIABLO_full, which aimed to explain both the correlation structure between multiple omics layers and a phenotype of interest, implicated the greatest number of known biological gene sets (pathways/functions/processes etc.).
- We wished to assess the enrichment of the selected features across a variety of annotated gene sets in the MSigDB collection (http://software.broadinstitute.org/gsea/msigdb), in particular:
- C1 - positional gene sets for each human chromosome and cytogenetic band.
- C2 – curated gene sets (Pathway Interaction DB [PID], Biocarta [BIOCARTA], Kyoto Encyclopedia of Genes and Genomes [KEGG], Reactome [REACTOME], and others)
- C3 - motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes.
- C4 – computational gene sets (from the Cancer Gene Neighbourhoods [CGN] and Cancer Modules [CM] – citation available via: http://www.broadinstitute.org/gsea/msigdb/collections.jsp)
- C5 - GO gene sets consist of genes annotated by the same GO terms.
- C6 – ontologic gene sets (Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer).
- C7 - immunologic gene sets defined directly from microarray gene expression data from immunologic studies.
- H - hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes. & A. BTM - Blood Transcriptional Modules (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2727981/) B. TISSUES - cell-specific expression from Benita et al. Blood 2008 http://www.bloodjournal.org/content/115/26/5376
- Significance of enrichment was determined using a hypergeometric test of the overlap between the selected features (mapped to official HUGO gene symbols or official miRNA symbols) and the various gene sets contained in the collections. Resulting p-values were corrected for multiple hypothesis using the Benjamini-Hochberg procedure applied across ALL genesets (10k+ tests – as pessimistic as possible). Adjusted p-values are reported in the fdr column.
All gene sets
- only using mRNA and CpGs
disease | method | type | BTM | C1 | C2 | C3 | C4 | C5 | C6 | C7 | H | TISSUES |
---|---|---|---|---|---|---|---|---|---|---|---|---|
colon | Concatenation | supervised | 0 | 0 | 12 | 11 | 1 | 7 | 0 | 61 | 0 | 0 |
colon | DIABLO_full | supervised | 23 | 0 | 113 | 0 | 46 | 216 | 0 | 218 | 7 | 16 |
colon | DIABLO_null | supervised | 0 | 0 | 21 | 6 | 1 | 0 | 0 | 62 | 2 | 0 |
colon | Ensemble | supervised | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 10 | 0 | 0 |
colon | JIVE | unsupervised | 0 | 0 | 15 | 8 | 0 | 19 | 0 | 1 | 0 | 2 |
colon | MOFA | unsupervised | 4 | 0 | 14 | 5 | 1 | 36 | 0 | 87 | 0 | 12 |
colon | sGCCA | unsupervised | 0 | 0 | 5 | 14 | 0 | 147 | 0 | 11 | 0 | 0 |
gbm | Concatenation | supervised | 10 | 0 | 258 | 14 | 47 | 526 | 30 | 432 | 19 | 10 |
gbm | DIABLO_full | supervised | 30 | 0 | 426 | 34 | 125 | 693 | 21 | 869 | 19 | 44 |
gbm | DIABLO_null | supervised | 10 | 0 | 312 | 15 | 62 | 776 | 24 | 147 | 20 | 14 |
gbm | Ensemble | supervised | 9 | 0 | 358 | 15 | 50 | 669 | 24 | 173 | 23 | 12 |
gbm | JIVE | unsupervised | 0 | 0 | 275 | 94 | 49 | 825 | 22 | 460 | 12 | 18 |
gbm | MOFA | unsupervised | 0 | 0 | 337 | 64 | 43 | 708 | 25 | 82 | 8 | 29 |
gbm | sGCCA | unsupervised | 19 | 0 | 193 | 37 | 68 | 706 | 18 | 526 | 8 | 21 |
kidney | Concatenation | supervised | 0 | 0 | 10 | 4 | 7 | 55 | 0 | 93 | 1 | 0 |
kidney | DIABLO_full | supervised | 0 | 1 | 4 | 1 | 0 | 0 | 0 | 18 | 0 | 0 |
kidney | DIABLO_null | supervised | 0 | 0 | 15 | 23 | 3 | 46 | 0 | 10 | 1 | 0 |
kidney | Ensemble | supervised | 0 | 0 | 5 | 35 | 1 | 27 | 0 | 13 | 0 | 0 |
kidney | JIVE | unsupervised | 1 | 0 | 42 | 8 | 17 | 157 | 0 | 0 | 6 | 2 |
kidney | MOFA | unsupervised | 0 | 0 | 33 | 80 | 6 | 110 | 0 | 74 | 3 | 0 |
kidney | sGCCA | unsupervised | 0 | 1 | 7 | 1 | 0 | 1 | 0 | 15 | 0 | 0 |
lung | Concatenation | supervised | 0 | 1 | 0 | 50 | 0 | 0 | 3 | 0 | 0 | 0 |
lung | DIABLO_full | supervised | 0 | 1 | 33 | 19 | 13 | 193 | 7 | 100 | 0 | 20 |
lung | DIABLO_null | supervised | 2 | 0 | 1 | 21 | 18 | 22 | 5 | 72 | 0 | 9 |
lung | Ensemble | supervised | 0 | 0 | 0 | 26 | 0 | 25 | 2 | 7 | 1 | 0 |
lung | JIVE | unsupervised | 0 | 0 | 4 | 48 | 17 | 35 | 1 | 18 | 0 | 0 |
lung | MOFA | unsupervised | 0 | 0 | 17 | 20 | 0 | 127 | 0 | 13 | 2 | 0 |
lung | sGCCA | unsupervised | 0 | 0 | 2 | 57 | 47 | 42 | 1 | 78 | 0 | 0 |
All gene sets combined
- only using mRNA and CpGs
method | colon | gbm | kidney | lung |
---|---|---|---|---|
Concatenation | 92 | 1346 | 170 | 54 |
DIABLO_full | 639 | 2261 | 24 | 386 |
DIABLO_null | 92 | 1380 | 98 | 150 |
Ensemble | 17 | 1333 | 81 | 61 |
JIVE | 45 | 1755 | 233 | 123 |
MOFA | 159 | 1296 | 306 | 179 |
sGCCA | 177 | 1596 | 25 | 227 |
which method is leads to the greatest number of signficant pathways?
Connectivity
The level of connectivity of each of the seven multi-omics panels was assessed by generating networks from the feature adjacency matrix at various Pearson correlation coefficient cut-offs. At all cut-offs, unsupervised approaches produced networks with greater connectivity (number of edges) compared to supervised approaches. In addition, biomarker panels identified by DIABLO_full, were more similar to those identified by unsupervised approaches, including high graph density, low number of communities and large number of triads, indicating that DIABLO_full identified discriminative sets of features that were tightly correlated across biological compartments.
Number of connections
Colon only
Network attributes
all network plots
network and component plot of the multi-omic panels derived using the colon cancer dataset
The plots below depict the networks of all multi-omics biomarker panels for the colon cancer dataset, which show higher modularity (a limited number of large clusters of variables; circled) for the DIABLO_full and the unsupervised approaches as compared to the supervised ones. The corresponding component plots show a clear separation between the high and low survival groups for the panels derived using supervised approaches, whereas the unsupervised approaches could not segregate the survival groups.
networks
References
- Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods [Internet]. 2014 [cited 2016 Jan 19];11:333–7. Available from: http://www.nature.com/doifinder/10.1038/nmeth.2810
- Lê Cao K-A, Boitard S, Besse P. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics [Internet]. 2011 [cited 2015 Jul 15];12:253. Available from: http://www.biomedcentral.com/1471-2105/12/253/
- Tenenhaus A, Philippe C, Guillemot V, Le Cao K-A, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics [Internet]. 2014 [cited 2015 Jul 15];15:569–83. Available from: http://biostatistics.oxfordjournals.org/cgi/doi/10.1093/biostatistics/kxu001
- The TCGA Research Network. The Cancer Genome Atlas [Internet]. Available from: http://cancergenome.nih.gov/
- Singh A, Yamamoto M, Kam SHY, Ruan J, Gauvreau GM, O’Byrne PM, et al. Gene-metabolite expression in blood can discriminate allergen-induced isolated early from dual asthmatic responses. Hsu Y-H, editor. PLoS ONE [Internet]. 2013 [cited 2015 Jul 18];8:e67907. Available from: http://dx.plos.org/10.1371/journal.pone.0067907
- Singh A, Yamamoto M, Ruan J, Choi JY, Gauvreau GM, Olek S, et al. Th17/Treg ratio derived using DNA methylation analysis is associated with the late phase asthmatic response. Allergy Asthma Clin Immunol [Internet]. 2014 [cited 2016 Mar 2];10:32. Available from: http://www.biomedcentral.com/content/pdf/1710-1492-10-32.pdf
- Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst [Internet]. 2015 [cited 2018 Jan 30];1:417–25. Available from: http://linkinghub.elsevier.com/retrieve/pii/S2405471215002185