Benchmarking

DIABLO identifies molecular networks with superior biological enrichment

To assess this, we turn to real biological datasets. We applied various integrative approaches to cancer multi-omics datasets (mRNA, miRNA, and CpG) – colon, kidney, glioblastoma (gbm) and lung – and identified multi-omics biomarker panels that were predictive of high and low survival times. We then compared the network properties and biological enrichment of the selected features across approaches.

Overview of multi-omics datasets analyzed for method benchmarking and in two case studies. The breast cancer case study includes training and test datasets for all omics types except proteins.

SNF data description

  • The SNF datasets were part of the datasets used in the Nature Methods paper on Similarity Network Fusion (SNF); https://www.nature.com/articles/nmeth.2810
  • The cancer datasets include GBM (Brain), Colon, Kidney, Lung and Breast (the Breast cancer dataset was excluded in order to avoid confusion with the case study on Breast Cancer)
  • The datasets were obtained from: http://compbio.cs.toronto.edu/SNF/SNF/Software.html
  • Survival times were provided for each disease cohort. The median survival time was used to dictomize each response variables into low and high survival times.

number of samples in each group

##      colon kidney gbm lung Sum
## high    33     61 105   53 252
## low     59     61 108   53 281
## Sum     92    122 213  106 533

number of variables in each dataset

  • mRNA transcripts or cpg probes that mapped to the same gene were averaged
##       colon kidney   gbm  lung
## mrna  17814  17665 12042 12042
## mirna   312    329   534   352
## cpg   23088  24960  1305 23074

Multi-omic biomarker panels

Multi-omics biomarker panels were developed using component-based integrative approaches that also performed variable selection: supervised methods included concatenation and ensemble schemes using the sPLSDA classifier [14], and DIABLO with either the null or full design (DIABLO_null, and DIABLO_full); unsupervised approaches included sparse generalized canonical correlation analysis [15] (sGCCA), Multi-Omics Factor Analysis (MOFA), and Joint and Individual Variation Explained (JIVE) [23] (see Supplementary Note for parameter settings). Both supervised and unsupervised approaches were considered in order to compare and contrast the types of omics-variables selected, network properties and biological enrichment results. A distinction was made between DIABLO models in which the correlation between omics datasets was not maximized (DIABLO_null) and those when the correlation between omics datasets was maximized (DIABLO_full).

Unsupervised

JIVE

MOFA

sGCCA

Supervised

Concatenation_sPLSDA

Ensemble_spslda

DIABLO_null

DIABLO_full

Number of features per panel

Each multi-omics biomarker panel included 180 features (60 features of each omics type across 2 components). Approaches generally identified distinct sets of features. The plots below depict the distinct and shared features between the seven multi-omics panels obtained from the unsupervised (purple, sGCCA, MOFA and JIVE) and supervised (green, Concatenation, Ensemble, DIABLO_null and DIABLO_full) methods. Supervised methods selected many of the same features (blue), but DIABLO_full had greater feature overlap with unsupervised methods (orange).

Component plots

colon component plot

Silhouette coeffients per datasets per phenotype

Overlap in panels

Colon

Intersection plot

Venn diagram

Kidney

Intersection plot

Venn diagram

GBM

Intersection plot

Venn diagram

Lung

Intersection plot

Venn diagram

Gene set enrichment analysis

Finally, we carried out gene set enrichment analysis on each multi-omics biomarker panel (using gene symbols of mRNAs and CpGs) against 10 gene set collections (see Methods) and tabulated the number of significant (FDR=5%) gene sets. The DIABLO_full model identified the greatest number of significant gene sets across the 10 gene set collections and generally ranked higher than the other methods in the colon (7 collections), gbm (5 collections) and lung (5 collections) cancer datasets, whereas JIVE outperformed all other methods in the kidney cancer datasets (6 collections). Unlike all other approaches considered, DIABLO_full, which aimed to explain both the correlation structure between multiple omics layers and a phenotype of interest, implicated the greatest number of known biological gene sets (pathways/functions/processes etc.).

  1. C1 - positional gene sets for each human chromosome and cytogenetic band.
  2. C2 – curated gene sets (Pathway Interaction DB [PID], Biocarta [BIOCARTA], Kyoto Encyclopedia of Genes and Genomes [KEGG], Reactome [REACTOME], and others)
  3. C3 - motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes.
  4. C4 – computational gene sets (from the Cancer Gene Neighbourhoods [CGN] and Cancer Modules [CM] – citation available via: http://www.broadinstitute.org/gsea/msigdb/collections.jsp)
  5. C5 - GO gene sets consist of genes annotated by the same GO terms.
  6. C6 – ontologic gene sets (Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer).
  7. C7 - immunologic gene sets defined directly from microarray gene expression data from immunologic studies.
  8. H - hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes. & A. BTM - Blood Transcriptional Modules (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2727981/) B. TISSUES - cell-specific expression from Benita et al. Blood 2008 http://www.bloodjournal.org/content/115/26/5376
  • Significance of enrichment was determined using a hypergeometric test of the overlap between the selected features (mapped to official HUGO gene symbols or official miRNA symbols) and the various gene sets contained in the collections. Resulting p-values were corrected for multiple hypothesis using the Benjamini-Hochberg procedure applied across ALL genesets (10k+ tests – as pessimistic as possible). Adjusted p-values are reported in the fdr column.

All gene sets

  • only using mRNA and CpGs
disease method type BTM C1 C2 C3 C4 C5 C6 C7 H TISSUES
colon Concatenation supervised 0 0 12 11 1 7 0 61 0 0
colon DIABLO_full supervised 23 0 113 0 46 216 0 218 7 16
colon DIABLO_null supervised 0 0 21 6 1 0 0 62 2 0
colon Ensemble supervised 0 0 3 2 2 0 0 10 0 0
colon JIVE unsupervised 0 0 15 8 0 19 0 1 0 2
colon MOFA unsupervised 4 0 14 5 1 36 0 87 0 12
colon sGCCA unsupervised 0 0 5 14 0 147 0 11 0 0
gbm Concatenation supervised 10 0 258 14 47 526 30 432 19 10
gbm DIABLO_full supervised 30 0 426 34 125 693 21 869 19 44
gbm DIABLO_null supervised 10 0 312 15 62 776 24 147 20 14
gbm Ensemble supervised 9 0 358 15 50 669 24 173 23 12
gbm JIVE unsupervised 0 0 275 94 49 825 22 460 12 18
gbm MOFA unsupervised 0 0 337 64 43 708 25 82 8 29
gbm sGCCA unsupervised 19 0 193 37 68 706 18 526 8 21
kidney Concatenation supervised 0 0 10 4 7 55 0 93 1 0
kidney DIABLO_full supervised 0 1 4 1 0 0 0 18 0 0
kidney DIABLO_null supervised 0 0 15 23 3 46 0 10 1 0
kidney Ensemble supervised 0 0 5 35 1 27 0 13 0 0
kidney JIVE unsupervised 1 0 42 8 17 157 0 0 6 2
kidney MOFA unsupervised 0 0 33 80 6 110 0 74 3 0
kidney sGCCA unsupervised 0 1 7 1 0 1 0 15 0 0
lung Concatenation supervised 0 1 0 50 0 0 3 0 0 0
lung DIABLO_full supervised 0 1 33 19 13 193 7 100 0 20
lung DIABLO_null supervised 2 0 1 21 18 22 5 72 0 9
lung Ensemble supervised 0 0 0 26 0 25 2 7 1 0
lung JIVE unsupervised 0 0 4 48 17 35 1 18 0 0
lung MOFA unsupervised 0 0 17 20 0 127 0 13 2 0
lung sGCCA unsupervised 0 0 2 57 47 42 1 78 0 0

All gene sets combined

  • only using mRNA and CpGs
method colon gbm kidney lung
Concatenation 92 1346 170 54
DIABLO_full 639 2261 24 386
DIABLO_null 92 1380 98 150
Ensemble 17 1333 81 61
JIVE 45 1755 233 123
MOFA 159 1296 306 179
sGCCA 177 1596 25 227

which method is leads to the greatest number of signficant pathways?

Connectivity

The level of connectivity of each of the seven multi-omics panels was assessed by generating networks from the feature adjacency matrix at various Pearson correlation coefficient cut-offs. At all cut-offs, unsupervised approaches produced networks with greater connectivity (number of edges) compared to supervised approaches. In addition, biomarker panels identified by DIABLO_full, were more similar to those identified by unsupervised approaches, including high graph density, low number of communities and large number of triads, indicating that DIABLO_full identified discriminative sets of features that were tightly correlated across biological compartments.

Number of connections

Colon only

Network attributes

all network plots

network and component plot of the multi-omic panels derived using the colon cancer dataset

The plots below depict the networks of all multi-omics biomarker panels for the colon cancer dataset, which show higher modularity (a limited number of large clusters of variables; circled) for the DIABLO_full and the unsupervised approaches as compared to the supervised ones. The corresponding component plots show a clear separation between the high and low survival groups for the panels derived using supervised approaches, whereas the unsupervised approaches could not segregate the survival groups.

networks

References

  1. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods [Internet]. 2014 [cited 2016 Jan 19];11:333–7. Available from: http://www.nature.com/doifinder/10.1038/nmeth.2810
  2. Lê Cao K-A, Boitard S, Besse P. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics [Internet]. 2011 [cited 2015 Jul 15];12:253. Available from: http://www.biomedcentral.com/1471-2105/12/253/
  3. Tenenhaus A, Philippe C, Guillemot V, Le Cao K-A, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics [Internet]. 2014 [cited 2015 Jul 15];15:569–83. Available from: http://biostatistics.oxfordjournals.org/cgi/doi/10.1093/biostatistics/kxu001
  4. The TCGA Research Network. The Cancer Genome Atlas [Internet]. Available from: http://cancergenome.nih.gov/
  5. Singh A, Yamamoto M, Kam SHY, Ruan J, Gauvreau GM, O’Byrne PM, et al. Gene-metabolite expression in blood can discriminate allergen-induced isolated early from dual asthmatic responses. Hsu Y-H, editor. PLoS ONE [Internet]. 2013 [cited 2015 Jul 18];8:e67907. Available from: http://dx.plos.org/10.1371/journal.pone.0067907
  6. Singh A, Yamamoto M, Ruan J, Choi JY, Gauvreau GM, Olek S, et al. Th17/Treg ratio derived using DNA methylation analysis is associated with the late phase asthmatic response. Allergy Asthma Clin Immunol [Internet]. 2014 [cited 2016 Mar 2];10:32. Available from: http://www.biomedcentral.com/content/pdf/1710-1492-10-32.pdf
  7. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst [Internet]. 2015 [cited 2018 Jan 30];1:417–25. Available from: http://linkinghub.elsevier.com/retrieve/pii/S2405471215002185