Benchmarking

DIABLO identifies molecular networks with superior biological enrichment
SNF data description
number of samples in each group
number of variables in each dataset
Multi-omic biomarker panels
- Unsupervised
- Supervised
Number of features per panel
Component plots
- colon component plot
Silhouette coeffients per datasets per phenotype
Overlap in panels
- Colon
- Kidney
- GBM
- Lung
Gene set enrichment analysis
- All gene sets
- All gene sets combined
Connectivity
- Number of connections
Network attributes

DIABLO identifies molecular networks with superior biological enrichment

To assess this, we turn to real biological datasets. We applied various integrative approaches to cancer multi-omics datasets (mRNA, miRNA, and CpG) – colon, kidney, glioblastoma (gbm) and lung – and identified multi-omics biomarker panels that were predictive of high and low survival times. We then compared the network properties and biological enrichment of the selected features across approaches.

Overview of multi-omics datasets analyzed for method benchmarking and in two case studies. The breast cancer case study includes training and test datasets for all omics types except proteins.

SNF data description

The SNF datasets were part of the datasets used in the Nature Methods paper on Similarity Network Fusion (SNF); https://www.nature.com/articles/nmeth.2810
The cancer datasets include GBM (Brain), Colon, Kidney, Lung and Breast (the Breast cancer dataset was excluded in order to avoid confusion with the case study on Breast Cancer)
The datasets were obtained from: http://compbio.cs.toronto.edu/SNF/SNF/Software.html
Survival times were provided for each disease cohort. The median survival time was used to dictomize each response variables into low and high survival times.

number of samples in each group

##      colon kidney gbm lung Sum
## high    33     61 105   53 252
## low     59     61 108   53 281
## Sum     92    122 213  106 533

number of variables in each dataset

mRNA transcripts or cpg probes that mapped to the same gene were averaged

##       colon kidney   gbm  lung
## mrna  17814  17665 12042 12042
## mirna   312    329   534   352
## cpg   23088  24960  1305 23074

Multi-omic biomarker panels

Multi-omics biomarker panels were developed using component-based integrative approaches that also performed variable selection: supervised methods included concatenation and ensemble schemes using the sPLSDA classifier [14], and DIABLO with either the null or full design (DIABLO_null, and DIABLO_full); unsupervised approaches included sparse generalized canonical correlation analysis [15] (sGCCA), Multi-Omics Factor Analysis (MOFA), and Joint and Individual Variation Explained (JIVE) [23] (see Supplementary Note for parameter settings). Both supervised and unsupervised approaches were considered in order to compare and contrast the types of omics-variables selected, network properties and biological enrichment results. A distinction was made between DIABLO models in which the correlation between omics datasets was not maximized (DIABLO_null) and those when the correlation between omics datasets was maximized (DIABLO_full).

Unsupervised

JIVE

MOFA

sGCCA

Supervised

Concatenation_sPLSDA

Ensemble_spslda

DIABLO_null

DIABLO_full

Number of features per panel

Each multi-omics biomarker panel included 180 features (60 features of each omics type across 2 components). Approaches generally identified distinct sets of features. The plots below depict the distinct and shared features between the seven multi-omics panels obtained from the unsupervised (purple, sGCCA, MOFA and JIVE) and supervised (green, Concatenation, Ensemble, DIABLO_null and DIABLO_full) methods. Supervised methods selected many of the same features (blue), but DIABLO_full had greater feature overlap with unsupervised methods (orange).

Component plots

colon component plot

Silhouette coeffients per datasets per phenotype

Overlap in panels

Colon

Intersection plot

Venn diagram

Kidney

Intersection plot

Venn diagram

GBM

Intersection plot

Venn diagram

Lung

Intersection plot

Venn diagram

Gene set enrichment analysis

Finally, we carried out gene set enrichment analysis on each multi-omics biomarker panel (using gene symbols of mRNAs and CpGs) against 10 gene set collections (see Methods) and tabulated the number of significant (FDR=5%) gene sets. The DIABLO_full model identified the greatest number of significant gene sets across the 10 gene set collections and generally ranked higher than the other methods in the colon (7 collections), gbm (5 collections) and lung (5 collections) cancer datasets, whereas JIVE outperformed all other methods in the kidney cancer datasets (6 collections). Unlike all other approaches considered, DIABLO_full, which aimed to explain both the correlation structure between multiple omics layers and a phenotype of interest, implicated the greatest number of known biological gene sets (pathways/functions/processes etc.).

We wished to assess the enrichment of the selected features across a variety of annotated gene sets in the MSigDB collection (http://software.broadinstitute.org/gsea/msigdb), in particular:

C1 - positional gene sets for each human chromosome and cytogenetic band.
C2 – curated gene sets (Pathway Interaction DB [PID], Biocarta [BIOCARTA], Kyoto Encyclopedia of Genes and Genomes [KEGG], Reactome [REACTOME], and others)
C3 - motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes.
C4 – computational gene sets (from the Cancer Gene Neighbourhoods [CGN] and Cancer Modules [CM] – citation available via: http://www.broadinstitute.org/gsea/msigdb/collections.jsp)
C5 - GO gene sets consist of genes annotated by the same GO terms.
C6 – ontologic gene sets (Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer).
C7 - immunologic gene sets defined directly from microarray gene expression data from immunologic studies.
H - hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes. & A. BTM - Blood Transcriptional Modules (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2727981/) B. TISSUES - cell-specific expression from Benita et al. Blood 2008 http://www.bloodjournal.org/content/115/26/5376

Significance of enrichment was determined using a hypergeometric test of the overlap between the selected features (mapped to official HUGO gene symbols or official miRNA symbols) and the various gene sets contained in the collections. Resulting p-values were corrected for multiple hypothesis using the Benjamini-Hochberg procedure applied across ALL genesets (10k+ tests – as pessimistic as possible). Adjusted p-values are reported in the fdr column.

All gene sets

only using mRNA and CpGs

disease	method	type	BTM	C1	C2	C3	C4	C5	C6	C7	H	TISSUES
colon	Concatenation	supervised	0	0	12	11	1	7	0	61	0	0
colon	DIABLO_full	supervised	23	0	113	0	46	216	0	218	7	16
colon	DIABLO_null	supervised	0	0	21	6	1	0	0	62	2	0
colon	Ensemble	supervised	0	0	3	2	2	0	0	10	0	0
colon	JIVE	unsupervised	0	0	15	8	0	19	0	1	0	2
colon	MOFA	unsupervised	4	0	14	5	1	36	0	87	0	12
colon	sGCCA	unsupervised	0	0	5	14	0	147	0	11	0	0
gbm	Concatenation	supervised	10	0	258	14	47	526	30	432	19	10
gbm	DIABLO_full	supervised	30	0	426	34	125	693	21	869	19	44
gbm	DIABLO_null	supervised	10	0	312	15	62	776	24	147	20	14
gbm	Ensemble	supervised	9	0	358	15	50	669	24	173	23	12
gbm	JIVE	unsupervised	0	0	275	94	49	825	22	460	12	18
gbm	MOFA	unsupervised	0	0	337	64	43	708	25	82	8	29
gbm	sGCCA	unsupervised	19	0	193	37	68	706	18	526	8	21
kidney	Concatenation	supervised	0	0	10	4	7	55	0	93	1	0
kidney	DIABLO_full	supervised	0	1	4	1	0	0	0	18	0	0
kidney	DIABLO_null	supervised	0	0	15	23	3	46	0	10	1	0
kidney	Ensemble	supervised	0	0	5	35	1	27	0	13	0	0
kidney	JIVE	unsupervised	1	0	42	8	17	157	0	0	6	2
kidney	MOFA	unsupervised	0	0	33	80	6	110	0	74	3	0
kidney	sGCCA	unsupervised	0	1	7	1	0	1	0	15	0	0
lung	Concatenation	supervised	0	1	0	50	0	0	3	0	0	0
lung	DIABLO_full	supervised	0	1	33	19	13	193	7	100	0	20
lung	DIABLO_null	supervised	2	0	1	21	18	22	5	72	0	9
lung	Ensemble	supervised	0	0	0	26	0	25	2	7	1	0
lung	JIVE	unsupervised	0	0	4	48	17	35	1	18	0	0
lung	MOFA	unsupervised	0	0	17	20	0	127	0	13	2	0
lung	sGCCA	unsupervised	0	0	2	57	47	42	1	78	0	0

All gene sets combined

only using mRNA and CpGs

method	colon	gbm	kidney	lung
Concatenation	92	1346	170	54
DIABLO_full	639	2261	24	386
DIABLO_null	92	1380	98	150
Ensemble	17	1333	81	61
JIVE	45	1755	233	123
MOFA	159	1296	306	179
sGCCA	177	1596	25	227

which method is leads to the greatest number of signficant pathways?

Connectivity

The level of connectivity of each of the seven multi-omics panels was assessed by generating networks from the feature adjacency matrix at various Pearson correlation coefficient cut-offs. At all cut-offs, unsupervised approaches produced networks with greater connectivity (number of edges) compared to supervised approaches. In addition, biomarker panels identified by DIABLO_full, were more similar to those identified by unsupervised approaches, including high graph density, low number of communities and large number of triads, indicating that DIABLO_full identified discriminative sets of features that were tightly correlated across biological compartments.

Number of connections

Colon only

Network attributes

all network plots

network and component plot of the multi-omic panels derived using the colon cancer dataset

The plots below depict the networks of all multi-omics biomarker panels for the colon cancer dataset, which show higher modularity (a limited number of large clusters of variables; circled) for the DIABLO_full and the unsupervised approaches as compared to the supervised ones. The corresponding component plots show a clear separation between the high and low survival groups for the panels derived using supervised approaches, whereas the unsupervised approaches could not segregate the survival groups.

networks

References

Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods [Internet]. 2014 [cited 2016 Jan 19];11:333–7. Available from: http://www.nature.com/doifinder/10.1038/nmeth.2810
Lê Cao K-A, Boitard S, Besse P. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics [Internet]. 2011 [cited 2015 Jul 15];12:253. Available from: http://www.biomedcentral.com/1471-2105/12/253/
Tenenhaus A, Philippe C, Guillemot V, Le Cao K-A, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics [Internet]. 2014 [cited 2015 Jul 15];15:569–83. Available from: http://biostatistics.oxfordjournals.org/cgi/doi/10.1093/biostatistics/kxu001
The TCGA Research Network. The Cancer Genome Atlas [Internet]. Available from: http://cancergenome.nih.gov/
Singh A, Yamamoto M, Kam SHY, Ruan J, Gauvreau GM, O’Byrne PM, et al. Gene-metabolite expression in blood can discriminate allergen-induced isolated early from dual asthmatic responses. Hsu Y-H, editor. PLoS ONE [Internet]. 2013 [cited 2015 Jul 18];8:e67907. Available from: http://dx.plos.org/10.1371/journal.pone.0067907
Singh A, Yamamoto M, Ruan J, Choi JY, Gauvreau GM, Olek S, et al. Th17/Treg ratio derived using DNA methylation analysis is associated with the late phase asthmatic response. Allergy Asthma Clin Immunol [Internet]. 2014 [cited 2016 Mar 2];10:32. Available from: http://www.biomedcentral.com/content/pdf/1710-1492-10-32.pdf
Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst [Internet]. 2015 [cited 2018 Jan 30];1:417–25. Available from: http://linkinghub.elsevier.com/retrieve/pii/S2405471215002185