Multiclass Disease Classification from Microbial Whole-Community Metagenomes

Saad Khan, Libusha Kelly

Research output: Contribution to journalArticle

Abstract

The microbiome, the community of microorganisms living within an individual, is a promising avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: random forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75% average test-set accuracy), receiver-operator-characteristics (92.1% average area-under-ROC (AUC)), and precision-recall (50% average area-under-precision-recall (AUPR)). Additionally, the convolutional net's performance complements that of the random forest, showing a lower propensity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90% average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.

Original languageEnglish (US)
Pages (from-to)55-66
Number of pages12
JournalPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Volume25
StatePublished - Jan 1 2020

Fingerprint

Metagenome
Microbiota
Neural networks
Microorganisms
Learning systems
Screening
Classifiers
Forests

ASJC Scopus subject areas

  • Biomedical Engineering
  • Computational Theory and Mathematics

Cite this

@article{20910197735647ee956cbb2b2be07460,
title = "Multiclass Disease Classification from Microbial Whole-Community Metagenomes",
abstract = "The microbiome, the community of microorganisms living within an individual, is a promising avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: random forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75{\%} average test-set accuracy), receiver-operator-characteristics (92.1{\%} average area-under-ROC (AUC)), and precision-recall (50{\%} average area-under-precision-recall (AUPR)). Additionally, the convolutional net's performance complements that of the random forest, showing a lower propensity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90{\%} average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.",
author = "Saad Khan and Libusha Kelly",
year = "2020",
month = "1",
day = "1",
language = "English (US)",
volume = "25",
pages = "55--66",
journal = "Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing",
issn = "2335-6936",

}

TY - JOUR

T1 - Multiclass Disease Classification from Microbial Whole-Community Metagenomes

AU - Khan, Saad

AU - Kelly, Libusha

PY - 2020/1/1

Y1 - 2020/1/1

N2 - The microbiome, the community of microorganisms living within an individual, is a promising avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: random forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75% average test-set accuracy), receiver-operator-characteristics (92.1% average area-under-ROC (AUC)), and precision-recall (50% average area-under-precision-recall (AUPR)). Additionally, the convolutional net's performance complements that of the random forest, showing a lower propensity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90% average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.

AB - The microbiome, the community of microorganisms living within an individual, is a promising avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: random forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75% average test-set accuracy), receiver-operator-characteristics (92.1% average area-under-ROC (AUC)), and precision-recall (50% average area-under-precision-recall (AUPR)). Additionally, the convolutional net's performance complements that of the random forest, showing a lower propensity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90% average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.

UR - http://www.scopus.com/inward/record.url?scp=85076098634&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85076098634&partnerID=8YFLogxK

M3 - Article

C2 - 31797586

AN - SCOPUS:85076098634

VL - 25

SP - 55

EP - 66

JO - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

JF - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

SN - 2335-6936

ER -