TY - JOUR
T1 - Multiclass disease classification from microbial whole-community metagenomes
AU - Khan, Saad
AU - Kelly, Libusha
N1 - Funding Information:
Saad Khan was supported by the Einstein Medical Scientist Training Program (2T32GM007288-45) and an NIH T32 fellowship on Geographic Medicine and Emerging In- fectious Diseases (2T32AI070117-13). Libusha Kelly is supported in part by a Peer Reviewed Cancer Research Program Career Development Award from the United States Department of Defense (CA171019).
Publisher Copyright:
© 2019 The Authors.
PY - 2020
Y1 - 2020
N2 - The microbiome, the community of microorganisms living within an individual, is a promis-ing avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: ran-dom forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75% average test-set accuracy), receiver-operator-characteristics (92.1% average area-under-ROC (AUC)), and precision-recall (50% average area-under-precision-recall (AUPR)). Additionally, the convo-lutional net's performance complements that of the random forest, showing a lower propen-sity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90% average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.
AB - The microbiome, the community of microorganisms living within an individual, is a promis-ing avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: ran-dom forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75% average test-set accuracy), receiver-operator-characteristics (92.1% average area-under-ROC (AUC)), and precision-recall (50% average area-under-precision-recall (AUPR)). Additionally, the convo-lutional net's performance complements that of the random forest, showing a lower propen-sity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90% average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.
KW - Machine learning
KW - Metagenomics
KW - Microbiome
UR - http://www.scopus.com/inward/record.url?scp=85076098634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076098634&partnerID=8YFLogxK
M3 - Conference article
C2 - 31797586
AN - SCOPUS:85076098634
VL - 25
SP - 55
EP - 66
JO - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
JF - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
SN - 2335-6936
IS - 2020
T2 - 25th Pacific Symposium on Biocomputing, PSB 2020
Y2 - 3 January 2020 through 7 January 2020
ER -