TY - JOUR
T1 - The shape of gene expression distributions matter
T2 - how incorporating distribution shape improves the interpretation of cancer transcriptomic data
AU - de Torrenté, Laurence
AU - Zimmerman, Samuel
AU - Suzuki, Masako
AU - Christopeit, Maximilian
AU - Greally, John M.
AU - Mar, Jessica C.
N1 - Funding Information:
This research was supported by the Albert Einstein College of Medicine start-up funds (LDT, JCM). JCM is supported by an Australian Research Council Future Fellowship (FT170100047) and by a Metcalf Prize from the National Stem Cell Foundation of Australia. The funding bodies listed did not play a role in the design of the study, the collection, analysis, and interpretation of the data, nor in writing the manuscript. Publication costs are funded by start-up funds provided by the University of Queensland to JCM.
Publisher Copyright:
© 2020, The Author(s).
PY - 2020/12
Y1 - 2020/12
N2 - Background: In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). Results: Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. Conclusions: Our results highlight the value of studying a gene’s distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.
AB - Background: In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). Results: Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. Conclusions: Our results highlight the value of studying a gene’s distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.
KW - Cancer genomics
KW - Gene expression
KW - Multi-modality
KW - Non-normal distribution
KW - Survival analysis
UR - http://www.scopus.com/inward/record.url?scp=85098195178&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098195178&partnerID=8YFLogxK
U2 - 10.1186/s12859-020-03892-w
DO - 10.1186/s12859-020-03892-w
M3 - Article
C2 - 33371881
AN - SCOPUS:85098195178
SN - 1471-2105
VL - 21
JO - BMC Bioinformatics
JF - BMC Bioinformatics
M1 - 562
ER -