TY - JOUR
T1 - Robust integration of multiple single-cell RNA sequencing datasets using a single reference space
AU - Liu, Yang
AU - Wang, Tao
AU - Zhou, Bin
AU - Zheng, Deyou
N1 - Funding Information:
We thank all the research groups that generated and shared the scRNA-seq data used in this study. We thank the members of the Zheng lab for valuable discussions, software testing and comments on the manuscript. We also acknowledge funding support from the National Institutes of Health (grants HL133120 to D.Z. and B.Z., HL153920 to D.Z., HD092944 to D.Z. and B.Z., and HD070454 to D.Z.).
Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Nature America, Inc.
PY - 2021/7
Y1 - 2021/7
N2 - In many biological applications of single-cell RNA sequencing (scRNA-seq), an integrated analysis of data from multiple batches or studies is necessary. Current methods typically achieve integration using shared cell types or covariance correlation between datasets, which can distort biological signals. Here we introduce an algorithm that uses the gene eigenvectors from a reference dataset to establish a global frame for integration. Using simulated and real datasets, we demonstrate that this approach, called Reference Principal Component Integration (RPCI), consistently outperforms other methods by multiple metrics, with clear advantages in preserving genuine cross-sample gene expression differences in matching cell types, such as those present in cells at distinct developmental stages or in perturbated versus control studies. Moreover, RPCI maintains this robust performance when multiple datasets are integrated. Finally, we applied RPCI to scRNA-seq data for mouse gut endoderm development and revealed temporal emergence of genetic programs helping establish the anterior–posterior axis in visceral endoderm.
AB - In many biological applications of single-cell RNA sequencing (scRNA-seq), an integrated analysis of data from multiple batches or studies is necessary. Current methods typically achieve integration using shared cell types or covariance correlation between datasets, which can distort biological signals. Here we introduce an algorithm that uses the gene eigenvectors from a reference dataset to establish a global frame for integration. Using simulated and real datasets, we demonstrate that this approach, called Reference Principal Component Integration (RPCI), consistently outperforms other methods by multiple metrics, with clear advantages in preserving genuine cross-sample gene expression differences in matching cell types, such as those present in cells at distinct developmental stages or in perturbated versus control studies. Moreover, RPCI maintains this robust performance when multiple datasets are integrated. Finally, we applied RPCI to scRNA-seq data for mouse gut endoderm development and revealed temporal emergence of genetic programs helping establish the anterior–posterior axis in visceral endoderm.
UR - http://www.scopus.com/inward/record.url?scp=85103158311&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85103158311&partnerID=8YFLogxK
U2 - 10.1038/s41587-021-00859-x
DO - 10.1038/s41587-021-00859-x
M3 - Article
C2 - 33767393
AN - SCOPUS:85103158311
SN - 1087-0156
VL - 39
SP - 877
EP - 884
JO - Biotechnology
JF - Biotechnology
IS - 7
ER -