Mixture models for undiagnosed prevalent disease and interval-censored incident disease

Applications to a cohort assembled from electronic health records

Li C. Cheung, Qing Pan, Noorie Hyun, Mark Schiffman, Barbara Fetterman, Philip E. Castle, Thomas Lorey, Hormuzd A. Katki

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan-Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence-incidence models. Parameters for parametric prevalence-incidence models, such as the logistic regression and Weibull survival (logistic-Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan-Meier, logistic-Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan-Meier provided poor estimates while the logistic-Weibull model was a close fit to the non-parametric. Our findings support our use of logistic-Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

Original languageEnglish (US)
JournalStatistics in Medicine
DOIs
StateAccepted/In press - 2017

Fingerprint

Electronic Health Records
Mixture Model
Health
Electronics
Interval
Logistic Regression
Kaplan-Meier
Weibull Model
Survival Model
Logistic Models
Public Sector
Early Detection of Cancer
Uterine Cervical Neoplasms
Screening
Incidence
Cancer
Estimate
Preexisting Condition Coverage
Cost Efficiency
Cohort Study

Keywords

  • Cervical cancer
  • Cumulative risk estimation
  • HPV
  • Kaplan-Meier
  • Prevalence-incidence models

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability

Cite this

Mixture models for undiagnosed prevalent disease and interval-censored incident disease : Applications to a cohort assembled from electronic health records. / Cheung, Li C.; Pan, Qing; Hyun, Noorie; Schiffman, Mark; Fetterman, Barbara; Castle, Philip E.; Lorey, Thomas; Katki, Hormuzd A.

In: Statistics in Medicine, 2017.

Research output: Contribution to journalArticle

@article{14f7c18ceada49258677b5dd4a59f294,
title = "Mixture models for undiagnosed prevalent disease and interval-censored incident disease: Applications to a cohort assembled from electronic health records",
abstract = "For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan-Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence-incidence models. Parameters for parametric prevalence-incidence models, such as the logistic regression and Weibull survival (logistic-Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan-Meier, logistic-Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan-Meier provided poor estimates while the logistic-Weibull model was a close fit to the non-parametric. Our findings support our use of logistic-Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.",
keywords = "Cervical cancer, Cumulative risk estimation, HPV, Kaplan-Meier, Prevalence-incidence models",
author = "Cheung, {Li C.} and Qing Pan and Noorie Hyun and Mark Schiffman and Barbara Fetterman and Castle, {Philip E.} and Thomas Lorey and Katki, {Hormuzd A.}",
year = "2017",
doi = "10.1002/sim.7380",
language = "English (US)",
journal = "Statistics in Medicine",
issn = "0277-6715",
publisher = "John Wiley and Sons Ltd",

}

TY - JOUR

T1 - Mixture models for undiagnosed prevalent disease and interval-censored incident disease

T2 - Applications to a cohort assembled from electronic health records

AU - Cheung, Li C.

AU - Pan, Qing

AU - Hyun, Noorie

AU - Schiffman, Mark

AU - Fetterman, Barbara

AU - Castle, Philip E.

AU - Lorey, Thomas

AU - Katki, Hormuzd A.

PY - 2017

Y1 - 2017

N2 - For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan-Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence-incidence models. Parameters for parametric prevalence-incidence models, such as the logistic regression and Weibull survival (logistic-Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan-Meier, logistic-Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan-Meier provided poor estimates while the logistic-Weibull model was a close fit to the non-parametric. Our findings support our use of logistic-Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

AB - For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan-Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence-incidence models. Parameters for parametric prevalence-incidence models, such as the logistic regression and Weibull survival (logistic-Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan-Meier, logistic-Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan-Meier provided poor estimates while the logistic-Weibull model was a close fit to the non-parametric. Our findings support our use of logistic-Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

KW - Cervical cancer

KW - Cumulative risk estimation

KW - HPV

KW - Kaplan-Meier

KW - Prevalence-incidence models

UR - http://www.scopus.com/inward/record.url?scp=85021437043&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021437043&partnerID=8YFLogxK

U2 - 10.1002/sim.7380

DO - 10.1002/sim.7380

M3 - Article

JO - Statistics in Medicine

JF - Statistics in Medicine

SN - 0277-6715

ER -