Challenges in risk estimation using routinely collected clinical data

The example of estimating cervical cancer risks from electronic health-records

Rebecca Landy, Li C. Cheung, Mark Schiffman, Julia C. Gage, Noorie Hyun, Nicolas Wentzensen, Walter K. Kinney, Philip E. Castle, Barbara Fetterman, Nancy E. Poitras, Thomas Lorey, Peter D. Sasieni, Hormuzd A. Katki

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Electronic health-records (EHR) are increasingly used by epidemiologists studying disease following surveillance testing to provide evidence for screening intervals and referral guidelines. Although cost-effective, undiagnosed prevalent disease and interval censoring (in which asymptomatic disease is only observed at the time of testing) raise substantial analytic issues when estimating risk that cannot be addressed using Kaplan-Meier methods. Based on our experience analysing EHR from cervical cancer screening, we previously proposed the logistic-Weibull model to address these issues. Here we demonstrate how the choice of statistical method can impact risk estimates. We use observed data on 41,067 women in the cervical cancer screening program at Kaiser Permanente Northern California, 2003-2013, as well as simulations to evaluate the ability of different methods (Kaplan-Meier, Turnbull, Weibull and logistic-Weibull) to accurately estimate risk within a screening program. Cumulative risk estimates from the statistical methods varied considerably, with the largest differences occurring for prevalent disease risk when baseline disease ascertainment was random but incomplete. Kaplan-Meier underestimated risk at earlier times and overestimated risk at later times in the presence of interval censoring or undiagnosed prevalent disease. Turnbull performed well, though was inefficient and not smooth. The logistic-Weibull model performed well, except when event times didn't follow a Weibull distribution. We have demonstrated that methods for right-censored data, such as Kaplan-Meier, result in biased estimates of disease risks when applied to interval-censored data, such as screening programs using EHR data. The logistic-Weibull model is attractive, but the model fit must be checked against Turnbull non-parametric risk estimates.

Original languageEnglish (US)
JournalPreventive Medicine
DOIs
StateAccepted/In press - Jan 1 2017

Fingerprint

Electronic Health Records
Uterine Cervical Neoplasms
Logistic Models
Early Detection of Cancer
Asymptomatic Diseases
Referral and Consultation
Guidelines
Costs and Cost Analysis

Keywords

  • Cervix
  • Electronic health-records
  • Epidemiology
  • Risk estimation
  • Screening
  • Statistical methods

ASJC Scopus subject areas

  • Epidemiology
  • Public Health, Environmental and Occupational Health

Cite this

Challenges in risk estimation using routinely collected clinical data : The example of estimating cervical cancer risks from electronic health-records. / Landy, Rebecca; Cheung, Li C.; Schiffman, Mark; Gage, Julia C.; Hyun, Noorie; Wentzensen, Nicolas; Kinney, Walter K.; Castle, Philip E.; Fetterman, Barbara; Poitras, Nancy E.; Lorey, Thomas; Sasieni, Peter D.; Katki, Hormuzd A.

In: Preventive Medicine, 01.01.2017.

Research output: Contribution to journalArticle

Landy, R, Cheung, LC, Schiffman, M, Gage, JC, Hyun, N, Wentzensen, N, Kinney, WK, Castle, PE, Fetterman, B, Poitras, NE, Lorey, T, Sasieni, PD & Katki, HA 2017, 'Challenges in risk estimation using routinely collected clinical data: The example of estimating cervical cancer risks from electronic health-records', Preventive Medicine. https://doi.org/10.1016/j.ypmed.2017.12.004
Landy, Rebecca ; Cheung, Li C. ; Schiffman, Mark ; Gage, Julia C. ; Hyun, Noorie ; Wentzensen, Nicolas ; Kinney, Walter K. ; Castle, Philip E. ; Fetterman, Barbara ; Poitras, Nancy E. ; Lorey, Thomas ; Sasieni, Peter D. ; Katki, Hormuzd A. / Challenges in risk estimation using routinely collected clinical data : The example of estimating cervical cancer risks from electronic health-records. In: Preventive Medicine. 2017.
@article{e2d45f84afc046219b45f3cae6a33d6e,
title = "Challenges in risk estimation using routinely collected clinical data: The example of estimating cervical cancer risks from electronic health-records",
abstract = "Electronic health-records (EHR) are increasingly used by epidemiologists studying disease following surveillance testing to provide evidence for screening intervals and referral guidelines. Although cost-effective, undiagnosed prevalent disease and interval censoring (in which asymptomatic disease is only observed at the time of testing) raise substantial analytic issues when estimating risk that cannot be addressed using Kaplan-Meier methods. Based on our experience analysing EHR from cervical cancer screening, we previously proposed the logistic-Weibull model to address these issues. Here we demonstrate how the choice of statistical method can impact risk estimates. We use observed data on 41,067 women in the cervical cancer screening program at Kaiser Permanente Northern California, 2003-2013, as well as simulations to evaluate the ability of different methods (Kaplan-Meier, Turnbull, Weibull and logistic-Weibull) to accurately estimate risk within a screening program. Cumulative risk estimates from the statistical methods varied considerably, with the largest differences occurring for prevalent disease risk when baseline disease ascertainment was random but incomplete. Kaplan-Meier underestimated risk at earlier times and overestimated risk at later times in the presence of interval censoring or undiagnosed prevalent disease. Turnbull performed well, though was inefficient and not smooth. The logistic-Weibull model performed well, except when event times didn't follow a Weibull distribution. We have demonstrated that methods for right-censored data, such as Kaplan-Meier, result in biased estimates of disease risks when applied to interval-censored data, such as screening programs using EHR data. The logistic-Weibull model is attractive, but the model fit must be checked against Turnbull non-parametric risk estimates.",
keywords = "Cervix, Electronic health-records, Epidemiology, Risk estimation, Screening, Statistical methods",
author = "Rebecca Landy and Cheung, {Li C.} and Mark Schiffman and Gage, {Julia C.} and Noorie Hyun and Nicolas Wentzensen and Kinney, {Walter K.} and Castle, {Philip E.} and Barbara Fetterman and Poitras, {Nancy E.} and Thomas Lorey and Sasieni, {Peter D.} and Katki, {Hormuzd A.}",
year = "2017",
month = "1",
day = "1",
doi = "10.1016/j.ypmed.2017.12.004",
language = "English (US)",
journal = "Preventive Medicine",
issn = "0091-7435",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Challenges in risk estimation using routinely collected clinical data

T2 - The example of estimating cervical cancer risks from electronic health-records

AU - Landy, Rebecca

AU - Cheung, Li C.

AU - Schiffman, Mark

AU - Gage, Julia C.

AU - Hyun, Noorie

AU - Wentzensen, Nicolas

AU - Kinney, Walter K.

AU - Castle, Philip E.

AU - Fetterman, Barbara

AU - Poitras, Nancy E.

AU - Lorey, Thomas

AU - Sasieni, Peter D.

AU - Katki, Hormuzd A.

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Electronic health-records (EHR) are increasingly used by epidemiologists studying disease following surveillance testing to provide evidence for screening intervals and referral guidelines. Although cost-effective, undiagnosed prevalent disease and interval censoring (in which asymptomatic disease is only observed at the time of testing) raise substantial analytic issues when estimating risk that cannot be addressed using Kaplan-Meier methods. Based on our experience analysing EHR from cervical cancer screening, we previously proposed the logistic-Weibull model to address these issues. Here we demonstrate how the choice of statistical method can impact risk estimates. We use observed data on 41,067 women in the cervical cancer screening program at Kaiser Permanente Northern California, 2003-2013, as well as simulations to evaluate the ability of different methods (Kaplan-Meier, Turnbull, Weibull and logistic-Weibull) to accurately estimate risk within a screening program. Cumulative risk estimates from the statistical methods varied considerably, with the largest differences occurring for prevalent disease risk when baseline disease ascertainment was random but incomplete. Kaplan-Meier underestimated risk at earlier times and overestimated risk at later times in the presence of interval censoring or undiagnosed prevalent disease. Turnbull performed well, though was inefficient and not smooth. The logistic-Weibull model performed well, except when event times didn't follow a Weibull distribution. We have demonstrated that methods for right-censored data, such as Kaplan-Meier, result in biased estimates of disease risks when applied to interval-censored data, such as screening programs using EHR data. The logistic-Weibull model is attractive, but the model fit must be checked against Turnbull non-parametric risk estimates.

AB - Electronic health-records (EHR) are increasingly used by epidemiologists studying disease following surveillance testing to provide evidence for screening intervals and referral guidelines. Although cost-effective, undiagnosed prevalent disease and interval censoring (in which asymptomatic disease is only observed at the time of testing) raise substantial analytic issues when estimating risk that cannot be addressed using Kaplan-Meier methods. Based on our experience analysing EHR from cervical cancer screening, we previously proposed the logistic-Weibull model to address these issues. Here we demonstrate how the choice of statistical method can impact risk estimates. We use observed data on 41,067 women in the cervical cancer screening program at Kaiser Permanente Northern California, 2003-2013, as well as simulations to evaluate the ability of different methods (Kaplan-Meier, Turnbull, Weibull and logistic-Weibull) to accurately estimate risk within a screening program. Cumulative risk estimates from the statistical methods varied considerably, with the largest differences occurring for prevalent disease risk when baseline disease ascertainment was random but incomplete. Kaplan-Meier underestimated risk at earlier times and overestimated risk at later times in the presence of interval censoring or undiagnosed prevalent disease. Turnbull performed well, though was inefficient and not smooth. The logistic-Weibull model performed well, except when event times didn't follow a Weibull distribution. We have demonstrated that methods for right-censored data, such as Kaplan-Meier, result in biased estimates of disease risks when applied to interval-censored data, such as screening programs using EHR data. The logistic-Weibull model is attractive, but the model fit must be checked against Turnbull non-parametric risk estimates.

KW - Cervix

KW - Electronic health-records

KW - Epidemiology

KW - Risk estimation

KW - Screening

KW - Statistical methods

UR - http://www.scopus.com/inward/record.url?scp=85037606766&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037606766&partnerID=8YFLogxK

U2 - 10.1016/j.ypmed.2017.12.004

DO - 10.1016/j.ypmed.2017.12.004

M3 - Article

JO - Preventive Medicine

JF - Preventive Medicine

SN - 0091-7435

ER -