Modeling ChIP sequencing in silico with applications

Zhengdong Zhang, Joel Rozowsky, Michael Snyder, Joseph Chang, Mark Gerstein

Research output: Contribution to journalArticle

47 Citations (Scopus)

Abstract

ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.

Original languageEnglish (US)
Article numbere1000158
JournalPLoS Computational Biology
Volume4
Issue number8
DOIs
StatePublished - Aug 2008
Externally publishedYes

Fingerprint

Binding sites
Computer Simulation
Sequencing
binding sites
genomics
Chip
Binding Sites
Genomics
Modeling
modeling
Count
power law distribution
Functional Genomics
Transcription factors
protein binding
Power-law Distribution
Gamma distribution
Transcription Factor
Computational methods
Scoring

ASJC Scopus subject areas

  • Cellular and Molecular Neuroscience
  • Ecology
  • Molecular Biology
  • Genetics
  • Ecology, Evolution, Behavior and Systematics
  • Modeling and Simulation
  • Computational Theory and Mathematics

Cite this

Modeling ChIP sequencing in silico with applications. / Zhang, Zhengdong; Rozowsky, Joel; Snyder, Michael; Chang, Joseph; Gerstein, Mark.

In: PLoS Computational Biology, Vol. 4, No. 8, e1000158, 08.2008.

Research output: Contribution to journalArticle

Zhang, Zhengdong ; Rozowsky, Joel ; Snyder, Michael ; Chang, Joseph ; Gerstein, Mark. / Modeling ChIP sequencing in silico with applications. In: PLoS Computational Biology. 2008 ; Vol. 4, No. 8.
@article{efa6248ef2cc4624856b94aa5a1793cf,
title = "Modeling ChIP sequencing in silico with applications",
abstract = "ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.",
author = "Zhengdong Zhang and Joel Rozowsky and Michael Snyder and Joseph Chang and Mark Gerstein",
year = "2008",
month = "8",
doi = "10.1371/journal.pcbi.1000158",
language = "English (US)",
volume = "4",
journal = "PLoS Computational Biology",
issn = "1553-734X",
publisher = "Public Library of Science",
number = "8",

}

TY - JOUR

T1 - Modeling ChIP sequencing in silico with applications

AU - Zhang, Zhengdong

AU - Rozowsky, Joel

AU - Snyder, Michael

AU - Chang, Joseph

AU - Gerstein, Mark

PY - 2008/8

Y1 - 2008/8

N2 - ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.

AB - ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.

UR - http://www.scopus.com/inward/record.url?scp=50949097455&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=50949097455&partnerID=8YFLogxK

U2 - 10.1371/journal.pcbi.1000158

DO - 10.1371/journal.pcbi.1000158

M3 - Article

VL - 4

JO - PLoS Computational Biology

JF - PLoS Computational Biology

SN - 1553-734X

IS - 8

M1 - e1000158

ER -