A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: Systematically incorporating validated biological knowledge

Jiang Du; Joel S. Rozowsky; Jan O. Korbel; Zhengdong D. Zhang; Thomas E. Royce; Martin H. Schultz; Michael Snyder; Mark Gerstein

doi:10.1093/bioinformatics/btl515

A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: Systematically incorporating validated biological knowledge

Jiang Du, Joel S. Rozowsky, Jan O. Korbel, Zhengdong D. Zhang, Thomas E. Royce, Martin H. Schultz, Michael Snyder, Mark Gerstein

Research output: Contribution to journal › Article › peer-review

30 Scopus citations

Abstract

Motivation: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). Results: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.

Original language	English (US)
Pages (from-to)	3016-3024
Number of pages	9
Journal	Bioinformatics
Volume	22
Issue number	24
DOIs	https://doi.org/10.1093/bioinformatics/btl515
State	Published - Dec 15 2006
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btl515

Cite this

Du, J., Rozowsky, J. S., Korbel, J. O., Zhang, Z. D., Royce, T. E., Schultz, M. H., Snyder, M., & Gerstein, M. (2006). A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: Systematically incorporating validated biological knowledge. Bioinformatics, 22(24), 3016-3024. https://doi.org/10.1093/bioinformatics/btl515

A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: Systematically incorporating validated biological knowledge. / Du, Jiang; Rozowsky, Joel S.; Korbel, Jan O. et al.
In: Bioinformatics, Vol. 22, No. 24, 15.12.2006, p. 3016-3024.

Research output: Contribution to journal › Article › peer-review

Du, J, Rozowsky, JS, Korbel, JO, Zhang, ZD, Royce, TE, Schultz, MH, Snyder, M & Gerstein, M 2006, 'A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: Systematically incorporating validated biological knowledge', Bioinformatics, vol. 22, no. 24, pp. 3016-3024. https://doi.org/10.1093/bioinformatics/btl515

@article{31cd1654b729472e889c70b2e8593173,

title = "A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: Systematically incorporating validated biological knowledge",

abstract = "Motivation: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). Results: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.",

author = "Jiang Du and Rozowsky, {Joel S.} and Korbel, {Jan O.} and Zhang, {Zhengdong D.} and Royce, {Thomas E.} and Schultz, {Martin H.} and Michael Snyder and Mark Gerstein",

note = "Funding Information: The authors thank the anonymous reviewers for their advices and comments. The authors acknowledge support from the NIH (1U01HG003156-01). J.O.K. is supported by a European Molecular Biology Organization Long-Term Fellowship. Funding to pay the Open Access publication charges for this article was provided by NIH (1U01HG003156-01).",

year = "2006",

month = dec,

day = "15",

doi = "10.1093/bioinformatics/btl515",

language = "English (US)",

volume = "22",

pages = "3016--3024",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "24",

}

TY - JOUR

T1 - A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments

T2 - Systematically incorporating validated biological knowledge

AU - Du, Jiang

AU - Rozowsky, Joel S.

AU - Korbel, Jan O.

AU - Zhang, Zhengdong D.

AU - Royce, Thomas E.

AU - Schultz, Martin H.

AU - Snyder, Michael

AU - Gerstein, Mark

N1 - Funding Information: The authors thank the anonymous reviewers for their advices and comments. The authors acknowledge support from the NIH (1U01HG003156-01). J.O.K. is supported by a European Molecular Biology Organization Long-Term Fellowship. Funding to pay the Open Access publication charges for this article was provided by NIH (1U01HG003156-01).

PY - 2006/12/15

Y1 - 2006/12/15

N2 - Motivation: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). Results: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.

AB - Motivation: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). Results: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.

UR - http://www.scopus.com/inward/record.url?scp=33845369532&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33845369532&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btl515

DO - 10.1093/bioinformatics/btl515

M3 - Article

C2 - 17038339

AN - SCOPUS:33845369532

SN - 1367-4803

VL - 22

SP - 3016

EP - 3024

JO - Bioinformatics

JF - Bioinformatics

IS - 24

ER -

A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: Systematically incorporating validated biological knowledge

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this