The DART classification of unannotated transcription within the ENCODE regions: Associating transcription with known and novel loci

Joel S. Rozowsky, Daniel Newburger, Fred Sayward, Jiaqian Wu, Greg Jordan, Jan O. Korbel, Ugrappa Nagalakshmi, Jin Yang, Deyou Zheng, Roderic Guigó, Thomas R. Gingeras, Sherman Weissman, Perry Miller, Michael Snyder, Mark B. Gerstein

Research output: Contribution to journalArticle

22 Citations (Scopus)

Abstract

For the ∼1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of "unannotated transcription." We use a number of disparate features to classify the 6988 novel TARs - array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that ∼14% of the novel TARs can be associated with known genes, while ∼21% can be clustered into ∼200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.

Original languageEnglish (US)
Pages (from-to)732-745
Number of pages14
JournalGenome Research
Volume17
Issue number6
DOIs
StatePublished - Jun 2007
Externally publishedYes

Fingerprint

Exons
Genes
Synteny
Benchmarking
Polymerase Chain Reaction
Human Genome
Databases
RNA
Cell Line

ASJC Scopus subject areas

  • Genetics

Cite this

Rozowsky, J. S., Newburger, D., Sayward, F., Wu, J., Jordan, G., Korbel, J. O., ... Gerstein, M. B. (2007). The DART classification of unannotated transcription within the ENCODE regions: Associating transcription with known and novel loci. Genome Research, 17(6), 732-745. https://doi.org/10.1101/gr.5696007

The DART classification of unannotated transcription within the ENCODE regions : Associating transcription with known and novel loci. / Rozowsky, Joel S.; Newburger, Daniel; Sayward, Fred; Wu, Jiaqian; Jordan, Greg; Korbel, Jan O.; Nagalakshmi, Ugrappa; Yang, Jin; Zheng, Deyou; Guigó, Roderic; Gingeras, Thomas R.; Weissman, Sherman; Miller, Perry; Snyder, Michael; Gerstein, Mark B.

In: Genome Research, Vol. 17, No. 6, 06.2007, p. 732-745.

Research output: Contribution to journalArticle

Rozowsky, JS, Newburger, D, Sayward, F, Wu, J, Jordan, G, Korbel, JO, Nagalakshmi, U, Yang, J, Zheng, D, Guigó, R, Gingeras, TR, Weissman, S, Miller, P, Snyder, M & Gerstein, MB 2007, 'The DART classification of unannotated transcription within the ENCODE regions: Associating transcription with known and novel loci', Genome Research, vol. 17, no. 6, pp. 732-745. https://doi.org/10.1101/gr.5696007
Rozowsky, Joel S. ; Newburger, Daniel ; Sayward, Fred ; Wu, Jiaqian ; Jordan, Greg ; Korbel, Jan O. ; Nagalakshmi, Ugrappa ; Yang, Jin ; Zheng, Deyou ; Guigó, Roderic ; Gingeras, Thomas R. ; Weissman, Sherman ; Miller, Perry ; Snyder, Michael ; Gerstein, Mark B. / The DART classification of unannotated transcription within the ENCODE regions : Associating transcription with known and novel loci. In: Genome Research. 2007 ; Vol. 17, No. 6. pp. 732-745.
@article{0724603bb5b14015adc0001a3efb9f4b,
title = "The DART classification of unannotated transcription within the ENCODE regions: Associating transcription with known and novel loci",
abstract = "For the ∼1{\%} of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of {"}unannotated transcription.{"} We use a number of disparate features to classify the 6988 novel TARs - array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that ∼14{\%} of the novel TARs can be associated with known genes, while ∼21{\%} can be clustered into ∼200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.",
author = "Rozowsky, {Joel S.} and Daniel Newburger and Fred Sayward and Jiaqian Wu and Greg Jordan and Korbel, {Jan O.} and Ugrappa Nagalakshmi and Jin Yang and Deyou Zheng and Roderic Guig{\'o} and Gingeras, {Thomas R.} and Sherman Weissman and Perry Miller and Michael Snyder and Gerstein, {Mark B.}",
year = "2007",
month = "6",
doi = "10.1101/gr.5696007",
language = "English (US)",
volume = "17",
pages = "732--745",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "6",

}

TY - JOUR

T1 - The DART classification of unannotated transcription within the ENCODE regions

T2 - Associating transcription with known and novel loci

AU - Rozowsky, Joel S.

AU - Newburger, Daniel

AU - Sayward, Fred

AU - Wu, Jiaqian

AU - Jordan, Greg

AU - Korbel, Jan O.

AU - Nagalakshmi, Ugrappa

AU - Yang, Jin

AU - Zheng, Deyou

AU - Guigó, Roderic

AU - Gingeras, Thomas R.

AU - Weissman, Sherman

AU - Miller, Perry

AU - Snyder, Michael

AU - Gerstein, Mark B.

PY - 2007/6

Y1 - 2007/6

N2 - For the ∼1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of "unannotated transcription." We use a number of disparate features to classify the 6988 novel TARs - array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that ∼14% of the novel TARs can be associated with known genes, while ∼21% can be clustered into ∼200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.

AB - For the ∼1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of "unannotated transcription." We use a number of disparate features to classify the 6988 novel TARs - array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that ∼14% of the novel TARs can be associated with known genes, while ∼21% can be clustered into ∼200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.

UR - http://www.scopus.com/inward/record.url?scp=34250340889&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34250340889&partnerID=8YFLogxK

U2 - 10.1101/gr.5696007

DO - 10.1101/gr.5696007

M3 - Article

C2 - 17567993

AN - SCOPUS:34250340889

VL - 17

SP - 732

EP - 745

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 6

ER -