Model-based clustering in GENE expression microarrays: An application to breast cancer data

J. C. Mar, G. J. Mclachlan

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.

Original languageEnglish (US)
Pages (from-to)579-592
Number of pages14
JournalInternational Journal of Software Engineering and Knowledge Engineering
Volume13
Issue number6
DOIs
StatePublished - Dec 2003
Externally publishedYes

Fingerprint

Microarrays
Genes
Tissue
Cluster analysis

Keywords

  • Cluster analysis
  • Microarray
  • Mixture modelling

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Software

Cite this

Model-based clustering in GENE expression microarrays : An application to breast cancer data. / Mar, J. C.; Mclachlan, G. J.

In: International Journal of Software Engineering and Knowledge Engineering, Vol. 13, No. 6, 12.2003, p. 579-592.

Research output: Contribution to journalArticle

@article{d744ffd01e1b4e53859591a1d2ea8a17,
title = "Model-based clustering in GENE expression microarrays: An application to breast cancer data",
abstract = "In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.",
keywords = "Cluster analysis, Microarray, Mixture modelling",
author = "Mar, {J. C.} and Mclachlan, {G. J.}",
year = "2003",
month = "12",
doi = "10.1142/S0218194003001482",
language = "English (US)",
volume = "13",
pages = "579--592",
journal = "International Journal of Software Engineering and Knowledge Engineering",
issn = "0218-1940",
publisher = "World Scientific Publishing Co. Pte Ltd",
number = "6",

}

TY - JOUR

T1 - Model-based clustering in GENE expression microarrays

T2 - An application to breast cancer data

AU - Mar, J. C.

AU - Mclachlan, G. J.

PY - 2003/12

Y1 - 2003/12

N2 - In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.

AB - In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.

KW - Cluster analysis

KW - Microarray

KW - Mixture modelling

UR - http://www.scopus.com/inward/record.url?scp=1142276170&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1142276170&partnerID=8YFLogxK

U2 - 10.1142/S0218194003001482

DO - 10.1142/S0218194003001482

M3 - Article

AN - SCOPUS:1142276170

VL - 13

SP - 579

EP - 592

JO - International Journal of Software Engineering and Knowledge Engineering

JF - International Journal of Software Engineering and Knowledge Engineering

SN - 0218-1940

IS - 6

ER -