Fuzzy C-Means for Clustering Microarray Data

Doulaye Dembélé and Philippe Kastner
IGBMC
CNRS-INSERM-ULP, BP 10142
67404 Illkirch Cedex, France

see : http://bioinformatics.oupjournals.org/
volume 19 ; number 8 ; pages 973-980

Contacts : Doulaye Dembélé or Philippe Kastner
Abstract
Datasets used
Color figures
Supplemental material
Matlab functions

Abstract

Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or Self-Organizing Maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-Means (FCM), to attribute cluster membership values to genes.

A major problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m. We show that the commonly used value m=2 is not appropriate for some datasets, and that optimal values for m vary widely from one dataset to another. We propose an empirical method, based on the distribution of distances between genes in a given dataset, to determine an adequate value for . By setting threshold levels for the membership values, genes which are the most tigthly associated to a given cluster can be selected. Using a yeast cell cycle dataset as an example, we show that this selection increases the overall biological significance of the genes within the cluster.

Back to top/Retour au debut

Datasets used

All files given below are tab delimited ascii text ones.

Serum dataset : serum.txt
can be downloaded also from : http://www.sciencemag.org/feature/data/984559.shl
the entire dataset is available at : http://genome-www.stanford.edu/serum

Yeast dataset : yeast.txt (normalized data used)
The entire dataset can be downloaded from : all dataset or from http://genomics.stanford.edu
see also http://arep.med.harvard.edu/network_discovery

Human cancer dataset : hc728g.txt
The complete dataset can be downloaded from : discover.nci.nih.gov/nature2000/

Iris dataset : iris.txt
can be downloaded also from : ftp://ftp.ics.uci.edu/pub/machine-learning-databases/

Synthetic dataset 1 : y3c.txt

Synthetic dataset 2 : y14c.txt

Back to top/Retour au debut

Color figures

figure 1 :
color figure 1
Influence of the fuzziness parameter m on the distribution of membership values.
Boxplot representations of sorted membership values from FCM clustering are shown.
For fixed values of m, the K membership values of each gene were sorted
in decreasing order. For a point in each plot, horizontal segments are 99 centile, third quartile,
median, first quartile and first centile values respectively ; isolated segments represent outliers.
a) distribution of membership values when m is fixed to 2. Note that, in the case
of the yeast and cancer datasets, all membership values for all genes were equal to 1/K.
b) distribution of membership values when $m$ is equal to the computed upper bound value m_ub.
c) effect of varying m in the case of the serum data.

figure 2 :
color figure 2
Boxplot of sorted membership values for real and randomized datasets (see figure 1 for
description of the representation). Only the first 15 significant values are shown for the
cancer dataset. Randomized datasets was generated as follows. To the first gene in the list of
the dataset, we associated an expression value selected randomly from the N values of the
experiment j. To the second gene in the list, we associated an expression value selected
radomly from the remaining (N-1) values of experiment j. We repeated this process until
we associated the remaining expression value to the last gene in the list. The overall randomized
dataset was formed by doing these operations for the p experiments.

figure 3 :
color figure 3
Scatterplot of the two highest membership values of all genes in the datasets.
Vertical lines correspond to median value of the highest membership value.
(a) serum dataset, (b) yeast dataset, (c) cancer dataset.

figure 4 :

color figure 4
Cancer data, threshold-based selection of genes and expression profile representation.
We use the value of the threshold (0.67) to divide the data set into two groups :
genes with a U_max greater than 0.67 (red in part (a)) and the genes which
have a U_max lower than 0.67 (magenta in part (a)). Finally, we replace the
normalized data of each gene by color codes in which red stands for the highest
expression value while green is used for the lower value. The panel (c) of this figure
shows a clean separation of clusters for genes having a membership value greater
than 0.67. In contrast, expression profile of the genes in magenta (part (a))
shows a more fuzzy pattern (the panel (b) of the figure).

figure 5 :
color figure 5
Boxplot of silhouettes values of genes in clusters. For each gene a silhouette value is computed,
see text. When this value is lower than zero, the corresponding gene is poorly classified.
(a) serum data set, top : no selection, bottom : gene selection with a threshold equal to 0.87 ;
(b) yeast data set, top : no selection, bottom : gene selection with a threshold equal to 0.80 ;
(c) cancer data set, top : no selection, bottom : gene selection with a threshold equal to 0.67.

Back to top/Retour au debut

Supplementary material

Iris dataset
Using this dataset, we ran the FCM algorithm with various values for the fuzziness parameter m. 30 independant runs are used. In each run, 3 (case where K=3) or 2 (case where K=2) samples are randomly selected as initial centroids. After convergence of FCM for all runs, the solution that gives the smallest value for J(K,m) was kept and we estimated U_{max} and U_{min} as defined above.

For the same values of the fuzziness parameter m, we computed the sample mean, standard deviation and the coefficient of variation of the derivative distances Y_m. We also computed the ratio of the coefficient of variation.
The results obtained are summarized in tables the following two tables

iris table 1

iris table 2

Synthetic dataset 1
We varied the fuzziness parameter m and we performed the same computations as for the iris dataset. The results obtained are summarized in the following table

synthetic dataset 1 table

Synthetic dataset 2
We varied the fuzzy parameter m and computed U_{min} and U_{max} and the sample statistics of distances Y_m (see the following table)

synthetic dataset 2 table

Serum dataset
The sample statistics of the normalized Y_m are summarized in the following table

serum dataset table

Yeast dataset
Two different selections of genes are made from the 6200 ORF in the yeast dataset. The first selection contains 2945 genes while the second contains 1159 genes. The sample statistics of the normalized Y_m of these datasets are summarized in the following tables

yeast dataset tables

Direct computation of the upper bound value for m
The flowchart in the following figure summarizes the computation steps used to define the upper bound value of the fuzziness parameter m.

figure 6 :
mub flowchart
Computation flowchart for determining the upper bound value of the fuzziness
parameter m. The algorithm stops when the absolute value of the error between
computed cv and 0.03p is lower than e. It also stops when the number of
iterations is greater than maxIter. Default values of e and maxIter are 0.001
and 500 respectively, and can be ajusted by the user. The maximum value of
m, m_2=1000 can also be changed by the user.

Back to top/Retour au debut

Matlab functions

fcm.ZIP

Back to top/Retour au debut