A major problem in applying the FCM method for clustering microarray data is the
choice of the fuzziness parameter m. We show that the commonly used value
m=2 is not appropriate for some datasets, and that optimal values for
m vary widely from one
dataset to another. We propose an empirical method, based on the distribution of
distances between genes in a given dataset, to determine an adequate value for
Serum dataset :
serum.txt Yeast dataset :
yeast.txt (normalized data used) Human cancer dataset :
hc728g.txt Iris dataset :
iris.txt Synthetic dataset 1 :
y3c.txt Synthetic dataset 2 :
y14c.txt figure 1 : figure 2 : figure 3 : figure 4 : figure 5 : For the same values of the fuzziness parameter m, we computed the sample
mean, standard deviation and the coefficient of variation of the derivative distances
Y_m. We also computed the ratio of the coefficient of variation. Synthetic dataset 1 Synthetic dataset 2 Serum dataset Yeast dataset Direct computation of the upper bound value for m figure 6 :
Datasets used
All files given below are tab delimited ascii text ones.
can be downloaded also from :
http://www.sciencemag.org/feature/data/984559.shl
the entire dataset is available at :
http://genome-www.stanford.edu/serum
The entire dataset can be downloaded from :
all dataset
or from http://genomics.stanford.edu
see also
http://arep.med.harvard.edu/network_discovery
The complete dataset can be downloaded from :
discover.nci.nih.gov/nature2000/
can be downloaded also from :
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/

Color figures

Influence of the fuzziness parameter m on the distribution of
membership values.
Boxplot representations of sorted membership values from FCM clustering
are shown.
For fixed values of m, the K membership values of each gene
were sorted
in decreasing order. For a point in each plot, horizontal segments are 99
centile, third quartile,
median, first quartile and first centile values respectively ; isolated
segments represent outliers.
a) distribution of membership values when m is fixed to 2.
Note that, in the case
of the yeast and cancer datasets, all membership values for all genes were
equal to 1/K.
b) distribution of membership values when $m$ is equal to the computed upper
bound value
m_ub.
c) effect of varying m in the case of the serum data.

Boxplot of sorted membership values for real and randomized datasets
(see figure 1 for
description of the representation). Only the first 15 significant
values are shown for the
cancer dataset. Randomized datasets was generated as follows. To the first
gene in the list of
the dataset, we associated an expression value selected randomly from the
N values of the
experiment j. To the second gene in the list, we associated an expression
value selected
radomly from the remaining (N-1) values of experiment j.
We repeated this process until
we associated the remaining expression value to the last gene in the list.
The overall randomized
dataset was formed by doing these operations for the p experiments.

Scatterplot of the two highest membership values of all genes in the datasets.
Vertical lines correspond to median value of the highest membership value.
(a) serum dataset, (b) yeast dataset, (c) cancer dataset.

Cancer data, threshold-based selection of genes and expression profile
representation.
We use the value of the threshold (0.67) to divide the data set into
two groups :
genes with a U_max greater than 0.67 (red in part (a)) and the
genes which
have a U_max lower than 0.67 (magenta in part (a)). Finally, we
replace the
normalized data of each gene by color codes in which red stands for the highest
expression value while green is used for the lower value. The panel (c)
of this figure
shows a clean separation of clusters for genes having a membership value greater
than 0.67. In contrast, expression profile of the genes in magenta (part (a))
shows a more fuzzy pattern (the panel (b) of the figure).

Boxplot of silhouettes values of genes in clusters. For each gene a silhouette
value is computed,
see text. When this value is lower than zero, the corresponding gene is poorly
classified.
(a) serum data set, top : no selection, bottom : gene selection with a threshold
equal to 0.87 ;
(b) yeast data set, top : no selection, bottom : gene selection with a threshold
equal to 0.80 ;
(c) cancer data set, top : no selection, bottom : gene selection with a threshold
equal to 0.67.

Supplementary material
Iris dataset
Using this dataset, we ran the FCM algorithm with various values for the fuzziness
parameter m. 30 independant runs are used.
In each run, 3 (case where K=3) or 2 (case where K=2)
samples are randomly selected as initial centroids.
After convergence of FCM for all runs, the solution that gives the smallest
value for J(K,m)
was kept and we estimated U_{max} and U_{min} as defined above.
The results obtained are summarized in tables the following two tables


We varied the fuzziness parameter m and we performed the same computations
as for the iris dataset. The results obtained are summarized in the following table

We varied the fuzzy parameter m and computed U_{min} and
U_{max} and the
sample statistics of distances Y_m (see the following table)

The sample statistics of the normalized Y_m are summarized in the
following table

Two different selections of genes are made from the 6200 ORF in the yeast
dataset. The first selection contains 2945 genes while the second contains
1159 genes.
The sample statistics of the normalized Y_m of these datasets are
summarized in the following tables

The flowchart in the following figure summarizes the computation steps used
to define the upper bound value of the fuzziness parameter m.

Computation flowchart for determining the upper bound value of the fuzziness
parameter m. The algorithm stops when the absolute value of the error between
computed cv and 0.03p is lower than e. It also stops when the number of
iterations is greater than maxIter. Default values of e and maxIter are 0.001
and 500 respectively, and can be ajusted by the user. The maximum value of
m, m_2=1000 can also be changed by the user.

Matlab functions
fcm.ZIP
