Discussion:
kmeans with chi-square metrics
(too old to reply)
Petr K.
2006-01-31 09:50:40 UTC
Permalink
Hello,

is it possible to extend the standard implementation of the kmeans
algorithm in matlab by additional metrics by just adding the equation
for calculation of the metric (e.g. by exchange the euclid. distance
with chi-square) to kmeans.m. I would need this for clustering of
histograms, for these it is better to using chi-square metric instead
of euclidian.
thanks and regards petr k.
Peter Perkins
2006-01-31 15:06:05 UTC
Permalink
Post by Petr K.
is it possible to extend the standard implementation of the kmeans
algorithm in matlab by additional metrics by just adding the equation
for calculation of the metric (e.g. by exchange the euclid. distance
with chi-square) to kmeans.m. I would need this for clustering of
histograms, for these it is better to using chi-square metric instead
of euclidian.
Hi Petr -

Maybe.

The K-Means algorithm not only computes distances, it also computes centroids.
The centroid of a cluster is defined as the point to which the sum of distances
within the cluster is minimized. "Distances", meaning whatever distance metric
you're using to cluster. By default, the distance metric is _squared_ Euclidean
distance, and so the point that minimizes the sum of that is the arithmetic
mean. All of the other distances that KMEANS supports have a centroid
calculation that corresponds to the particular distance, and it's the arithmetic
mean only in that one special case.

It kind of doesn't make sense to cluster based on a particular distance and to
compute the centroids with a formula that does not minimize the within-cluster
sum of distances. SOme other K-Means implementations do it, but it kind of
doesn't make sense.

So if you want to use a different distance metric, you really ought to figure
out the corresponding centroid calculation. This can be hard -- even for
(unsquared) Euclidean distance, it is hard. I'd be interested to know if there
is an appropriate formula for your chi-squared distance.

Barring that, have you thought about using hierarchical clustering? You can use
any distance metric you want, because there's no centroid calculation.

Hope this helps.

- Peter Perkins
The MathWorks, Inc.

Loading...