cluster-analysis – Page 4

Python k-means algorithm

May 10, 2023 by Tarik

Update: (Eleven years after this original answer, it’s probably time for an update.) First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I’d suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the … Read more

Scikit Learn – K-Means – Elbow – criterion

April 27, 2023 by Tarik

If the true label is not known in advance(as in your case), then K-Means clustering can be evaluated using either Elbow Criterion or Silhouette Coefficient. Elbow Criterion Method: The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and … Read more

plotting results of hierarchical clustering on top of a matrix of data

April 23, 2023 by Tarik

The question does not define matrix very well: “matrix of values”, “matrix of data”. I assume that you mean a distance matrix. In other words, element D_ij in the symmetric nonnegative N-by-N distance matrix D denotes the distance between two feature vectors, x_i and x_j. Is that correct? If so, then try this (edited June … Read more

K-means algorithm variation with equal cluster size

April 2, 2023 by Tarik

This might do the trick: apply Lloyd’s algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < j ≤ k) off to j and … Read more

Unsupervised clustering with unknown number of clusters

January 14, 2023 by Tarik

You can use hierarchical clustering. It is a rather basic approach, so there are lots of implementations available. It is for example included in Python’s scipy. See for example the following script: import matplotlib.pyplot as plt import numpy import scipy.cluster.hierarchy as hcluster # generate 3 clusters of each around 100 points and one orphan point … Read more

What is an intuitive explanation of the Expectation Maximization technique? [closed]

December 21, 2022 by Tarik

Note: the code behind this answer can be found here. Suppose we have some data sampled from two different groups, red and blue: Here, we can see which data point belongs to the red or blue group. This makes it easy to find the parameters that characterise each group. For example, the mean of the … Read more

How do I determine k when using k-means clustering?

November 20, 2022 by Tarik

You can maximize the Bayesian Information Criterion (BIC): BIC(C | X) = L(X | C) – (p / 2) * log n where L(X | C) is the log-likelihood of the dataset X according to model C, p is the number of parameters in the model C, and n is the number of points in … Read more

Difference between classification and clustering in data mining? [closed]

October 16, 2022 by Tarik

In general, in classification you have a set of predefined classes and want to know which class a new object belongs to. Clustering tries to group a set of objects and find whether there is some relationship between the objects. In the context of machine learning, classification is supervised learning and clustering is unsupervised learning. … Read more

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

October 12, 2022 by Tarik

Here’s a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function. Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ? #!/usr/bin/env python # kmeans.py using any of the 20-odd metrics in scipy.spatial.distance # kmeanssample … Read more

Cluster analysis in R: determine the optimal number of clusters

September 16, 2022 by Tarik

If your question is “how can I determine how many clusters are appropriate for a kmeans analysis of my data?“, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods. First, some reproducible data (the data in the Q are… unclear to me): … Read more