k-means – Page 2 – Tarik Billa

How to get the samples in each cluster?

May 26, 2023 by Tarik

I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns. data = pd.read_csv(‘filename’) km = KMeans(n_clusters=5).fit(data) cluster_map = pd.DataFrame() cluster_map[‘data_index’] = data.index.values cluster_map[‘cluster’] = km.labels_ Once the DataFrame is available is quite easy to filter, For example, to filter … Read more

Simple approach to assigning clusters for new data after k-means clustering

May 22, 2023 by Tarik

Python k-means algorithm

May 10, 2023 by Tarik

Update: (Eleven years after this original answer, it’s probably time for an update.) First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I’d suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the … Read more

Scikit Learn – K-Means – Elbow – criterion

April 27, 2023 by Tarik

If the true label is not known in advance(as in your case), then K-Means clustering can be evaluated using either Elbow Criterion or Silhouette Coefficient. Elbow Criterion Method: The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and … Read more

K-means algorithm variation with equal cluster size

April 2, 2023 by Tarik

This might do the trick: apply Lloyd’s algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < j ≤ k) off to j and … Read more

Will scikit-learn utilize GPU?

December 30, 2022 by Tarik

Tensorflow only uses GPU if it is built against Cuda and CuDNN. By default it does not use GPU, especially if it is running inside Docker, unless you use nvidia-docker and an image with a built-in support. Scikit-learn is not intended to be used as a deep-learning framework and it does not provide any GPU … Read more

How do I determine k when using k-means clustering?

November 20, 2022 by Tarik

You can maximize the Bayesian Information Criterion (BIC): BIC(C | X) = L(X | C) – (p / 2) * log n where L(X | C) is the log-likelihood of the dataset X according to model C, p is the number of parameters in the model C, and n is the number of points in … Read more

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

October 12, 2022 by Tarik

Here’s a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function. Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ? #!/usr/bin/env python # kmeans.py using any of the 20-odd metrics in scipy.spatial.distance # kmeanssample … Read more

Cluster analysis in R: determine the optimal number of clusters

September 16, 2022 by Tarik

If your question is “how can I determine how many clusters are appropriate for a kmeans analysis of my data?“, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods. First, some reproducible data (the data in the Q are… unclear to me): … Read more