cluster-analysis – Page 3

Choosing eps and minpts for DBSCAN (R)?

August 22, 2023 by Tarik

Grid search for hyperparameter evaluation of clustering in scikit-learn

August 10, 2023 by Tarik

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan. pip install clusteval Depending on your data, the evaluation method can be chosen. # Import library from clusteval import clusteval … Read more

Calculating the percentage of variance measure for k-means?

August 8, 2023 by Tarik

The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence) If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using pdist2 function, … Read more

How Could One Implement the K-Means++ Algorithm?

July 30, 2023 by Tarik

Interesting question. Thank you for bringing this paper to my attention – K-Means++: The Advantages of Careful Seeding In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers. Here is … Read more

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

July 22, 2023 by Tarik

Write code yourself. Then it fits your problem best! Boilerplate: Never assume code you download from the net to be correct or optimal… make sure to fully understand it before using it. %matplotlib inline from numpy import array, linspace from sklearn.neighbors import KernelDensity from matplotlib.pyplot import plot a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1) kde = KernelDensity(kernel=”gaussian”, bandwidth=3).fit(a) … Read more

1D Number Array Clustering

June 26, 2023 by Tarik

Don’t use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier. In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization. You might want to look at Jenks … Read more

kmeans: Quick-TRANSfer stage steps exceeded maximum

June 8, 2023 by Tarik

Plot dendrogram using sklearn.AgglomerativeClustering

May 28, 2023 by Tarik

Here is a simple function for taking a hierarchical clustering model from sklearn and plotting it using the scipy dendrogram function. Seems like graphing functions are often not directly supported in sklearn. You can find an interesting discussion of that related to the pull request for this plot_dendrogram code snippet here. I’d clarify that the … Read more

How to get the samples in each cluster?

May 26, 2023 by Tarik

I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns. data = pd.read_csv(‘filename’) km = KMeans(n_clusters=5).fit(data) cluster_map = pd.DataFrame() cluster_map[‘data_index’] = data.index.values cluster_map[‘cluster’] = km.labels_ Once the DataFrame is available is quite easy to filter, For example, to filter … Read more

scikit-learn: Predicting new points with DBSCAN

May 17, 2023 by Tarik

While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it’s usefulness. * Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps … Read more