cluster-analysis
Grid search for hyperparameter evaluation of clustering in scikit-learn
The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan. pip install clusteval Depending on your data, the evaluation method can be chosen. # Import library from clusteval import clusteval … Read more
Calculating the percentage of variance measure for k-means?
The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence) If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using pdist2 function, … Read more
How Could One Implement the K-Means++ Algorithm?
Interesting question. Thank you for bringing this paper to my attention – K-Means++: The Advantages of Careful Seeding In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers. Here is … Read more
How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
Write code yourself. Then it fits your problem best! Boilerplate: Never assume code you download from the net to be correct or optimal… make sure to fully understand it before using it. %matplotlib inline from numpy import array, linspace from sklearn.neighbors import KernelDensity from matplotlib.pyplot import plot a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1) kde = KernelDensity(kernel=”gaussian”, bandwidth=3).fit(a) … Read more
1D Number Array Clustering
Don’t use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier. In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization. You might want to look at Jenks … Read more
Plot dendrogram using sklearn.AgglomerativeClustering
Here is a simple function for taking a hierarchical clustering model from sklearn and plotting it using the scipy dendrogram function. Seems like graphing functions are often not directly supported in sklearn. You can find an interesting discussion of that related to the pull request for this plot_dendrogram code snippet here. I’d clarify that the … Read more
How to get the samples in each cluster?
I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns. data = pd.read_csv(‘filename’) km = KMeans(n_clusters=5).fit(data) cluster_map = pd.DataFrame() cluster_map[‘data_index’] = data.index.values cluster_map[‘cluster’] = km.labels_ Once the DataFrame is available is quite easy to filter, For example, to filter … Read more
scikit-learn: Predicting new points with DBSCAN
While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it’s usefulness. * Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps … Read more