cluster-analysis – Tarik Billa

python scikit-learn clustering with missing data

January 6, 2024 by Tarik

I think you can use an iterative EM-type algorithm: Initialize missing values to their column means Repeat until convergence: Perform K-means clustering on the filled-in data Set the missing values to the centroid coordinates of the clusters to which they were assigned Implementation import numpy as np from sklearn.cluster import KMeans def kmeans_missing(X, n_clusters, max_iter=10): … Read more

How can I find the center of a cluster of data points?

January 5, 2024 by Tarik

The following solution works even if the points are scattered all over the Earth, by converting latitude and longitude to Cartesian coordinates. It does a kind of KDE (kernel density estimation), but in a first pass the sum of kernels is evaluated only at the data points. The kernel should be chosen to fit the … Read more

whats is the difference between “k means” and “fuzzy c means” objective functions?

January 5, 2024 by Tarik

BTW, the Fuzzy-C-Means (FCM) clustering algorithm is also known as Soft K-Means. The objective functions are virtually identical, the only difference being the introduction of a vector which expresses the percentage of belonging of a given point to each of the clusters. This vector is submitted to a “stiffness” exponent aimed at giving more importance … Read more

Python Implementation of OPTICS (Clustering) Algorithm

December 27, 2023 by Tarik

I’m not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3). OPTICS has a number of tricky things besides the obvious idea. … Read more

Reordering matrix elements to reflect column and row clustering in naiive python [duplicate]

December 17, 2023 by Tarik

I’m not sure completely understand, but it appears you are trying to re-index each axis of the array based on sorts of the dendrogram indicies. I guess that assumes there is some comparative logic in each branch delineation. If this is the case then would this work(?): >>> x_idxs = [(0,1,0,0),(0,1,1,1),(0,1,1),(0,0,1),(1,1,1,1),(0,0,0,0)] >>> y_idxs = [(1,1),(0,1),(1,0),(0,0)] … Read more

Clustering Algorithm for Mapping Application

December 17, 2023 by Tarik

For a virtual earth application I’ve used the clustering described here. It’s lightning fast and easily extensible.

Which machine learning library to use [closed]

December 15, 2023 by Tarik

There are only a few ML libraries that i have used enough so that i am comfortable recommending them; dlib ml is certainly one of them. Sourceforge download here; and bleeding-edge check-out: hg clone http://hg.code.sf.net/p/dclib/code dclib-code The original library creator and current maintainer is Davis King. Your wishlist versus the relevant dlib features: good documentation: … Read more

Scikit Learn GridSearchCV without cross validation (unsupervised learning)

December 11, 2023 by Tarik

After much searching, I was able to find this thread. It appears that you can get rid of cross validation in GridSearchCV if you use: cv=[(slice(None), slice(None))] I have tested this against my own coded version of grid search without cross validation and I get the same results from both methods. I am posting this … Read more

DBSCAN for clustering of geographic location data

December 8, 2023 by Tarik

You can cluster spatial latitude-longitude data with scikit-learn’s DBSCAN without precomputing a distance matrix. db = DBSCAN(eps=2/6371., min_samples=5, algorithm=’ball_tree’, metric=”haversine”).fit(np.radians(coordinates)) This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps value is still 2km, but it’s divided by 6371 to convert it to radians. Also, notice that … Read more

Cluster one-dimensional data optimally? [closed]

December 7, 2023 by Tarik