data-mining – Page 2

Kmeans without knowing the number of clusters? [duplicate]

July 27, 2023 by Tarik

One approach is cross-validation. In essence, you pick a subset of your data and cluster it into k clusters, and you ask how well it clusters, compared with the rest of the data: Are you assigning data points to the same cluster memberships, or are they falling into different clusters? If the memberships are roughly … Read more

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

July 22, 2023 by Tarik

Write code yourself. Then it fits your problem best! Boilerplate: Never assume code you download from the net to be correct or optimal… make sure to fully understand it before using it. %matplotlib inline from numpy import array, linspace from sklearn.neighbors import KernelDensity from matplotlib.pyplot import plot a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1) kde = KernelDensity(kernel=”gaussian”, bandwidth=3).fit(a) … Read more

1D Number Array Clustering

June 26, 2023 by Tarik

Don’t use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier. In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization. You might want to look at Jenks … Read more

R Random Forests Variable Importance

May 28, 2023 by Tarik

How to calculate the regularization parameter in linear regression

May 21, 2023 by Tarik

The regularization parameter (lambda) is an input to your model so what you probably want to know is how do you select the value of lambda. The regularization parameter reduces overfitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda … Read more

scikit-learn: Predicting new points with DBSCAN

May 17, 2023 by Tarik

While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it’s usefulness. * Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps … Read more

How many principal components to take?

May 5, 2023 by Tarik

To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don’t have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming … Read more

Decision tree vs. Naive Bayes classifier [closed]

April 26, 2023 by Tarik

Decision Trees are very flexible, easy to understand, and easy to debug. They will work with classification problems and regression problems. So if you are trying to predict a categorical value like (red, green, up, down) or if you are trying to predict a continuous value like 2.9, 3.4 etc Decision Trees will handle both … Read more

Calculate AUC in R?

April 25, 2023 by Tarik

PCA For categorical features?

April 8, 2023 by Tarik

I disagree with the others. While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well. PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have … Read more