scikit-learn – Page 2

python scikit-learn clustering with missing data

January 6, 2024 by Tarik

I think you can use an iterative EM-type algorithm: Initialize missing values to their column means Repeat until convergence: Perform K-means clustering on the filled-in data Set the missing values to the centroid coordinates of the clusters to which they were assigned Implementation import numpy as np from sklearn.cluster import KMeans def kmeans_missing(X, n_clusters, max_iter=10): … Read more

Using the predict_proba() function of RandomForestClassifier in the safe and right way

January 5, 2024 by Tarik

A RandomForestClassifier is a collection of DecisionTreeClassifier‘s. No matter how big your training set, a decision tree simply returns: a decision. One class has probability 1, the other classes have probability 0. The RandomForest simply votes among the results. predict_proba() returns the number of votes for each class (each tree in the forest makes its … Read more

Python sci-kit learn (metrics): difference between r2_score and explained_variance_score?

January 5, 2024 by Tarik

Most of the answers I found (including here) emphasize on the difference between R2 and Explained Variance Score, that is: The Mean Residue (i.e. The Mean of Error). However, there is an important question left behind, that is: Why on earth I need to consider The Mean of Error? Refresher: R2: is the Coefficient of … Read more

How to get a non-shuffled train_test_split in sklearn

January 5, 2024 by Tarik

I’m not adding much to Psidom’s answer except an easy to copy paste function: def non_shuffling_train_test_split(X, y, test_size=0.2): i = int((1 – test_size) * X.shape[0]) + 1 X_train, X_test = np.split(X, [i]) y_train, y_test = np.split(y, [i]) return X_train, X_test, y_train, y_test Update: At some point this feature became built in, so now you can … Read more

PCA projection and reconstruction in scikit-learn

January 2, 2024 by Tarik

You can do proj = pca.inverse_transform(X_train_pca) That way you do not have to worry about how to do the multiplications. What you obtain after pca.fit_transform or pca.transform are what is usually called the “loadings” for each sample, meaning how much of each component you need to describe it best using a linear combination of the … Read more

How can i know probability of class predicted by predict() function in Support Vector Machine?

January 1, 2024 by Tarik

Definitely read this section of the docs as there’s some subtleties involved. See also Scikit-learn predict_proba gives wrong answers Basically, if you have a multi-class problem with plenty of data predict_proba as suggested earlier works well. Otherwise, you may have to make do with an ordering that doesn’t yield probability scores from decision_function. Here’s a … Read more

Scikit-Learn Linear Regression how to get coefficient’s respective features?

January 1, 2024 by Tarik

What I found to work was: X = your independent variables coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1) The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

How to standard scale a 3D matrix?

December 31, 2023 by Tarik

With only 3 line of code… scaler = StandardScaler() X_train = scaler.fit_transform(X_train.reshape(-1, X_train.shape[-1])).reshape(X_train.shape) X_test = scaler.transform(X_test.reshape(-1, X_test.shape[-1])).reshape(X_test.shape)

ValueError: setting an array element with a sequence. while using SVM in scikit-learn

December 31, 2023 by Tarik

Finally I found the answer to my question with the help of some ideas from @larsmans and @eickenberg. The problem was that X_train did not have the same number of elements in all the arrays. So, it was not able to form a 2D array. Now that I have added an additional value to that … Read more

sklearn KMeans is not working as I only get ‘NoneType’ object has no attribute ‘split’ on nonEmpty Array

December 30, 2023 by Tarik

Upgrade threadpoolctl to version >3. This works for all versions of numpy.