python scikit-learn clustering with missing data

I think you can use an iterative EM-type algorithm: Initialize missing values to their column means Repeat until convergence: Perform K-means clustering on the filled-in data Set the missing values to the centroid coordinates of the clusters to which they were assigned Implementation import numpy as np from sklearn.cluster import KMeans def kmeans_missing(X, n_clusters, max_iter=10): … Read more

Using the predict_proba() function of RandomForestClassifier in the safe and right way

A RandomForestClassifier is a collection of DecisionTreeClassifier‘s. No matter how big your training set, a decision tree simply returns: a decision. One class has probability 1, the other classes have probability 0. The RandomForest simply votes among the results. predict_proba() returns the number of votes for each class (each tree in the forest makes its … Read more

Python sci-kit learn (metrics): difference between r2_score and explained_variance_score?

Most of the answers I found (including here) emphasize on the difference between R2 and Explained Variance Score, that is: The Mean Residue (i.e. The Mean of Error). However, there is an important question left behind, that is: Why on earth I need to consider The Mean of Error? Refresher: R2: is the Coefficient of … Read more

How to get a non-shuffled train_test_split in sklearn

I’m not adding much to Psidom’s answer except an easy to copy paste function: def non_shuffling_train_test_split(X, y, test_size=0.2): i = int((1 – test_size) * X.shape[0]) + 1 X_train, X_test = np.split(X, [i]) y_train, y_test = np.split(y, [i]) return X_train, X_test, y_train, y_test Update: At some point this feature became built in, so now you can … Read more

PCA projection and reconstruction in scikit-learn

You can do proj = pca.inverse_transform(X_train_pca) That way you do not have to worry about how to do the multiplications. What you obtain after pca.fit_transform or pca.transform are what is usually called the “loadings” for each sample, meaning how much of each component you need to describe it best using a linear combination of the … Read more

How can i know probability of class predicted by predict() function in Support Vector Machine?

Definitely read this section of the docs as there’s some subtleties involved. See also Scikit-learn predict_proba gives wrong answers Basically, if you have a multi-class problem with plenty of data predict_proba as suggested earlier works well. Otherwise, you may have to make do with an ordering that doesn’t yield probability scores from decision_function. Here’s a … Read more

Scikit-Learn Linear Regression how to get coefficient’s respective features?

What I found to work was: X = your independent variables coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1) The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)