scikit-learn – Tarik Billa

Scikit-Learn Linear Regression how to get coefficient’s respective features?

January 1, 2024 by Tarik

What I found to work was: X = your independent variables coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1) The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

Issue with OneHotEncoder for categorical features

September 28, 2023 by Tarik

If you read the docs for OneHotEncoder you’ll see the input for fit is “Input array of type int”. So you need to do two steps for your one hot encoded data from sklearn import preprocessing cat_features = [‘color’, ‘director_name’, ‘actor_2_name’] enc = preprocessing.LabelEncoder() enc.fit(cat_features) new_cat_features = enc.transform(cat_features) print new_cat_features # [1 2 0] new_cat_features … Read more

What’s the difference between predict_proba and decision_function in scikit-learn?

August 2, 2023 by Tarik

The latter, predict_proba is a method of a (soft) classifier outputting the probability of the instance being in each of the classes. The former, decision_function, finds the distance to the separating hyperplane. For example, a(n) SVM classifier finds hyperplanes separating the space into areas associated with classification outcomes. This function, given a point, finds the … Read more

How to insert Keras model into scikit-learn pipeline?

July 20, 2023 by Tarik

You need to wrap your Keras model as a Scikit learn model first and then proceed as usual. Here’s a quick example (I’ve omitted the imports for brevity) Here is a full blog post with this one and many other examples: Scikit-learn Pipeline Examples # create a function that returns a model, taking as parameters … Read more

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

April 13, 2023 by Tarik

Some quick preliminaries: Let’s say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the “impurity” of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate: Pr(Class=k) = #(examples … Read more

A progress bar for scikit-learn?

January 6, 2023 by Tarik

If you initialize the model with verbose=1 before calling fit you should get some kind of output indicating the progress. For example sklearn.ensemble.GradientBoostingClassifer(verbose=1) provides progress output that looks like this: Iter Train Loss Remaining Time 1 1.2811 0.71s 2 1.2595 0.58s 3 1.2402 0.50s 4 1.2263 0.46s 5 1.2121 0.43s 6 1.1999 0.41s 7 1.1876 … Read more

RandomForestClassifier vs ExtraTreesClassifier in scikit learn

January 5, 2023 by Tarik

Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling. In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs … Read more

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

December 27, 2022 by Tarik

It looks like sklearn requires the data shape of (row number, column number). If your data shape is (row number, ) like (999, ), it does not work. By using numpy.reshape(), you should change the shape of the array to (999, 1), e.g. using data=data.reshape((999,1)) In my case, it worked with that.

How are feature_importances in RandomForestClassifier determined?

November 30, 2022 by Tarik

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more