scikit-learn – Tarik Billa

adding words to stop_words list in TfidfVectorizer in sklearn

April 11, 2024 by Tarik

This is how you can do it: from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer my_stop_words = text.ENGLISH_STOP_WORDS.union([“book”]) vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words) X = vectorizer.fit_transform([“this is an apple.”,”this is a book.”]) idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) # printing the tfidf vectors print(X) # printing the vocabulary print(vectorizer.vocabulary_) In this example, I created the tfidf vectors for … Read more

Custom transformer for sklearn Pipeline that alters both X and y

April 10, 2024 by Tarik

Modifying the sample axis, e.g. removing samples, does not (yet?) comply with the scikit-learn transformer API. So if you need to do this, you should do it outside any calls to scikit learn, as preprocessing. As it is now, the transformer API is used to transform the features of a given sample into something new. … Read more

Python scikit learn Linear Model Parameter Standard Error

April 7, 2024 by Tarik

tl;dr not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below. also here’s a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31 what and why the standard errors of your estimates are just the square root of the variances of your estimates. what’s the variance of your … Read more

using confusion matrix as scoring metric in cross validation in scikit learn

April 5, 2024 by Tarik

You could use cross_val_predict(See the scikit-learn docs) instead of cross_val_score. instead of doing : from sklearn.model_selection import cross_val_score scores = cross_val_score(clf, x, y, cv=10) you can do : from sklearn.model_selection import cross_val_predict from sklearn.metrics import confusion_matrix y_pred = cross_val_predict(clf, x, y, cv=10) conf_mat = confusion_matrix(y, y_pred)

What are different options for objective functions available in xgboost.XGBClassifier?

January 8, 2024 by Tarik

That’s true that binary:logistic is the default objective for XGBClassifier, but I don’t see any reason why you couldn’t use other objectives offered by XGBoost package. For example, you can see in sklearn.py source code that multi:softprob is used explicitly in multiclass case. Moreover, if it’s really necessary, you can provide a custom objective function … Read more

XGBoost for multilabel classification?

January 8, 2024 by Tarik

One possible approach, instead of using OneVsRestClassifier which is for multi-class tasks, is to use MultiOutputClassifier from the sklearn.multioutput module. Below is a small reproducible sample code with the number of input features and target outputs requested by the OP import xgboost as xgb from sklearn.datasets import make_multilabel_classification from sklearn.model_selection import train_test_split from sklearn.multioutput import … Read more

Understanding “score” returned by scikit-learn KMeans

January 7, 2024 by Tarik

The word chosen by the documentation is a bit confusing. It says “Opposite of the value of X on the K-means objective.” It means negative of the K-means objective. K-Means Objective The objective in the K-means is to reduce the sum of squares of the distances of points from their respective cluster centroids. It has … Read more

Predicting how long an scikit-learn classification will take to run

January 7, 2024 by Tarik

There are very specific classes of classifier or regressors that directly report remaining time or progress of your algorithm (number of iterations etc.). Most of this can be turned on by passing verbose=2 (any high number > 1) option to the constructor of individual models. Note: this behavior is according to sklearn-0.14. Earlier versions have … Read more

OLS Regression: Scikit vs. Statsmodels? [closed]

January 7, 2024 by Tarik

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here’s an example to show you which options you need to use for sklearn and statsmodels to produce identical results. import numpy as np import statsmodels.api as sm from sklearn.linear_model import LinearRegression # Generate artificial data … Read more

What’s the best way to test whether an sklearn model has been fitted?

January 6, 2024 by Tarik

You can do something like: from sklearn.exceptions import NotFittedError for model in models: try: model.predict(some_test_data) except NotFittedError as e: print(repr(e)) Ideally you would check the results of model.predict against expected results but if all you want to know if wether the model is fitted or not that should suffice. Update: Some commenters have suggested using … Read more