adding words to stop_words list in TfidfVectorizer in sklearn

This is how you can do it: from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer my_stop_words = text.ENGLISH_STOP_WORDS.union([“book”]) vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words) X = vectorizer.fit_transform([“this is an apple.”,”this is a book.”]) idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) # printing the tfidf vectors print(X) # printing the vocabulary print(vectorizer.vocabulary_) In this example, I created the tfidf vectors for … Read more

Custom transformer for sklearn Pipeline that alters both X and y

Modifying the sample axis, e.g. removing samples, does not (yet?) comply with the scikit-learn transformer API. So if you need to do this, you should do it outside any calls to scikit learn, as preprocessing. As it is now, the transformer API is used to transform the features of a given sample into something new. … Read more

Python scikit learn Linear Model Parameter Standard Error

tl;dr not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below. also here’s a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31 what and why the standard errors of your estimates are just the square root of the variances of your estimates. what’s the variance of your … Read more

using confusion matrix as scoring metric in cross validation in scikit learn

You could use cross_val_predict(See the scikit-learn docs) instead of cross_val_score. instead of doing : from sklearn.model_selection import cross_val_score scores = cross_val_score(clf, x, y, cv=10) you can do : from sklearn.model_selection import cross_val_predict from sklearn.metrics import confusion_matrix y_pred = cross_val_predict(clf, x, y, cv=10) conf_mat = confusion_matrix(y, y_pred)

What are different options for objective functions available in xgboost.XGBClassifier?

That’s true that binary:logistic is the default objective for XGBClassifier, but I don’t see any reason why you couldn’t use other objectives offered by XGBoost package. For example, you can see in sklearn.py source code that multi:softprob is used explicitly in multiclass case. Moreover, if it’s really necessary, you can provide a custom objective function … Read more

XGBoost for multilabel classification?

One possible approach, instead of using OneVsRestClassifier which is for multi-class tasks, is to use MultiOutputClassifier from the sklearn.multioutput module. Below is a small reproducible sample code with the number of input features and target outputs requested by the OP import xgboost as xgb from sklearn.datasets import make_multilabel_classification from sklearn.model_selection import train_test_split from sklearn.multioutput import … Read more

Predicting how long an scikit-learn classification will take to run

There are very specific classes of classifier or regressors that directly report remaining time or progress of your algorithm (number of iterations etc.). Most of this can be turned on by passing verbose=2 (any high number > 1) option to the constructor of individual models. Note: this behavior is according to sklearn-0.14. Earlier versions have … Read more

OLS Regression: Scikit vs. Statsmodels? [closed]

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here’s an example to show you which options you need to use for sklearn and statsmodels to produce identical results. import numpy as np import statsmodels.api as sm from sklearn.linear_model import LinearRegression # Generate artificial data … Read more

What’s the best way to test whether an sklearn model has been fitted?

You can do something like: from sklearn.exceptions import NotFittedError for model in models: try: model.predict(some_test_data) except NotFittedError as e: print(repr(e)) Ideally you would check the results of model.predict against expected results but if all you want to know if wether the model is fitted or not that should suffice. Update: Some commenters have suggested using … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)