feature-selection – Tarik Billa

Scikit-Learn Linear Regression how to get coefficient’s respective features?

January 1, 2024 by Tarik

What I found to work was: X = your independent variables coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1) The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

Information Gain calculation with Scikit-learn

December 27, 2023 by Tarik

You can use scikit-learn’s mutual_info_classif here is an example from sklearn.datasets import fetch_20newsgroups from sklearn.feature_selection import mutual_info_classif from sklearn.feature_extraction.text import CountVectorizer categories = [‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’] newsgroups_train = fetch_20newsgroups(subset=”train”, categories=categories) X, Y = newsgroups_train.data, newsgroups_train.target cv = CountVectorizer(max_df=0.95, min_df=2, max_features=10000, stop_words=”english”) X_vec = cv.fit_transform(X) res = dict(zip(cv.get_feature_names(), mutual_info_classif(X_vec, Y, discrete_features=True) )) print(res) this will output … Read more

TypeError: only integer arrays with one element can be converted to an index

December 26, 2023 by Tarik

I finally got to solve the problem. Two things had to be done: train_argcands_target is a list and it has to be a numpy array. I’m surprised it worked well before when I just used the estimator directly. For some reason (I don’t know why, yet), it doesn’t work either if I use the sparse … Read more

Understanding max_features parameter in RandomForestRegressor

August 17, 2023 by Tarik

Straight from the documentation: [max_features] is the size of the random subsets of features to consider when splitting a node. So max_features is what you call m. When max_features=”auto”, m = p and no feature subset selection is performed in the trees, so the “random forest” is actually a bagged ensemble of ordinary regression trees. … Read more

Feature selection using scikit-learn

August 17, 2023 by Tarik

The error message Input X must be non-negative says it all: Pearson’s chi square test (goodness of fit) does not apply to negative values. It’s logical because the chi square test assumes frequencies distribution and a frequency can’t be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative. You are saying that your features … Read more

Correlated features and classification accuracy

August 10, 2023 by Tarik

Correlated features do not affect classification accuracy per se. The problem in realistic situations is that we have a finite number of training examples with which to train a classifier. For a fixed number of training examples, increasing the number of features typically increases classification accuracy to a point but as the number of features … Read more

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

June 15, 2023 by Tarik

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don’t set it, you get: >>> v = CountVectorizer(ngram_range=(1, 2)) >>> pprint(v.fit([“an apple a day keeps the doctor away”]).vocabulary_) {u’an’: 0, u’an apple’: 1, u’apple’: 2, u’apple day’: 3, u’away’: 4, u’day’: 5, u’day keeps’: 6, u’doctor’: 7, u’doctor away’: 8, u’keeps’: … Read more

How is the feature score(/importance) in the XGBoost package calculated?

May 28, 2023 by Tarik

Random Forest Feature Importance Chart using Python

April 30, 2023 by Tarik

Here is an example using the iris data set. >>> from sklearn.datasets import load_iris >>> iris = load_iris() >>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42) >>> rnd_clf.fit(iris[“data”], iris[“target”]) >>> for name, importance in zip(iris[“feature_names”], rnd_clf.feature_importances_): … print(name, “=”, importance) sepal length (cm) = 0.112492250999 sepal width (cm) = 0.0231192882825 petal length (cm) = 0.441030464364 petal width … Read more

Feature/Variable importance after a PCA analysis

February 19, 2023 by Tarik

First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data. Before the example, please note that the basic idea when … Read more