scikit-learn – Page 3

Does TensorFlow have cross validation implemented?

December 28, 2023 by Tarik

As already discussed, tensorflow doesn’t provide its own way to cross-validate the model. The recommended way is to use KFold. It’s a bit tedious, but doable. Here’s a complete example of cross-validating MNIST model with tensorflow and KFold: from sklearn.model_selection import KFold import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data # Parameters learning_rate = 0.01 … Read more

How is the TFIDFVectorizer in scikit-learn supposed to work?

December 28, 2023 by Tarik

Data Standardization vs Normalization vs Robust Scaler

December 27, 2023 by Tarik

Am I right to say that also Standardization gets affected negatively by the extreme values as well? Indeed you are; the scikit-learn docs themselves clearly warn for such a case: However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers. … Read more

Information Gain calculation with Scikit-learn

December 27, 2023 by Tarik

You can use scikit-learn’s mutual_info_classif here is an example from sklearn.datasets import fetch_20newsgroups from sklearn.feature_selection import mutual_info_classif from sklearn.feature_extraction.text import CountVectorizer categories = [‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’] newsgroups_train = fetch_20newsgroups(subset=”train”, categories=categories) X, Y = newsgroups_train.data, newsgroups_train.target cv = CountVectorizer(max_df=0.95, min_df=2, max_features=10000, stop_words=”english”) X_vec = cv.fit_transform(X) res = dict(zip(cv.get_feature_names(), mutual_info_classif(X_vec, Y, discrete_features=True) )) print(res) this will output … Read more

Plotting a ROC curve in scikit yields only 3 points

December 26, 2023 by Tarik

The number of points depend on the number of unique values in the input. Since the input vector has only 2 unique values, the function gives correct output.

Why does sklearn Imputer need to fit?

December 26, 2023 by Tarik

The Imputer fills missing values with some statistics (e.g. mean, median, …) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform. from sklearn.preprocessing import Imputer obj = Imputer(strategy=’mean’) obj.fit([[1, 2, 3], [2, … Read more

TypeError: only integer arrays with one element can be converted to an index

December 26, 2023 by Tarik

I finally got to solve the problem. Two things had to be done: train_argcands_target is a list and it has to be a numpy array. I’m surprised it worked well before when I just used the estimator directly. For some reason (I don’t know why, yet), it doesn’t work either if I use the sparse … Read more

SKlearn import MLPClassifier fails

December 26, 2023 by Tarik

MLPClassifier is not yet available in scikit-learn v0.17 (as of 1 Dec 2015). If you really want to use it you could clone 0.18dev (however, I don’t know how stable this branch currently is).

Using GridSearchCV with AdaBoost and DecisionTreeClassifier

December 25, 2023 by Tarik

There are several things wrong in the code you posted: The keys of the param_grid dictionary need to be strings. You should be getting a NameError. The key “abc__n_estimators” should just be “n_estimators”: you are probably mixing this with the pipeline syntax. Here nothing tells Python that the string “abc” represents your AdaBoostClassifier. None (and … Read more

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

December 25, 2023 by Tarik

Here are a couple of approaches: Find the ratio of number of unique values to the total number of unique values. Something like the following likely_cat = {} for var in df.columns: likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold Check if the top n unique values account for more than a certain proportion … Read more