cross-validation – Page 2

Sklearn StratifiedKFold: ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘multilabel-indicator’ instead

July 20, 2023 by Tarik

keras.utils.to_categorical produces a one-hot encoded class vector, i.e. the multilabel-indicator mentioned in the error message. StratifiedKFold is not designed to work with such input; from the split method docs: split(X, y, groups=None) […] y : array-like, shape (n_samples,) The target variable for supervised learning problems. Stratification is done based on the y labels. i.e. your … Read more

How to split data on balanced training set and test set on sklearn

June 6, 2023 by Tarik

Although Christian’s suggestion is correct, technically train_test_split should give you stratified results by using the stratify param. So you could do: X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target) The trick here is that it starts from version 0.17 in sklearn. From the documentation about the parameter stratify: stratify : array-like or None … Read more

module ‘sklearn’ has no attribute ‘cross_validation’

May 14, 2023 by Tarik

sklearn does not automatically import its subpackages. If you only imported via: import sklearn, then it won’t work. Import with import sklearn.cross_validation instead. Further, sklearn.cross_validation will be deprecated in version 0.20. Use sklearn.model_selection.train_test_split instead.

Using explicit (predefined) validation set for grid search with sklearn

April 15, 2023 by Tarik

Use PredefinedSplit ps = PredefinedSplit(test_fold=your_test_fold) then set cv=ps in GridSearchCV test_fold : “array-like, shape (n_samples,) test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold. Also see here when … Read more

What is the difference between cross-validation and grid search?

April 11, 2023 by Tarik

Cross-validation is when you reserve part of your data to use in evaluating your model. There are different cross-validation methods. The simplest conceptually is to just take 70% (just making up a number here, it doesn’t have to be 70%) of your data and use that for training, and then use the remaining 30% of … Read more

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

April 4, 2023 by Tarik

You have to fit your data before you can get the best parameter combination. from sklearn.grid_search import GridSearchCV from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=False) rfc = RandomForestClassifier(n_jobs=-1,max_features=”sqrt” ,n_estimators=50, oob_score = True) param_grid = … Read more

scikit-learn cross validation, negative values with mean squared error

March 13, 2023 by Tarik

Trying to close this out, so am providing the answer that David and larsmans have eloquently described in the comments section: Yes, this is supposed to happen. The actual MSE is simply the positive version of the number you’re getting. The unified scoring API always maximizes the score, so scores which need to be minimized … Read more

difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

February 15, 2023 by Tarik

In stratKFolds, each test set should not overlap, even when shuffle is included. With stratKFolds and shuffle=True, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest. In ShuffleSplit, the data is shuffled … Read more