Sklearn StratifiedKFold: ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘multilabel-indicator’ instead

keras.utils.to_categorical produces a one-hot encoded class vector, i.e. the multilabel-indicator mentioned in the error message. StratifiedKFold is not designed to work with such input; from the split method docs: split(X, y, groups=None) […] y : array-like, shape (n_samples,) The target variable for supervised learning problems. Stratification is done based on the y labels. i.e. your … Read more

How to split data on balanced training set and test set on sklearn

Although Christian’s suggestion is correct, technically train_test_split should give you stratified results by using the stratify param. So you could do: X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target) The trick here is that it starts from version 0.17 in sklearn. From the documentation about the parameter stratify: stratify : array-like or None … Read more

Using explicit (predefined) validation set for grid search with sklearn

Use PredefinedSplit ps = PredefinedSplit(test_fold=your_test_fold) then set cv=ps in GridSearchCV test_fold : “array-like, shape (n_samples,) test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold. Also see here when … Read more

What is the difference between cross-validation and grid search?

Cross-validation is when you reserve part of your data to use in evaluating your model. There are different cross-validation methods. The simplest conceptually is to just take 70% (just making up a number here, it doesn’t have to be 70%) of your data and use that for training, and then use the remaining 30% of … Read more

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

You have to fit your data before you can get the best parameter combination. from sklearn.grid_search import GridSearchCV from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=False) rfc = RandomForestClassifier(n_jobs=-1,max_features=”sqrt” ,n_estimators=50, oob_score = True) param_grid = … Read more

scikit-learn cross validation, negative values with mean squared error

Trying to close this out, so am providing the answer that David and larsmans have eloquently described in the comments section: Yes, this is supposed to happen. The actual MSE is simply the positive version of the number you’re getting. The unified scoring API always maximizes the score, so scores which need to be minimized … Read more

difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

In stratKFolds, each test set should not overlap, even when shuffle is included. With stratKFolds and shuffle=True, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest. In ShuffleSplit, the data is shuffled … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)