How to engineer features for machine learning [closed]

Good feature engineering involves two components. The first is an understanding the properties of the task you’re trying to solve and how they might interact with the strengths and limitations of the classifier you’re using. The second is experimental work where you will be testing your expectations and find out what actually works and what … Read more

Unbalanced classification using RandomForestClassifier in sklearn

You can pass sample weights argument to Random Forest fit method sample_weight : array-like, shape = [n_samples] or None Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, … Read more

Dealing with unbalanced datasets in Spark MLlib

Class weight with Spark ML As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here) But If you’re willing to try other classifiers – this functionality has been already added to the Logistic Regression. Consider a case where we have 80% positives (label == 1) in … Read more

What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit learn?

Multiclass classification To better illustrate the differences, let us assume that your goal is that of classifying SO questions into n_classes different, mutually exclusive classes. For the sake of simplicity in this example we will only consider four classes, namely ‘Python’, ‘Java’, ‘C++’ and ‘Other language’. Let us assume that you have a dataset formed … Read more