classification – Page 2

How to engineer features for machine learning [closed]

December 4, 2023 by Tarik

Good feature engineering involves two components. The first is an understanding the properties of the task you’re trying to solve and how they might interact with the strengths and limitations of the classifier you’re using. The second is experimental work where you will be testing your expectations and find out what actually works and what … Read more

Difference between Objective and feval in xgboost

December 2, 2023 by Tarik

What is a threshold in a Precision-Recall curve?

September 23, 2023 by Tarik

ROC Curves: x-axis: False Positive Rate FPR = FP /(FP + TN) = FP / N y-axis: True Positive Rate TPR = Recall = TP /(TP + FN) = TP / P Precision-Recall Curves: x-axis: Recall = TP / (TP + FN) = TP / P = TPR y-axis: Precision = TP / (TP + … Read more

Recommended anomaly detection technique for simple, one-dimensional scenario?

September 21, 2023 by Tarik

Check out the three-sigma rule: mu = mean of the data std = standard deviation of the data IF abs(x-mu) > 3*std THEN x is outlier An alternative method is the IQR outlier test: Q25 = 25th_percentile Q75 = 75th_percentile IQR = Q75 – Q25 // inter-quartile range IF (x < Q25 – 1.5*IQR) OR … Read more

Unbalanced classification using RandomForestClassifier in sklearn

September 18, 2023 by Tarik

You can pass sample weights argument to Random Forest fit method sample_weight : array-like, shape = [n_samples] or None Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, … Read more

Dealing with unbalanced datasets in Spark MLlib

September 10, 2023 by Tarik

Class weight with Spark ML As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here) But If you’re willing to try other classifiers – this functionality has been already added to the Logistic Regression. Consider a case where we have 80% positives (label == 1) in … Read more

Scikit-learn confusion matrix

August 31, 2023 by Tarik

scikit learn sorts labels in ascending order, thus 0’s are first column/row and 1’s are the second one >>> from sklearn.metrics import confusion_matrix as cm >>> y_test = [1, 0, 0] >>> y_pred = [1, 0, 0] >>> cm(y_test, y_pred) array([[2, 0], [0, 1]]) >>> y_pred = [4, 0, 0] >>> y_test = [4, 0, … Read more

What is “naive” in a naive Bayes classifier?

August 28, 2023 by Tarik

There’s actually a very good example on Wikipedia: In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. For example, a fruit may be considered to be an apple if … Read more

Multilabel-indicator is not supported for confusion matrix

August 20, 2023 by Tarik

No, your input to confusion_matrix must be a list of predictions, not OHEs (one hot encodings). Call argmax on your y_test and y_pred, and you should get what you expect. confusion_matrix( y_test.values.argmax(axis=1), predictions.argmax(axis=1)) array([[1, 0], [0, 2]])

What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit learn?

August 17, 2023 by Tarik

Multiclass classification To better illustrate the differences, let us assume that your goal is that of classifying SO questions into n_classes different, mutually exclusive classes. For the sake of simplicity in this example we will only consider four classes, namely ‘Python’, ‘Java’, ‘C++’ and ‘Other language’. Let us assume that you have a dataset formed … Read more