regression – Page 2 – Tarik Billa

Distinguishing overfitting vs good prediction

December 16, 2023 by Tarik

how would you normally tell that the model is over-fitting? One useful rule of thumb is that you may be overfitting when your model’s performance on its own training set is much better than on its held-out validation set or in a cross-validation setting. That’s not all there is to it, though. The blog entry … Read more

sklearn LogisticRegression and changing the default threshold for classification

December 14, 2023 by Tarik

I would like to give a practical answer from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score import numpy as np X, y = make_classification( n_classes=2, class_sep=1.5, weights=[0.9, 0.1], n_features=20, n_samples=1000, random_state=10 ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) clf = … Read more

ValueError: feature_names mismatch: in xgboost in the predict() function

September 26, 2023 by Tarik

This is the case where the order of column-names while model building is different from order of column-names while model scoring. I have used the following steps to overcome this error First load the pickle file model = pickle.load(open(“saved_model_file”, “rb”)) extraxt all the columns with order in which they were used cols_when_model_builds = model.get_booster().feature_names reorder … Read more

Scikit-learn cross validation scoring for regression

September 20, 2023 by Tarik

I dont have the reputation to comment but I want to provide this link for you and/or a passersby where the negative output of the MSE in scikit learn is discussed – https://github.com/scikit-learn/scikit-learn/issues/2439 In addition (to make this a real answer) your first option is correct in that not only is MSE the metric you … Read more

What is the difference between xgb.train and xgb.XGBRegressor (or xgb.XGBClassifier)?

September 18, 2023 by Tarik

xgboost.train is the low-level API to train the model via gradient boosting method. xgboost.XGBRegressor and xgboost.XGBClassifier are the wrappers (Scikit-Learn-like wrappers, as they call it) that prepare the DMatrix and pass in the corresponding objective function and parameters. In the end, the fit call simply boils down to: self._Booster = train(params, dmatrix, self.n_estimators, evals=evals, early_stopping_rounds=early_stopping_rounds, … Read more

Show confidence limits and prediction limits in scatter plot

September 15, 2023 by Tarik

Here’s what I put together. I tried to closely emulate your screenshot. Given import numpy as np import scipy as sp import scipy.stats as stats import matplotlib.pyplot as plt %matplotlib inline # Raw Data heights = np.array([50,52,53,54,58,60,62,64,66,67,68,70,72,74,76,55,50,45,65]) weights = np.array([25,50,55,75,80,85,50,65,85,55,45,45,50,75,95,65,50,40,45]) Two detailed options to plot confidence intervals: def plot_ci_manual(t, s_err, n, x, x2, y2, ax=None): … Read more

Difference between cross_val_score and cross_val_predict

September 11, 2023 by Tarik

cross_val_score returns score of test fold where cross_val_predict returns predicted y values for the test fold. For the cross_val_score(), you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly). Whereas, cross_val_predict() returns, for … Read more

GridSearchCV – XGBoost – Early Stopping

September 1, 2023 by Tarik

When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early. See the documentation of xgboosts fit method for details. … Read more

predict.lm() with an unknown factor level in test data

August 17, 2023 by Tarik

how to use the Box-Cox power transformation in R

August 8, 2023 by Tarik