classification – Tarik Billa

adding words to stop_words list in TfidfVectorizer in sklearn

April 11, 2024 by Tarik

This is how you can do it: from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer my_stop_words = text.ENGLISH_STOP_WORDS.union([“book”]) vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words) X = vectorizer.fit_transform([“this is an apple.”,”this is a book.”]) idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) # printing the tfidf vectors print(X) # printing the vocabulary print(vectorizer.vocabulary_) In this example, I created the tfidf vectors for … Read more

Predicting how long an scikit-learn classification will take to run

January 7, 2024 by Tarik

There are very specific classes of classifier or regressors that directly report remaining time or progress of your algorithm (number of iterations etc.). Most of this can be turned on by passing verbose=2 (any high number > 1) option to the constructor of individual models. Note: this behavior is according to sklearn-0.14. Earlier versions have … Read more

General approach to developing an image classification algorithm for Dilbert cartoons

December 28, 2023 by Tarik

So i think you are on the right track w/r/t your step 1 (apply some algorithm to the image, which converts it into a set of features). This project is more challenging that most ML problems because here you will actually have to create your training data set from the raw data (the individual frames … Read more

Precision/recall for multiclass-multilabel classification

December 26, 2023 by Tarik

For multi-label classification you have two ways to go First consider the following. is the number of examples. is the ground truth label assignment of the example.. is the example. is the predicted labels for the example. Example based The metrics are computed in a per datapoint manner. For each predicted label its only its … Read more

how to implement tensorflow’s next_batch for own data

December 25, 2023 by Tarik

The link you posted says: “we get a “batch” of one hundred random data points from our training set”. In my example I use a global function (not a method like in your example) so there will be a difference in syntax. In my function you’ll need to pass the number of samples wanted and … Read more

How to compute error rate from a decision tree?

December 24, 2023 by Tarik

Compute class weight function issue in ‘sklearn’ library when used in ‘Keras’ classification (Python 3.8, only in VS code)

December 19, 2023 by Tarik

After spending a lot of time, this is how I fixed it. I still don’t know why but when the code is modified as follows, it works fine. I got the idea after seeing this solution for a similar but slightly different issue. class_weights = compute_class_weight( class_weight = “balanced”, classes = np.unique(train_classes), y = train_classes … Read more

Are GAN’s unsupervised or supervised?

December 15, 2023 by Tarik

GANs are unsupervised learning algorithms that use a supervised loss as part of the training. The later appears to be where you are getting hung-up. When we talk about supervised learning, we are usually talking about learning to predict a label associated with the data. The goal is for the model to generalize to new … Read more

sklearn LogisticRegression and changing the default threshold for classification

December 14, 2023 by Tarik

I would like to give a practical answer from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score import numpy as np X, y = make_classification( n_classes=2, class_sep=1.5, weights=[0.9, 0.1], n_features=20, n_samples=1000, random_state=10 ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) clf = … Read more

What are the 15 classifications of types in C++?

December 6, 2023 by Tarik

I spoke with Walter directly, and it was simply a miscount. “Alas, I realized shortly thereafter that I’d miscounted and hence committed an off-by-one error during the talk: there are 14 (not 15) type classifications. See the list of primary type category predicates in clause [meta.unary.cat] in the C++ standard; these correspond to the classifications … Read more