text-classification – Tarik Billa

adding words to stop_words list in TfidfVectorizer in sklearn

April 11, 2024 by Tarik

This is how you can do it: from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer my_stop_words = text.ENGLISH_STOP_WORDS.union([“book”]) vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words) X = vectorizer.fit_transform([“this is an apple.”,”this is a book.”]) idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) # printing the tfidf vectors print(X) # printing the vocabulary print(vectorizer.vocabulary_) In this example, I created the tfidf vectors for … Read more

Information Gain calculation with Scikit-learn

December 27, 2023 by Tarik

You can use scikit-learn’s mutual_info_classif here is an example from sklearn.datasets import fetch_20newsgroups from sklearn.feature_selection import mutual_info_classif from sklearn.feature_extraction.text import CountVectorizer categories = [‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’] newsgroups_train = fetch_20newsgroups(subset=”train”, categories=categories) X, Y = newsgroups_train.data, newsgroups_train.target cv = CountVectorizer(max_df=0.95, min_df=2, max_features=10000, stop_words=”english”) X_vec = cv.fit_transform(X) res = dict(zip(cv.get_feature_names(), mutual_info_classif(X_vec, Y, discrete_features=True) )) print(res) this will output … Read more

Multilabel Text Classification using TensorFlow

September 23, 2023 by Tarik

Change relu to sigmoid of output layer. Modify cross entropy loss to explicit mathematical formula of sigmoid cross entropy loss (explicit loss was working in my case/version of tensorflow ) import tensorflow as tf # hidden Layer class HiddenLayer(object): def __init__(self, input, n_in, n_out): self.input = input w_h = tf.Variable(tf.random_normal([n_in, n_out],mean = 0.0,stddev = 0.05)) … Read more

ROC for multiclass classification

June 14, 2023 by Tarik

As people mentioned in comments you have to convert your problem into binary by using OneVsAll approach, so you’ll have n_class number of ROC curves. A simple example: from sklearn.metrics import roc_curve, auc from sklearn import datasets from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import LinearSVC from sklearn.preprocessing import label_binarize from sklearn.model_selection import train_test_split import matplotlib.pyplot … Read more

How to use Bert for long text classification?

February 24, 2023 by Tarik

You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and combine the results … Read more

How can I plot a confusion matrix? [duplicate]

November 22, 2022 by Tarik

you can use plt.matshow() instead of plt.imshow() or you can use seaborn module’s heatmap (see documentation) to plot the confusion matrix import seaborn as sn import pandas as pd import matplotlib.pyplot as plt array = [[33,2,0,0,0,0,0,0,0,1,3], [3,31,0,0,0,0,0,0,0,0,0], [0,4,41,0,0,0,0,0,0,0,1], [0,1,0,30,0,6,0,0,0,0,1], [0,0,0,0,38,10,0,0,0,0,0], [0,0,0,3,1,39,0,0,0,0,4], [0,2,2,0,4,1,31,0,0,0,2], [0,1,0,0,0,0,0,36,0,2,0], [0,0,0,0,0,0,1,5,37,5,1], [3,0,0,0,0,0,0,0,0,39,0], [0,0,0,0,0,0,0,0,0,0,38]] df_cm = pd.DataFrame(array, index = [i for i in … Read more