categorical-data – Page 2

How to handle categorical features with spark-ml?

May 10, 2023 by Tarik

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

Make Frequency Histogram for Factor Variables

April 4, 2023 by Tarik

XGBoost Categorical Variables: Dummification vs encoding

April 1, 2023 by Tarik

xgboost only deals with numeric columns. if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship) Using LabelEncoder you will simply have this: array([0, 1, 1, 2]) Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string (‘a’,’b’,’c’) to an integer, nothing more. Proper … Read more

Any way to get mappings of a label encoder in Python pandas?

March 31, 2023 by Tarik

You can create additional dictionary with mapping: from sklearn import preprocessing le = preprocessing.LabelEncoder() le.fit(data[‘name’]) le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_))) print(le_name_mapping) {‘Tom’: 0, ‘Nick’: 1, ‘Kate’: 2}

Scikit-learn’s LabelBinarizer vs. OneHotEncoder

March 29, 2023 by Tarik

A simple example which encodes an array using LabelEncoder, OneHotEncoder, LabelBinarizer is shown below. I see that OneHotEncoder needs data in integer encoded form first to convert into its respective encoding which is not required in the case of LabelBinarizer. from numpy import array from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import … Read more

Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

February 16, 2023 by Tarik

As mentioned in the comments, there cannot be a continuous scale on variable of the factor type. You could change the factor to numeric as follows, just after you define the meltDF variable. meltDF$variable=as.numeric(levels(meltDF$variable))[meltDF$variable] Then, execute the ggplot command ggplot(meltDF[meltDF$value == 1,]) + geom_point(aes(x = MW, y = variable)) + scale_x_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, … Read more

pandas dataframe convert column type to string or categorical

January 24, 2023 by Tarik

You need astype: df[‘zipcode’] = df.zipcode.astype(str) #df.zipcode = df.zipcode.astype(str) For converting to categorical: df[‘zipcode’] = df.zipcode.astype(‘category’) #df.zipcode = df.zipcode.astype(‘category’) Another solution is Categorical: df[‘zipcode’] = pd.Categorical(df.zipcode) Sample with data: import pandas as pd df = pd.DataFrame({‘zipcode’: {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, ‘bathrooms’: {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, … Read more

Pandas: convert categories to numbers

December 20, 2022 by Tarik

First, change the type of the column: df.cc = pd.Categorical(df.cc) Now the data look similar but are stored categorically. To capture the category codes: df[‘code’] = df.cc.cat.codes Now you have: cc temp code 0 US 37.0 2 1 CA 12.0 1 2 US 35.0 2 3 AU 20.0 0 If you don’t want to modify … Read more

How to force R to use a specified factor level as reference in a regression?

November 21, 2022 by Tarik

See the relevel() function. Here is an example: set.seed(123) x <- rnorm(100) DF <- data.frame(x = x, y = 4 + (1.5*x) + rnorm(100, sd = 2), b = gl(5, 20)) head(DF) str(DF) m1 <- lm(y ~ x + b, data = DF) summary(m1) Now alter the factor b in DF by use of the … Read more