How to handle categorical features with spark-ml?

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

XGBoost Categorical Variables: Dummification vs encoding

xgboost only deals with numeric columns. if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship) Using LabelEncoder you will simply have this: array([0, 1, 1, 2]) Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string (‘a’,’b’,’c’) to an integer, nothing more. Proper … Read more

Scikit-learn’s LabelBinarizer vs. OneHotEncoder

A simple example which encodes an array using LabelEncoder, OneHotEncoder, LabelBinarizer is shown below. I see that OneHotEncoder needs data in integer encoded form first to convert into its respective encoding which is not required in the case of LabelBinarizer. from numpy import array from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import … Read more

Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

As mentioned in the comments, there cannot be a continuous scale on variable of the factor type. You could change the factor to numeric as follows, just after you define the meltDF variable. meltDF$variable=as.numeric(levels(meltDF$variable))[meltDF$variable] Then, execute the ggplot command ggplot(meltDF[meltDF$value == 1,]) + geom_point(aes(x = MW, y = variable)) + scale_x_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, … Read more

pandas dataframe convert column type to string or categorical

You need astype: df[‘zipcode’] = df.zipcode.astype(str) #df.zipcode = df.zipcode.astype(str) For converting to categorical: df[‘zipcode’] = df.zipcode.astype(‘category’) #df.zipcode = df.zipcode.astype(‘category’) Another solution is Categorical: df[‘zipcode’] = pd.Categorical(df.zipcode) Sample with data: import pandas as pd df = pd.DataFrame({‘zipcode’: {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, ‘bathrooms’: {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)