Updating a dataframe column in spark

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you’d first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python: from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import StringType … Read more

Coerce multiple columns to factors at once

Choose some columns to coerce to factors: cols <- c(“A”, “C”, “D”, “H”) Use lapply() to coerce and replace the chosen columns: data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used Check the result: sapply(data, class) # A B C D E F G # “factor” “integer” “factor” “factor” “integer” “integer” “integer” # H … Read more

Removing display of row names from data frame

You have successfully removed the row names. The print.data.frame method just shows the row numbers if no row names are present. df1 <- data.frame(values = rnorm(3), group = letters[1:3], row.names = paste0(“RowName”, 1:3)) print(df1) # values group #RowName1 -1.469809 a #RowName2 -1.164943 b #RowName3 0.899430 c rownames(df1) <- NULL print(df1) # values group #1 -1.469809 … Read more

R Apply() function on specific dataframe columns

lapply is probably a better choice than apply here, as apply first coerces your data.frame to an array which means all the columns must have the same type. Depending on your context, this could have unintended consequences. The pattern is: df[cols] <- lapply(df[cols], FUN) The ‘cols’ vector can be variable names or indices. I prefer … Read more

Create a data.frame where a column is a list

Slightly obscurely, from ?data.frame: If a list or data frame or matrix is passed to ‘data.frame’ it is as if each component or column had been passed as a separate argument (except for matrices of class ‘”model.matrix”’ and those protected by ‘I’). (emphasis added). So data.frame(a=1:3,b=I(list(1,1:2,1:3))) seems to work.

Elegant way to create empty pandas DataFrame with NaN of type float

Simply pass the desired value as first argument, like 0, math.inf or, here, np.nan. The constructor then initializes and fills the value array to the size specified by arguments index and columns: >>> import numpy as np >>> import pandas as pd >>> df = pd.DataFrame(np.nan, index=[0, 1, 2, 3], columns=[‘A’, ‘B’]) >>> df A … Read more