sklearn train_test_split on pandas stratify by multiple columns

If you want train_test_split to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.

df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])

If you’re worried about collision due to values like 11 and 3 and 1 and 13 both creating a concatenated value of 113, then you can add some arbitrary string in the middle:

df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)