What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

Question

Here are a couple of approaches:

Find the ratio of number of unique values to the total number of unique values. Something like the following

likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold

Check if the top n unique values account for more than a certain proportion of all values

top_n = 10 
likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold

Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a ‘long-tailed distribution’, where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

Leave a Comment Cancel reply