What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

Here are a couple of approaches:

  1. Find the ratio of number of unique values to the total number of unique values. Something like the following

    likely_cat = {}
    for var in df.columns:
        likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
    
  2. Check if the top n unique values account for more than a certain proportion of all values

    top_n = 10 
    likely_cat = {}
    for var in df.columns:
        likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
    

Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a ‘long-tailed distribution’, where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)