How to count occurrences of each distinct value for every column in a dataframe?

countDistinct is probably the first choice:

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("some_column"))

If speed is more important than the accuracy you may consider approx_count_distinct (approxCountDistinct in Spark 1.x):

import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("some_column"))

To get values and counts:

df.groupBy("some_column").count()

In SQL (spark-sql):

SELECT COUNT(DISTINCT some_column) FROM df

and

SELECT approx_count_distinct(some_column) FROM df

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)