Spark DataFrame: count distinct values of every column

In pySpark you could do something like this, using countDistinct(): from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)) Similarly in Scala : import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*) If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

pyspark: count distinct over a window

EDIT: as noleto mentions in his answer below, there is now approx_count_distinct available since PySpark 2.1 that works over a window. Original answer – exact distinct count (not an approximation) We can use a combination of size and collect_set to mimic the functionality of countDistinct over a window: from pyspark.sql import functions as F, Window … Read more

How to maintain a Unique List in Java?

You can use a Set implementation: Some info from the JAVADoc: A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction. Note: Great care must be … Read more

Java 8 Distinct by property

Consider distinct to be a stateful filter. Here is a function that returns a predicate that maintains state about what it’s seen previously, and that returns whether the given element was seen for the first time: public static <T> Predicate<T> distinctByKey(Function<? super T, ?> keyExtractor) { Set<Object> seen = ConcurrentHashMap.newKeySet(); return t -> seen.add(keyExtractor.apply(t)); } … Read more

tech