pyspark count rows on condition

count doesn’t sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then sum: import pyspark.sql.functions as F cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0)) test.groupBy(‘x’).agg( cnt_cond(F.col(‘y’) > 12453).alias(‘y_cnt’), cnt_cond(F.col(‘z’) > 230).alias(‘z_cnt’) ).show() +—+—–+—–+ | x|y_cnt|z_cnt| +—+—–+—–+ | bn| … Read more

Counting frequency of values by date using pandas

It might be easiest to turn your Series into a DataFrame and use Pandas’ groupby functionality (if you already have a DataFrame then skip straight to adding another column below). If your Series is called s, then turn it into a DataFrame like so: >>> df = pd.DataFrame({‘Timestamp’: s.index, ‘Category’: s.values}) >>> df Category Timestamp … Read more

Spark: How to translate count(distinct(value)) in Dataframe API’s

What you need is the DataFrame aggregation function countDistinct: import sqlContext.implicits._ import org.apache.spark.sql.functions._ case class Log(page: String, visitor: String) val logs = data.map(p => Log(p._1,p._2)) .toDF() val result = logs.select(“page”,”visitor”) .groupBy(‘page) .agg(‘page, countDistinct(‘visitor)) result.foreach(println)

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)