pyspark: count distinct over a window

EDIT: as noleto mentions in his answer below, there is now approx_count_distinct available since PySpark 2.1 that works over a window. Original answer – exact distinct count (not an approximation) We can use a combination of size and collect_set to mimic the functionality of countDistinct over a window: from pyspark.sql import functions as F, Window … Read more

pyspark: rolling average using timeseries data

I figured out the correct way to calculate a moving/rolling average using this stackoverflow: Spark Window Functions – rangeBetween dates The basic idea is to convert your timestamp column to seconds, and then you can use the rangeBetween function in the pyspark.sql.Window class to include the correct rows in your window. Here’s the solved example: … Read more

What is the difference between rowsBetween and rangeBetween?

It is simple: ROWS BETWEEN doesn’t care about the exact values. It cares only about the order of rows, and takes fixed number of preceding and following rows when computing frame. RANGE BETWEEN considers values when computing frame. Let’s use an example using two window definitions: ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT … Read more

PostgreSQL: running count of rows for a query ‘by minute’

Return only minutes with activity Shortest SELECT DISTINCT date_trunc(‘minute’, “when”) AS minute , count(*) OVER (ORDER BY date_trunc(‘minute’, “when”)) AS running_ct FROM mytable ORDER BY 1; Use date_trunc(), it returns exactly what you need. Don’t include id in the query, since you want to GROUP BY minute slices. count() is typically used as plain aggregate … Read more

Spark Window Functions – rangeBetween dates

Spark >= 2.3 Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress. df.createOrReplaceTempView(“df”) spark.sql( “””SELECT *, mean(some_value) OVER ( PARTITION BY id ORDER BY CAST(start AS timestamp) RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW ) AS mean FROM df”””).show() … Read more

Using window functions in an update statement

The error is from postgres not django. You can rewrite this as: WITH v_table_name AS ( SELECT row_number() over (partition by col2 order by col3) AS rn, primary_key FROM table_name ) UPDATE table_name set table_name.col1 = v_table_name.rn FROM v_table_name WHERE table_name.primary_key = v_table_name.primary_key; Or alternatively: UPDATE table_name set table_name.col1 = v_table_name.rn FROM ( SELECT row_number() … Read more

Spark SQL Row_number() PartitionBy Sort Desc

desc should be applied on a column not a window definition. You can use either a method on a column: from pyspark.sql.functions import col, row_number from pyspark.sql.window import Window F.row_number().over( Window.partitionBy(“driver”).orderBy(col(“unit_count”).desc()) ) or a standalone function: from pyspark.sql.functions import desc from pyspark.sql.window import Window F.row_number().over( Window.partitionBy(“driver”).orderBy(desc(“unit_count”)) )

What is ROWS UNBOUNDED PRECEDING used for in Teradata?

It’s the “frame” or “range” clause of window functions, which are part of the SQL standard and implemented in many databases, including Teradata. A simple example would be to calculate the average amount in a frame of three days. I’m using PostgreSQL syntax for the example, but it will be the same for Teradata: WITH … Read more

OVER clause in Oracle

The OVER clause specifies the partitioning, ordering and window “over which” the analytic function operates. Example #1: calculate a moving average AVG(amt) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) date amt avg_amt ===== ==== ======= 1-Jan 10.0 10.5 2-Jan 11.0 17.0 3-Jan 30.0 17.0 4-Jan 10.0 18.0 5-Jan 14.0 12.0 It … Read more