window-functions – Page 2

pyspark: count distinct over a window

July 19, 2023 by Tarik

EDIT: as noleto mentions in his answer below, there is now approx_count_distinct available since PySpark 2.1 that works over a window. Original answer – exact distinct count (not an approximation) We can use a combination of size and collect_set to mimic the functionality of countDistinct over a window: from pyspark.sql import functions as F, Window … Read more

pyspark: rolling average using timeseries data

July 16, 2023 by Tarik

I figured out the correct way to calculate a moving/rolling average using this stackoverflow: Spark Window Functions – rangeBetween dates The basic idea is to convert your timestamp column to seconds, and then you can use the rangeBetween function in the pyspark.sql.Window class to include the correct rows in your window. Here’s the solved example: … Read more

What is the difference between rowsBetween and rangeBetween?

June 7, 2023 by Tarik

It is simple: ROWS BETWEEN doesn’t care about the exact values. It cares only about the order of rows, and takes fixed number of preceding and following rows when computing frame. RANGE BETWEEN considers values when computing frame. Let’s use an example using two window definitions: ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT … Read more

PostgreSQL: running count of rows for a query ‘by minute’

May 20, 2023 by Tarik

Return only minutes with activity Shortest SELECT DISTINCT date_trunc(‘minute’, “when”) AS minute , count(*) OVER (ORDER BY date_trunc(‘minute’, “when”)) AS running_ct FROM mytable ORDER BY 1; Use date_trunc(), it returns exactly what you need. Don’t include id in the query, since you want to GROUP BY minute slices. count() is typically used as plain aggregate … Read more

Spark Window Functions – rangeBetween dates

April 19, 2023 by Tarik

Spark >= 2.3 Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress. df.createOrReplaceTempView(“df”) spark.sql( “””SELECT *, mean(some_value) OVER ( PARTITION BY id ORDER BY CAST(start AS timestamp) RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW ) AS mean FROM df”””).show() … Read more

Using window functions in an update statement

April 12, 2023 by Tarik

The error is from postgres not django. You can rewrite this as: WITH v_table_name AS ( SELECT row_number() over (partition by col2 order by col3) AS rn, primary_key FROM table_name ) UPDATE table_name set table_name.col1 = v_table_name.rn FROM v_table_name WHERE table_name.primary_key = v_table_name.primary_key; Or alternatively: UPDATE table_name set table_name.col1 = v_table_name.rn FROM ( SELECT row_number() … Read more

Spark SQL Row_number() PartitionBy Sort Desc

April 1, 2023 by Tarik

desc should be applied on a column not a window definition. You can use either a method on a column: from pyspark.sql.functions import col, row_number from pyspark.sql.window import Window F.row_number().over( Window.partitionBy(“driver”).orderBy(col(“unit_count”).desc()) ) or a standalone function: from pyspark.sql.functions import desc from pyspark.sql.window import Window F.row_number().over( Window.partitionBy(“driver”).orderBy(desc(“unit_count”)) )

Why no windowed functions in where clauses?

February 22, 2023 by Tarik

why can’t I use a windowed function in a where clause in SQL Server? One answer, though not particularly informative, is because the spec says that you can’t. See the article by Itzik Ben Gan – Logical Query Processing: What It Is And What It Means to You and in particular the image here. Window … Read more

What is ROWS UNBOUNDED PRECEDING used for in Teradata?

February 9, 2023 by Tarik

It’s the “frame” or “range” clause of window functions, which are part of the SQL standard and implemented in many databases, including Teradata. A simple example would be to calculate the average amount in a frame of three days. I’m using PostgreSQL syntax for the example, but it will be the same for Teradata: WITH … Read more

OVER clause in Oracle

February 3, 2023 by Tarik

The OVER clause specifies the partitioning, ordering and window “over which” the analytic function operates. Example #1: calculate a moving average AVG(amt) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) date amt avg_amt ===== ==== ======= 1-Jan 10.0 10.5 2-Jan 11.0 17.0 3-Jan 30.0 17.0 4-Jan 10.0 18.0 5-Jan 14.0 12.0 It … Read more