Anatomy of SQL Window Functions. Back To Basics | SQL fundamentals for… | by Iffat Malik Gore | Mar, 2023

Anatomy of SQL Window Functions. Back To Basics | SQL fundamentals for… | by Iffat Malik Gore | Mar, 2023

[ad_1]

Image by author, created on canva

In order to understand the enterprise data; you have to query it a lot. When I say ‘A lot’, I mean it. Working with unfamiliar piles of data is often daunting and it’s always a good practice to take some time to explore and understand the data itself. It’s good to have basic data retrieval skills but knowing analytical functions to derive some useful insights out of your data is cherry on top of a cake and it’s fun too!

I come from data visualisation back ground and it’s crucial for me to not just understand the data but also figuring out any noteworthy findings to highlight it to the wider teams. Also building complex dashboards is quite often a to-and-fro process at where you go back to your data source to tally the data and SQL Window Functions have always accompanied me in my data analysis journey.

Even though they are very useful for data analysis, there is some sort of confusion and people are often frighten to use them. While writing a detailed guide to SQL Window Functions, I realised it was becoming way too descriptive and yet I did not want to skip details especially about the syntax and clauses used along with it. It’s important to understand the building blocks, yeah? So, I will try and break down the building blocks of Window Function in this article so that it’s not overwhelming to process and implement it.

As usual, we will be using classicmodels MySQL sample database for demonstration, which holds the business data of a car retailer. Below is the ER Diagram for reference,

Image by author

First things first, What is a Window Function?

Textbook definition of a Window Function is,

A Window Function performs calculation across a set of table rows that are somehow related to the current row.

Image source: Masterfile

What do you think this little fella sees from the window? A partial view from the scene outside of the window of this room or building. right? That’s exactly what a Window Function does. It allows you to perform calculation against a subset of data without aggregating current rows.

What’s the need? Why Window Function? Where does Aggregate Function fall behind?

Here is the sample data from table PRODUCTS, for demo purpose I have limited it to PRODUCTLINEs – Planes, Ships and Trains.

--sample data from table PRODUCTS.
SELECT
*
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains');
Image by author

Now find out the total quantity in stock for each PRODUCTLINE,

--total quantity in stock for each productline
SELECT
PRODUCTLINE,
SUM(QUANTITYINSTOCK) AS TOTAL_QUNATITY
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains')
GROUP BY PRODUCTLINE;
Image by author

This is pretty straight forward. SUM(QUANTITYINSTOCK) summarised the data from multiple rows into 3 rows, since we are using GROUP BY clause it provided single row for each product line. Great! that’s expected from it.

Image by author

Let’s revamp the requirement now,

  1. Display the quantity of each product within the PRODUCTLINE along with the the total quantity in stock for that particular PRODUCTLINE.
  2. Arrange the result set grouped by PRODUCTLINE.

Now, can this be done with the help of Aggregate Function? We surely want the total quantity in stock for each PRODUCTLINE but we do not wish to summarise it all. That’s where Window Functions come to the picture,

--sum() as a window function
SELECT
PRODUCTNAME,
PRODUCTLINE,
QUANTITYINSTOCK,
SUM(QUANTITYINSTOCK) OVER (PARTITION BY PRODUCTLINE) AS TOTAL_QUANTITY
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains');
Image by author

Here, SUM() as a Window Function performed calculations on a set of rows but unlike Aggregate Function it does not summarise the result set into a single row. Instead, all the rows (PRODUCTNAME, PRODUCTLINE and QUANTITYINSTOCK ) maintain their original form/identity and the calculated row(TOTAL_QUANTITY ) is added in the result set for every row.

Image by author

Types of Window Functions

Honestly there is no official categorisation of Window Functions but based on the usage, we can briefly categorise them in 3 ways,

Image by author
  • Aggregate Functions – Regular Aggregate Function can be used as a Window Function to calculate aggregations for numeric columns within window partitions such as running total sales, minimum or maximum value within partition etc.
  • Ranking Functions – These functions return a ranking value for each row in a partition.
  • Value Functions – These functions are useful for generating simple statistics or time series analysis.

Syntax of a Window Function

The common syntax of a Window Function is,

Image by author

Before we go deeper into it; let’s first understand significance of each clause within it,

OVER() Clause

OVER() clause specifies a function as a Window Function and hence it must always be included in the statement. It defines a user-specified subset(a window)of rows on which the Window Function will be applied. If you do not supply anything inside OVER(), the Window Function will be applied on the entire result set.

Continuing to the above example,

--empty OVER() clause
SELECT
PRODUCTNAME,
PRODUCTLINE,
QUANTITYINSTOCK,
SUM(QUANTITYINSTOCK) OVER () AS TOTAL_QUANTITY
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains');
Image by author

Here, since we supplied empty OVER() clause, Window Function SUM(QUANTITYINSTOCK) was applied to all the records of PRODUCTLINE- Planes, Ships and Trains resulting into 105816 as TOTAL_QUANTITY.

PARTITION BY Clause

PARTITION BY is used with OVER clause. It divides the query result set into partitions or buckets based on user specified expression and then the Window Function applies on each partition or bucket.

It’s optional, so if you do not specify PARTITION BY clause, then the function treats all rows as a single partition. Exactly what we did in the above example, we simply specified an empty OVER() clause without PARTITION BY clause and hence total quantity was calculated for all the PRODUCTLINEs.

What happens if we do specify one,

--OVER() with PARTITION BY
SELECT
PRODUCTNAME,
PRODUCTLINE,
QUANTITYINSTOCK,
SUM(QUANTITYINSTOCK) OVER (PARTITION BY PRODUCTLINE) AS TOTAL_QUANTITY
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains');
Image by author

Since we are asking OVER() clause to partition the TOTAL_QUANTITY result-set based on PRODUCYLINE, now SUM(QUANTITYINSTOCK) will be calculating quantity in stock for each one of them .

Now, what’s the deal with PARTITION BY and GROUP BY clauses? Are they alike or different?

GROUP BY Clause,

  • It groups multiple rows into summarised rows based on a single or multiple columns/expressions (returning 1 row for each group). In simpler words, it reduces the number of rows in your result set.
  • It is used along with Aggregate Functions such as SUM(), AVG(), MAX() etc.
  • It’s placed after the WHERE clause but before HAVING, ORDER BY clauses. Common syntax is,
Image by author

PARTITION BY Clause,

  • It is used with OVER() clause in Window Function. It divides the query result set into partitions and then the Window Function applies on each partition.
  • PARTITION BY is similar to GROUP BY since it aggregates the result based on the expression; however, the major difference is, it will not reduce the rows of result-set.
  • It’s optional, so if you do not specify PARTITION BY clause, then the function treats all rows as a single partition.
  • Common syntax is,
Image by author

If we need to figure out MIN and MAX values of MSRP for each PRODUCTLINE, then how will the result sets differ for both GROUP BY and PARTITION BY clause,

Image by author

ORDER BY Clause

It is used to sort the result set in either ascending or descending order within each partition of the result set. By default it’s in ascending order.

ROWS/RANGE Clause

Now we already know that the key feature of a Window Function is to create a window or a partition of the result set using PARTITION BY clause and then perform calculations on each partition. What if we further want to create subsets within these partitions? Woah! partition within partition? Yes, that’s why we have FRAME clause.

FRAME clause further defines a subset of the current partition. It uses ROW or RANGE to define the start and end points of this subset. It requires ORDER BY clause.

Frames are determined with respect to the current row, which simply means you take your current row’s location as a base point and with that reference you define your frame within the partition.

  • ROWS — This defines the beginning and ending of the frame by specifying the number of rows that precede or follow the current row.
  • RANGE — Contrary to the ROWS, RANGE specifies the range of values compared to the value of the current row to define a frame within the partition.

Generic syntax is,

{ROWS | RANGE} BETWEEN <frame_starting> AND <frame_ending>

Image by author

Before we go any further, let’s understand some basic terms defining the frame.

Image source: mysqltutorials
  • UNBOUNDED PRECEDING – This specifies all the rows (starting from the first row) before the current row in the partition.
  • N PRECEDING – This specifies ’N’ number of rows before your current row in the partition.
  • UNBOUNDED FOLLOWING – This specifies all the rows after your current row (all the way to the very last row) in the partition.
  • M FOLLOWING – This specifies ‘M’ number of rows below your current row in the partition.

Let’s quickly understand this with an example,

SELECT 
PRODUCTNAME,
PRODUCTLINE,
QUANTITYINSTOCK,
SUM(QUANTITYINSTOCK) OVER (PARTITION BY PRODUCTLINE ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS TOTAL
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains');
Image by author

ROWS BETWEEN 1 PRECEDING AND CURRENT ROW indicates the size of the frame on which SUM(QUANTITYINSTOCK) needs to be calculated within the partition.

Image by author

Here are some examples of FRAME clauses,

  • ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING — This means consider the frame from the first row of the partition to the last row of the partition.
  • ROWS BETWEEN UNBOUNDED PRECEDING AND 4 FOLLOWING – This means consider the frame from the first row of the partition to 4 row after the current row.
  • ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING – Frame will be the previous 4 rows up to 1 row before the current row.

The default frame differs depending on the presence or the absence of the ORDER BY clause; with it the default frame is,

{ROWS/RANGE} BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

This means consider the frame as all the rows starting from the row number one to the current row in the partition.

Without ORDER BY clause, the default frame as,

{ROWS/RANGE} BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

This simply means entire partition.

Defining a window alias,

If there are more than one window functions in your query which utilises the same window, then you might want to use window alias.

--finding out minimum and maximum MSRP for each productline
SELECT
PRODUCTNAME,
PRODUCTLINE,
MSRP,
MIN(MSRP) OVER(PARTITION BY PRODUCTLINE) AS MIN_MSRP,
MAX(MSRP) OVER(PARTITION BY PRODUCTLINE) AS MAX_MSRP
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains');

The alternate way of writing the same query by using window alias is,

--using window alias
SELECT
PRODUCTNAME,
PRODUCTLINE,
MSRP,
MIN(MSRP) OVER MSRP_WINDOW AS MIN_MSRP,
MAX(MSRP) OVER MSRP_WINDOW AS MAX_MSRP
FROM
CLASSICMODELS.PRODUCTS
WHERE PRODUCTLINE IN ('Planes','Ships','Trains')
WINDOW MSRP_WINDOW AS (PARTITION BY PRODUCTLINE);

📌 Side Note

During the query execution, Window Functions are performed on the result set,

  • After JOIN, WHERE , GROUP BY and HAVING clauses and
  • Before the ORDER BY clause, LIMIT and SELECT DISTINCT.
[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *