Diesel: Window functions
As part of a NLnet NGI0 Zero grant I'm currently working on extending Diesel to support SQL window function calls. I would like to use this possibility to express my sincere thanks for this opportunity to all involved persons.
Window functions are analytical SQL functions which turn values from one or more rows in a single value per row. They aggregate values while not aggregating the complete query results. Window functions are supported by all database systems supported by Diesel. This includes PostgreSQL, MySQL and SQLite. As part of this blog post I want to describe the current process on this project and present the current drafted API to be implemented. The actual changes are available as pull request at the Diesel GitHub repo.
Window Function Syntax
Before looking at the designed DSL for window functions we need to look at the actual underlying SQL to understand which expressions exists and how they interact. Window functions are commonly used as part of SELECT
clauses in your queries. The PostgreSQL documentation describes the following syntax:
function_name ([expression [, expression ... ]]) OVER window_name
function_name ([expression [, expression ... ]]) OVER ( window_definition )
function_name ( * ) OVER window_name
function_name ( * ) OVER ( window_definition )
where window_definition
has syntax:
[ existing_window_name ]
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
[ frame_clause ]
frame_clause
has the syntax of one of the following:
{ RANGE | ROWS | GROUPS } frame_start [ frame_exclusion ]
{ RANGE | ROWS | GROUPS } BETWEEN frame_start AND frame_end [ frame_exclusion ]
frame_start
and frame_end
can be
UNBOUNDED PRECEDING
offset PRECEDING
CURRENT ROW
offset FOLLOWING
UNBOUNDED FOLLOWING
frame_exclusion
can be
EXCLUDE CURRENT ROW
EXCLUDE GROUP
EXCLUDE TIES
EXCLUDE NO OTHERS
expression
is an arbitrary SQL expression, window_name
refers to a named window_definition
in the WINDOW
part of a SELECT
statement.
Each part in []
is optional, each part in {}
lists different possible variants.
The simplest possible window function call you can write is
SELECT avg(age) OVER () FROM users;
This means everything but the OVER()
part is optional. The OVER()
part itself signals that an ordinary function call is turned into a window function call.
By using all possible parts of the window function syntax you end up with queries like this one:
SELECT avg(age) OVER (
PARTITION BY name
ORDER BY id DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING EXCLUDE CURRENT ROW
) FROM users;
For other database systems the syntax very similar, although there are subtile differences. MySQL doesn't support the GROUPS
frame clause unit. Additionally there is no support for any variant of frame_exclusion
. SQLite documents that UNBOUNDED PRECEDING
can only be used in place of frame_start
and UNBOUNDED FOLLOWING
can be only used in place of frame_end
. Fun fact: The actual implementation in PostgreSQL and MySQL has the same requirements here, it just seems to be not documented correctly.
On top of that both the SQLite and the PostgreSQL documentation state that any aggregate function can be used as window function. The MySQL function restricts this to specific functions (which cover almost all built-in aggregate functions). Additionally there are window only functions like rank()
, lag()
or lead
in all three database systems.
Diesel Integration Requirements
Diesel therefore wants to enforce that:
- You cannot use unsupported syntax variants with your current backend
- You cannot construct invalid queries
- You cannot use queries that would be valid with different backends with a backend where they are not valid
For the last point it's also important to know how window function interact with other query parts. Previously we encountered the following relatively common problems while designing new DSL for Diesel:
- How do expressions in your
SELECT
clause interact with theGROUP BY
clause? Are they treated as aggregate expressions or not? - How does the nullability of the return type of an expression interact with the arguments of the expression?
The first point is important to ensure that you cannot mix aggregate expressions, like max(age)
with none-aggregate expressions, like name
in your SELECT
clause without having a proper GROUP BY
clause. Mixing such expressions is either rejected by the database system (PostgreSQL) or produces non-deterministic results (MySQL). For this reasons Diesel decided to reject such queries at compile time. For window functions it turns out that they propagate whether the inner expression is aggregate or not. That means min(max(id)) OVER ()
is actually an aggregate expression, because the argument of the window function max(id)
is an aggregate function call. (Another fun fact: You cannot nest window function calls that way).
The second point is important to ensure that you don't accidentally end up trying to deserialize a NULL
value from the database into a non-Option
type on the Rust side. Window functions must be considered to always be able to return a NULL
value as it is possible to construct an empty window frame e.g. via ROWS BETWEEN 0 PRECEDING AND 0 FOLLOWING EXCLUDE CURRENT ROW
.
Diesel API Design
The presented window function syntax divides using window functions into two steps:
- Creating the window function call by adding an
OVER()
clause to the function call - Potentially adding options (
PARTITION BY
,ORDER BY
,{ RANGE | ROWS | GROUPS } BETWEEN …
) to the function call
The first step is required to get any window function call, while the second step is optional. For this reason I've decided to closely follow these steps by requiring users to first call an WindowExpressionMethods::over(self)
function to turn any function that is allowed to be used as window function into an actual window function call. After that various functions to specify the additional options exist.
This allows us to write our two example queries from above as:
// SELECT avg(age) OVER () FROM users;
users::table.select(dsl::max(users::age).over())
// SELECT avg(age) OVER (
// PARTITION BY name
// ORDER BY id DESC
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING EXCLUDE CURRENT ROW
// ) FROM users;
users::table.select(
dsl::max(users::age)
.over()
.partition_by(users::name)
.order_by(users::id.desc())
.frame_by(dsl::frame::Rows.between_with_exclusion(
dsl::frame::UnboundedPreceding,
dsl::frame::UnboundedFollowing,
dsl::frame::ExcludeCurrentRow
))
)
For specifying whether or not a SQL function is a window function, I introduced a #[window]
attribute that can be used as part of the function declaration. This closely mirrors the existing #[aggregate]
attribute to mark aggregate functions.
For enforcing the various previously invariant Diesel has already existing infrastructure. Preventing mixing aggregate and non-aggregate expressions is enforced by the ValidGrouping
trait. By using an implementation that exposes the aggregate type state from the inner expressions. Handling nullability is implemented in a similar way
Finally enforcing that certain parts of the syntax cannot be used with specific backends is enforced by the QueryFragment
trait. This allows us to either restrict the implementation to a specific backend or to use an existing trait based system that allows several implementations to coexist.
Missing Parts
As indicated by the not-yet-merged PR this work is not completed yet. The current presented version in a draft of how I imagine the API. One of the motivations for writing this blog post is to get early feedback on the proposed methods.
For the actual implementation both tests for compile time constraints and runtime behaviour are missing. Additionally documentation needs to be written and any existing suitable built-in function needs marked as window function. Finally I want to add some built-in window only functions to the set of functions provided by Diesel. I'm optimistic to finish this work in time for an upcoming Diesel 2.3 release.