简介
MySQL从8.0开始支持窗口函数,这个功能在大多商业数据库和部分开源数据库中早已支持,有的也叫分析函数。
窗口是什么
窗口的概念非常重要,它可以理解为记录集合,窗口函数也就是在满足某种条件的记录集合上执行的特殊函数。对于每条记录都要在此窗口内执行函数,有的函数随着记录不同,窗口大小都是固定的,这种属于静态窗口;有的函数则相反,不同的记录对应着不同的窗口,这种动态变化的窗口叫滑动窗口。
窗口函数和聚合函数
- 聚合函数是将多条记录聚合为一条;而窗口函数是每条记录都会执行,有几条记录执行完还是几条。
- 聚合函数也可以用于窗口函数中,这个后面会举例说明。
基本用法
函数名 OVER 子句
over关键字用来指定函数执行的窗口范围,若后面括号中什么都不写,则意味着窗口包含满足WHERE条件的所有行,窗口函数基于所有行进行计算;如果不为空,则支持以下4中语法来设置窗口。
window_name:给窗口指定一个别名。如果SQL中涉及的窗口较多,采用别名可以看起来更清晰易读;
PARTITION BY 子句:窗口按照哪些字段进行分组,窗口函数在不同的分组上分别执行;
ORDER BY子句:按照哪些字段进行排序,窗口函数将按照排序后的记录顺序进行编号;
FRAME子句:FRAME是当前分区的一个子集,子句用来定义子集的规则,通常用来作为滑动窗口使用。
分类
按照功能划分,可以把MySQL支持的窗口函数分为如下几类:
序号函数:row_number() / rank() / dense_rank()
分布函数:percent_rank() / cume_dist()
前后函数:lag() / lead()
头尾函数:first_val() / last_val()
其他函数:nth_value() / nfile()
序号函数
ROW_NUMBER()
用途:显示分区中的当前行号
应用场景:查询每个学生的分数最高的前3门课程
-- 学生-课程-成绩表
+-------+----------+-------+------------+
| stu_id | lesson_id | score | create_time |
+-------+----------+-------+------------+
| 1 | L005 | 98 | 2021-10-01 |
| 1 | L004 | 88 | 2021-09-30 |
| 1 | L001 | 98 | 2021-09-21 |
| 2 | L002 | 90 | 2021-09-13 |
| 2 | L003 | 86 | 2021-09-16 |
| 2 | L001 | 84 | 2021-09-26 |
| 3 | L002 | 91 | 2021-09-02 |
| 3 | L001 | 100 | 2021-09-03 |
| 3 | L003 | 85 | 2021-09-17 |
| 4 | L001 | 99 | 2021-09-01 |
| 4 | L005 | 98 | 2021-09-10 |
| 4 | L002 | 88 | 2021-09-19 |
| 4 | L010 | 100 | 2021-09-22 |
+-------+----------+-------+------------+
查询SQL
SELECT
*
FROM
( SELECT stu_id, ROW_NUMBER() OVER ( PARTITION BY stu_id ORDER BY score DESC ) AS score_order, lesson_id, score FROM t_score ) t
WHERE
score_order <= 3;
查询结果
+-------+------------+----------+-------+
| stu_id | score_order | lesson_id | score |
+-------+------------+----------+-------+
| 1 | 1 | L005 | 98 |
| 1 | 2 | L001 | 98 |
| 1 | 3 | L004 | 88 |
| 2 | 1 | L002 | 90 |
| 2 | 2 | L003 | 86 |
| 2 | 3 | L001 | 84 |
| 3 | 1 | L001 | 100 |
| 3 | 2 | L002 | 91 |
| 3 | 3 | L003 | 85 |
| 4 | 1 | L010 | 100 |
| 4 | 2 | L001 | 99 |
| 4 | 3 | L005 | 98 |
+-------+------------+----------+-------+
对于stu_id=1的同学,有两门课程的成绩均为98,序号随机排了1和2。但很多情况下二者应该是并列第一,则他的成绩为88的这门课的序号可能是第2名,也可能为第3名。
这时候,ROW_NUMBER()就不能满足需求,需要RANK()和DENSE_RANK()出场,它们和ROW_NUMBER()非常类似,只是在出现重复值时处理逻辑有所不同。
SELECT
*
FROM
(
SELECT
stu_id,
ROW_NUMBER() OVER ( PARTITION BY stu_id ORDER BY score DESC ) AS score_order1,
RANK() OVER ( PARTITION BY stu_id ORDER BY score DESC ) AS score_order2,
DENSE_RANK() OVER ( PARTITION BY stu_id ORDER BY score DESC ) AS score_order3,
lesson_id,
score
FROM
t_score
) t
WHERE
score_order1 <= 3
AND score_order2 <= 3
AND score_order3 <= 3;
查询结果
+-------+-------------+-------------+-------------+----------+-------+
| stu_id | score_order1 | score_order2 | score_order3 | lesson_id | score |
+-------+-------------+-------------+-------------+----------+-------+
| 1 | 1 | 1 | 1 | L005 | 98 |
| 1 | 2 | 1 | 1 | L001 | 98 |
| 1 | 3 | 3 | 2 | L004 | 88 |
| 2 | 1 | 1 | 1 | L002 | 90 |
| 2 | 2 | 2 | 2 | L003 | 86 |
| 2 | 3 | 3 | 3 | L001 | 84 |
| 3 | 1 | 1 | 1 | L001 | 100 |
| 3 | 2 | 2 | 2 | L002 | 91 |
| 3 | 3 | 3 | 3 | L003 | 85 |
| 4 | 1 | 1 | 1 | L010 | 100 |
| 4 | 2 | 2 | 2 | L001 | 99 |
| 4 | 3 | 3 | 3 | L005 | 98 |
+-------+-------------+-------------+-------------+----------+-------+
ROW_NUMBER():顺序排序——1、2、3
RANK():并列排序,跳过重复序号——1、1、3
DENSE_RANK():并列排序,不跳过重复序号——1、1、2
分布函数
PERCENT_RANK()
用途:每行按照公式(rank-1) / (rows-1)进行计算。其中,rank为RANK()函数产生的序号,rows为当前窗口的记录总行数
应用场景:不常用
SELECT
RANK() OVER w_s AS rk,
PERCENT_RANK() OVER w_s AS prk,
stu_id,
lesson_id,
score
FROM
t_score
WHERE
stu_id = 1 WINDOW w_s AS ( PARTITION BY stu_id ORDER BY score );
查询结果
+----+-----+-------+----------+-------+
| rk | prk | stu_id | lesson_id | score |
+----+-----+-------+----------+-------+
| 1 | 0 | 1 | L004 | 88 |
| 2 | 0.5 | 1 | L005 | 98 |
| 2 | 0.5 | 1 | L001 | 98 |
+----+-----+-------+----------+-------+
CUME_DIST()
用途:分组内小于、等于当前rank值的行数 / 分组内总行数
应用场景:查询小于等于当前成绩(score)的比例
SELECT
stu_id,
lesson_id,
score,
ROUND( CUME_DIST() OVER ( ORDER BY score ), 2 ) AS cd1,
ROUND( CUME_DIST() OVER ( PARTITION BY lesson_id ORDER BY score ), 2 ) AS cd2
FROM
t_score
WHERE
lesson_id IN ( 'L001', 'L002' );
查询结果
+-------+----------+-------+------+------+
| stu_id | lesson_id | score | cd1 | cd2 |
+-------+----------+-------+------+------+
| 2 | L001 | 84 | 0.14 | 0.25 |
| 1 | L001 | 98 | 0.71 | 0.5 |
| 4 | L001 | 99 | 0.86 | 0.75 |
| 3 | L001 | 100 | 1 | 1 |
| 4 | L002 | 88 | 0.29 | 0.33 |
| 2 | L002 | 90 | 0.43 | 0.67 |
| 3 | L002 | 91 | 0.57 | 1 |
+-------+----------+-------+------+------+
前后函数
LAG(expr,n)、LEAD(expr,n)
用途:返回位于当前行的前n行(LAG(expr,n))或后n行(LEAD(expr,n))的expr的值
应用场景:查询前1名同学的成绩和当前同学成绩的差值
SELECT
stu_id,
lesson_id,
score,
pre_score,
score - pre_score AS diff1,
next_score,
score - next_score AS diff2
FROM
(
SELECT
stu_id,
lesson_id,
score,
LAG( score, 1 ) OVER w AS pre_score,
LEAD( score, 1 ) OVER w AS next_score
FROM
t_score
WHERE
lesson_id IN ( 'L001', 'L002' ) WINDOW w AS ( PARTITION BY lesson_id ORDER BY score )) t;
查询结果
+-------+----------+-------+-----------+------+------------+------+
| stu_id | lesson_id | score | pre_score | diff1 | next_score | diff2 |
+-------+----------+-------+-----------+------+------------+------+
| 2 | L001 | 84 | NULL | NULL | 98 | -14 |
| 1 | L001 | 98 | 84 | 14 | 99 | -1 |
| 4 | L001 | 99 | 98 | 1 | 100 | -1 |
| 3 | L001 | 100 | 99 | 1 | NULL | NULL |
| 4 | L002 | 88 | NULL | NULL | 90 | -2 |
| 2 | L002 | 90 | 88 | 2 | 91 | -1 |
| 3 | L002 | 91 | 90 | 1 | NULL | NULL |
+-------+----------+-------+-----------+------+------------+------+
头尾函数
FIRST_VALUE(expr)、LAST_VALUE(expr)
用途:返回第一个(FIRST_VALUE(expr))或最后一个(LAST_VALUE(expr))expr的值
应用场景:截止到当前成绩,按照日期排序查询第1个和最后1个同学的分数
SELECT
stu_id,
lesson_id,
score,
create_time,
FIRST_VALUE( score ) OVER w AS first_score,
LAST_VALUE( score ) OVER w AS last_score
FROM
t_score
WHERE
lesson_id IN ( 'L001', 'L002' ) WINDOW w AS ( PARTITION BY lesson_id ORDER BY create_time );
查询结果
+-------+----------+-------+------------+-----------+-----------+
| stu_id | lesson_id | score | create_time | first_score | last_score |
+-------+----------+-------+------------+-----------+-----------+
| 4 | L001 | 99 | 2021-09-01 | 99 | 99 |
| 3 | L001 | 100 | 2021-09-03 | 99 | 100 |
| 1 | L001 | 98 | 2021-09-21 | 99 | 98 |
| 2 | L001 | 84 | 2021-09-26 | 99 | 84 |
| 3 | L002 | 91 | 2021-09-02 | 91 | 91 |
| 2 | L002 | 90 | 2021-09-13 | 91 | 90 |
| 4 | L002 | 88 | 2021-09-19 | 91 | 88 |
+-------+----------+-------+------------+-----------+-----------+
其它函数
NTH_VALUE(expr,n)
用途:返回窗口中第n个expr的值。expr可以是表达式,也可以是列名
应用场景:截止到当前成绩,显示每个同学的成绩中排名第2和第3的成绩的分数
SELECT
stu_id,
lesson_id,
score,
NTH_VALUE( score, 2 ) OVER w AS second_score,
NTH_VALUE( score, 3 ) OVER w AS third_score
FROM
t_score
WHERE
stu_id IN ( 1, 2 ) WINDOW w AS ( PARTITION BY stu_id ORDER BY score );
查询结果
+-------+----------+-------+--------------+------------+
| stu_id | lesson_id | score | second_score | third_score |
+-------+----------+-------+--------------+------------+
| 1 | L004 | 88 | NULL | NULL |
| 1 | L005 | 98 | 98 | 98 |
| 1 | L001 | 98 | 98 | 98 |
| 2 | L001 | 84 | NULL | NULL |
| 2 | L003 | 86 | 86 | NULL |
| 2 | L002 | 90 | 86 | 90 |
+-------+----------+-------+--------------+------------+
NTILE(n)
用途:将分区中的有序数据分为n个等级,记录等级数
应用场景:将每门课程按照成绩分成3组
SELECT
NTILE( 3 ) OVER w AS nf,
stu_id,
lesson_id,
score
FROM
t_score
WHERE
lesson_id IN ( 'L001', 'L002' ) WINDOW w AS ( PARTITION BY lesson_id ORDER BY score );
查询结果
+----+-------+----------+-------+
| nf | stu_id | lesson_id | score |
+----+-------+----------+-------+
| 1 | 2 | L001 | 84 |
| 1 | 1 | L001 | 98 |
| 2 | 4 | L001 | 99 |
| 3 | 3 | L001 | 100 |
| 1 | 4 | L002 | 88 |
| 2 | 2 | L002 | 90 |
| 3 | 3 | L002 | 91 |
+----+-------+----------+-------+
NTILE(n)函数在数据分析中应用较多,比如由于数据量大,需要将数据平均分配到n个并行的进程分别计算,此时就可以用NTILE(n)对数据进行分组(由于记录数不一定被n整除,所以数据不一定完全平均),然后将不同桶号的数据再分配。
聚合函数作为窗口函数
用途:在窗口中每条记录动态地应用聚合函数(SUM()、AVG()、MAX()、MIN()、COUNT()),可以动态计算在指定的窗口内的各种聚合函数值
应用场景:截止到当前时间,查询stu_id=1的学生的累计分数、分数最高的科目、分数最低的科目
SELECT
stu_id,
lesson_id,
score,
create_time,
SUM( score ) OVER w AS score_sum,
MAX( score ) OVER w AS score_max,
MIN( score ) OVER w AS score_min
FROM
t_score
WHERE
stu_id = 1 WINDOW w AS ( PARTITION BY stu_id ORDER BY create_time );
查询结果
+-------+----------+-------+------------+-----------+-----------+-----------+
| stu_id | lesson_id | score | create_time | score_sum | score_max | score_min |
+-------+----------+-------+------------+-----------+-----------+-----------+
| 1 | L001 | 98 | 2021-09-21 | 98 | 98 | 98 |
| 1 | L004 | 88 | 2021-09-30 | 186 | 98 | 88 |
| 1 | L005 | 98 | 2021-10-01 | 284 | 98 | 88 |
+-------+----------+-------+------------+-----------+-----------+-----------+