mql数据分析(使用HQL窗口函数进行商务数据分析)
mql数据分析(使用HQL窗口函数进行商务数据分析)WITH monthly_revenue as ( SELECT trunc(add_time 'MM') as month sum(amount) as revenue FROM orders GROUP BY 1 ) SELECT month revenue sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total FROM monthly_revenue ORDER BY 1 结果输出累计汇总,即当前元素和所有先前元素的总和,如下面的SQL:WITH monthly_revenue as ( SELECT trunc(add_time 'MM')
通过本文的5个需求分析,可以看出SQL窗口函数的功能十分强大,不仅能够使我们编写的SQL逻辑更加清晰,而且在某种程度上可以简化需求开发。
数据准备本文主要分析只涉及一张订单表orders,操作过程在Hive中完成,具体数据如下:
-- 建表
CREATE TABLE orders(
order_id int
customer_id string
city string
add_time string
amount decimal(10 2));
-- 准备数据
INSERT INTO orders VALUES
(1 "A" "上海" "2020-01-01 00:00:00.000000" 200)
(2 "B" "上海" "2020-01-05 00:00:00.000000" 250)
(3 "C" "北京" "2020-01-12 00:00:00.000000" 200)
(4 "A" "上海" "2020-02-04 00:00:00.000000" 400)
(5 "D" "上海" "2020-02-05 00:00:00.000000" 250)
(5 "D" "上海" "2020-02-05 12:00:00.000000" 300)
(6 "C" "北京" "2020-02-19 00:00:00.000000" 300)
(7 "A" "上海" "2020-03-01 00:00:00.000000" 150)
(8 "E" "北京" "2020-03-05 00:00:00.000000" 500)
(9 "F" "上海" "2020-03-09 00:00:00.000000" 250)
(10 "B" "上海" "2020-03-21 00:00:00.000000" 600);
需求1:收入增长
在业务方面,第m1个月的收入增长计算如下:100 *(m1-m0)/ m0
其中,m1是给定月份的收入,m0是上个月的收入。因此,从技术上讲,我们需要找到每个月的收入,然后以某种方式将每个月的收入与上一个收入相关联,以便进行上述计算。计算当时如下:
WITH
monthly_revenue as (
SELECT
trunc(add_time 'MM') as month
sum(amount) as revenue
FROM orders
GROUP BY 1
)
prev_month_revenue as (
SELECT
month
revenue
lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入
FROM monthly_revenue
)
SELECT
month
revenue
prev_month_revenue
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue 1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
结果输出
我们还可以按照按城市分组进行统计,查看某个城市某个月份的收入增长情况
WITH
monthly_revenue as (
SELECT
trunc(add_time 'MM') as month
city
sum(amount) as revenue
FROM orders
GROUP BY 1 2
)
prev_month_revenue as (
SELECT
month
city
revenue
lag(revenue) over (partition by city order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month
city
revenue
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue 1) as revenue_growth
FROM prev_month_revenue
ORDER BY 2 1
结果输出
累计汇总,即当前元素和所有先前元素的总和,如下面的SQL:
WITH
monthly_revenue as (
SELECT
trunc(add_time 'MM') as month
sum(amount) as revenue
FROM orders
GROUP BY 1
)
SELECT
month
revenue
sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total
FROM monthly_revenue
ORDER BY 1
结果输出
我们还可以使用下面的组合方式进行分析,SQL如下:
SELECT
order_id
customer_id
city
add_time
amount
sum(amount) over () as amount_total -- 所有数据求和
sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum -- 累计求和
sum(amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer
avg(amount) over (order by add_time rows between 5 preceding and current row) as trailing_avg -- 滚动求平均
FROM orders
ORDER BY 1
结果输出:
从上面的数据可以看出,存在两条重复的数据**(5 "D" "上海" "2020-02-05 00:00:00.000000" 250) (5 "D" "上海" "2020-02-05 12:00:00.000000" 300) **显然需要对其进行清洗去重,保留最新的一条数据,SQL如下:
我们先进行分组排名,然后保留最新的那条数据即可:
SELECT *
FROM (
SELECT *
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
) t
WHERE rank=1
结果输出:
经过上面的清洗过程,对数据进行了去重。重新计算上面的需求1,正确SQL脚本为:
WITH
orders_cleaned as (
SELECT *
FROM (
SELECT *
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
)
monthly_revenue as (
SELECT
trunc(add_time 'MM') as month
sum(amount) as revenue
FROM orders_cleaned
GROUP BY 1
)
prev_month_revenue as (
SELECT
month
revenue
lag(revenue) over (order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month
revenue
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue 1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
结果输出:
将清洗后的数据创建成视图,方便以后使用
CREATE VIEW orders_cleaned AS
SELECT
order_id
customer_id
city
add_time
amount
FROM (
SELECT *
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
需求4:分组取TopN
分组取topN是最常见的SQL窗口函数使用场景,下面的SQL是计算每个月份的top2订单金额,如下:
WITH orders_ranked as (
SELECT
trunc(add_time 'MM') as month
*
row_number() over (partition by trunc(add_time 'MM') order by amount desc add_time) as rank
FROM orders_cleaned
)
SELECT
month
order_id
customer_id
city
add_time
amount
FROM orders_ranked
WHERE rank <=2
ORDER BY 1
需求5:重复购买行为
下面的SQL计算重复购买率:重复购买的人数/总人数*100%以及第一笔订单金额与第二笔订单金额之间的典型差额:avg(第二笔订单金额/第一笔订单金额)
WITH customer_orders as (
SELECT *
row_number() over (partition by customer_id order by add_time) as customer_order_n
lag(amount) over (partition by customer_id order by add_time) as prev_order_amount
FROM orders_cleaned
)
SELECT
round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id) 1) as repeat_purchases -- 重复购买率
avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion -- 重复购买较上次购买差异,第一笔订单金额与第二笔订单金额之间的典型差额
FROM customer_orders
结果输出:
WITH结果输出:
最终结果输出: