When looking at timeseries data, it is often useful to look at metrics that reveal an underlying trend, ie something robust enough to deal with volatility and short term fluctuations. A frequently proposed question is how does average sale price compare over time? This is an example of accomplishing a moving average and sum calculation within SQL using a monthly interval. There are several ways of accomplishing this. I'm going to cover examples using correlated subselects and derived table.
Here is a sample of what the raw data looks like in the table below:
I'm going to use a simple purchases table to create rolling sums on revenue. The sample code can be modified to calculate many different aggregates and can for comparing other timeframes (such as daily or hourly):
TABLE: purchases | ||
---|---|---|
id | timestamp | price |
1 | 2014-03-03 00:00:04 | 230 |
2 | 2014-03-03 00:01:14 | 210 |
3 | 2014-03-03 00:02:34 | 250 |
4 | ... | ... |
Here is the result set we expect:
Derived Table: Purchases weeekly comparison | ||||
---|---|---|---|---|
week | revenue | avg_rev | revenue_4_weeks_prior | avg_rev_4_weeks_prior |
2014-03-03 | 51,312 | 208.12 | 193,210 | 202.91 |
2014-03-10 | 54,310 | 202.87 | 221,409 | 210.41 |
2014-03-17 | 57,942 | 218.82 | 259,682 | 205.23 |
... | ... | ... | ... | ... |
The finished result set is faster and easier to create reports and discover interesting insights. We can also add dimensions to the final form so we can see how different facets can alter our rolling aggregate! Now let's take a look at how to get to this form.
The easiest way to produced our desired result set is by using a simple correlated subquery (LINK TO http://en.wikipedia.org/wiki/Correlated_subquery).
-- 2nd pass
SELECT
pw.week AS week
, pw.revenue AS revenue
, (
SELECT
SUM(p2.revenue)
FROM purchases AS p2
WHERE p2 <= pw.week
AND p2 >= pw.week + 'INTERVAL -3 weeks'
) AS revenue_4_weeks_prior
, (
SELECT
AVG(p2.revenue)
FROM purchases AS p2
WHERE p2 <= pw.week
AND p2 >= pw.week + 'INTERVAL -3 weeks'
) AS avg_rev_4_weeks_prior
FROM
-- 1st pass
(SELECT
EXTRACT(week FROM p.timestamp) AS week
, SUM(p.revenue) AS revenue
FROM purchases AS p
GROUP BY 1) AS pw
GROUP BY week
In the first pass, you can see we're grabbing all the weeks for which we have data and summing the revenue for that week. In the second pass, we're calculating for each week the revenue from four weeks prior.
Unfortunately, there are some drawbacks to using this technique. The simplicity and elegance of this query comes at a great cost to performance. Correlated subqueries run a SELECT statement for each row and column within the result set. With a large dataset, one that has many weeks, this query may end up running for a long time. This query pattern is ideal for small datasets or for looking at a small subset of the data (using a WHERE clause to limit query range).
Please note that correlated subqueries are not implemented in every database. Popular MPP databases, such as Redshift and Vertica, only partially support correlated subqueries (MAYBE LINK TO http://docs.aws.amazon.com/redshift/latest/dg/r_correlated_subqueries.html#r_correlated_subqueries-correlated-subquery-patterns-that-are-not-supported).
Let's go over how to create this result when dealing with very large datasets. We want to avoid scanning any row of the raw data more than once. The best way to do this is to use derived tables (http://en.wikipedia.org/wiki/Relational_database#Base_and_derived_relations). Here's the code:
-- 1st pass
WITH weekly_revenue AS (
SELECT
EXTRACT(week FROM p.timestamp) AS week
, SUM(p.revenue) AS weekly_revenue
, COUNT(*) AS purchases
FROM purchases
)
-- 2nd pass
SELECT
wr.week AS week
, wr.weekly_revenue AS revenue
, wr.weekly_revenue/wr.purchases AS avg_rev
, SUM(wr2.revenue) AS revenue_4_weeks_prior
, SUM(wr2.revenue)/SUM(wr2.purchases) AS avg_rev_4_weeks_prior
FROM weekly_revenue AS wr
JOIN weekly_revenue AS wr2
ON wr.week < wr2.week
AND wr.week >= wr2.week
GROUP BY 1,2
This query may seem a bit strange, so let's go over it step by step. First, we create a simple CTE (LINK TO http://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL#Common_table_expression) with every week and its revenue. Next, we join this CTE onto itself to create four rows, one for every prior week. We're fanning out (WE DON'T HAVE AN ARTICLE ON FANOUTS YET :( ) the data and then summing on each of the new joined-on weekly_revenues. Here's how the form looks right after the JOIN but before the SUM:
Derived Table: Fanned out weekly_revenue | |||||
---|---|---|---|---|---|
wr.week | wr.revenue | wr.purchases | wr2.week | wr2.revenue | wr2.purchases |
2014-03-31 | 61,582 | 296 | 2014-03-10 | 51,312 | 247 |
2014-03-31 | 61,582 | 296 | 2014-03-17 | 54,310 | 317 |
2014-03-31 | 61,582 | 296 | 2014-03-24 | 57,942 | 286 |
2014-03-31 | 61,582 | 296 | 2014-03-31 | 59,429 | 272 |
... | ... | ... | ... | ... | ... |
There are now four rows for the week of 2014-03-31. Each row has a weekly revenue from one of the four prior weeks. To get to the result set we want, we simply sum and average the wr2.weekly_revenue and then group by the original week's date and revenue value.
Derived Table: Purchases weeekly comparison | ||||
---|---|---|---|---|
week | revenue | avg_rev | revenue_4_weeks_prior | avg_rev_4_weeks_prior |
2014-03-31 | 61,582 | 208.12 | 222,993 | 198.76 |
This query will run much faster than the subselect method, since we only scan the raw table once.
(I used a WITH statement in this example query. Your syntax may be different, depending on your SQL database. You may perfer to use CREATE TEMPORARY TABLE or CREATE TABLE.)
Sometimes our data isn't perfect. There may be some weeks when nothing occurs, and we want to make sure our report captures that. We can accomplish this by simply adding a table with all the weeks we're interested in:
Derived Table: weeks |
---|
week |
2014-03-03 |
2014-03-10 |
2014-03-17 |
2014-03-24 |
... |
Now we can use this table for our queries by incorporating it with the second pass of our subselect or within the join of the derived table.
**** Whats next?
With these patterns we can now create rolling reports in SQL! We can do more than just averages and sums and start calculating very complex metrics such as ratios and medians. Hope you enjoyed this article.