关于sql:聚合每周最近加入的记录

Aggregating the most recent joined records per week

我在 Postgres 中有一个 updates 表是 9.4.5,如下所示:

1
2
3
4
goal_id    | created_at | STATUS
1          | 2016-01-01 | green
1          | 2016-01-02 | red
2          | 2016-01-02 | amber

还有一个像这样的 goals 表:

1
2
3
id | company_id
1  | 1
2  | 2

我想为每家公司创建一个图表,显示他们每周所有目标的状态。

example

我想象这需要生成一系列过去 8 周的数据,找到该周之前每个目标的最新更新,然后计算找到的更新的不同状态。

到目前为止我所拥有的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SELECT EXTRACT(YEAR FROM generate_series) AS YEAR,
       EXTRACT(week FROM generate_series) AS week,
       u.company_id,
       COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
       COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
       COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
  SELECT DISTINCT ON(YEAR, week)
         goals.company_id,
         updates.status,
         EXTRACT(week FROM updates.created_at) week,
         EXTRACT(YEAR FROM updates.created_at) AS YEAR,
         updates.created_at
  FROM updates
  JOIN goals ON goals.id = updates.goal_id
  ORDER BY YEAR, week, updates.created_at DESC
) u ON u.week = week AND u.year = YEAR
GROUP BY 1,2,3

但这有两个问题。似乎 u 上的连接并没有像我想象的那样工作。它似乎在从内部查询返回的每一行 (?) 上加入,并且这只选择从那一周发生的最新更新。如果需要,它应该获取该周之前的最新更新。

这是一些非常复杂的 SQL,我喜欢一些关于如何实现它的输入。

表结构和信息

目标表有大约 1000 个目标 ATM,并且每周增加大约 100 个:

1
2
3
4
5
6
7
8
9
10
11
12
13
                                           TABLE"goals"
     COLUMN      |            TYPE             |                         Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
 id              | INTEGER                     | NOT NULL DEFAULT NEXTVAL('goals_id_seq'::regclass)
 company_id      | INTEGER                     | NOT NULL
 name            | text                        | NOT NULL
 created_at      | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now())
 updated_at      | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now())
Indexes:
   "goals_pkey" PRIMARY KEY, btree (id)
   "entity_goals_company_id_fkey" btree (company_id)
Foreign-KEY constraints:
   "goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT

updates 表有大约 1000 个,并且每周增加大约 100 个:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
                                         TABLE"updates"
   COLUMN   |            TYPE             |                            Modifiers
------------+-----------------------------+------------------------------------------------------------------
 id         | INTEGER                     | NOT NULL DEFAULT NEXTVAL('updates_id_seq'::regclass)
 STATUS     | entity.goalstatus           | NOT NULL
 goal_id    | INTEGER                     | NOT NULL
 created_at | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now())
 updated_at | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now())
Indexes:
   "goal_updates_pkey" PRIMARY KEY, btree (id)
   "entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-KEY constraints:
   "updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE

 Schema |       Name        | Internal name | SIZE | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
 entity | entity.goalstatus | goalstatus    | 4    | green   +|                   |
        |                   |               |      | amber   +|                   |
        |                   |               |      | red      |                   |


您每周需要一个数据项和一个目标(在汇总每个公司的计数之前)。这是 generate_series()goals 之间的普通 CROSS JOIN。 (可能)昂贵的部分是从 updates 为每个获取当前的 state 。就像@Paul 已经建议的那样, LATERAL 加入似乎是最好的工具。但是,仅对 updates 执行此操作,并在 LIMIT 1.

中使用更快的技术

并使用 date_trunc().

简化日期处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SELECT w_start
     , g.company_id
     , COUNT(*) FILTER (WHERE u.status = 'green') AS green_count
     , COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count
     , COUNT(*) FILTER (WHERE u.status = 'red')   AS red_count
FROM   generate_series(date_trunc('week', NOW() - INTERVAL '2 months')
                     , date_trunc('week', NOW())
                     , INTERVAL '1 week') w_start
CROSS  JOIN goals g
LEFT   JOIN LATERAL (
   SELECT STATUS
   FROM   updates
   WHERE  goal_id = g.id
   AND    created_at < w_start
   ORDER  BY created_at DESC
   LIMIT  1
   ) u ON TRUE
GROUP  BY w_start, g.company_id
ORDER  BY w_start, g.company_id;

要做到这一点,您需要一个多列索引:

1
CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, STATUS);

created_at 的降序是最好的,但不是绝对必要的。 Postgres 可以几乎完全一样快地向后扫描索引。 (但不适用于多列的倒排排序。)

按该顺序索引列。为什么?

  • 多列索引和性能

第三列 status 仅用于允许在 updates 上进行快速仅索引扫描。相关案例:

  • 大表中的慢索引扫描

9 周的 1k 目标(您的 2 个月间隔与至少 9 周重叠)只需要对只有 1k 行的第二个表进行 9k 索引查找。对于像这样的小表,性能应该不是什么大问题。但是一旦每个表中多了几千个,性能就会随着顺序扫描而下降。

w_start 代表每周的开始。因此,计数是针对本周开始的。如果您坚持,您仍然可以提取年和周(或任何其他细节代表您的一周):

1
2
   EXTRACT(isoyear FROM w_start) AS YEAR
 , EXTRACT(week    FROM w_start) AS week

最好使用 ISOYEAR,就像@Paul 解释的那样。

SQL 小提琴。

相关:

  • LATERAL 和 PostgreSQL 中的子查询有什么区别?
  • 优化 GROUP BY 查询以检索每个用户的最新记录
  • 选择每个 GROUP BY 组中的第一行?
  • PostgreSQL:运行查询的行数\\'按分钟\\'


这似乎是 LATERAL 连接的一个很好的用途:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SELECT  EXTRACT(ISOYEAR FROM s) AS YEAR,
        EXTRACT(WEEK FROM s) AS week,
        u.company_id,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count
FROM    generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)
LEFT OUTER JOIN LATERAL (
  SELECT  DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status
  FROM    updates u2
  INNER JOIN goals g
  ON      g.id = u2.goal_id
  WHERE   u2.created_at <= s.w
  ORDER BY g.company_id, u2.goal_id, u2.created_at DESC
) u
ON TRUE
WHERE   u.company_id IS NOT NULL
GROUP BY YEAR, week, u.company_id
ORDER BY u.company_id, YEAR, week
;

顺便说一句,我正在提取 ISOYEAR 而不是 YEAR 以确保我在 1 月初左右获得合理的结果。例如 EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08')2016EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08')53!

编辑:你应该测试你的真实数据,但我觉得这应该更快:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
SELECT  YEAR,
        week,
        company_id,
        COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count,
        COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count,
        COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count
FROM    (
  SELECT  EXTRACT(ISOYEAR FROM s) AS YEAR,
          EXTRACT(WEEK FROM s) AS week,
          u.company_id,
          u.goal_id,
          (array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status
  FROM    generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t)
  LEFT OUTER JOIN (
    SELECT  g.company_id, u2.goal_id, u2.created_at, u2.status
    FROM    updates u2
    INNER JOIN goals g
    ON      g.id = u2.goal_id
  ) u
  ON      s.t >= u.created_at
  WHERE   u.company_id IS NOT NULL
  GROUP BY YEAR, week, u.company_id, u.goal_id
) x
GROUP BY YEAR, week, company_id
ORDER BY company_id, YEAR, week
;

但仍然没有窗口函数。 :-) 此外,您还可以通过将 (array_agg(...))[1] 替换为真正的 first 函数来加快速度。您必须自己定义它,但 Postgres wiki 上的实现很容易在 Google 上找到。


我使用 PostgreSQL 9.3。我对你的问题很感兴趣。我检查了你的数据结构。比我创建以下表格。

Data

1
2
3
4
5
6
7
SELECT c.id company_id, c.name company_name, u.status goal_status,
         EXTRACT(week FROM u.created_at) goal_status_week,
         EXTRACT(YEAR FROM u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id
ORDER BY goal_status_year DESC, goal_status_week DESC;

我得到以下结果;
Inner

最后我将此查询与周系列合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SELECT
             gs.company_id,
             gs.company_name,
             gs.goal_status,
             EXTRACT(YEAR FROM w) AS YEAR,
       EXTRACT(week FROM w) AS week,
             COUNT(gs.*) cnt
FROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') w
LEFT JOIN(
SELECT c.id company_id, c.name company_name, u.status goal_status,
             EXTRACT(week FROM u.created_at) goal_status_week,
       EXTRACT(YEAR FROM u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id ) gs
ON gs.goal_status_week = EXTRACT(week FROM w) AND gs.goal_status_year = EXTRACT(YEAR FROM w)
GROUP BY company_id, company_name, goal_status, YEAR, week
ORDER BY  YEAR DESC, week DESC;

我得到了这个结果

Final

祝你有美好的一天。