Aggregating the most recent joined records per week
我在 Postgres 中有一个
1 2 3 4 | goal_id | created_at | STATUS 1 | 2016-01-01 | green 1 | 2016-01-02 | red 2 | 2016-01-02 | amber |
还有一个像这样的
1 2 3 | id | company_id 1 | 1 2 | 2 |
我想为每家公司创建一个图表,显示他们每周所有目标的状态。
我想象这需要生成一系列过去 8 周的数据,找到该周之前每个目标的最新更新,然后计算找到的更新的不同状态。
到目前为止我所拥有的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | SELECT EXTRACT(YEAR FROM generate_series) AS YEAR, EXTRACT(week FROM generate_series) AS week, u.company_id, COUNT(*) FILTER (WHERE u.status = 'green') AS green_count, COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count, COUNT(*) FILTER (WHERE u.status = 'red') AS red_count FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week') LEFT OUTER JOIN ( SELECT DISTINCT ON(YEAR, week) goals.company_id, updates.status, EXTRACT(week FROM updates.created_at) week, EXTRACT(YEAR FROM updates.created_at) AS YEAR, updates.created_at FROM updates JOIN goals ON goals.id = updates.goal_id ORDER BY YEAR, week, updates.created_at DESC ) u ON u.week = week AND u.year = YEAR GROUP BY 1,2,3 |
但这有两个问题。似乎
这是一些非常复杂的 SQL,我喜欢一些关于如何实现它的输入。
表结构和信息
目标表有大约 1000 个目标 ATM,并且每周增加大约 100 个:
1 2 3 4 5 6 7 8 9 10 11 12 13 | TABLE"goals" COLUMN | TYPE | Modifiers -----------------+-----------------------------+----------------------------------------------------------- id | INTEGER | NOT NULL DEFAULT NEXTVAL('goals_id_seq'::regclass) company_id | INTEGER | NOT NULL name | text | NOT NULL created_at | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now()) updated_at | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now()) Indexes: "goals_pkey" PRIMARY KEY, btree (id) "entity_goals_company_id_fkey" btree (company_id) Foreign-KEY constraints: "goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | TABLE"updates" COLUMN | TYPE | Modifiers ------------+-----------------------------+------------------------------------------------------------------ id | INTEGER | NOT NULL DEFAULT NEXTVAL('updates_id_seq'::regclass) STATUS | entity.goalstatus | NOT NULL goal_id | INTEGER | NOT NULL created_at | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now()) updated_at | TIMESTAMP WITHOUT TIME zone | NOT NULL DEFAULT timezone('utc'::text, now()) Indexes: "goal_updates_pkey" PRIMARY KEY, btree (id) "entity_goal_updates_goal_id_fkey" btree (goal_id) Foreign-KEY constraints: "updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE Schema | Name | Internal name | SIZE | Elements | Access privileges | Description --------+-------------------+---------------+------+----------+-------------------+------------- entity | entity.goalstatus | goalstatus | 4 | green +| | | | | | amber +| | | | | | red | | |
您每周需要一个数据项和一个目标(在汇总每个公司的计数之前)。这是
中使用更快的技术
并使用
简化日期处理
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | SELECT w_start , g.company_id , COUNT(*) FILTER (WHERE u.status = 'green') AS green_count , COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count , COUNT(*) FILTER (WHERE u.status = 'red') AS red_count FROM generate_series(date_trunc('week', NOW() - INTERVAL '2 months') , date_trunc('week', NOW()) , INTERVAL '1 week') w_start CROSS JOIN goals g LEFT JOIN LATERAL ( SELECT STATUS FROM updates WHERE goal_id = g.id AND created_at < w_start ORDER BY created_at DESC LIMIT 1 ) u ON TRUE GROUP BY w_start, g.company_id ORDER BY w_start, g.company_id; |
要做到这一点,您需要一个多列索引:
1 | CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, STATUS); |
按该顺序索引列。为什么?
- 多列索引和性能
第三列
- 大表中的慢索引扫描
9 周的 1k 目标(您的 2 个月间隔与至少 9 周重叠)只需要对只有 1k 行的第二个表进行 9k 索引查找。对于像这样的小表,性能应该不是什么大问题。但是一旦每个表中多了几千个,性能就会随着顺序扫描而下降。
1 2 | EXTRACT(isoyear FROM w_start) AS YEAR , EXTRACT(week FROM w_start) AS week |
最好使用
SQL 小提琴。
相关:
- LATERAL 和 PostgreSQL 中的子查询有什么区别?
- 优化 GROUP BY 查询以检索每个用户的最新记录
- 选择每个 GROUP BY 组中的第一行?
- PostgreSQL:运行查询的行数\\'按分钟\\'
这似乎是
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | SELECT EXTRACT(ISOYEAR FROM s) AS YEAR, EXTRACT(WEEK FROM s) AS week, u.company_id, COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count, COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count, COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w) LEFT OUTER JOIN LATERAL ( SELECT DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status FROM updates u2 INNER JOIN goals g ON g.id = u2.goal_id WHERE u2.created_at <= s.w ORDER BY g.company_id, u2.goal_id, u2.created_at DESC ) u ON TRUE WHERE u.company_id IS NOT NULL GROUP BY YEAR, week, u.company_id ORDER BY u.company_id, YEAR, week ; |
顺便说一句,我正在提取
编辑:你应该测试你的真实数据,但我觉得这应该更快:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | SELECT YEAR, week, company_id, COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count, COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count, COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count FROM ( SELECT EXTRACT(ISOYEAR FROM s) AS YEAR, EXTRACT(WEEK FROM s) AS week, u.company_id, u.goal_id, (array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t) LEFT OUTER JOIN ( SELECT g.company_id, u2.goal_id, u2.created_at, u2.status FROM updates u2 INNER JOIN goals g ON g.id = u2.goal_id ) u ON s.t >= u.created_at WHERE u.company_id IS NOT NULL GROUP BY YEAR, week, u.company_id, u.goal_id ) x GROUP BY YEAR, week, company_id ORDER BY company_id, YEAR, week ; |
但仍然没有窗口函数。 :-) 此外,您还可以通过将
我使用 PostgreSQL 9.3。我对你的问题很感兴趣。我检查了你的数据结构。比我创建以下表格。
我插入以下记录;
公司
目标
更新
之后我写了以下查询,以供更正
1 2 3 4 5 6 7 | SELECT c.id company_id, c.name company_name, u.status goal_status, EXTRACT(week FROM u.created_at) goal_status_week, EXTRACT(YEAR FROM u.created_at) AS goal_status_year FROM company c INNER JOIN goals g ON g.company_id = c.id INNER JOIN updates u ON u.goal_id = g.id ORDER BY goal_status_year DESC, goal_status_week DESC; |
我得到以下结果;
最后我将此查询与周系列合并
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | SELECT gs.company_id, gs.company_name, gs.goal_status, EXTRACT(YEAR FROM w) AS YEAR, EXTRACT(week FROM w) AS week, COUNT(gs.*) cnt FROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') w LEFT JOIN( SELECT c.id company_id, c.name company_name, u.status goal_status, EXTRACT(week FROM u.created_at) goal_status_week, EXTRACT(YEAR FROM u.created_at) AS goal_status_year FROM company c INNER JOIN goals g ON g.company_id = c.id INNER JOIN updates u ON u.goal_id = g.id ) gs ON gs.goal_status_week = EXTRACT(week FROM w) AND gs.goal_status_year = EXTRACT(YEAR FROM w) GROUP BY company_id, company_name, goal_status, YEAR, week ORDER BY YEAR DESC, week DESC; |
我得到了这个结果
祝你有美好的一天。