关于sql：在每个GROUP BY组中选择第一行？

Select first row in each GROUP BY group?

如标题所示，我想选择每一组用GROUP BY分组的行中的第一行。

具体来说，如果我有一个这样的purchases表：

1	SELECT * FROM purchases;

我的输出：

1
2
3
4
5
6

id | customer | total
---+----------+------
1 | Joe | 5
2 | Sally | 3
3 | Joe | 2
4 | Sally | 1

我想查询每个customer的最大采购量(total)的id。像这样：

1
2
3
4

SELECT FIRST(id), customer, FIRST(total)
FROM purchases
GROUP BY customer
ORDER BY total DESC;

预期输出：

1
2
3
4

FIRST(id) | customer | FIRST(total)
----------+----------+-------------
1 | Joe | 5
2 | Sally | 3

在PostgreSQL中，这通常更简单、更快(下面是更多的性能优化)：好的。

1
2
3
4

SELECT DISTINCT ON (customer)
id, customer, total
FROM purchases
ORDER BY customer, total DESC, id;

或更短(如果不是很清楚)的输出列的序号：好的。

1
2
3
4

SELECT DISTINCT ON (2)
id, customer, total
FROM purchases
ORDER BY 2, 3 DESC, 1;

如果total可以为空(这两种方法都不会造成影响，但您需要匹配现有索引)：好的。

1 2	... ORDER BY customer, total DESC NULLS LAST, id;

要点

DISTINCT ON是标准的postgresql扩展(在整个SELECT列表中只定义了DISTINCT)。好的。
在DISTINCT ON子句中列出任意数量的表达式，组合行值定义重复项。手册：好的。

Obviously, two rows are considered distinct if they differ in at least
one column value. Null values are considered equal in this comparison.

Ok.

大胆强调我的。好的。
DISTINCT ON可与ORDER BY组合使用。前导表达式必须以相同的顺序匹配前导DISTINCT ON表达式。您可以向ORDER BY添加额外的表达式，以便从每个对等组中选择特定的行。我添加了EDOCX1[9]作为最后一个打破联系的项目：好的。
"从共享最高total的每个组中选择具有最小id的行。"好的。
要以与确定每组第一个查询的排序顺序不一致的方式对结果排序，可以将上面的查询嵌套在另一个ORDER BY的外部查询中。像：好的。
- PostgreSQL上的distinct on和different order by
如果total可以为空，则最可能需要非空值最大的行。如图所示，加上NULLS LAST。细节：好的。
- PostgreSQL按datetime asc排序，首先为空？
SELECT列表不受DISTINCT ON或ORDER BY表达式的任何约束。(上述简单情况下不需要)：好的。
- 您不必在DISTINCT ON或ORDER BY中包含任何表达式。好的。
- 您可以在SELECT列表中包含任何其他表达式。这有助于用子查询和聚合/窗口函数替换更复杂的查询。好的。
我用Postgres版本8.3–11进行了测试。但这个特性至少从7.1版开始就存在，所以基本上总是存在的。好的。

索引

上述查询的完美索引将是一个多列索引，该索引按匹配顺序和匹配排序顺序跨越所有三列：好的。

1	CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

可能太专业了。但如果特定查询的读取性能至关重要，请使用它。如果查询中有DESC NULLS LAST，请在索引中使用该查询，以便排序顺序匹配并且索引适用。好的。有效性/性能优化

在为每个查询创建定制的索引之前权衡成本和收益。上述指标的潜力很大程度上取决于数据分布。好的。

使用索引是因为它提供预先排序的数据。在Postgres9.2或更高版本中，如果索引小于基础表，那么查询也可以只从索引扫描中获益。不过，必须对索引进行整体扫描。好的。

对于每个客户的几行(customer列中的高基数)，这是非常有效的。更重要的是，如果您无论如何都需要经过排序的输出。随着每个客户的行数的增加，收益会减少。理想情况下，您有足够的work_mem来处理RAM中涉及的排序步骤，而不会溢出到磁盘。但一般情况下，将work_mem设得过高会产生不利影响。对于非常大的查询，考虑使用SET LOCAL。找到你需要多少与EXPLAIN ANALYZE。在排序步骤中提到"磁盘"，表示需要更多：好的。
- Linux上PostgreSQL中的配置参数Work-Mem
- 使用按日期和文本排序优化简单查询
对于每个客户的许多行(customer列中的低基数)，松索引扫描(也称为"跳过扫描")将(非常)高效，但在Postgres 11之前没有实现。(计划对Postgres 12实施仅索引扫描。见这里和这里。)目前，有更快的查询技术来替代它。尤其是如果您有一个单独的表来存放唯一的客户，这是典型的用例。但如果你不这样做：好的。
- 按查询优化分组以检索每个用户的最新记录
- 优化GroupWise最大查询
- 每行查询最后n个相关行

基准

我这里有一个简单的基准，现在已经过时了。在这个单独的答案中，我用一个详细的基准代替了它。好的。好啊。

相关讨论

在Oracle 9.2+上(不是最初所说的8i+)，SQL Server 2005+，PostgreSQL 8.4+，DB2，Firebird 3.0+，Teradata，Sybase，Vertica:

1
2
3
4
5
6
7
8
9
10

WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rk
FROM PURCHASES p)
SELECT s.*
FROM summary s
WHERE s.rk = 1

任何数据库都支持：

但你需要添加逻辑来打破联系：

1
2
3
4
5
6
7
8
9
10

SELECT MIN(x.id), -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total

相关讨论

基准

使用Postgres 9.4和9.5测试最有意思的候选人，在purchases和10k不同的customer_id中有一个20万行的半现实表(每个客户平均20行)。

对于Postgres9.5，我对86446个不同的客户进行了第二次测试。见下文(每个客户平均2.3行)。

安装程序

主台

1
2
3
4
5
6

CREATE TABLE purchases (
id serial
, customer_id INT -- REFERENCES customer
, total INT -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);

我使用一个serial和一个整数customer_id，因为这是一个更典型的设置。还添加了some_column，以弥补通常更多的列。

虚拟数据、pk、index——典型的表也有一些死元组：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

INSERT INTO purchases (customer_id, total, some_column) -- insert 200k rows
SELECT (random() * 10000)::INT AS customer_id -- 10k customers
, (random() * random() * 100000)::INT AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::INT)
FROM generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9; -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::INT AS customer_id -- 10k customers
, (random() * random() * 100000)::INT AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::INT)
FROM generate_series(1,20000) g; -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

customer表-用于上级查询

1
2
3
4
5
6
7
8
9

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM purchases
GROUP BY 1
ORDER BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

在我的第二个9.5测试中，我使用了相同的设置，但使用random() * 100000生成customer_id，以便每个customer_id只获得几行。

表purchases的对象大小

与此查询一起生成。

1
2
3
4
5
6
7
8
9
10
11
12
13

what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
core_relation_size | 20496384 | 20 MB | 102
visibility_map | 0 | 0 bytes | 0
free_space_map | 24576 | 24 kB | 0
table_size_incl_toast | 20529152 | 20 MB | 102
indexes_size | 10977280 | 10 MB | 54
total_size_incl_toast_and_indexes | 31506432 | 30 MB | 157
live_rows_in_text_representation | 13729802 | 13 MB | 68
------------------------------ | | |
ROW_COUNT | 200045 | |
live_tuples | 200045 | |
dead_tuples | 19955 | |

查询1。CTE中的row_number()(见其他答案)

1
2
3
4
5
6
7
8

WITH cte AS (
SELECT id, customer_id, total
, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
)
SELECT id, customer_id, total
FROM cte
WHERE rn = 1;

2。子查询中的row_number()(我的优化)

1
2
3
4
5
6
7

SELECT id, customer_id, total
FROM (
SELECT id, customer_id, total
, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
) sub
WHERE rn = 1;

三。DISTINCT ON(见其他答案)

1
2
3
4

SELECT DISTINCT ON (customer_id)
id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC, id;

4。带LATERAL子查询的RCTE(见此处)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC
LIMIT 1
)
UNION ALL
SELECT u.*
FROM cte c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id > c.customer_id -- lateral reference
ORDER BY customer_id, total DESC
LIMIT 1
) u
)
SELECT id, customer_id, total
FROM cte
ORDER BY customer_id;

5。带LATERAL的customer表(见这里)

1
2
3
4
5
6
7
8
9

SELECT l.*
FROM customer c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id = c.customer_id -- lateral reference
ORDER BY total DESC
LIMIT 1
) l;

6。array_agg()与ORDER BY之间(见其他答案)

1
2
3
4
5

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
, customer_id
, MAX(total) AS total
FROM purchases
GROUP BY customer_id;

结果

在EXPLAIN ANALYZE的情况下(所有选项都关闭)，执行上述查询的时间最好是5次。

所有查询仅在purchases2_3c_idx上使用索引扫描(以及其他步骤)。其中一些只是为了索引的较小规模，另一些则更有效。

a.Postgres 9.4，20万行，每个customer_id约20行。

1
2
3
4
5
6

1. 273.274 ms
2. 194.572 ms
3. 111.067 ms
4. 92.922 ms
5. 37.679 ms -- winner
6. 189.495 ms

B.与Postgres 9.5相同

1
2
3
4
5
6

1. 288.006 ms
2. 223.032 ms
3. 107.074 ms
4. 78.032 ms
5. 33.944 ms -- winner
6. 211.540 ms

c.与b相同，但每个customer_id有约2.3行。

1
2
3
4
5
6

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

2011年原始(过时)基准

我用PostgreSQL 9.1在65579行的实际表上运行了三次测试，在涉及的三列中的每一列上运行了单列btree索引，并用了5次运行的最佳执行时间。将@omgponies的第一个查询(A与上述DISTINCT ON解决方案(B进行比较：

选择整个表，本例中结果为5958行。

1 2	A: 567.218 ms B: 386.673 ms

使用条件WHERE customer BETWEEN x AND y产生1000行。

1 2	A: 249.136 ms B: 55.111 ms

用WHERE customer = x选择单个客户。

1 2	A: 0.143 ms B: 0.072 ms

用另一个答案中描述的索引重复相同的测试

1	CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1
2
3
4
5
6
7
8

1A: 277.953 ms
1B: 193.547 ms

2A: 249.796 ms -- special index not used
2B: 28.679 ms

3A: 0.120 ms
3B: 0.048 ms

相关讨论

这是一个常见的最大N组问题，它已经得到了很好的测试和高度优化的解决方案。就我个人而言，我更喜欢比尔·卡温(Bill Karwin)的左联解决方案(最初的帖子中有很多其他解决方案)。

请注意，对于这个常见问题的大量解决方案可以在一个最官方的资源mysql手册中找到！请参阅常见查询示例：包含特定列的按组最大值的行。

相关讨论

在Postgres中，您可以这样使用array_agg：

1
2
3
4
5

SELECT customer,
(array_agg(id ORDER BY total DESC))[1],
MAX(total)
FROM purchases
GROUP BY customer

这将为您提供每个客户最大采购量的id。

需要注意的一些事项：

array_agg是一个聚合函数，因此它与GROUP BY一起工作。
array_agg允许您指定一个仅限于其自身的排序范围，因此它不会约束整个查询的结构。如果需要执行与默认值不同的操作，还可以使用语法来排序空值。
一旦我们构建了数组，我们就获取第一个元素。(Postgres数组是1索引的，而不是0索引的)。
您可以用与第三个输出列类似的方式使用array_agg，但max(total)更简单。
与DISTINCT ON不同，使用array_agg可以保留GROUP BY以防出于其他原因需要。

由于存在子问题，因此该解决方案并不像erwin所指出的那样高效。

1 2	SELECT * FROM purchases p1 WHERE total IN (SELECT MAX(total) FROM purchases WHERE p1.customer=customer) ORDER BY total DESC;

相关讨论

我用这种方式(仅限PostgreSQL)：https://wiki.postgresql.org/wiki/first/last%28aggregate%29

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
sfunc = public.first_agg,
basetype = anyelement,
stype = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);

那么，您的示例应该几乎可以工作：

1
2
3
4

SELECT FIRST(id), customer, FIRST(total)
FROM purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;

警告：它忽略了空行

编辑1-改为使用Postgres扩展

现在我用这种方式：http://pgxn.org/dist/first-last-agg/

在Ubuntu 14.04上安装：

1
2
3
4
5

apt-GET install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'

它是一个Postgres扩展，为您提供了第一个和最后一个函数；显然比上面的方法快。

编辑2-排序和筛选

如果使用聚合函数(如这些)，则可以对结果进行排序，而无需对数据进行排序：

1	http://www.postgresql.org/docs/CURRENT/static/sql-expressions.html#SYNTAX-AGGREGATES

因此，使用排序的等效示例如下：

1
2
3
4

SELECT FIRST(id ORDER BY id), customer, FIRST(total ORDER BY id)
FROM purchases
GROUP BY customer
ORDER BY FIRST(total);

当然，您可以按照您认为适合于聚合的方式进行排序和筛选；这是非常强大的语法。

相关讨论

非常快速的解决方案

1
2
3
4
5
6
7
8

SELECT a.*
FROM
purchases a
JOIN (
SELECT customer, MIN( id ) AS id
FROM purchases
GROUP BY customer
) b USING ( id );

如果表是按ID索引的，则速度非常快：

1	CREATE INDEX purchases_id ON purchases (id);

相关讨论

查询：

1
2
3
4
5
6
7
8

SELECT purchases.*
FROM purchases
LEFT JOIN purchases AS p
ON
p.customer = purchases.customer
AND
purchases.total < p.total
WHERE p.total IS NULL

这是怎么回事！(我去过那里)

我们要确保每次购买的总金额都是最高的。

一些理论上的东西(如果你只想理解这个查询，跳过这部分)

total是一个函数t(customer，id)，它返回一个给定名称和id的值为了证明给定的总数(t(客户，id))是最高的，我们必须证明我们也要证明

？x t(客户，ID)>t(客户，X)(此总数高于所有其他值该客户的合计)

或

？？x t(客户，ID)

第一种方法需要我们获取我不喜欢的那个名字的所有记录。

第二个需要一个聪明的方法来说明没有比这个更高的记录。

返回SQL

如果我们左键联接表的名称和合计小于联接表：

1
2
3
4
5

LEFT JOIN purchases AS p
ON
p.customer = purchases.customer
AND
purchases.total < p.total

我们确保将要加入的同一用户的其他记录的总数更高：

1
2
3
4
5
6
7

purchases.id, purchases.customer, purchases.total, p.id, p.customer, p.total
1 , Tom , 200 , 2 , Tom , 300
2 , Tom , 300
3 , Bob , 400 , 4 , Bob , 500
4 , Bob , 500
5 , Alice , 600 , 6 , Alice , 700
6 , Alice , 700

这将帮助我们筛选无需分组的每个采购的最高总额：

1
2
3
4
5
6

WHERE p.total IS NULL

purchases.id, purchases.name, purchases.total, p.id, p.name, p.total
2 , Tom , 300
4 , Bob , 500
6 , Alice , 700

这就是我们需要的答案。

对PostgreSQL、U-SQL、IBM DB2和Google BigQuery SQL使用ARRAY_AGG函数：

1
2
3

SELECT customer, (ARRAY_AGG(id ORDER BY total DESC))[1], MAX(total)
FROM purchases
GROUP BY customer

在SQL Server中，可以执行以下操作：

1
2
3
4
5
6
7

SELECT *
FROM (
SELECT ROW_NUMBER()
OVER(PARTITION BY customer
ORDER BY total DESC) AS StRank, *
FROM Purchases) n
WHERE StRank = 1

说明：这里的分组方式是根据客户进行的，然后按总数订购，然后每个分组都有一个序列号，称为Strank，我们将选出第一个客户，Strank为1。

接受的OMG PONIES的"任何数据库支持"解决方案在我的测试中速度很快。

在这里，我提供了相同的方法，但更完整和干净的任何数据库解决方案。考虑绑定(假设希望每个客户只获得一行，甚至每个客户的最大合计有多个记录)，并且将为采购表中的实际匹配行选择其他采购字段(例如采购付款ID)。

任何数据库都支持：

1
2
3
4
5
6
7
8
9
10

SELECT * FROM purchase
JOIN (
SELECT MIN(id) AS id FROM purchase
JOIN (
SELECT customer, MAX(total) AS total FROM purchase
GROUP BY customer
) t1 USING (customer, total)
GROUP BY customer
) t2 USING (id)
ORDER BY customer

这个查询速度相当快，特别是当采购表上有一个复合索引(customer，total)时。

备注：

T1、T2是子查询别名，可以根据数据库删除。

注意：截止2017年1月的编辑，MS-SQL和Oracle数据库目前不支持using (...)子句。您必须自己扩展到on t2.id = purchase.id等，使用语法在sqlite、mysql和postgresql中都有效。

对于SQL Server，最有效的方法是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

WITH
ids AS ( --condition for split table into groups
SELECT i FROM (VALUES (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) AS v(i)
)
,src AS (
SELECT * FROM yourTable WHERE <condition> --use this as filter for other conditions
)
,joined AS (
SELECT tops.* FROM ids
CROSS apply --it`s like for each rows
(
SELECT top(1) *
FROM src
WHERE CommodityId = ids.i
) AS tops
)
SELECT * FROM joined

别忘了为使用过的列创建聚集索引

如果要从聚合行集合中选择任何行(根据特定条件)。
如果要使用除max/min之外的另一个(sum/avg聚合函数。因此，你不能使用线索与DISTINCT ON。

可以使用下一个子查询：

1
2
3
4
5
6
7
8
9
10

SELECT
(
SELECT **id** FROM t2
WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )
) id,
name,
MAX(amount) ma,
SUM( ratio )
FROM t2 tf
GROUP BY name

您可以用一个限制条件来替换amount = MAX( tf.amount )：此子查询不能返回多行

但是如果你想做这样的事情，你可能会寻找窗口函数