关于r：按组选择第一行

Select the first row by group

从这样的数据框中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
id string
1 1 A
2 1 F
3 2 B
4 2 G
5 3 C
6 3 H
7 4 D
8 4 I
9 5 E
10 5 J

我想用每个id /字符串对的第一行创建一个新的。如果sqldf接受其中的R代码，则查询可能如下所示：

1
2
3
4
5
6
7
8
9
10
11

res <- sqldf("select id, min(rownames(test)), string
from test
group by id, string")

> res
id string
1 1 A
3 2 B
5 3 C
7 4 D
9 5 E

有没有创建像这样的新列的解决方案

1	test$row <- rownames(test)

并用min(row)运行相同的sqldf查询？

相关讨论

您可以使用duplicated快速完成此操作。

1	test[!duplicated(test$id),]

速度怪胎的基准测试：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
testd <- as.data.table(test)
setkey(testd,id)
# Initial solution (slow)
# testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
# Faster options :
testd[!duplicated(id)] # (1)
# testd[, .SD[1L], by=key(testd)] # (2)
# testd[J(unique(id)),mult="first"] # (3)
# testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
replications=5, order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 5 0.03 1.000 0.03 0.00
# 5 jdt() 5 0.03 1.000 0.03 0.00
# 3 gs2() 5 3.49 116.333 2.87 0.58
# 2 gs1() 5 3.58 119.333 3.00 0.58
# 4 jply() 5 3.69 123.000 3.11 0.51

让我们再试一次，但仅是初学者的竞争者，并且会有更多的数据和更多的复制。

1
2
3
4
5
6
7

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 100 5.48 1.000 4.44 1.00
# 2 jdt() 100 6.92 1.263 5.70 1.15

相关讨论

关于什么

1
2
3
4

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult ="first"]

编辑

data.tables还有一个独特的方法，它将按键返回第一行

1	jdtu <- function() unique(DT)

我认为，如果要在基准测试之外订购test，则也可以从基准测试中删除setkey和data.table转换(因为setkey基本上按id排序，与order相同)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult = 'first']

library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
## test replications elapsed relative user.self sys.self
## 2 jdt() 5 0.01 1 0.02 0
## 1 ju() 5 0.05 5 0.05 0

并有更多数据

**使用独特的方法进行编辑**

1
2
3
4
5
6
7
8

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
test replications elapsed relative user.self sys.self
2 jdt() 5 0.09 2.25 0.09 0.00
3 jdtu() 5 0.04 1.00 0.05 0.00
1 ju() 5 0.22 5.50 0.19 0.03

独特的方法在这里最快。

相关讨论

我赞成dplyr方法。

group_by(id)，后跟任意一个

filter(row_number()==1)或
slice(1)或
top_n(n = -1)
- top_n()在内部使用等级函数。
  负数从排名底端选择。

在某些情况下，可能有必要在group_by之后安排ID。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

library(dplyr)

# using filter(), top_n() or slice()

m1 <-
test %>%
group_by(id) %>%
filter(row_number()==1)

m2 <-
test %>%
group_by(id) %>%
slice(1)

m3 <-
test %>%
group_by(id) %>%
top_n(n = -1)

这三种方法均返回相同的结果

1
2
3
4
5
6
7
8
9

# A tibble: 5 x 2
# Groups: id [5]
id string
<int> <fct>
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E

相关讨论

一个简单的ddply选项：

1	ddply(test,.(id),function(x) head(x,1))

如果速度是一个问题，可以使用data.table采取类似的方法：

1
2
3

testd <- data.table(test)
setkey(testd,id)
testd[,.SD[1],by = key(testd)]

否则可能会更快：

1	testd[testd[, .I[1], by = key(testd]$V1]

相关讨论

现在，对于dplyr，添加一个不同的计数器。

1
2
3

df %>%
group_by(aa, bb) %>%
summarise(first=head(value,1), count=n_distinct(value))

您创建组，它们在组内汇总。

如果数据是数字，则可以使用：
first(value) [还有last(value)]代替head(value, 1)

看到：
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

充分：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

> df
Source: local data frame [16 x 3]

aa bb value
1 1 1 GUT
2 1 1 PER
3 1 2 SUT
4 1 2 GUT
5 1 3 SUT
6 1 3 GUT
7 1 3 PER
8 2 1 221
9 2 1 224
10 2 1 239
11 2 2 217
12 2 2 221
13 2 2 224
14 3 1 GUT
15 3 1 HUL
16 3 1 GUT

> library(dplyr)
> df %>%
> group_by(aa, bb) %>%
> summarise(first=head(value,1), count=n_distinct(value))

Source: local data frame [6 x 4]
Groups: aa

aa bb first count
1 1 1 GUT 2
2 1 2 SUT 2
3 1 3 SUT 3
4 2 1 221 3
5 2 2 217 3
6 3 1 GUT 2

相关讨论

(1)SQLite具有内置的rowid伪列，因此可以正常工作：

1
2
3

sqldf("select min(rowid) rowid, id, string
from test
group by id")

给予：

1
2
3
4
5
6

rowid id string
1 1 1 A
2 3 2 B
3 5 3 C
4 7 4 D
5 9 5 E

(2)sqldf本身也有一个row.names=参数：

1
2
3

sqldf("select min(cast(row_names as real)) row_names, id, string
from test
group by id", row.names = TRUE)

给予：

1
2
3
4
5
6

id string
1 1 A
3 2 B
5 3 C
7 4 D
9 5 E

(3)混合以上两个元素的第三个选择可能更好：

1
2
3

sqldf("select min(rowid) row_names, id, string
from test
group by id", row.names = TRUE)

给予：

1
2
3
4
5
6

id string
1 1 A
3 2 B
5 3 C
7 4 D
9 5 E

请注意，所有这三个都依赖于SQL的SQLite扩展，其中保证使用min或max会导致从同一行中选择其他列。 (在其他无法保证的基于SQL的数据库中。)

相关讨论

基本的R选项是split()-lapply()-do.call()惯用法：

1
2
3
4
5
6
7

> do.call(rbind, lapply(split(test, test$id), head, 1))
id string
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E

一个更直接的选择是lapply() [函数：

1
2
3
4
5
6
7

> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
id string
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E

lapply()调用末尾的逗号空间1, )是必不可少的，因为这等效于调用[1, ]选择第一行和所有列。

相关讨论