Collapse / concatenate / aggregate a column to a single comma separated string within each group
我想根据两个分组变量在数据框中汇总一列,并用逗号分隔各个值。
以下是一些数据:
1 2 3 4 5 6 7 8 9 | data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = rep(1:2, 3), C = c(5:10)) data # A B C # 1 111 1 5 # 2 111 2 6 # 3 111 1 7 # 4 222 2 8 # 5 222 1 9 # 6 222 2 10 |
" A"和" B"是分组变量," C"是我要折叠成逗号分隔的
1 2 3 4 5 6 7 8 | library(plyr) ddply(data, .(A,B), summarise, test = list(C)) A B test 1 111 1 5, 7 2 111 2 6 3 222 1 9 4 222 2 8, 10 |
但是当我尝试将测试列转换为
1 2 3 4 5 6 | ddply(data, .(A,B), summarise, test = as.character(list(C))) # A B test # 1 111 1 c(5, 7) # 2 111 2 6 # 3 222 1 9 # 4 222 2 c(8, 10) |
如何保持
以下是使用
data.table
1 2 3 | # alternative using data.table library(data.table) as.data.table(data)[, toString(C), by = list(A, B)] |
aggregate这不使用任何软件包:
1 2 | # alternative using aggregate from the stats package in the core of R aggregate(C ~., data, toString) |
sqldf
这是使用sqldf包的SQL函数
1 2 | library(sqldf) sqldf("select A, B, group_concat(C) C from data group by A, B", method ="raw") |
dplyr A
1 2 3 4 5 | library(dplyr) data %>% group_by(A, B) %>% summarise(test = toString(C)) %>% ungroup() |
plyr
1 2 3 | # plyr library(plyr) ddply(data, .(A,B), summarize, C = toString(C)) |
这是
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | library(tidyverse) library(stringr) data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = rep(1:2, 3), C = c(5:10)) data %>% group_by(A, B) %>% summarize(text = str_c(C, collapse =",")) # A tibble: 4 x 3 # Groups: A [2] A B test <dbl> <int> <chr> 1 111 1 5, 7 2 111 2 6 3 222 1 9 4 222 2 8, 10 |
更改放置
的位置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | > out <- ddply(data, .(A, B), summarise, test = list(as.character(C))) > str(out) 'data.frame': 4 obs. of 3 variables: $ A : num 111 111 222 222 $ B : int 1 2 1 2 $ test:List of 4 ..$ : chr "5""7" ..$ : chr"6" ..$ : chr"9" ..$ : chr "8""10" > out A B test 1 111 1 5, 7 2 111 2 6 3 222 1 9 4 222 2 8, 10 |
在这种情况下,请注意,每个项目实际上仍然是单独的字符,而不是单个字符串。也就是说,这不是看起来像" 5,7 "的实际字符串,而是两个字符" 5 "和" 7 ",R用两个字符之间的逗号显示。
与以下各项进行比较:
1 2 3 4 5 6 7 8 9 10 11 12 | > out2 <- ddply(data, .(A, B), summarise, test = paste(C, collapse =",")) > str(out2) 'data.frame': 4 obs. of 3 variables: $ A : num 111 111 222 222 $ B : int 1 2 1 2 $ test: chr "5, 7""6""9""8, 10" > out A B test 1 111 1 5, 7 2 111 2 6 3 222 1 9 4 222 2 8, 10 |
基数R中的可比解当然是
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | > A1 <- aggregate(C ~ A + B, data, function(x) c(as.character(x))) > str(A1) 'data.frame': 4 obs. of 3 variables: $ A: num 111 222 111 222 $ B: int 1 1 2 2 $ C:List of 4 ..$ 0: chr "5""7" ..$ 1: chr"9" ..$ 2: chr"6" ..$ 3: chr "8""10" > A2 <- aggregate(C ~ A + B, data, paste, collapse =",") > str(A2) 'data.frame': 4 obs. of 3 variables: $ A: num 111 222 111 222 $ B: int 1 1 2 2 $ C: chr "5, 7""9""6""8, 10" |
这里有一个小的改进,可以避免重复
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # 1. Original data set data <- data.frame( A = c(rep(111, 3), rep(222, 3)), B = rep(1:2, 3), C = c(5:10)) # 2. Add duplicate row data <- rbind(data, data.table( A = 111, B = 1, C = 5 )) # 3. Solution with duplicates data %>% group_by(A, B) %>% summarise(test = toString(C)) %>% ungroup() # A B test # <dbl> <dbl> <chr> # 1 111 1 5, 7, 5 # 2 111 2 6 # 3 222 1 9 # 4 222 2 8, 10 # 4. Solution without duplicates data %>% select(A, B, C) %>% unique() %>% group_by(A, B) %>% summarise(test = toString(C)) %>% ungroup() # A B test # <dbl> <dbl> <chr> # 1 111 1 5, 7 # 2 111 2 6 # 3 222 1 9 # 4 222 2 8, 10 |
希望它会有用。