关于r：dplyr-多种汇总功能

dplyr - Multiple summary functions

我正在尝试为一个数据帧计算多个统计信息。

我尝试了dplyr的summarise_each。但是，结果以平坦的单行返回，并添加了函数名作为后缀。

是否有直接方法-使用dplyr或基数r-在其中我可以在数据框中获得结果，而列是数据框的列，行是汇总函数？

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

library(dplyr)

df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))

df %>% summarise_each(funs(min, max))
# A_min B_min C_min A_max B_max C_max
# 1 13 117 0 98 188 1

# Desired format
summary(df)
# A B C
# Min. :13.00 Min. :117.0 Min. :0.00
# 1st Qu.:34.75 1st Qu.:134.2 1st Qu.:0.00
# Median :45.00 Median :148.0 Median :1.00
# Mean :52.35 Mean :149.9 Mean :0.65
# 3rd Qu.:62.50 3rd Qu.:168.8 3rd Qu.:1.00
# Max. :98.00 Max. :188.0 Max. :1.00

相关讨论

为什么不简单地将sapply与summary一起使用？

1	sapply(df, summary)

给予：

1
2
3
4
5
6
7
A B C
Min. 1.00 112.0 0.00
1st Qu. 23.75 134.5 0.00
Median 57.00 148.5 1.00
Mean 50.15 149.9 0.55
3rd Qu. 77.50 167.2 1.00
Max. 94.00 191.0 1.00

要恢复数据框，只需将sapply调用package在data.frame()中：data.frame(sapply(df, summary))。如果要将摘要统计名称保留在列中，则可以使用rownames(df)和df$rn <- rownames(df)提取它们，或者使用data.table：

中的keep.rownames参数

1 2	library(data.table) dt <- data.table(sapply(df, summary), keep.rownames = TRUE)

给出：

1
2
3
4
5
6
7
8
> dt
rn A B C
1: Min. 11.00 113.0 0.0
2: 1st Qu. 21.50 126.8 0.0
3: Median 55.00 138.0 0.5
4: Mean 53.65 145.2 0.5
5: 3rd Qu. 83.25 160.5 1.0
6: Max. 98.00 193.0 1.0

怎么样：

1 2	library(tidyr) gather(df) %>% group_by(key) %>% summarise_all(funs(min, max))

1
2
3
4
5
6
# A tibble: 3 × 3
key min max
<chr> <dbl> <dbl>
1 A 2 92
2 B 111 194
3 C 0 1

使用您建议的data.frame，并使用库purrr

1
2
3
4
5
6
7
8
9
10

library(purrr)
out <- df %>% map(~summary(.)) %>% rbind.data.frame
row.names(out) <- c("Min.","1st Qu.","Median","Mean","3rd Qu.","Max.")
#### A B C
#### Min. 7.00 110.0 0.0
#### 1st Qu. 36.75 132.5 0.0
#### Median 53.50 143.5 0.5
#### Mean 55.45 151.8 0.5
#### 3rd Qu. 82.00 167.0 1.0
#### Max. 99.00 199.0 1.0

你去了。我只想提一下，该代码仅适用于具有100％数字变量的输入data.frame。如果存在字符/因子变量，则它将返回错误，因为摘要的输出完全不同。

这不是唯一的方法，但是您可以根据需要使用dplyr和tidyr重塑data.frame。 (和stringr或其他字符来修剪字符。)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

library(dplyr)

df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))

as_data_frame(summary(df)) %>%
# some blank character could be trim
mutate(Var2 = stringr::str_trim(Var2)) %>%
# you don't need Var1
select(-Var1) %>%
# Get the type of summary and the value
tidyr::separate(n, c("Type","value"), sep =":") %>%
# Convert value to numeric
mutate(value = as.numeric(value)) %>%
# reshape as you wish
tidyr::spread(Var2, value, drop = T)
#> # A tibble: 6 x 4
#> Type A B C
#> * <chr> <dbl> <dbl> <dbl>
#> 1 1st Qu. 36.25 122.2 1.00
#> 2 3rd Qu. 77.25 164.5 1.00
#> 3 Max. 95.00 193.0 1.00
#> 4 Mean 57.30 144.6 0.85
#> 5 Median 63.00 143.5 1.00
#> 6 Min. 8.00 111.0 0.00

不使用tidyr或dplyr：

的方法

1
2
3
4
5
6

df <- data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))
df %>%
lapply(summary) %>%
do.call("rbind", .)

输出：

1
2
3
4

Min. 1st Qu. Median Mean 3rd Qu. Max.
A 9 32.5 50.5 49.65 70.25 84
B 116 137.2 162.5 157.70 178.20 196
C 0 0.0 0.0 0.45 1.00 1

如果要使用dplyr进行操作，请尝试：

1
2
3
4

df %>%
gather(attribute, value) %>%
group_by(attribute) %>%
do(as.data.frame(as.list(summary(.$value))))

输出：

1
2
3
4
5
6
7
8

Source: local data frame [3 x 7]
Groups: attribute [3]

attribute Min. X1st.Qu. Median Mean X3rd.Qu. Max.
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 9 32.5 50.5 49.65 70.25 84
2 B 116 137.2 162.5 157.70 178.20 196
3 C 0 0.0 0.0 0.45 1.00 1

非常感谢大家的帮助！经过一番樱桃采摘后，我使用了以下方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

# Dataframe
df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))

# Add summary functions to a list
summaryFns = list(
NA.n = function(x) sum(is.na(x)),
NA.percent = function(x) sum(is.na(x))/length(x),
unique.n = function(x) ifelse(sum(is.na(x)) > 0, length(unique(x)) - 1, length(unique(x))),
min = function(x) min(x, na.rm=TRUE),
max = function(x) max(x, na.rm=TRUE))

# Summarise data frame with each function
# Using dplyr:
library(dplyr)
sapply(summaryFns, function(fn){df %>% summarise_all(fn)})
# NA.n NA.percent unique.n min max
# A 0 0 20 1 98
# B 0 0 20 114 200
# C 0 0 2 0 1

# Using base-r:
sapply(summaryFns, function(fn){sapply(df, fn)})
# NA.n NA.percent unique.n min max
# A 0 0 20 1 98
# B 0 0 20 114 200
# C 0 0 2 0 1

我认为这是最直接，最灵活的方法。
进一步的评论，修改和建议，不胜感激。