dplyr - Multiple summary functions
我正在尝试为一个数据帧计算多个统计信息。
我尝试了
是否有直接方法-使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | library(dplyr) df = data.frame(A = sample(1:100, 20), B = sample(110:200, 20), C = sample(c(0,1), 20, replace = T)) df %>% summarise_each(funs(min, max)) # A_min B_min C_min A_max B_max C_max # 1 13 117 0 98 188 1 # Desired format summary(df) # A B C # Min. :13.00 Min. :117.0 Min. :0.00 # 1st Qu.:34.75 1st Qu.:134.2 1st Qu.:0.00 # Median :45.00 Median :148.0 Median :1.00 # Mean :52.35 Mean :149.9 Mean :0.65 # 3rd Qu.:62.50 3rd Qu.:168.8 3rd Qu.:1.00 # Max. :98.00 Max. :188.0 Max. :1.00 |
为什么不简单地将
1 | sapply(df, summary) |
给予:
1
2
3
4
5
6
7 A B C
Min. 1.00 112.0 0.00
1st Qu. 23.75 134.5 0.00
Median 57.00 148.5 1.00
Mean 50.15 149.9 0.55
3rd Qu. 77.50 167.2 1.00
Max. 94.00 191.0 1.00
要恢复数据框,只需将
中的
1 2 | library(data.table) dt <- data.table(sapply(df, summary), keep.rownames = TRUE) |
给出:
1
2
3
4
5
6
7
8 > dt
rn A B C
1: Min. 11.00 113.0 0.0
2: 1st Qu. 21.50 126.8 0.0
3: Median 55.00 138.0 0.5
4: Mean 53.65 145.2 0.5
5: 3rd Qu. 83.25 160.5 1.0
6: Max. 98.00 193.0 1.0
怎么样:
1 2 | library(tidyr) gather(df) %>% group_by(key) %>% summarise_all(funs(min, max)) |
1
2
3
4
5
6 # A tibble: 3 × 3
key min max
<chr> <dbl> <dbl>
1 A 2 92
2 B 111 194
3 C 0 1
使用您建议的data.frame,并使用库
1 2 3 4 5 6 7 8 9 10 | library(purrr) out <- df %>% map(~summary(.)) %>% rbind.data.frame row.names(out) <- c("Min.","1st Qu.","Median","Mean","3rd Qu.","Max.") #### A B C #### Min. 7.00 110.0 0.0 #### 1st Qu. 36.75 132.5 0.0 #### Median 53.50 143.5 0.5 #### Mean 55.45 151.8 0.5 #### 3rd Qu. 82.00 167.0 1.0 #### Max. 99.00 199.0 1.0 |
你去了。我只想提一下,该代码仅适用于具有100%数字变量的输入data.frame。如果存在字符/因子变量,则它将返回错误,因为摘要的输出完全不同。
这不是唯一的方法,但是您可以根据需要使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | library(dplyr) df = data.frame(A = sample(1:100, 20), B = sample(110:200, 20), C = sample(c(0,1), 20, replace = T)) as_data_frame(summary(df)) %>% # some blank character could be trim mutate(Var2 = stringr::str_trim(Var2)) %>% # you don't need Var1 select(-Var1) %>% # Get the type of summary and the value tidyr::separate(n, c("Type","value"), sep =":") %>% # Convert value to numeric mutate(value = as.numeric(value)) %>% # reshape as you wish tidyr::spread(Var2, value, drop = T) #> # A tibble: 6 x 4 #> Type A B C #> * <chr> <dbl> <dbl> <dbl> #> 1 1st Qu. 36.25 122.2 1.00 #> 2 3rd Qu. 77.25 164.5 1.00 #> 3 Max. 95.00 193.0 1.00 #> 4 Mean 57.30 144.6 0.85 #> 5 Median 63.00 143.5 1.00 #> 6 Min. 8.00 111.0 0.00 |
不使用
的方法
1 2 3 4 5 6 | df <- data.frame(A = sample(1:100, 20), B = sample(110:200, 20), C = sample(c(0,1), 20, replace = T)) df %>% lapply(summary) %>% do.call("rbind", .) |
输出:
1 2 3 4 | Min. 1st Qu. Median Mean 3rd Qu. Max. A 9 32.5 50.5 49.65 70.25 84 B 116 137.2 162.5 157.70 178.20 196 C 0 0.0 0.0 0.45 1.00 1 |
如果要使用
1 2 3 4 | df %>% gather(attribute, value) %>% group_by(attribute) %>% do(as.data.frame(as.list(summary(.$value)))) |
输出:
1 2 3 4 5 6 7 8 | Source: local data frame [3 x 7] Groups: attribute [3] attribute Min. X1st.Qu. Median Mean X3rd.Qu. Max. <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 A 9 32.5 50.5 49.65 70.25 84 2 B 116 137.2 162.5 157.70 178.20 196 3 C 0 0.0 0.0 0.45 1.00 1 |
非常感谢大家的帮助!经过一番樱桃采摘后,我使用了以下方法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | # Dataframe df = data.frame(A = sample(1:100, 20), B = sample(110:200, 20), C = sample(c(0,1), 20, replace = T)) # Add summary functions to a list summaryFns = list( NA.n = function(x) sum(is.na(x)), NA.percent = function(x) sum(is.na(x))/length(x), unique.n = function(x) ifelse(sum(is.na(x)) > 0, length(unique(x)) - 1, length(unique(x))), min = function(x) min(x, na.rm=TRUE), max = function(x) max(x, na.rm=TRUE)) # Summarise data frame with each function # Using dplyr: library(dplyr) sapply(summaryFns, function(fn){df %>% summarise_all(fn)}) # NA.n NA.percent unique.n min max # A 0 0 20 1 98 # B 0 0 20 114 200 # C 0 0 2 0 1 # Using base-r: sapply(summaryFns, function(fn){sapply(df, fn)}) # NA.n NA.percent unique.n min max # A 0 0 20 1 98 # B 0 0 20 114 200 # C 0 0 2 0 1 |
我认为这是最直接,最灵活的方法。
进一步的评论,修改和建议,不胜感激。