get long format data frame from list
我有一个包含字符串的列表列表。每个子列表的第一个字符串描述了以下字符串所属的类别。我想要一个(长格式)数据框,其中一列用于类别,一列用于内容。
如何从此列表中获取长格式的数据帧:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | mylist <- list( c("A","lorem","ipsum"), c("B","sed","eiusmod","tempor" ,"inci"), c("C","aliq","ex","ea")) > mylist [[1]] [1]"A" "lorem""ipsum" [[2]] [1]"B" "sed" "eiusmod" "tempor" "incidunt" [[3]] [1]"C" "aliquid""ex" "ea" |
应该看起来像这个数据框
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | mydf <- data.frame(cate= c("A","A","B","B","B","B","C","C","C"), cont= c("lorem","ipsum","sed","eiusmod","tempor","inci","aliq","ex","ea")) > mydf cate cont 1 A lorem 2 A ipsum 3 B sed 4 B eiusmod 5 B tempor 6 B incidunt 7 C aliquid 8 C ex 9 C ea |
我已经将类别和内容分开了。
1 2 | cate <- sapply(mylist,"[[",1) cont <- sapply(mylist,"[", -(1)) |
如何继续获取mydf?
使用原始列表而不是创建的拆分对象,可以尝试以下操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(data.table) setorder(melt(as.data.table(transpose(mylist)), id.vars ="V1", na.rm = TRUE), V1, variable)[] # V1 variable value # 1: A V2 lorem # 2: A V3 ipsum # 3: B V2 sed # 4: B V3 eiusmod # 5: B V4 tempor # 6: B V5 inci # 7: C V2 aliq # 8: C V3 ex # 9: C V4 ea |
为了娱乐,您还可以尝试以下操作之一:
1 2 3 4 5 6 7 8 | library(dplyr) library(tidyr) data_frame(id = seq_along(mylist), mylist) %>% unnest %>% group_by(id) %>% mutate(ind = mylist[1]) %>% slice(2:n()) |
1 2 3 4 5 | library(purrr) data_frame( value = mylist %>% map(~ .x[-1]) %>% unlist, ind = mylist %>% map(~ rep(.x[1], length(.x)-1)) %>% unlist ) |
请注意," purrr"还具有
我们还可以将
1 2 | dat <- data.frame(cat=rep(cate, lengths(cont)), cont=unlist(cont)) |
因此,关于"最佳"答案是什么(如果有人怀疑,甚至是一个答案)进行了一些讨论,下面是一些基准(以防性能影响),该基准基于要处理的100000个向量的列表:
1 2 3 4 5 | Unit: milliseconds expr min lq mean median uq max neval cld heroka 56.24516 67.98583 122.1209 82.35606 117.6017 391.8297 50 a akrun 258.86939 283.10408 363.5425 331.50263 448.9134 578.1818 50 b ananda 47.72320 61.05269 132.2678 76.22913 218.8286 385.5709 50 a |
基准测试代码假定变量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | heroka <- function(){ data.frame(cat=rep(cate, lengths(cont)), cont=unlist(cont)) } akrun <- function(){ setNames(stack(setNames(cont, cate))[2:1], c('cate', 'cont')) } ananda <- function(){ setorder(melt(as.data.table(transpose(mylist)), id.vars ="V1", na.rm = TRUE), V1, variable)[] } mylist <- replicate(100000,c(sample(LETTERS[1:10],1),sample(LETTERS[1:10],sample(5)))) cate <- sapply(mylist,"[[",1) cont <- sapply(mylist,"[", -(1)) tests <- microbenchmark( heroka = heroka(), akrun=akrun(),ananda=ananda(), times=50 ) |
在将'cont'的
1 2 3 4 5 6 7 8 9 10 11 | setNames(stack(setNames(cont, cate))[2:1], c('cate', 'cont')) # cate cont #1 A lorem #2 A ipsum #3 B sed #4 B eiusmod #5 B tempor #6 B inci #7 C aliq #8 C ex #9 C ea |
使用
的另一个选择
1 2 3 4 5 6 7 8 9 10 11 12 | do.call(rbind, lapply(mylist, function(x) data.frame(cate = x[1], cont = x[-1]))) # cate cont #1 A lorem #2 A ipsum #3 B sed #4 B eiusmod #5 B tempor #6 B inci #7 C aliq #8 C ex #9 C ea |