R, Complement on aggregating data.table
是否可以与R data.tables中的补码进行聚合。以下示例。
1 2 | library(data.table) dt <- data.table(a=c("word1","word2","word2","word2"), b=c("cat1","cat1","cat1","cat2")) |
获取类别中特定单词的数量
1 2 3 4 | newdt <- dt[,(.N),by=.(a,b)] #word1,cat1 - 1 #word2,cat1 - 2 #word2,cat2 - 1 |
如何计算类别中所有其他单词的数量?或与此相关的是,该词还属于其他类别吗?类似于以下内容吗?
1 2 3 4 5 6 | #doesn't work #newdt2 <- dt[a!=a,(.N),by=.(a,b)] #the expected answer would be #word1,cat1 - 2 #word2,cat1 - 1 #word2,cat2 - 0 |
在在线教程或问题中,我找不到任何帮助。有没有一种简单的方法来获得补充。 Data.table解决方案将是一个不错的选择,因为它可以处理50M行表。谢谢!
遵循Bruno的想法来计算每个类别的总计数减去每个类别中的单词数的差,但是使用
1 2 3 | library(data.table) dt <-data.table(a = c("word1", rep("word2", 3L)), b = c(rep("cat1", 3L),"cat2")) dt[, .N, by = .(a, b)][dt[, .N, by = b], on ="b", Nc := i.N - N][] |
1
2
3
4 a b N Nc
1: word1 cat1 1 2
2: word2 cat1 2 1
3: word2 cat2 1 0
以下是您的代码(我添加了双引号使其可以运行):
1 2 3 4 5 6 | library(data.table) dt <- data.table(a=c("word1","word2","word2","word2"),b=c("cat1","cat1","cat1","cat2")) newdt <- dt[,(.N),by=.(a,b)] names(newdt) = c("a","b","cnt") # rename the count column |
下面的行将计算每个类别出现的次数
1 2 3 | catCnt = dt[,(.N),by=.(b)] names(catCnt) = c("b","tot_b") catCnt |
除当前单词外,属于每个类别的单词数是属于该类别的单词数与对(
为了获得结果,我按类别列
合并了两个
1 | aux = merge(newdt, catCnt, by="b") |
然后我计算总计数和" couple "计数之间的差:
1 | aux$cnt_not_a = aux$tot_b - aux$cnt |
如果您只想保留必填列:
1 2 | res = aux[, c("b","a","cnt_not_a")] res |
我不知道您是否可以仅使用