How do I apply a function using comparisons of data in two or more columns with R data.table
我想将一个函数应用于R数据表对象,该对象比较两列中的值并返回结果。这是数据表X的示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 | X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56), FIRST=c("A","BB","AA","B","AAA","B","A,B"), SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A"))) POSITION FIRST SECOND 1: 1 A B 2: 4 BB AA 3: 5 AA B 4: 9 B AAA 5: 24 AAA BBB 6: 36 B AB,ABB 7: 42 A,B B,A 8: 56 A B |
我想对" FIRST"和" SECOND"列中的数据执行以下逻辑比较,以创建" RESULT"列:
1 2 3 4 5 6 | SAME = length of FIRST and SECOND are both one character BLOCK = Character length of FIRST and SECOND are the same, but greater than one, and not mixed (i.e. no comma) LESS = SECOND has fewer characters, but neither is mixed MORE = SECOND has more characters, but neither is mixed MIXED = either firs of second contains a comma |
因此,所需的结果将如下所示:
1 2 3 4 5 6 7 8 9 10 | POSITION FIRST SECOND RESULTS 1 A B SAME 4 BB AA BLOCK 5 A B,A MIXED 9 AA B LESS 24 B AAA MORE 28 BBB A,B MIXED 36 AAA BBB BLOCK 42 B AB,ABB MIXED 56 A,B B,A MIXED |
因此,以下方法可行,但对于具有400万行的文件来说速度较慢!
1 2 3 4 | X[, RESULT := ifelse(nchar(FIRST)+nchar(SECOND)==2,"SAME", ifelse(grepl(",", FIRST) | grepl(",",SECOND),"MIXED", ifelse(nchar(FIRST) > nchar(SECOND),"LESS", ifelse(nchar(FIRST) < nchar(SECOND),"MORE","BLOCK")))] |
但是它确实给了您想要的结果:
1 2 3 4 5 6 7 8 9 | POSITION FIRST SECOND RESULT 1: 1 A B SAME 2: 4 BB AA BLOCK 3: 5 AA B LESS 4: 9 B AAA MORE 5: 24 AAA BBB BLOCK 6: 36 B AB,ABB MIXED 7: 42 A,B B,A MIXED 8: 56 A B SAME |
我实际上还有更多条件要测试,其中有些条件变得更加复杂,以至于字符计数都没有。不用长的ifelse语句,是否可以将两列作为输入来应用函数?例如:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | checkType <- function(x) { if(nchar(x$FIRST)+nchar(x$SECOND)==2) { type <-"SNP" } else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) > nchar(x$SECOND))) { type <-"LESS" } else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) < nchar(x$SECOND))) { type <-"MORE" } else if (!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) == nchar(x$SECOND)) & nchar(x$SECOND)>1) { type <-"BLOCK" } else { type <-"MIXED" } return(type) } > checkType(X[1,]) [1]"SAME" for(i in 1:nrow(X)) X[i, RESULT := checkType(X[i,])] |
因此,尽管上述方法可行,但这显然不是使用data.table运行事物的最佳方法。但是,我尝试了lapply并套用,但均无效:
1 2 3 4 5 6 7 8 9 | X[, RESULT3 := lapply(.SD, checkType)] Error in x$FIRST : $ operator is invalid for atomic vectors nchar(x$FIRST) FUN(X[[1L]], ...) lapply(.SD, checkType) eval(expr, envir, enclos) eval(jsub, SDenv, parent.frame()) `[.data.table`(X, , `:=`(RESULT3, lapply(.SD, checkType))) X[, `:=`(RESULT3, lapply(.SD, checkType))] |
与apply(.SD,1,checkType)相同的结果。通过应用函数,我想做的事情有可能吗?
因此,来自@Frank和@jlhoward的答案均提供了所需的结果,并且比我最初的尝试要快得多。但是,从这些答案中,该方法(createResult1)的速度比具有1,000,000行的文件快约4倍:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | createResult1 <- function(X) { X[,`:=`( cf=nchar(FIRST), cs=nchar(SECOND), mf=grepl(',',FIRST), ms=grepl(',',SECOND) )] X[cf==1&cs==1, RESULT:="SAME"] X[cf > cs, RESULT:="LESS"] X[cf < cs, RESULT:="MORE"] X[cf==cs & cs>1, RESULT:="BLOCK"] X[(mf)|(ms), RESULT:="MIXED"] X[,c('cf','cs','mf','ms'):=NULL] return(X) } createResult2 <- function(X) { #@Frank X[,`:=`( cf=nchar(FIRST), cs=nchar(SECOND), mf=grepl(',',FIRST), ms=grepl(',',SECOND) )][,RESULT:=ifelse(cf==1&cs==1,"SAME", ifelse(mf | ms,"MIXED", ifelse(cf > cs,"LESS", ifelse(cf < cs,"MORE","BLOCK")))) ][ ,c('cf','cs','mf','ms'):=NULL ] return(X) } createResult3 <- function(X) { #@jlhoward X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND) X$nf <- nchar(X$FIRST) X$ns <- nchar(X$SECOND) X$RESULT ="" setkey(X,nf,ns) X[J(1,1),RESULT:="SAME"] X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <-"BLOCK" X[!mixed & nf > ns]$RESULT <-"LESS" X[!mixed & nf < ns]$RESULT <-"MORE" X[(mixed)]$RESULT <-"MIXED" X[,c('nf','ns','mixed'):=NULL] setkey(X,POSITION) return(X) } |
创建与上述相同的数据表,但具有1,000,000行
1 2 3 4 5 | X <- as.data.table(list(POSITION=rep(c(1,4,5,9,24,36,42,56),1000000), FIRST=rep(c("A","BB","AA","B","AAA","B","A,B"),1000000), SECOND=rep(c("B","AA","B","AAA","BBB","AB,ABB","B,A"),1000000))) Y <- copy(X) Z <- copy(X) |
结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 | > system.time(X <- createResult1(X)) user system elapsed 4.06 0.05 4.12 > system.time(Y <- createResult2(Y)) user system elapsed 18.53 0.36 18.94 > system.time(Z <- createResult2(Z)) user system elapsed 18.63 0.29 18.97 > identical(X,Y) [1] TRUE > identical(X,Z) [1] TRUE |
请注意,由您的代码生成的数据表(下面的第一行,是从上面的代码段粘贴的)与下面的"所需结果"框中显示的数据表不同。
尽管如此,这实际上可能会更快,并且绝对会更容易理解。它产生的结果我认为与您的规则一致。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56), FIRST=c("A","BB","AA","B","AAA","B","A,B"), SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A"))) X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND) X$nf <- nchar(X$FIRST) X$ns <- nchar(X$SECOND) X$RESULT ="" setkey(X,nf,ns) X[J(1,1),RESULT:="SAME"] X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <-"BLOCK" X[!mixed & nf > ns]$RESULT <-"LESS" X[!mixed & nf < ns]$RESULT <-"MORE" X[(mixed)]$RESULT <-"MIXED" setkey(X,POSITION) |
您的类别不是互斥的,因此我假设这些规则按顺序适用(例如
此外,我认为您对MORE和LESS的定义是相同的。