关于交集:R中的非交集值

un-intersect values in R

我有两个数据集,每个数据集至少有420,500个观察值,例如

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
dataset1 <- data.frame(col1=c("microsoft","apple","vmware","delta","microsoft"),
                     col2=paste0(c("a","b","c",4,"asd"),".exe"),
                     col3=rnorm(5))

dataset2 <- data.frame(col1=c("apple","cisco","proactive","dtex","microsoft"),
                     col2=paste0(c("a","b","c",4,"asd"),".exe"),
                     col3=rnorm(5))
> dataset1
       col1    col2 col3
1 microsoft   a.exe    2
2     apple   b.exe    1
3    vmware   c.exe    3
4     delta   4.exe    4
5 microsoft asd.exe    5
> dataset2
       col1    col2 col3
1     apple   a.exe    3
2     cisco   b.exe    4
3    vmware   d.exe    1
4     delta   5.exe    5
5 microsoft asd.exe    2

我想打印dataset1中所有不与dataset2相交的观测值(比较每个col1col2),在这种情况下,它将打印除最后一个观测值以外的所有内容-观察1


您可以使用dplyr

中的anti_join

1
2
3
4
5
6
7
 library(dplyr)
 anti_join(df1, df2, by = c('col1', 'col2'))
 #      col1  col2       col3
 #1     delta 4.exe -0.5836272
 #2    vmware c.exe  0.4196231
 #3     apple b.exe  0.5365853
 #4 microsoft a.exe -0.5458808

数据

1
2
3
4
5
6
7
8
 set.seed(24)
 df1 <- data.frame(col1 = c('microsoft', 'apple', 'vmware', 'delta',
 'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
    col3=rnorm(5), stringsAsFactors=FALSE)
 set.seed(22)
 df2 <- data.frame(col1 = c( 'apple', 'cisco', 'proactive', 'dtex',
 'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
  col3=rnorm(5), stringsAsFactors=FALSE)

data.table解决方案受此启发:

1
2
3
4
5
6
7
8
9
10
library(data.table) #1.9.5+
setDT(dataset1,key=c("col1","col2"))
setDT(dataset2,key=key(dataset1))
dataset1[!dataset2]

        col1  col2 col3
1:     apple b.exe    1
2:     delta 4.exe    4
3: microsoft a.exe    2
4:    vmware c.exe    3

您也可以尝试不键入任何键:

1
2
3
library(data.table) #1.9.5+
setDT(dataset1); setDT(dataset2)
dataset1[!dataset2,on=c("col1","col2")]