How to create unique rows in a data frame
我有一个重复行的数据框。我需要从中创建唯一的行。我尝试了几种选择,但它们似乎不起作用
1 | l1 <-summarise(group_by(l,bowler,wickets),economyRate,d=unique(date)) |
这适用于某些行,但也会出现错误"期望单个值"。数据框" l"看起来像这样
1 2 3 4 5 6 7 8 9 10 11 12 | bowler overs maidens runs wickets economyRate date opposition (fctr) (int) (int) (dbl) (dbl) (dbl) (date) (chr) 1 MA Starc 9 0 51 0 5.67 2010-10-20 India 2 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka 3 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka 4 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka 5 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka 6 MA Starc 6 0 33 2 5.50 2012-02-05 India 7 MA Starc 6 0 33 2 5.50 2012-02-05 India 8 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka 9 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka 10 MA Starc 8 0 49 0 6.12 2012-02-12 India |
日期是唯一的,可用于获取可以为其选择行的行。请让我知道如何完成此操作。
数据
1 2 3 4 5 6 7 8 9 10 11 | l <- read.table(text ="bowler overs maidens runs wickets economyRate date opposition 1 MA_Starc 9 0 51 0 5.67 2010-10-20 India 2 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka 3 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka 4 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka 5 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka 6 MA_Starc 6 0 33 2 5.50 2012-02-05 India 7 MA_Starc 6 0 33 2 5.50 2012-02-05 India 8 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka 9 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka 10 MA_Starc 8 0 49 0 6.12 2012-02-12 India") |
清楚的
使用dplyr :: distinct删除重复的行。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ldistinct <- distinct(l) # bowler overs maidens runs wickets economyRate date # 1 MA_Starc 9 0 51 0 5.67 2010-10-20 # 2 MA_Starc 9 0 27 4 3.00 2010-11-07 # 3 MA_Starc 6 0 33 2 5.50 2012-02-05 # 4 MA_Starc 10 0 50 2 5.00 2012-02-10 # 5 MA_Starc 8 0 49 0 6.12 2012-02-12 # opposition # 1 India # 2 Sri-Lanka # 3 India # 4 Sri-Lanka # 5 India l2 <- summarise(group_by(ldistinct,bowler,wickets), economyRate,d=unique(date)) #Error: expecting a single value |
但这还不够,仍然有很多date
圆顶硬礼帽和小门的一种组合。
一起折叠值
通过将多个值粘贴在一起,您将看到保龄球和小门的单个组合有很多日期和很多经济效益。
1 2 3 4 5 6 7 8 9 10 | l3 <- summarise(group_by(l,bowler,wickets), economyRate = paste(unique(economyRate),collapse=","), d=paste(unique(date),collapse=",")) l3 # bowler wickets economyRate d # (fctr) (int) (chr) (chr) # 1 MA_Starc 0 5.67, 6.12 2010-10-20, 2012-02-12 # 2 MA_Starc 2 5.5, 5 2012-02-05, 2012-02-10 # 3 MA_Starc 4 3 2010-11-07 |
如果我正确理解OP的意图,那么他要求简单地删除重复的行。所以,我会用
1 | unique(l1) |
这就是
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
在示例数据集中,每个"保龄球","检票口"组合中有超过一个
1 2 3 | l %>% group_by(bowler, wickets) %>% summarise(economyRate= mean(economyRate), d = toString(unique(date))) |
或将'd'创建为
1 2 3 | l %>% group_by(bowler, wickets) %>% summarise(economyRate= mean(economyRate), d = list(unique(date))) |
关于'economyRate',我猜想OP需要它的
如果我们需要在原始数据集中创建日期为
1 2 3 | l %>% group_by(bowler, wickets) %>% mutate(d = list(unique(date))) |
由于OP没有提供预期的输出,因此以下内容也可能是结果
1 2 3 | l %>% group_by(bowler, wickets) %>% distinct(date) |
或者如@Frank所述
1 2 3 | l %>% group_by(bowler,wickets,date) %>% slice(1L) |
因此,我采用了一种不寻常的方法来进行该分割,但是当日期从我创建的csv文件中移出时,我仍然将其作为一个因素。您可以使用
轻松地将日期列转换为因子
1 | l1$date<-as.factor(l1$date) |
这将使该行成为非日期行,您也可以将其转换为字符,两者都可以正常工作。这就是结构上的样子。
1 2 3 4 5 6 7 8 9 10 11 12 | str(l1) 'data.frame': 10 obs. of 10 variables: $ bowler : Factor w/ 2 levels"(fctr)","MA": 2 2 2 2 2 2 2 2 2 2 $ overs : Factor w/ 2 levels"(int)","Starc": 2 2 2 2 2 2 2 2 2 2 $ maidens : Factor w/ 5 levels"(int)","10","6",..: 5 5 5 5 5 3 3 2 2 4 $ runs : Factor w/ 2 levels"(dbl)","0": 2 2 2 2 2 2 2 2 2 2 $ wickets : Factor w/ 6 levels"(dbl)","27","33",..: 6 2 2 2 2 3 3 5 5 4 $ economyRate: Factor w/ 4 levels"(dbl)","0","2",..: 2 4 4 4 4 3 3 3 3 2 $ date : Factor w/ 6 levels"(date)","3","5",..: 5 2 2 2 2 4 4 3 3 6 $ opposition : Factor w/ 6 levels"(chr)","10/20/2010",..: 2 3 3 3 3 6 6 4 4 5 $ X.1 : Factor w/ 3 levels"","India","Sri": 2 3 3 3 3 2 2 3 3 2 $ X.2 : Factor w/ 2 levels"","Lanka": 1 2 2 2 2 1 1 2 2 1 |
之后,这是关于确保使用最简洁的查询正确使用子设置语法:
1 | l2<-l1[!duplicated(l1$date),] |
这就是返回的5行唯一数据:
1 2 3 4 5 6 | bowler overs maidens runs wickets economyRate date opposition X.1 X.2 2 MA Starc 9 0 51 0 5.67 10/20/2010 India 3 MA Starc 9 0 27 4 3 11/7/2010 Sri Lanka 7 MA Starc 6 0 33 2 5.5 2/5/2012 India 9 MA Starc 10 0 50 2 5 2/10/2012 Sri Lanka 11 MA Starc 8 0 49 0 6.12 2/12/2012 India |
您唯一需要注意的是将逗号放在
如果需要日期或字符,可以将
我希望这对您有用!