关于regex:R .csv无法正确读取,因为文本中有双引号

R .csv not read in correctly because there are double quotes in the text

我有一个.csv文件,其中包含所有文本字段。但是,某些文本字段包含不转义的双引号字符,例如:

1
2
3
4
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote"every dog must have it's day"","Hi","2013-01-01"
"3","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"

第1行和第2行很好,但第3行未正确读入。目前,我正在手动浏览记事本中的文件以尝试删除此类引号。理想情况下,我希望R能够处理此问题,但我认为无与伦比的双引号的不转义性质使这样的期望变得不合理。

在记事本中,我试图构建一个正则表达式以标识双引号,且双引号不应以逗号开头或后缀。逻辑是有效的双引号将出现在字段的开头或结尾,并由相邻的逗号表示。这可能有助于确定我的大部分案件,然后我可以处理这些案件。

仅说我有大约340万条记录,而大约0.1%似乎是有问题的。

编辑:
已建议从data.table中读取fread作为替代方法,但使用fread的成功率甚至更低:

1
2
1: In fread(paste(infilename,"1",".csv", sep ="")) :
  Stopped early on line 21. Expected 18 fields but found 9. Consider fill=TRUE and comment.char=. First discarded non-empty line

所有建议的选项均无效。我认为这是因为"文本"字段也可以包含CRLF字符。 read.csv似乎只是忽略了这些(好),而fread则例外。抱歉,我无法提供实际的文本,但是这里有一些更全面的测试数据,它具有无与伦比的双引号(read.csv有问题)和CRLF(fread有问题)。

1
2
3
4
5
6
7
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote"every dog must have it's day"","Hi","2013-01-01"
"3","An issue with this line is that it contains a CRLF here
which is not usual.","Again an unusual CRLF
is present in these data","2013-02-02"
"4","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"

在记事本中使用正则表达式的帮助将非常有用。


也许一种选择是在记事本中使用条件替换。

您可以找到所有以双引号开头,以逗号开头或字符串开头的字符串。

然后匹配双引号,直到遇到逗号后或字符串末尾的下一个双引号。这些是白色的行,所以对于要捕获和替换的替代部分,请匹配双引号而不是逗号之间。

查找内容:

1
2
(?:^|,)"[^"\
]*"(?=$|,)|(?<!,)(")(?!,)

替换为:

有条件的替换。如果为组1,则替换为空,否则替换为匹配项。

1
(?{1}:$0)

正则表达式演示

说明

  • (?:^|,)匹配逗号或断言字符串的开头
  • "[^"\
    ]*"
    当两者之间没有双引号时,匹配双引号
  • (?=$|,)断言右边的内容是字符串的结尾还是逗号
  • |
  • (?<!,)(")(?!,)在断言左右两边不是逗号的情况下在group1中捕获双引号


似乎可以很好地与data.table::fread

一起使用

1
2
3
4
5
6
7
8
fread("E:/temp/test.txt")
#   ID                                                                 Text Optional text    "Date"
#1:  1                                      Today is going to be a good day               2013-02-03
#2:  2        And I am inspired by the quote"every dog must have it's day"            Hi 2013-01-01
#3:  3 Did not the bard say"All the World's a stage" this quote is so true      Terrible 2013-05-05
#Warning message:
#In fread("E:/temp/test.txt") :
#  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.