errors=surrogateescape vs errors=replace
我正在尝试打开这样的文件:
1 | with open("myfile.txt", encoding="utf-8") as f: |
但
我已经在Google上进行了搜索,找到了一些Stackoverflow答案,说可以这样打开我的文件:
1 | with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f: |
,但其他答案表示要使用:
1 | with open("myfile.txt", encoding="utf-8", errors="replace") as f: |
那么
医生说:
'replace':
Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding. Implemented in replace_errors()....
'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)
这意味着使用
因此,使用replace时,您将获得有效的unicode字符,但会丢失文件的原始内容,而使用surrogateescape时,您可以知道原始字节(甚至可以使用
长话短说:如果原始违规字节无所谓,而您只是想摆脱错误,那么
当您的文件主要包含ascii字符和一些(带重音)非ascii字符时,
1 2 3 4 5 6 7 8 9 | # first map all surrogates in the range 0xdc80-0xdcff to codes 0x80-0xff tab0 = str.maketrans(''.join(range(0xdc80, 0xdd00)), ''.join(range(0x80, 0x100))) # then decode all bytes in the range 0x80-0xff as cp1252, and map the undecoded ones # to latin1 (using previous transtable) t = bytes(range(0x80, 0x100)).decode('cp1252', errors='surrogateescape').translate(tab0) # finally use above string to build a transtable mapping surrogates in the range 0xdc80-0xdcff # to their cp1252 equivalent, or latin1 if byte has no value in cp1252 charset tab = str.maketrans(''.join(chr(i) for i in range(0xdc80, 0xdd00)), t) |
然后,您可以解码包含utf8和cp1252的mojibake的文件:
1 2 3 | with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f: for line in f: # ok utf8 has been decoded here line = line.translate(tab) # and cp1252 bytes are recovered here |
我已成功使用该方法多次恢复了以utf8格式生成并已在Windows计算机上用Excel编辑过的csv文件。
相同的方法可用于其他从ascii
派生的字符集
我的问题是文件中包含混合编码类型的行。
解决方法是删除
1 | with open("myfile.txt", errors="replace") as f: |
如果可以检测文件的编码类型,我将其添加为