关于python：errors = surrogateescape vs errors = replace

errors=surrogateescape vs errors=replace

我正在尝试打开这样的文件：

1	with open("myfile.txt", encoding="utf-8") as f:

但myfile.txt来自我的应用程序的用户。而且有90％的时间，此文件以非UTF-8格式出现，导致应用程序退出，因为它无法正确读取文件。错误就像'utf-8' codec can't decode byte 0x9c

我已经在Google上进行了搜索，找到了一些Stackoverflow答案，说可以这样打开我的文件：

1	with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:

，但其他答案表示要使用：

1	with open("myfile.txt", encoding="utf-8", errors="replace") as f:

那么errors="replace"和errors="surrogateescape"之间的区别是什么？哪个将修复文件中的非UTF-8字节？

相关讨论

医生说：

'replace':
Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding. Implemented in replace_errors()....
'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)

这意味着使用replace，任何有问题的字节将被替换为相同的U+FFFD替换字符，而使用surrogateescape的每个字节将被替换为不同的值。例如，将'\\xe9'替换为'\\udce9'，将'\\xe8'替换为'\\udce8'。

因此，使用replace时，您将获得有效的unicode字符，但会丢失文件的原始内容，而使用surrogateescape时，您可以知道原始字节(甚至可以使用.encode(errors='surrogateescape')完全重建它)，但是unicode字符串为错误，因为它包含原始代理代码。

长话短说：如果原始违规字节无所谓，而您只是想摆脱错误，那么replace是一个不错的选择，如果您需要保留它们以便以后处理，则surrogateescape是要走的路。

当您的文件主要包含ascii字符和一些(带重音)非ascii字符时，

surrogateescape具有非常好的功能。而且您还有一些用户有时会使用非UTF8编辑器来修改文件(或者无法声明UTF8编码)。在这种情况下，您将获得一个文件，该文件主要包含utf8数据和一些采用不同编码的字节，对于非英语西欧语言(例如法语，西班牙语的葡萄牙语)的Windows用户，通常为CP1252。在这种情况下，可以构建一个转换表，该转换表会将代理字符映射到cp1252 charset中的等效字符：

1
2
3
4
5
6
7
8
9

# first map all surrogates in the range 0xdc80-0xdcff to codes 0x80-0xff
tab0 = str.maketrans(''.join(range(0xdc80, 0xdd00)),
''.join(range(0x80, 0x100)))
# then decode all bytes in the range 0x80-0xff as cp1252, and map the undecoded ones
# to latin1 (using previous transtable)
t = bytes(range(0x80, 0x100)).decode('cp1252', errors='surrogateescape').translate(tab0)
# finally use above string to build a transtable mapping surrogates in the range 0xdc80-0xdcff
# to their cp1252 equivalent, or latin1 if byte has no value in cp1252 charset
tab = str.maketrans(''.join(chr(i) for i in range(0xdc80, 0xdd00)), t)

然后，您可以解码包含utf8和cp1252的mojibake的文件：

1
2
3

with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:
for line in f: # ok utf8 has been decoded here
line = line.translate(tab) # and cp1252 bytes are recovered here

我已成功使用该方法多次恢复了以utf8格式生成并已在Windows计算机上用Excel编辑过的csv文件。

相同的方法可用于其他从ascii

派生的字符集