关于 unicode:u’\?’ 在 Python 字符串中

pythonunicodeutf-8

u'\ufeff' in Python string

我收到以下异常消息的错误：

1 2	UnicodeEncodeError: 'ascii' codec can't encode character u'\\ufeff' in position 155: ordinal not in range(128)

不确定 u'\\ufeff' 是什么，它在我抓取网页时出现。我该如何补救？ .replace() 字符串方法对其不起作用。

相关讨论

我在 Python 3 上遇到了这个问题并发现了这个问题(和解决方案)。
打开文件时，Python 3 支持 encoding 关键字来自动处理编码。

没有它，读取结果中包含BOM：

1
2
3

>>> f = open('file', mode='r')
>>> f.read()
'\\ufefftest'

给出正确的编码，结果中省略了BOM：

1
2
3

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

只要我的 2 美分。

相关讨论

Unicode 字符 U+FEFF 是字节顺序标记或 BOM，用于区分大端和小端 UTF-16 编码。如果您使用正确的编解码器解码网页，Python 会为您删除它。示例：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print 'utf-8 %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le %r' % e16le
print 'utf-16be %r' % e16be
print
print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')
print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')

请注意，EF BB BF 是 UTF-8 编码的 BOM。 UTF-8 不需要它，它仅用作签名(通常在 Windows 上)。

输出：

1
2
3
4
5
6
7
8
9
10

utf-8 'ABC'
utf-8-sig '\\xef\\xbb\\xbfABC'
utf-16 '\\xff\\xfeA\\x00B\\x00C\\x00' # Adds BOM and encodes using native processor endian-ness.
utf-16le 'A\\x00B\\x00C\\x00'
utf-16be '\\x00A\\x00B\\x00C'

utf-8 w/ BOM decoded with utf-8 u'\\ufeffABC' # doesn't remove BOM if present.
utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le u'\\ufeffABC' # doesn't remove BOM if present.

请注意，utf-16 编解码器要求存在 BOM，否则 Python 将不知道数据是大端还是小端。

相关讨论

您抓取的内容以 unicode 而不是 ascii 文本编码，并且您得到的字符不会转换为 ascii。正确的"翻译"取决于原始网页的想法。 Python\\'s unicode page 提供了它如何工作的背景。

您是要打印结果还是将其粘贴到文件中？该错误表明它正在写入导致问题的数据，而不是读取它。这个问题是寻找修复的好地方。

该字符是 BOM 或"字节顺序标记"。它通常作为文件的前几个字节接收，告诉您如何解释其余数据的编码。您可以简单地删除字符以继续。虽然，由于错误表明您正在尝试转换为 \\'ascii\\'，您可能应该为您尝试做的任何事情选择另一种编码。

这里基于 Mark Tolonen 的回答。该字符串包含用 \\'|\\' 分隔的单词 \\'test\\' 的不同语言，因此您可以看到差异。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

u = u'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print('utf-8 %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16 %r' % e16)
print('utf-16le %r' % e16le)
print('utf-16be %r' % e16be)
print()
print('utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8'))
print('utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le'))

这是一个测试运行：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

>>> u = u'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> e8 = u.encode('utf-8') # encode without BOM
>>> e8s = u.encode('utf-8-sig') # encode with BOM
>>> e16 = u.encode('utf-16') # encode with BOM
>>> e16le = u.encode('utf-16le') # encode without BOM
>>> e16be = u.encode('utf-16be') # encode without BOM
>>> print('utf-8 %r' % e8)
utf-8 b'ABCtest\\xce\\xb2\\xe8\\xb2\\x9d\\xe5\\xa1\\x94\\xec\\x9c\\x84m\\xc3\\xa1sb\\xc3\\xaata|test|\\xd8\\xa7\\xd8\\xae\\xd8\\xaa\\xd8\\xa8\\xd8\\xa7\\xd8\\xb1|\\xe6\\xb5\\x8b\\xe8\\xaf\\x95|\\xe6\\xb8\\xac\\xe8\\xa9\\xa6|\\xe3\\x83\\x86\\xe3\\x82\\xb9\\xe3\\x83\\x88|\\xe0\\xa4\\xaa\\xe0\\xa4\\xb0\\xe0\\xa5\\x80\\xe0\\xa4\\x95\\xe0\\xa5\\x8d\\xe0\\xa4\\xb7\\xe0\\xa4\\xbe|\\xe0\\xb4\\xaa\\xe0\\xb4\\xb0\\xe0\\xb4\\xbf\\xe0\\xb4\\xb6\\xe0\\xb5\\x8b\\xe0\\xb4\\xa7\\xe0\\xb4\\xa8|\\xd7\\xa4\\xd6\\xbc\\xd7\\xa8\\xd7\\x95\\xd7\\x91\\xd7\\x99\\xd7\\xa8\\xd7\\x9f|ki\\xe1\\xbb\\x83m tra|\\xc3\\x96l\\xc3\\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\\xef\\xbb\\xbfABCtest\\xce\\xb2\\xe8\\xb2\\x9d\\xe5\\xa1\\x94\\xec\\x9c\\x84m\\xc3\\xa1sb\\xc3\\xaata|test|\\xd8\\xa7\\xd8\\xae\\xd8\\xaa\\xd8\\xa8\\xd8\\xa7\\xd8\\xb1|\\xe6\\xb5\\x8b\\xe8\\xaf\\x95|\\xe6\\xb8\\xac\\xe8\\xa9\\xa6|\\xe3\\x83\\x86\\xe3\\x82\\xb9\\xe3\\x83\\x88|\\xe0\\xa4\\xaa\\xe0\\xa4\\xb0\\xe0\\xa5\\x80\\xe0\\xa4\\x95\\xe0\\xa5\\x8d\\xe0\\xa4\\xb7\\xe0\\xa4\\xbe|\\xe0\\xb4\\xaa\\xe0\\xb4\\xb0\\xe0\\xb4\\xbf\\xe0\\xb4\\xb6\\xe0\\xb5\\x8b\\xe0\\xb4\\xa7\\xe0\\xb4\\xa8|\\xd7\\xa4\\xd6\\xbc\\xd7\\xa8\\xd7\\x95\\xd7\\x91\\xd7\\x99\\xd7\\xa8\\xd7\\x9f|ki\\xe1\\xbb\\x83m tra|\\xc3\\x96l\\xc3\\xa7ek|'
>>> print('utf-16 %r' % e16)
utf-16 b"\\xff\\xfeA\\x00B\\x00C\\x00t\\x00e\\x00s\\x00t\\x00\\xb2\\x03\\x9d\\x8cTX\\x04\\xc7m\\x00\\xe1\\x00s\\x00b\\x00\\xea\\x00t\\x00a\\x00|\\x00t\\x00e\\x00s\\x00t\\x00|\\x00'\\x06.\\x06*\\x06(\\x06'\\x061\\x06|\\x00Km\\xd5\\x8b|\\x00,nf\\x8a|\\x00\\xc60\\xb90\\xc80|\\x00*\\t0\\t@\\t\\x15\\tM\\t7\\t>\\t|\\x00*\
0\
?\
6\
K\
'\
(\
|\\x00\\xe4\\x05\\xbc\\x05\\xe8\\x05\\xd5\\x05\\xd1\\x05\\xd9\\x05\\xe8\\x05\\xdf\\x05|\\x00k\\x00i\\x00\\xc3\\x1em\\x00 \\x00t\\x00r\\x00a\\x00|\\x00\\xd6\\x00l\\x00\\xe7\\x00e\\x00k\\x00|\\x00"
>>> print('utf-16le %r' % e16le)
utf-16le b"A\\x00B\\x00C\\x00t\\x00e\\x00s\\x00t\\x00\\xb2\\x03\\x9d\\x8cTX\\x04\\xc7m\\x00\\xe1\\x00s\\x00b\\x00\\xea\\x00t\\x00a\\x00|\\x00t\\x00e\\x00s\\x00t\\x00|\\x00'\\x06.\\x06*\\x06(\\x06'\\x061\\x06|\\x00Km\\xd5\\x8b|\\x00,nf\\x8a|\\x00\\xc60\\xb90\\xc80|\\x00*\\t0\\t@\\t\\x15\\tM\\t7\\t>\\t|\\x00*\
0\
?\
6\
K\
'\
(\
|\\x00\\xe4\\x05\\xbc\\x05\\xe8\\x05\\xd5\\x05\\xd1\\x05\\xd9\\x05\\xe8\\x05\\xdf\\x05|\\x00k\\x00i\\x00\\xc3\\x1em\\x00 \\x00t\\x00r\\x00a\\x00|\\x00\\xd6\\x00l\\x00\\xe7\\x00e\\x00k\\x00|\\x00"
>>> print('utf-16be %r' % e16be)
utf-16be b"\\x00A\\x00B\\x00C\\x00t\\x00e\\x00s\\x00t\\x03\\xb2\\x8c\\x9dXT\\xc7\\x04\\x00m\\x00\\xe1\\x00s\\x00b\\x00\\xea\\x00t\\x00a\\x00|\\x00t\\x00e\\x00s\\x00t\\x00|\\x06'\\x06.\\x06*\\x06(\\x06'\\x061\\x00|mK\\x8b\\xd5\\x00|n,\\x8af\\x00|0\\xc60\\xb90\\xc8\\x00|\\t*\\t0\\t@\\t\\x15\\tM\\t7\\t>\\x00|\
*\
0\
?\
6\
K\
'\
(\\x00|\\x05\\xe4\\x05\\xbc\\x05\\xe8\\x05\\xd5\\x05\\xd1\\x05\\xd9\\x05\\xe8\\x05\\xdf\\x00|\\x00k\\x00i\\x1e\\xc3\\x00m\\x00 \\x00t\\x00r\\x00a\\x00|\\x00\\xd6\\x00l\\x00\\xe7\\x00e\\x00k\\x00|"
>>> print()

>>> print('utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8'))
utf-8 w/ BOM decoded with utf-8 '\\ufeffABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> print('utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8 w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> print('utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16 'ABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'
>>> print('utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le '\\ufeffABCtestβ貝塔?másbêta|test|??????|测试|測試|テスト|???????|???????|????????|ki?m tra|?l?ek|'

值得知道的是，只有 utf-8-sig 和 utf-16 在 encode 和 decode 之后都返回原始字符串。

这个问题基本上出现在你以UTF-8或UTF-16编码保存你的python代码时，因为python会自动在代码的开头添加一些特殊字符(文本编辑器不会显示)来识别编码格式。但是，当您尝试执行代码时，它会给您第 1 行中的语法错误，即代码开头，因为 python 编译器理解 ASCII 编码。
当您使用 read() 函数查看文件代码时，您可以在返回代码的开头看到 \\'\\\\\\\?\\' 。
解决此问题的一个最简单的方法是将编码改回 ASCII 编码(为此，您可以将代码复制到记事本并保存)记住！选择 ASCII 编码...
希望这会有所帮助。