关于utf 8：Python如何使用十六进制字符解码unicode

Python how to decode unicode with hex characters

我已从Web爬网脚本中提取了一个字符串，如下所示：

1	u'\\xe3\\x80\\x90\\xe4\\xb8\\xad\\xe5\\xad\\x97\\xe3\\x80\\x91'

我想用utf-8解码u'\\xe3\\x80\\x90\\xe4\\xb8\\xad\\xe5\\xad\\x97\\xe3\\x80\\x91'。
使用http://ddecode.com/hexdecoder/，我可以看到结果是'【中字】'

我尝试使用以下语法，但失败了。

1 2	msg = u'\\xe3\\x80\\x90\\xe4\\xb8\\xad\\xe5\\xad\\x97\\xe3\\x80\\x91' result = msg.decode('utf8')

错误：

1
2
3
4
5
6

Traceback (most recent call last):
File"<stdin>", line 1, in <module>
File"C:\\Python27\\lib\\encodings\\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11: ordi
nal not in range(128)

请问如何正确解码字符串？

感谢您的帮助。

相关讨论

的问题

1 2	msg = u'\\xe3\\x80\\x90\\xe4\\xb8\\xad\\xe5\\xad\\x97\\xe3\\x80\\x91' result = msg.decode('utf8')

是您正在尝试解码Unicode。那真的没有道理。您可以从Unicode编码为某种编码类型，也可以将字节字符串解码为Unicode。

当你做

1	msg.decode('utf8')

Python 2看到msg是Unicode。它知道它无法解码Unicode，因此"有帮助地"假定您要使用默认的ASCII编解码器对msg进行编码，以便可以使用UTF-8编解码器将该转换的结果解码为Unicode。 Python 3的行为更加明智：该代码只会因

而失败

1	AttributeError: 'str' object has no attribute 'decode'

kennytm的答案中给出的技术：

1	msg.encode('latin1').decode('utf-8')

之所以起作用，是因为小于256的Unicode代码点直接对应于Latin1编码(又称为ISO 8859-1)中的字符。

以下是一些说明此问题的Python 2代码：

1
2
3
4
5

for i in xrange(256):
lat = chr(i)
uni = unichr(i)
assert lat == uni.encode('latin1')
assert lat.decode('latin1') == uni

这是等效的Python 3代码：

1
2
3
4
5

for i in range(256):
lat = bytes([i])
uni = chr(i)
assert lat == uni.encode('latin1')
assert lat.decode('latin1') == uni

您可能会发现这篇文章很有帮助：实用Unicode，由SO老兵Ned Batchelder编写。

除非您被迫使用Python 2，否则我强烈建议您切换到Python3。这将大大减轻处理Unicode的痛苦。

也许您应该修复爬网脚本，一个Unicode字符串应该已经包含u'【中字】'(u'\\u3010\\u4e2d\\u5b57\\u3011')，而不是原始的UTF-8字节。

要将msg转换为正确的编码，首先需要将错误的Unicode字符串转换回字节字符串(将其编码为Latin-1)，然后将其解码为UTF-8：

1 2	>>> print msg.encode('latin1').decode('utf-8') 【中字】