Python: Converting from ISO-8859-1/latin1 to UTF-8
我有这个字符串已经使用电子邮件模块从Quoted-printable解码为ISO-8859-1。 这给了我像" xC4pple"这样的字符串,它们对应于"?pple"(瑞典语中的Apple)。
但是,我无法将这些字符串转换为UTF-8。
1 2 3 4 5 6 7 | >>> apple ="\xC4pple" >>> apple '\xc4pple' >>> apple.encode("UTF-8") Traceback (most recent call last): File"<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) |
我该怎么办?
这是一个常见问题,所以这里有一个相对彻底的例子。
对于非unicode字符串(即没有
首先,这是一个方便的实用程序函数,它将有助于阐明Python 2.7字符串和unicode的模式:
1 | >>> def tell_me_about(s): return (type(s), s) |
简单的字符串
1 2 3 4 5 6 7 8 9 10 11 12 | >>> v ="\xC4pple" # iso-8859-1 aka latin1 encoded string >>> tell_me_about(v) (<type 'str'>, '\xc4pple') >>> v '\xc4pple' # representation in memory >>> print v ?pple # map the iso-8859-1 in-memory to iso-8859-1 chars # note that '\xc4' has no representation in iso-8859-1, # so is printed as"?". |
解码iso8859-1字符串 - 将纯字符串转换为unicode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | >>> uv = v.decode("iso-8859-1") >>> uv u'\xc4pple' # decoding iso-8859-1 becomes unicode, in memory >>> tell_me_about(uv) (<type 'unicode'>, u'\xc4pple') >>> print v.decode("iso-8859-1") ?pple # convert unicode to the default character set # (utf-8, based on sys.stdout.encoding) >>> v.decode('iso-8859-1') == u'\xc4pple' True # one could have just used a unicode representation # from the start |
再说一点 - 用"?"
1 2 3 4 5 6 7 8 9 10 11 12 | >>> u"?" == u"\xc4" True # the native unicode char and escaped versions are the same >>>"?" == u"\xc4" False # the native unicode char is '\xc3\x84' in latin1 >>>"?".decode('utf8') == u"\xc4" True # one can decode the string to get unicode >>>"?" =="\xc4" False # the native character and the escaped string are # of course not equal ('\xc3\x84' != '\xc4'). |
编码为UTF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | >>> u8 = v.decode("iso-8859-1").encode("utf-8") >>> u8 '\xc3\x84pple' # convert iso-8859-1 to unicode to utf-8 >>> tell_me_about(u8) (<type 'str'>, '\xc3\x84pple') >>> u16 = v.decode('iso-8859-1').encode('utf-16') >>> tell_me_about(u16) (<type 'str'>, '\xff\xfe\xc4\x00p\x00p\x00l\x00e\x00') >>> tell_me_about(u8.decode('utf8')) (<type 'unicode'>, u'\xc4pple') >>> tell_me_about(u16.decode('utf16')) (<type 'unicode'>, u'\xc4pple') |
unicode与UTF和latin1之间的关系
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | >>> print u8 ?pple # printing utf-8 - because of the encoding we now know # how to print the characters >>> print u8.decode('utf-8') # printing unicode ?pple >>> print u16 # printing 'bytes' of u16 ???pple >>> print u16.decode('utf16') ?pple # printing unicode >>> v == u8 False # v is a iso8859-1 string; u8 is a utf-8 string >>> v.decode('iso8859-1') == u8 False # v.decode(...) returns unicode >>> u8.decode('utf-8') == v.decode('latin1') == u16.decode('utf-16') True # all decode to the same unicode memory representation # (latin1 is iso-8859-1) |
Unicode例外
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | >>> u8.encode('iso8859-1') Traceback (most recent call last): File"<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) >>> u16.encode('iso8859-1') Traceback (most recent call last): File"<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) >>> v.encode('iso8859-1') Traceback (most recent call last): File"<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) |
通过从特定编码(latin-1,utf8,utf16)转换为unicode,可以解决这些问题。
所以也许人们可以得出以下原则和概括:
-
类型
str 是一组字节,可以具有多种编码之一,如Latin-1,UTF-8和UTF-16 -
类型
unicode 是一组字节,可以转换为任意数量的编码,最常见的是UTF-8和latin-1(iso8859-1) -
print 命令有自己的编码逻辑,设置为sys.stdout.encoding 并默认为UTF-8 -
在转换为另一种编码之前,必须将
str 解码为unicode。
当然,Python 3.x中的所有这些变化。
希望这很有启发性。
进一步阅读
- 字符与字节,由蒂姆布雷。
Armin Ronacher的非常具有说服力的咆哮:
- Python上的Unicode更新指南(2013年7月2日)
- 有关Python 2和3中的Unicode的更多信息(2014年1月5日)
- UCS与UTF-8作为内部字符串编码(2014年1月9日)
- 你不想在Python 3中了解Unicode的所有内容(2014年5月12日)
首先尝试解码,然后编码:
1 | apple.decode('iso-8859-1').encode('utf8') |
对于Python 3:
1 | bytes(apple,'iso-8859-1').decode('utf-8') |
我用这个文本错误地编码为iso-8859-1(显示像Ve? x99ejn ??这样的单词)而不是utf-8。此代码生成正确的版本Ve?ejné。
解码为Unicode,将结果编码为UTF8。
1 | apple.decode('latin1').encode('utf8') |
1 2 | concept = concept.encode('ascii', 'ignore') concept = MySQLdb.escape_string(concept.decode('latin1').encode('utf8').rstrip()) |
我这样做,我不确定这是否是一个好方法,但它每次都有效!