关于unicode:带有UTF-8数据的Python CSV DictReader

Python CSV DictReader with UTF-8 data

afaik,python(v2.6)csv模块默认不能处理unicode数据,对吗?在python文档中,有一个关于如何读取UTF-8编码文件的示例。但是这个示例只返回csv行作为列表。我希望按名称访问行列,就像csv.DictReader所做的那样,但使用UTF-8编码的csv输入文件。

有人能告诉我如何有效地做到这一点吗?我必须处理100兆字节大小的csv文件。


我自己想出了一个答案:

1
2
3
4
def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}

注意:此项已更新,因此根据注释中的建议对密钥进行解码


对我来说,关键不是操作csv dictleader参数,而是文件打开器本身。这就成功了:

1
2
with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
    csv_reader = csv.DictReader(csv_file)

不需要特殊课程。现在我可以打开有或没有BOM的文件而不会崩溃。


答案没有DictWriter方法,因此这里是更新的类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        self.fieldnames = fieldnames    # list of keys for the dict
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)

csvw包还具有其他功能(对于元数据丰富的web csv),但它定义了一个围绕其UnicodeReader类的UnicodeDictReader类,其核心是:

1
2
3
4
5
6
7
8
9
class UnicodeReader(Iterator):
   """Read Unicode data from a csv file."""
    []

    def _next_row(self):
        self.lineno += 1
        return [
            s if isinstance(s, text_type) else s.decode(self._reader_encoding)
            for s in next(self.reader)]

它确实吸引了我几次,但csvw.UnicodeDictReader确实需要在with块中使用,否则会中断。除此之外,该模块非常通用,并且与PY2和PY3都兼容。


@lmatter-answer是一种基于分类的方法,使用这种方法,您仍然可以获得dictreader的所有好处,例如获取字段名和行号,以及它处理utf-8。

1
2
3
4
5
6
7
import csv

class UnicodeDictReader(csv.DictReader, object):

    def next(self):
        row = super(UnicodeDictReader, self).next()
        return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}

首先,使用文档的2.6版本。每次发布都可以更改。很明显,它不支持Unicode,但支持UTF-8。从技术上讲,这不是一回事。正如医生所说:

The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.

下面的示例(来自文档)显示了如何创建两个函数,将文本正确读取为utf-8作为csv。您应该知道,csv.reader()总是返回dictleader对象。

1
2
3
4
5
6
7
8
9
import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]