关于字典：以最快的方式计算python中的重复单词

counting duplicate words in python the fastest way

我试着把重复的单词数在23万个单词的列表上，我用Python字典来计算。代码如下：

1
2
3
4
5

for words in word_list:
if words in word_dict.keys():
word_dict[words] += 1
else:
word_dict[words] = 1

上面的代码花了3分钟！我用同样的代码运行了150多万个字，运行了25分钟多，我失去了耐心，终止了工作。然后我发现我可以使用下面的代码(也显示在下面)。结果是如此惊人，它在几秒钟内完成！所以我的问题是什么是更快的方法来做这个操作？.我想字典的创建过程一定要花很多时间。Counter方法如何能够在几秒钟内完成这个过程，并创建一个精确的字典，将单词作为键，将频率作为值？

1 2	from collections import Counter word_dict = Counter(word_list)

号

相关讨论

您可以查看计数器代码，这里是在init上调用的update方法：

(注意，它使用了定义self.get的本地定义的性能技巧)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

def update(self, iterable=None, **kwds):
'''Like dict.update() but add counts instead of replacing them.

Source can be an iterable, a dictionary, or another Counter instance.

>>> c = Counter('which')
>>> c.update('witch') # add elements from another iterable
>>> d = Counter('watch')
>>> c.update(d) # add elements from another counter
>>> c['h'] # four 'h' in which, witch, and watch
4

'''
# The regular dict.update() operation makes no sense here because the
# replace behavior results in the some of original untouched counts
# being mixed-in with all of the other counts for a mismash that
# doesn't have a straight-forward interpretation in most counting
# contexts. Instead, we implement straight-addition. Both the inputs
# and outputs are allowed to contain zero and negative counts.

if iterable is not None:
if isinstance(iterable, Mapping):
if self:
self_get = self.get
for elem, count in iterable.iteritems():
self[elem] = self_get(elem, 0) + count
else:
super(Counter, self).update(iterable) # fast path when counter is empty
else:
self_get = self.get
for elem in iterable:
self[elem] = self_get(elem, 0) + 1
if kwds:
self.update(kwds)

您也可以尝试使用defaultdict作为更具竞争力的选择。尝试：

1
2
3
4
5
6
7

from collections import defaultdict

word_dict = defaultdict(lambda: 0)
for word in word_list:
word_dict[word] +=1

print word_dict