关于python：快速从给定列表中查找字典中的所有键

Finding all keys in a dictionary from a given list QUICKLY

我有一本(可能相当大)字典和一个"可能"键列表。我想快速找到字典中哪些键的值匹配。我在这里和这里发现了很多关于单个字典值的讨论，但是没有关于速度或多个条目的讨论。

我想出了四种方法，对于三种最有效的方法，我比较了它们在以下不同样本尺寸上的速度——有更好的方法吗？如果人们能提出明智的竞争者，我也会让他们接受下面的分析。

示例列表和字典创建如下：

1
2
3
4
5
6
7

import cProfile
from random import randint

length = 100000

listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)]
dictionaryOfRandomInts = {randint(0,length*length/10-1):"It's here" for x in range(length)}

nbsp；

方法1:'in'关键字：

1
2
3
4
5
6
7
8

def way1(theList,theDict):
resultsList = []
for listItem in theList:
if listItem in theDict:
resultsList.append(theDict[listItem])
return resultsList

cProfile.run('way1(listOfRandomInts,dictionaryOfRandomInts)')

0.018秒内32次函数调用

nbsp；

方法2：错误处理：

1
2
3
4
5
6
7
8
9
10

def way2(theList,theDict):
resultsList = []
for listItem in theList:
try:
resultsList.append(theDict[listItem])
except:
;
return resultsList

cProfile.run('way2(listOfRandomInts,dictionaryOfRandomInts)')

0.087秒内32次函数调用

nbsp；

方法3：设置交叉点：

1
2
3
4

def way3(theList,theDict):
return list(set(theList).intersection(set(theDict.keys())))

cProfile.run('way3(listOfRandomInts,dictionaryOfRandomInts)')

0.046秒内26次函数调用

nbsp；

方法四：单纯使用dict.keys()：

这是一个警示性的故事——这是我的第一次尝试，也是迄今为止最慢的一次！

1
2
3
4
5
6
7
8
9

def way4(theList,theDict):
resultsList = []
keys = theDict.keys()
for listItem in theList:
if listItem in keys:
resultsList.append(theDict[listItem])
return resultsList

cProfile.run('way4(listOfRandomInts,dictionaryOfRandomInts)')

248.552秒内12次函数调用

nbsp；

编辑：将答案中给出的建议引入到我用于一致性的相同框架中。许多人注意到，在python3.x中可以获得更多的性能提升，特别是基于列表理解的方法。非常感谢您的帮助！

方法5：更好的交叉方式(感谢Jonrsharpe)：

1 2	def way5(theList, theDict): return = list(set(theList).intersection(theDict))

在0.037秒内调用25个函数

nbsp；

方法6：列表理解(感谢Jonrsharpe)：

1 2	def way6(theList, theDict): return [item for item in theList if item in theDict]

0.020秒内24次函数调用

nbsp；

方法7：使用&关键字(谢谢jornsharpe)：

1 2	def way7(theList, theDict): return list(theDict.viewkeys() & theList)

在0.026秒内调用25个函数

对于方法1-3和5-7，我使用长度为1000、10000、100000、1000000、10000000、10000000和10000000的列表/字典对它们进行了如上计时，并显示所用时间的日志图。在所有长度上，交集和语句内方法的性能都更好。梯度都在1左右(可能更高一点)，表示O(N)或者稍微超线性的比例。

Log-Log plot comparing time-scaling of the 6 sensible methods with list/dict length

相关讨论

首先，我认为你是2.7的，所以我会用2.7做大部分的事情。但是值得注意的是，如果您真的对优化代码感兴趣，那么3.x分支将继续得到性能改进，而2.x分支永远不会得到改进。你为什么用cpython而不是pypy？

不管怎样，还需要进一步的微观优化(除了Jonrsharpe的答案中的那些：

在局部变量中缓存属性和/或全局查找(出于某种原因称为LOAD_FAST)。例如：

1
2
3
4
5
6
7
8
9
10
11
12

def way1a(theList, theDict):
resultsList = []
rlappend = resultsList.append
for listItem in theList:
if listItem in theDict:
rlappend(theDict[listItem])
return resultsList

In [10]: %timeit way1(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 13.2 ms per loop
In [11]: %timeit way1a(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 12.4 ms per loop

但是对于一些特殊的操作方法，如__contains__和__getitem__来说，这可能是不值得的。当然，除非你尝试，否则你不会知道：

1
2
3
4
5
6
7
8
9
10
11
12

def way1b(theList, theDict):
resultsList = []
rlappend = resultsList.append
tdin = theDict.__contains__
tdgi = theDict.__getitem__
for listItem in theList:
if tdin(listItem):
rlappend(tdgi(listItem))
return resultsList

In [14]: %timeit way1b(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 12.8 ms per loop

同时，jon的way6答案已经通过使用listcomp完全优化了resultList.append，我们只是看到优化他所做的查找可能不会有帮助。尤其是在3.x中，理解将被编译成它自己的函数，但即使在2.7中，我也不会期望有任何好处，因为与显式循环中的原因相同。但我们还是要确保：

1
2
3
4
5
6
7
8
9
10
11

def way6(theList, theDict):
return [theDict[item] for item in theList if item in theDict]
def way6a(theList, theDict):
tdin = theDict.__contains__
tdgi = theDict.__getitem__
return [tdgi(item) for item in theList if tdin(item)]

In [31]: %timeit way6(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 14.7 ms per loop
In [32]: %timeit way6a(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 13.9 ms per loop

令人惊讶的是(至少对我来说)，这一次确实有所帮助。不知道为什么。

但我真正要做的是：将过滤器表达式和值表达式转换为函数调用的另一个好处是我们可以使用filter和map：

1
2
3
4
5
6
7
8
9
10
11
12
13

def way6b(theList, theDict):
tdin = theDict.__contains__
tdgi = theDict.__getitem__
return map(tdgi, filter(tdin, theList))
def way6c(theList, theDict):
tdin = theDict.__contains__
tdgi = theDict.__getitem__
return map(tdgi, ifilter(tdin, theList))

In [34]: %timeit way6b(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 10.7 ms per loop
In [35]: %timeit way6c(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 13 ms per loop

但这一收益主要是2.x特定的；3.x的理解速度更快，而它的list(map(filter(…)))比2.x的map(filter(…))或map(ifilter(…))慢。

您不需要将集合交集的两边都转换为集合，只需要将左侧转换为集合；右侧可以是任何可迭代的，而dict已经是其键的可迭代的。

但是，更好的是，dict的关键视图(3.x中的dict.keys，2.7中的dict.keyview)已经是一个集合状的对象，并且由dict的哈希表支持，因此您不需要转换任何内容。(它没有完全相同的接口，它没有intersection方法，但它的&运算符采用iterables，不像set方法采用iterables，后者的intersection方法采用iterables，但它的&只采用sets。这很烦人，但我们只关心这里的表现，对吗？)

1
2
3
4
5
6
7
8
9
10
11
12
13

def way3(theList,theDict):
return list(set(theList).intersection(set(theDict.keys())))
def way3a(theList,theDict):
return list(set(theList).intersection(theDict))
def way3b(theList,theDict):
return list(theDict.viewkeys() & theList)

In [20]: %timeit way3(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 23.7 ms per loop
In [20]: %timeit way3a(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 15.5 ms per loop
In [20]: %timeit way3b(listOfRandomInts, dictionaryOfRandomInts)
100 loops, best of 3: 15.7 ms per loop

最后一个没有帮助(虽然使用的是python3.4而不是2.7，但速度快了10%…)，但第一个确实有帮助。

在现实生活中，您可能还想比较两个集合的大小，以决定哪个集合被设置，但这里的信息是静态的，所以编写代码来测试它是没有意义的。

不管怎样，我最快的结果是2.7版的map(filter(…))，有相当大的优势。在3.4版本中(我在这里没有显示)，jon的listcomp速度最快(甚至固定为返回值而不是键)，比任何2.7方法都快。另外，3.4最快的设置操作(使用键视图作为一个集合，列表作为一个可迭代的)比2.7更接近于迭代方法。