关于python：random.choice的加权版本

A weighted version of random.choice

我需要写一个加权版本的random.choice(列表中的每个元素都有不同的被选择概率)。这是我想出的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

def weightedChoice(choices):
"""Like random.choice, but each element can have a different chance of
being selected.

choices can be any iterable containing iterables with two items each.
Technically, they can have more than two items, the rest will just be
ignored. The first item is the thing being chosen, the second item is
its weight. The weights can be any numeric values, what matters is the
relative differences between them.
"""
space = {}
current = 0
for choice, weight in choices:
if weight > 0:
space[current] = choice
current += weight
rand = random.uniform(0, current)
for key in sorted(space.keys() + [current]):
if rand < key:
return choice
choice = space[key]
return None

对于我来说，此功能似乎过于复杂且难看。我希望这里的每个人都可以提出一些改进建议或替代方法。对于我来说，效率并不像代码的清洁度和可读性那么重要。

相关讨论

从Python3.6开始，random模块提供了一种方法choices。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import random

In [2]: random.choices(
...: population=[['a','b'], ['b','a'], ['c','b']],
...: weights=[0.2, 0.2, 0.6],
...: k=10
...: )

Out[2]:
[['c', 'b'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['c', 'b']]

人们还提到，有numpy.random.choice支持权重，但不支持2d数组，依此类推。

因此，如果您拥有3.6.x Python，则~~基本上可以通过内置的random.choices获得所需的内容(请参阅更新)。~~

更新：
正如@roganjosh亲切提及的那样，random.choices不能返回值而无需替换，如文档中所述：

Return a k sized list of elements chosen from the population with replacement.

@ ronan-paix？o的出色回答指出numpy.choice具有replace自变量，可以控制这种行为。

相关讨论

这可以选择权重，但似乎无法停止替换，np.random.choice可以做到。

这比numpy.random.choice快得多。从8个加权项的列表中选择10,000次，numpy.random.choice花费了0.3286秒，而random.choices花费了0.0416秒，大约快了8倍。

1
2
3
4
5
6
7
8
9
def weighted_choice(choices):
total = sum(w for c, w in choices)
r = random.uniform(0, total)
upto = 0
for c, w in choices:
if upto + w >= r:
return c
upto += w
assert False,"Shouldn't get here"

相关讨论

我不知道为什么我认为我必须对砝码进行排序并对其进行排序...这样更好。

您可以通过反转for循环内的语句来删除操作并节省一小段时间：upto +=w; if upto > r

random.uniform(0, total)可以返回合计(docs.python.org/2/library/random.html#random.uniform)，在这种情况下将为AssertionError

@knite，请不要建议。你甚至测试过吗？它完全破坏了分布。在修改后运行weighted_choice([(a,1.0),(b,2.0),(c,3.0)])会使b永远不会被选择...

@rsk，您是正确的，尽管那是非常罕见的情况。将> r更改为>= r可以为我解决该问题。

@Cerin似乎永远不会选择b似乎没有道理。你能解释一下吗？另外，需要将"> r"更改为"> = r"，否则将永远不会选择a(假设您遵循编织修改)。

通过删除upto并每次将r减权重来保存变量。则比较为if r < 0

@JnBrymn您需要检查r <= 0。考虑一个包含1个项目的输入集和一个1.0的卷。断言将失败。我更正了答案中的错误。

您可以使用for ... else构造而不是错误的断言

我在一些代码中使用了此代码，并运行了自己的覆盖率工具。它给了我："没有从函数返回，因为循环没有完成"。有没有办法获得100％的测试覆盖率？我应该为此打开错误报告或单独的问题吗？

@Sardathrion您可以使用编译指示将for循环标记为部分循环：# pragma: no branch

将权重排列为
累积分布。

使用random.random()随机选择
浮点0.0 <= x < total。

搜索
使用bisect.bisect进行分布
如http://docs.python.org/dev/library/bisect.html#other-examples中的示例所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from random import random
from bisect import bisect

def weighted_choice(choices):
values, weights = zip(*choices)
total = 0
cum_weights = []
for w in weights:
total += w
cum_weights.append(total)
x = random() * total
i = bisect(cum_weights, x)
return values[i]

>>> weighted_choice([("WHITE",90), ("RED",8), ("GREEN",2)])
'WHITE'

如果您需要做出多个选择，请将其拆分为两个函数，一个用于构建累加权重，另一个用于平分至随机点。

相关讨论

这比Neds的回答更有效率。基本上，他没有进行选择的线性(O(n))搜索，而是进行了二进制搜索(O(log n))。 +1！

如果random()恰好返回1.0，则元组索引超出范围

由于进行了累积分布计算，因此它仍在O(n)中运行。

我更喜欢这种解决方案。更清晰易懂的代码。

在同一选择集需要多次调用weighted_choice的情况下，此解决方案更好。在这种情况下，您可以创建一次累加和，然后对每个调用进行二进制搜索。

如果您不介意使用numpy，则可以使用numpy.random.choice。

例如：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy

items = [["item1", 0.2], ["item2", 0.3], ["item3", 0.45], ["item4", 0.05]
elems = [i[0] for i in items]
probs = [i[1] for i in items]

trials = 1000
results = [0] * len(items)
for i in range(trials):
res = numpy.random.choice(items, p=probs) #This is where the item is selected!
results[items.index(res)] += 1
results = [r / float(trials) for r in results]
print"item\texpected\tactual"
for i in range(len(probs)):
print"%s\t%0.4f\t%0.4f" % (items[i], probs[i], results[i])

如果您知道需要事先选择多少个选项，则可以像这样循环执行：

1
numpy.random.choice(items, trials, p=probs)

粗略，但可能足够：

1
2
import random
weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))

它行得通吗？

1
2
3
4
5
6
7
8
9
10
11
# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]

# initialize tally dict
tally = dict.fromkeys(choices, 0)

# tally up 1000 weighted choices
for i in xrange(1000):
tally[weighted_choice(choices)] += 1

print tally.items()

印刷品：

1
[('WHITE', 904), ('GREEN', 22), ('RED', 74)]

假设所有权重都是整数。他们不必相加100，我只是这样做以使测试结果更易于解释。 (如果权重是浮点数，则将它们全部乘以10，直到所有权重> =1。)

1
2
3
4
weights = [.6, .2, .001, .199]
while any(w < 1.0 for w in weights):
weights = [w*10 for w in weights]
weights = map(int, weights)

相关讨论

很好，我不确定我是否可以假设所有权重都是整数。

似乎您的对象将在此示例中重复。那是低效的(将权重转换为整数的函数也是如此)。但是，如果整数权重较小，则此解决方案是一个很好的方案。

基元将被复制，但是对象将仅具有重复的引用，而不是对象本身。 (这就是为什么您不能使用[[]]*10创建列表列表的原因-外部列表中的所有元素都指向同一列表。

如果您有加权词典而不是列表，则可以这样写

1
2
items = {"a": 10,"b": 5,"c": 1 }
random.choice([k for k in items for dummy in range(items[k])])

请注意，[k for k in items for dummy in range(items[k])]生成此列表['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c', 'b', 'b', 'b', 'b', 'b']

相关讨论

这适用于较小的总人口值，但不适用于较大的数据集(例如，按州划分的美国人口最终将创建其中包含3亿个项目的工作清单)。

做工作，荣誉

从Python v3.6开始，random.choices可用于从给定总体中以可选权重返回指定大小的元素的list。

random.choices(population, weights=None, *, cum_weights=None, k=1)

人口：list包含独特的观察结果。 (如果为空，则引发IndexError)

权重：更精确地进行选择所需的相对权重。

cum_weights：进行选择所需的累积权重。

k：要输出的list的大小(len)。 (默认len()=1)

注意事项：

1)它使用加权抽样进行替换，因此抽取的项目将在以后被替换。权重序列中的值本身并不重要，但它们的相对比率却无关紧要。

与np.random.choice只能将概率作为权重并且还必须确保单个概率的总和不超过1个标准不同，此处没有此类规定。只要它们属于数字类型(Decimal类型以外的int/float/fraction)，它们仍然会执行。

1
2
3
4
5
6
7
8
9
10
>>> import random
# weights being integers
>>> random.choices(["white","green","red"], [12, 12, 4], k=10)
['green', 'red', 'green', 'white', 'white', 'white', 'green', 'white', 'red', 'white']
# weights being floats
>>> random.choices(["white","green","red"], [.12, .12, .04], k=10)
['white', 'white', 'green', 'green', 'red', 'red', 'white', 'green', 'white', 'green']
# weights being fractions
>>> random.choices(["white","green","red"], [12/100, 12/100, 4/100], k=10)
['green', 'green', 'white', 'red', 'green', 'red', 'white', 'green', 'green', 'green']

2)如果既未指定权重也未指定cum_weights，则选择的可能性均等。如果提供了权重序列，则其长度必须与总体序列的长度相同。

同时指定权重和cum_weights会引发TypeError。

1
2
>>> random.choices(["white","green","red"], k=10)
['white', 'white', 'green', 'red', 'red', 'red', 'white', 'white', 'white', 'green']

3)cum_weights通常是itertools.accumulate函数的结果，在这种情况下非常方便。

_{From the documentation linked:}

Internally, the relative weights are converted to cumulative weights
before making selections, so supplying the cumulative weights saves
work.

因此，为人为的案例提供weights=[12, 12, 4]或cum_weights=[12, 24, 28]都会产生相同的结果，而后者似乎更快/更有效。

这是Python 3.6标准库中包含的版本：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import itertools as _itertools
import bisect as _bisect

class Random36(random.Random):
"Show the code included in the Python 3.6 version of the Random class"

def choices(self, population, weights=None, *, cum_weights=None, k=1):
"""Return a k sized list of population elements chosen with replacement.

If the relative weights or cumulative weights are not specified,
the selections are made with equal probability.

"""
random = self.random
if cum_weights is None:
if weights is None:
_int = int
total = len(population)
return [population[_int(random() * total)] for i in range(k)]
cum_weights = list(_itertools.accumulate(weights))
elif weights is not None:
raise TypeError('Cannot specify both weights and cumulative weights')
if len(cum_weights) != len(population):
raise ValueError('The number of weights does not match the population')
bisect = _bisect.bisect
total = cum_weights[-1]
return [population[bisect(cum_weights, random() * total)] for i in range(k)]

来源：https://hg.python.org/cpython/file/tip/Lib/random.py#l340

我要求选择的总和是1，但这还是可行的

1
2
3
4
5
6
7
8
9
10
11
12
13
def weightedChoice(choices):
# Safety check, you can remove it
for c,w in choices:
assert w >= 0

tmp = random.uniform(0, sum(c for c,w in choices))
for choice,weight in choices:
if tmp < weight:
return choice
else:
tmp -= weight
raise ValueError('Negative values in input')

相关讨论

出于好奇，您是否有理由更喜欢random.random()* total而不是random.uniform(0，total)？

@科林不，一点也不。更新。

您遍历了三次遍历。迭代器可能不支持此功能。

那是个很好的观点。我只传递了元组列表，所以我还没有发现那个错误。

@liori你说得对。但是，如果不将iterable的所有项目都存储在列表中，就无法计算weightedChoice，因此输入应为列表。

我认为这实际上是可能的。 utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf实际上很简单...但是谁在乎...

@liori我很在乎，您呢：weightedChoice只能通过一次迭代器计算。但是，这似乎需要对伪随机生成器进行多次调用。

我可能为时已晚，无法提供任何有用的信息，但这是一个简单，简短且非常有效的代码段：

1
2
3
4
5
6
7
8
def choose_index(probabilies):
cmf = probabilies[0]
choice = random.random()
for k in xrange(len(probabilies)):
if choice <= cmf:
return k
else:
cmf += probabilies[k+1]

无需对您的概率进行排序或使用cmf创建向量，并且一旦找到选择就终止。内存：O(1)，时间：O(N)，平均运行时间约为N / 2。

如果您有权重，只需添加一行：

1
2
3
4
5
6
7
8
9
def choose_index(weights):
probabilities = weights / sum(weights)
cmf = probabilies[0]
choice = random.random()
for k in xrange(len(probabilies)):
if choice <= cmf:
return k
else:
cmf += probabilies[k+1]

如果您的加权选择列表相对静态，并且您希望频繁采样，则可以执行一个O(N)预处理步骤，然后使用此相关答案中的函数在O(1)中进行选择。

1
2
3
4
5
# run only when `choices` changes.
preprocessed_data = prep(weight for _,weight in choices)

# O(1) selection
value = choices[sample(preprocessed_data)][0]

1
2
3
import numpy as np
w=np.array([ 0.4, 0.8, 1.6, 0.8, 0.4])
np.random.choice(w, p=w/sum(w))

这是使用numpy的weighted_choice的另一个版本。传递权重向量，它将返回一个包含1的0数组，指示选择了哪个bin。该代码默认只进行一次抽奖，但是您可以传递要进行的抽奖次数，并且将返回每个抽奖箱的计数。

如果权重向量的总和不等于1，它将被归一化。

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

def weighted_choice(weights, n=1):
if np.sum(weights)!=1:
weights = weights/np.sum(weights)

draws = np.random.random_sample(size=n)

weights = np.cumsum(weights)
weights = np.insert(weights,0,0.0)

counts = np.histogram(draws, bins=weights)
return(counts[0])

通用解决方案：

1
2
3
4
5
6
7
8
import random
def weighted_choice(choices, weights):
total = sum(weights)
treshold = random.uniform(0, total)
for k, weight in enumerate(weights):
total -= weight
if total < treshold:
return choices[k]

这取决于您要对分布进行采样的次数。

假设您要采样K次分布。然后，当n是分布中的项目数时，每次使用np.random.choice()的时间复杂度为O(K(n + log(n)))。

就我而言，我需要对同一分布进行多次采样，采样次数为10 ^ 3，其中n为10 ^ 6。我使用了以下代码，该代码预先计算了累积分布并在O(log(n))中对其进行了采样。总时间复杂度为O(n+K*log(n))。

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

n,k = 10**6,10**3

# Create dummy distribution
a = np.array([i+1 for i in range(n)])
p = np.array([1.0/n]*n)

cfd = p.cumsum()
for _ in range(k):
x = np.random.uniform()
idx = cfd.searchsorted(x, side='right')
sampled_element = a[idx]

我查看了所指向的其他线程，并提出了我的编码样式的这种变体，它返回用于计算目的的选择索引，但是返回字符串很简单(注释返回替代)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import random
import bisect

try:
range = xrange
except:
pass

def weighted_choice(choices):
total, cumulative = 0, []
for c,w in choices:
total += w
cumulative.append((total, c))
r = random.uniform(0, total)
# return index
return bisect.bisect(cumulative, (r,))
# return item string
#return choices[bisect.bisect(cumulative, (r,))][0]

# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]

tally = [0 for item in choices]

n = 100000
# tally up n weighted choices
for i in range(n):
tally[weighted_choice(choices)] += 1

print([t/sum(tally)*100 for t in tally])

为random.choice()提供预加权列表：

解决方案和测试：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import random

options = ['a', 'b', 'c', 'd']
weights = [1, 2, 5, 2]

weighted_options = [[opt]*wgt for opt, wgt in zip(options, weights)]
weighted_options = [opt for sublist in weighted_options for opt in sublist]
print(weighted_options)

# test

counts = {c: 0 for c in options}
for x in range(10000):
counts[random.choice(weighted_options)] += 1

for opt, wgt in zip(options, weights):
wgt_r = counts[opt] / 10000 * sum(weights)
print(opt, counts[opt], wgt, wgt_r)

输出：

1
2
3
4
5
['a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd']
a 1025 1 1.025
b 1948 2 1.948
c 5019 5 5.019
d 2008 2 2.008

一种方法是对所有权重的总和进行随机化，然后将这些值用作每个变量的极限点。这是生成器的粗略实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def rand_weighted(weights):
"""
Generator which uses the weights to generate a
weighted random values
"""
sum_weights = sum(weights.values())
cum_weights = {}
current_weight = 0
for key, value in sorted(weights.iteritems()):
current_weight += value
cum_weights[key] = current_weight
while True:
sel = int(random.uniform(0, 1) * sum_weights)
for key, value in sorted(cum_weights.iteritems()):
if sel < value:
break
yield key

使用numpy

1
2
def choice(items, weights):
return items[np.argmin((np.cumsum(weights) / sum(weights)) < np.random.rand())]

我需要快速，非常简单地完成这样的工作，从寻找想法开始，我终于建立了这个模板。这个想法是从api接收json形式的加权值，这里是由dict模拟的。

然后将其转换为一个列表，其中每个值均按其权重成比例地重复，只需使用random.choice从列表中选择一个值即可。

我尝试了运行10、100和1000次迭代。分布似乎很稳定。

1
2
3
4
5
6
def weighted_choice(weighted_dict):
"""Input example: dict(apples=60, oranges=30, pineapples=10)"""
weight_list = []
for key in weighted_dict.keys():
weight_list += [key] * weighted_dict[key]
return random.choice(weight_list)

我不喜欢那些语法。我真的只想指定项目是什么，每个项目的权重是什么。我意识到我本可以使用random.choices，但是我很快在下面编写了该类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import random, string
from numpy import cumsum

class randomChoiceWithProportions:
'''
Accepts a dictionary of choices as keys and weights as values. Example if you want a unfair dice:

choiceWeightDic = {"1":0.16666666666666666,"2": 0.16666666666666666,"3": 0.16666666666666666
,"4": 0.16666666666666666,"5": .06666666666666666,"6": 0.26666666666666666}
dice = randomChoiceWithProportions(choiceWeightDic)

samples = []
for i in range(100000):
samples.append(dice.sample())

# Should be close to .26666
samples.count("6")/len(samples)

# Should be close to .16666
samples.count("1")/len(samples)
'''
def __init__(self, choiceWeightDic):
self.choiceWeightDic = choiceWeightDic
weightSum = sum(self.choiceWeightDic.values())
assert weightSum == 1, 'Weights sum to ' + str(weightSum) + ', not 1.'
self.valWeightDict = self._compute_valWeights()

def _compute_valWeights(self):
valWeights = list(cumsum(list(self.choiceWeightDic.values())))
valWeightDict = dict(zip(list(self.choiceWeightDic.keys()), valWeights))
return valWeightDict

def sample(self):
num = random.uniform(0,1)
for key, val in self.valWeightDict.items():
if val >= num:
return key