关于python:random.choice的加权版本

A weighted version of random.choice

我需要写一个加权版本的random.choice(列表中的每个元素都有不同的被选择概率)。 这是我想出的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def weightedChoice(choices):
   """Like random.choice, but each element can have a different chance of
    being selected.

    choices can be any iterable containing iterables with two items each.
    Technically, they can have more than two items, the rest will just be
    ignored.  The first item is the thing being chosen, the second item is
    its weight.  The weights can be any numeric values, what matters is the
    relative differences between them.
   """

    space = {}
    current = 0
    for choice, weight in choices:
        if weight > 0:
            space[current] = choice
            current += weight
    rand = random.uniform(0, current)
    for key in sorted(space.keys() + [current]):
        if rand < key:
            return choice
        choice = space[key]
    return None

对于我来说,此功能似乎过于复杂且难看。 我希望这里的每个人都可以提出一些改进建议或替代方法。 对于我来说,效率并不像代码的清洁度和可读性那么重要。


从1.7.0版开始,NumPy具有choice函数,该函数支持概率分布。

1
2
3
from numpy.random import choice
draw = choice(list_of_candidates, number_of_items_to_pick,
              p=probability_distribution)

请注意,probability_distribution是与list_of_candidates相同顺序的序列。您也可以使用关键字replace=False更改行为,以便不替换绘制的项目。


从Python3.6开始,random模块提供了一种方法choices

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import random

In [2]: random.choices(
...:     population=[['a','b'], ['b','a'], ['c','b']],
...:     weights=[0.2, 0.2, 0.6],
...:     k=10
...: )

Out[2]:
[['c', 'b'],
 ['c', 'b'],
 ['b', 'a'],
 ['c', 'b'],
 ['c', 'b'],
 ['b', 'a'],
 ['c', 'b'],
 ['b', 'a'],
 ['c', 'b'],
 ['c', 'b']]

人们还提到,有numpy.random.choice支持权重,但不支持2d数组,依此类推。

因此,如果您拥有3.6.x Python,则基本上可以通过内置的random.choices获得所需的内容(请参阅更新)。

更新:
正如@roganjosh亲切提及的那样,random.choices不能返回值而无需替换,如文档中所述:

Return a k sized list of elements chosen from the population with replacement.

@ ronan-paix?o的出色回答指出numpy.choice具有replace自变量,可以控制这种行为。


1
2
3
4
5
6
7
8
9
def weighted_choice(choices):
   total = sum(w for c, w in choices)
   r = random.uniform(0, total)
   upto = 0
   for c, w in choices:
      if upto + w >= r:
         return c
      upto += w
   assert False,"Shouldn't get here"


  • 将权重排列为
    累积分布。
  • 使用random.random()随机选择
    浮点0.0 <= x < total
  • 搜索
    使用bisect.bisect进行分布
    如http://docs.python.org/dev/library/bisect.html#other-examples中的示例所示。
  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    from random import random
    from bisect import bisect

    def weighted_choice(choices):
        values, weights = zip(*choices)
        total = 0
        cum_weights = []
        for w in weights:
            total += w
            cum_weights.append(total)
        x = random() * total
        i = bisect(cum_weights, x)
        return values[i]

    >>> weighted_choice([("WHITE",90), ("RED",8), ("GREEN",2)])
    'WHITE'

    如果您需要做出多个选择,请将其拆分为两个函数,一个用于构建累加权重,另一个用于平分至随机点。


    如果您不介意使用numpy,则可以使用numpy.random.choice。

    例如:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    import numpy

    items  = [["item1", 0.2], ["item2", 0.3], ["item3", 0.45], ["item4", 0.05]
    elems = [i[0] for i in items]
    probs = [i[1] for i in items]

    trials = 1000
    results = [0] * len(items)
    for i in range(trials):
        res = numpy.random.choice(items, p=probs)  #This is where the item is selected!
        results[items.index(res)] += 1
    results = [r / float(trials) for r in results]
    print"item\texpected\tactual"
    for i in range(len(probs)):
        print"%s\t%0.4f\t%0.4f" % (items[i], probs[i], results[i])

    如果您知道需要事先选择多少个选项,则可以像这样循环执行:

    1
    numpy.random.choice(items, trials, p=probs)

    粗略,但可能足够:

    1
    2
    import random
    weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))

    它行得通吗?

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # define choices and relative weights
    choices = [("WHITE",90), ("RED",8), ("GREEN",2)]

    # initialize tally dict
    tally = dict.fromkeys(choices, 0)

    # tally up 1000 weighted choices
    for i in xrange(1000):
        tally[weighted_choice(choices)] += 1

    print tally.items()

    印刷品:

    1
    [('WHITE', 904), ('GREEN', 22), ('RED', 74)]

    假设所有权重都是整数。他们不必相加100,我只是这样做以使测试结果更易于解释。 (如果权重是浮点数,则将它们全部乘以10,直到所有权重> =1。)

    1
    2
    3
    4
    weights = [.6, .2, .001, .199]
    while any(w < 1.0 for w in weights):
        weights = [w*10 for w in weights]
    weights = map(int, weights)


    如果您有加权词典而不是列表,则可以这样写

    1
    2
    items = {"a": 10,"b": 5,"c": 1 }
    random.choice([k for k in items for dummy in range(items[k])])

    请注意,[k for k in items for dummy in range(items[k])]生成此列表['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c', 'b', 'b', 'b', 'b', 'b']


    从Python v3.6开始,random.choices可用于从给定总体中以可选权重返回指定大小的元素的list

    random.choices(population, weights=None, *, cum_weights=None, k=1)

    • 人口:list包含独特的观察结果。 (如果为空,则引发IndexError)

    • 权重:更精确地进行选择所需的相对权重。

    • cum_weights:进行选择所需的累积权重。

    • k:要输出的list的大小(len)。 (默认len()=1)

    注意事项:

    1)它使用加权抽样进行替换,因此抽取的项目将在以后被替换。权重序列中的值本身并不重要,但它们的相对比率却无关紧要。

    np.random.choice只能将概率作为权重并且还必须确保单个概率的总和不超过1个标准不同,此处没有此类规定。只要它们属于数字类型(Decimal类型以外的int/float/fraction),它们仍然会执行。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    >>> import random
    # weights being integers
    >>> random.choices(["white","green","red"], [12, 12, 4], k=10)
    ['green', 'red', 'green', 'white', 'white', 'white', 'green', 'white', 'red', 'white']
    # weights being floats
    >>> random.choices(["white","green","red"], [.12, .12, .04], k=10)
    ['white', 'white', 'green', 'green', 'red', 'red', 'white', 'green', 'white', 'green']
    # weights being fractions
    >>> random.choices(["white","green","red"], [12/100, 12/100, 4/100], k=10)
    ['green', 'green', 'white', 'red', 'green', 'red', 'white', 'green', 'green', 'green']

    2)如果既未指定权重也未指定cum_weights,则选择的可能性均等。如果提供了权重序列,则其长度必须与总体序列的长度相同。

    同时指定权重和cum_weights会引发TypeError

    1
    2
    >>> random.choices(["white","green","red"], k=10)
    ['white', 'white', 'green', 'red', 'red', 'red', 'white', 'white', 'white', 'green']

    3)cum_weights通常是itertools.accumulate函数的结果,在这种情况下非常方便。

    From the documentation linked:

    Internally, the relative weights are converted to cumulative weights
    before making selections, so supplying the cumulative weights saves
    work.

    因此,为人为的案例提供weights=[12, 12, 4]cum_weights=[12, 24, 28]都会产生相同的结果,而后者似乎更快/更有效。


    这是Python 3.6标准库中包含的版本:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    import itertools as _itertools
    import bisect as _bisect

    class Random36(random.Random):
       "Show the code included in the Python 3.6 version of the Random class"

        def choices(self, population, weights=None, *, cum_weights=None, k=1):
           """Return a k sized list of population elements chosen with replacement.

            If the relative weights or cumulative weights are not specified,
            the selections are made with equal probability.

           """

            random = self.random
            if cum_weights is None:
                if weights is None:
                    _int = int
                    total = len(population)
                    return [population[_int(random() * total)] for i in range(k)]
                cum_weights = list(_itertools.accumulate(weights))
            elif weights is not None:
                raise TypeError('Cannot specify both weights and cumulative weights')
            if len(cum_weights) != len(population):
                raise ValueError('The number of weights does not match the population')
            bisect = _bisect.bisect
            total = cum_weights[-1]
            return [population[bisect(cum_weights, random() * total)] for i in range(k)]

    来源:https://hg.python.org/cpython/file/tip/Lib/random.py#l340


    我要求选择的总和是1,但这还是可行的

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    def weightedChoice(choices):
        # Safety check, you can remove it
        for c,w in choices:
            assert w >= 0


        tmp = random.uniform(0, sum(c for c,w in choices))
        for choice,weight in choices:
            if tmp < weight:
                return choice
            else:
                tmp -= weight
         raise ValueError('Negative values in input')


    我可能为时已晚,无法提供任何有用的信息,但这是一个简单,简短且非常有效的代码段:

    1
    2
    3
    4
    5
    6
    7
    8
    def choose_index(probabilies):
        cmf = probabilies[0]
        choice = random.random()
        for k in xrange(len(probabilies)):
            if choice <= cmf:
                return k
            else:
                cmf += probabilies[k+1]

    无需对您的概率进行排序或使用cmf创建向量,并且一旦找到选择就终止。内存:O(1),时间:O(N),平均运行时间约为N / 2。

    如果您有权重,只需添加一行:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    def choose_index(weights):
        probabilities = weights / sum(weights)
        cmf = probabilies[0]
        choice = random.random()
        for k in xrange(len(probabilies)):
            if choice <= cmf:
                return k
            else:
                cmf += probabilies[k+1]

    如果您的加权选择列表相对静态,并且您希望频繁采样,则可以执行一个O(N)预处理步骤,然后使用此相关答案中的函数在O(1)中进行选择。

    1
    2
    3
    4
    5
    # run only when `choices` changes.
    preprocessed_data = prep(weight for _,weight in choices)

    # O(1) selection
    value = choices[sample(preprocessed_data)][0]

    1
    2
    3
    import numpy as np
    w=np.array([ 0.4,  0.8,  1.6,  0.8,  0.4])
    np.random.choice(w, p=w/sum(w))

    这是使用numpy的weighted_choice的另一个版本。传递权重向量,它将返回一个包含1的0数组,指示选择了哪个bin。该代码默认只进行一次抽奖,但是您可以传递要进行的抽奖次数,并且将返回每个抽奖箱的计数。

    如果权重向量的总和不等于1,它将被归一化。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    import numpy as np

    def weighted_choice(weights, n=1):
        if np.sum(weights)!=1:
            weights = weights/np.sum(weights)

        draws = np.random.random_sample(size=n)

        weights = np.cumsum(weights)
        weights = np.insert(weights,0,0.0)

        counts = np.histogram(draws, bins=weights)
        return(counts[0])


    通用解决方案:

    1
    2
    3
    4
    5
    6
    7
    8
    import random
    def weighted_choice(choices, weights):
        total = sum(weights)
        treshold = random.uniform(0, total)
        for k, weight in enumerate(weights):
            total -= weight
            if total < treshold:
                return choices[k]

    这取决于您要对分布进行采样的次数。

    假设您要采样K次分布。然后,当n是分布中的项目数时,每次使用np.random.choice()的时间复杂度为O(K(n + log(n)))

    就我而言,我需要对同一分布进行多次采样,采样次数为10 ^ 3,其中n为10 ^ 6。我使用了以下代码,该代码预先计算了累积分布并在O(log(n))中对其进行了采样。总时间复杂度为O(n+K*log(n))

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    import numpy as np

    n,k = 10**6,10**3

    # Create dummy distribution
    a = np.array([i+1 for i in range(n)])
    p = np.array([1.0/n]*n)

    cfd = p.cumsum()
    for _ in range(k):
        x = np.random.uniform()
        idx = cfd.searchsorted(x, side='right')
        sampled_element = a[idx]

    我查看了所指向的其他线程,并提出了我的编码样式的这种变体,它返回用于计算目的的选择索引,但是返回字符串很简单(注释返回替代):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    import random
    import bisect

    try:
        range = xrange
    except:
        pass

    def weighted_choice(choices):
        total, cumulative = 0, []
        for c,w in choices:
            total += w
            cumulative.append((total, c))
        r = random.uniform(0, total)
        # return index
        return bisect.bisect(cumulative, (r,))
        # return item string
        #return choices[bisect.bisect(cumulative, (r,))][0]

    # define choices and relative weights
    choices = [("WHITE",90), ("RED",8), ("GREEN",2)]

    tally = [0 for item in choices]

    n = 100000
    # tally up n weighted choices
    for i in range(n):
        tally[weighted_choice(choices)] += 1

    print([t/sum(tally)*100 for t in tally])

    为random.choice()提供预加权列表:

    解决方案和测试:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    import random

    options = ['a', 'b', 'c', 'd']
    weights = [1, 2, 5, 2]

    weighted_options = [[opt]*wgt for opt, wgt in zip(options, weights)]
    weighted_options = [opt for sublist in weighted_options for opt in sublist]
    print(weighted_options)

    # test

    counts = {c: 0 for c in options}
    for x in range(10000):
        counts[random.choice(weighted_options)] += 1

    for opt, wgt in zip(options, weights):
        wgt_r = counts[opt] / 10000 * sum(weights)
        print(opt, counts[opt], wgt, wgt_r)

    输出:

    1
    2
    3
    4
    5
    ['a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd']
    a 1025 1 1.025
    b 1948 2 1.948
    c 5019 5 5.019
    d 2008 2 2.008

    一种方法是对所有权重的总和进行随机化,然后将这些值用作每个变量的极限点。这是生成器的粗略实现。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    def rand_weighted(weights):
       """
        Generator which uses the weights to generate a
        weighted random values
       """

        sum_weights = sum(weights.values())
        cum_weights = {}
        current_weight = 0
        for key, value in sorted(weights.iteritems()):
            current_weight += value
            cum_weights[key] = current_weight
        while True:
            sel = int(random.uniform(0, 1) * sum_weights)
            for key, value in sorted(cum_weights.iteritems()):
                if sel < value:
                    break
            yield key


    使用numpy

    1
    2
    def choice(items, weights):
        return items[np.argmin((np.cumsum(weights) / sum(weights)) < np.random.rand())]

    我需要快速,非常简单地完成这样的工作,从寻找想法开始,我终于建立了这个模板。这个想法是从api接收json形式的加权值,这里是由dict模拟的。

    然后将其转换为一个列表,其中每个值均按其权重成比例地重复,只需使用random.choice从列表中选择一个值即可。

    我尝试了运行10、100和1000次迭代。分布似乎很稳定。

    1
    2
    3
    4
    5
    6
    def weighted_choice(weighted_dict):
       """Input example: dict(apples=60, oranges=30, pineapples=10)"""
        weight_list = []
        for key in weighted_dict.keys():
            weight_list += [key] * weighted_dict[key]
        return random.choice(weight_list)

    我不喜欢那些语法。我真的只想指定项目是什么,每个项目的权重是什么。我意识到我本可以使用random.choices,但是我很快在下面编写了该类。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    import random, string
    from numpy import cumsum

    class randomChoiceWithProportions:
        '''
        Accepts a dictionary of choices as keys and weights as values. Example if you want a unfair dice:


        choiceWeightDic = {"1":0.16666666666666666,"2": 0.16666666666666666,"3": 0.16666666666666666
        ,"4": 0.16666666666666666,"5": .06666666666666666,"6": 0.26666666666666666}
        dice = randomChoiceWithProportions(choiceWeightDic)

        samples = []
        for i in range(100000):
            samples.append(dice.sample())

        # Should be close to .26666
        samples.count("6")/len(samples)

        # Should be close to .16666
        samples.count("1")/len(samples)
        '''

        def __init__(self, choiceWeightDic):
            self.choiceWeightDic = choiceWeightDic
            weightSum = sum(self.choiceWeightDic.values())
            assert weightSum == 1, 'Weights sum to ' + str(weightSum) + ', not 1.'
            self.valWeightDict = self._compute_valWeights()

        def _compute_valWeights(self):
            valWeights = list(cumsum(list(self.choiceWeightDic.values())))
            valWeightDict = dict(zip(list(self.choiceWeightDic.keys()), valWeights))
            return valWeightDict

        def sample(self):
            num = random.uniform(0,1)
            for key, val in self.valWeightDict.items():
                if val >= num:
                    return key