与えられた（数値）分布で乱数を生成します

Question

私は異なる値のいくつかの確率を持つファイルを持っています：

1 0.1 2 0.05 3 0.05 4 0.2 5 0.4 6 0.2

この分布を使用して乱数を生成したいと思います。これを処理する既存のモジュールは存在しますか？自分でコーディングするのはかなり簡単です（累積密度関数を作成し、ランダムな値[0,1]を生成し、対応する値を選択します）が、これは一般的な問題であり、おそらく誰かがそれ。

誕生日のリスト（標準のrandomモジュールの分布に従わない）を生成するため、これが必要です。

Sven Marnach · Accepted Answer

scipy.stats.rv_discrete はあなたが望むものかもしれません。 valuesパラメーターを使用して確率を指定できます。その後、分布オブジェクトのrvs()メソッドを使用して、乱数を生成できます。

Eugene Pakhomovがコメントで指摘したように、pキーワードパラメーターを numpy.random.choice() に渡すこともできます。

numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])

Python 3.6以降を使用している場合は、標準ライブラリのrandom.choices()を使用できます。 Mark Dickinsonによる回答を参照してください。

Mark Dickinson · Answer

Python 3.6以降、Pythonの標準ライブラリにはこの解決策があります。つまり、 random.choices です。

使用例：OPの質問に一致する母集団と重みを設定しましょう：

>>> from random import choices >>> population = [1, 2, 3, 4, 5, 6] >>> weights = [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]

choices(population, weights)は単一のサンプルを生成します：

>>> choices(population, weights) 4

オプションのキーワードのみの引数kを使用すると、複数のサンプルを一度に要求できます。サンプルを生成する前に、random.choicesが呼び出されるたびに行う必要がある準備作業があるため、これは貴重です。一度に多くのサンプルを生成することにより、その準備作業を一度行うだけで済みます。ここでは、100万個のサンプルを生成し、collections.Counterを使用して、取得した分布が与えた重みとほぼ一致することを確認します。

>>> million_samples = choices(population, weights, k=10**6) >>> from collections import Counter >>> Counter(million_samples) Counter({5: 399616, 6: 200387, 4: 200117, 1: 99636, 3: 50219, 2: 50025})

sdcvvc · Answer

CDFを使用してリストを生成する利点は、バイナリ検索を使用できることです。前処理にはO(n)時間とスペースが必要ですが、O（k log n）でk個の数値を取得できます。通常のPythonリストは非効率的であるため、arrayモジュールを使用できます。

一定のスペースを要求する場合、次のことができます。 O(n)時間、O(1)スペース。

def random_distr(l): r = random.uniform(0, 1) s = 0 for item, prob in l: s += prob if s >= r: return item return item # Might occur because of floating point inaccuracies

Ramon Martinez · Answer

たぶんそれはちょっと遅いです。ただし、 numpy.random.choice() を使用して、pパラメーターを渡すことができます。

val = numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])

Marcelo Cantos · Answer

（OK、シュリンクラップを要求していることは知っていますが、多分それらの自家製のソリューションはあなたの好みには十分に簡潔ではありませんでした。:-)

pdf = [(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)] cdf = [(i, sum(p for j,p in pdf if j < i)) for i,_ in pdf] R = max(i for r in [random.random()] for i,c in cdf if c <= r)

この式の出力を確認することで、これが機能することを疑似確認しました。

sorted(max(i for r in [random.random()] for i,c in cdf if c <= r) for _ in range(1000))

Markus Dutschke · Answer

カスタム連続分布からランダムサンプルを描画するソリューションを作成しました。

私はこれをあなたと似たようなユースケース（つまり、与えられた確率分布でランダムな日付を生成する）に必要としました。

機能random_custDistと行samples=random_custDist(x0,x1,custDist=custDist,size=1000)が必要です。残りは装飾です^^。

import numpy as np #funtion def random_custDist(x0,x1,custDist,size=None, nControl=10**6): #genearte a list of size random samples, obeying the distribution custDist #suggests random samples between x0 and x1 and accepts the suggestion with probability custDist(x) #custDist noes not need to be normalized. Add this condition to increase performance. #Best performance for max_{x in [x0,x1]} custDist(x) = 1 samples=[] nLoop=0 while len(samples)<size and nLoop<nControl: x=np.random.uniform(low=x0,high=x1) prop=custDist(x) assert prop>=0 and prop<=1 if np.random.uniform(low=0,high=1) <=prop: samples += [x] nLoop+=1 return samples #call x0=2007 x1=2019 def custDist(x): if x<2010: return .3 else: return (np.exp(x-2008)-1)/(np.exp(2019-2007)-1) samples=random_custDist(x0,x1,custDist=custDist,size=1000) print(samples) #plot import matplotlib.pyplot as plt #hist bins=np.linspace(x0,x1,int(x1-x0+1)) hist=np.histogram(samples, bins )[0] hist=hist/np.sum(hist) plt.bar( (bins[:-1]+bins[1:])/2, hist, width=.96, label='sample distribution') #dist grid=np.linspace(x0,x1,100) discCustDist=np.array([custDist(x) for x in grid]) #distrete version discCustDist*=1/(grid[1]-grid[0])/np.sum(discCustDist) plt.plot(grid,discCustDist,label='custom distribustion (custDist)', color='C1', linewidth=4) #decoration plt.legend(loc=3,bbox_to_anchor=(1,0)) plt.show()

このソリューションのパフォーマンスは確かに改善できますが、読みやすさを好みます。

khachik · Answer

weightsに基づいてアイテムのリストを作成します。

items = [1, 2, 3, 4, 5, 6] probabilities= [0.1, 0.05, 0.05, 0.2, 0.4, 0.2] # if the list of probs is normalized (sum(probs) == 1), omit this part prob = sum(probabilities) # find sum of probs, to normalize them c = (1.0)/prob # a multiplier to make a list of normalized probs probabilities = map(lambda x: c*x, probabilities) print probabilities ml = max(probabilities, key=lambda x: len(str(x)) - str(x).find('.')) ml = len(str(ml)) - str(ml).find('.') -1 amounts = [ int(x*(10**ml)) for x in probabilities] itemsList = list() for i in range(0, len(items)): # iterate through original items itemsList += items[i:i+1]*amounts[i] # choose from itemsList randomly print itemsList

最適化は、最大公約数で金額を正規化し、ターゲットリストを小さくすることです。

また、 this は興味深いかもしれません。

Saksham Varma · Answer

from __future__ import division import random from collections import Counter def num_gen(num_probs): # calculate minimum probability to normalize min_prob = min(prob for num, prob in num_probs) lst = [] for num, prob in num_probs: # keep appending num to lst, proportional to its probability in the distribution for _ in range(int(prob/min_prob)): lst.append(num) # all elems in lst occur proportional to their distribution probablities while True: # pick a random index from lst ind = random.randint(0, len(lst)-1) yield lst[ind]

検証：

gen = num_gen([(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)]) lst = [] times = 10000 for _ in range(times): lst.append(next(gen)) # Verify the created distribution: for item, count in Counter(lst).iteritems(): print '%d has %f probability' % (item, count/times) 1 has 0.099737 probability 2 has 0.050022 probability 3 has 0.049996 probability 4 has 0.200154 probability 5 has 0.399791 probability 6 has 0.200300 probability

Lucas Moeskops · Answer

別の答え、おそらくより速い:)

distribution = [(1, 0.2), (2, 0.3), (3, 0.5)] # init distribution dlist = [] sumchance = 0 for value, chance in distribution: sumchance += chance dlist.append((value, sumchance)) assert sumchance == 1.0 # not good assert because of float equality # get random value r = random.random() # for small distributions use lineair search if len(distribution) < 64: # don't know exact speed limit for value, sumchance in dlist: if r < sumchance: return value else: # else (not implemented) binary search algorithm

Manuel Salvadores · Answer

numPyを見てみたいと思うかもしれませんランダムサンプリング分布

Muayyad Alsadi · Answer

他のソリューションに基づいて、累積分布（整数または浮動小数点として任意）を生成してから、二等分を使用して高速化することができます

これは簡単な例です（ここでは整数を使用しました）

l=[(20, 'foo'), (60, 'banana'), (10, 'monkey'), (10, 'monkey2')] def get_cdf(l): ret=[] c=0 for i in l: c+=i[0]; ret.append((c, i[1])) return ret def get_random_item(cdf): return cdf[bisect.bisect_left(cdf, (random.randint(0, cdf[-1][0]),))][1] cdf=get_cdf(l) for i in range(100): print get_random_item(cdf),

get_cdf関数は、20、60、10、10から20、20 + 60、20 + 60 + 10、20 + 60 + 10 + 10に変換します

ここで、random.randintを使用して最大20 + 60 + 10 + 10までの乱数を選択し、次にbisectを使用して実際の値をすばやく取得します

Cris Stringfellow · Answer

これらの答えはどれも特に明確でも単純でもありません。

動作が保証されている明確でシンプルな方法を次に示します。

accumulate_normalize_probabilitiesは、シンボルを確率にマップする辞書pを取りますまたは頻度。選択可能なタプルの使用可能なリストを出力します。

def accumulate_normalize_values(p): pi = p.items() if isinstance(p,dict) else p accum_pi = [] accum = 0 for i in pi: accum_pi.append((i[0],i[1]+accum)) accum += i[1] if accum == 0: raise Exception( "You are about to explode the universe. Continue ? Y/N " ) normed_a = [] for a in accum_pi: normed_a.append((a[0],a[1]*1.0/accum)) return normed_a

利回り：

>>> accumulate_normalize_values( { 'a': 100, 'b' : 300, 'c' : 400, 'd' : 200 } ) [('a', 0.1), ('c', 0.5), ('b', 0.8), ('d', 1.0)]

なぜ機能するか

累積ステップは、各シンボルをそれ自体と前のシンボルの確率または頻度（または最初のシンボルの場合は0）との間隔に変換します。これらの間隔は、間隔0.0-> 1.0（以前に準備された）の乱数が現在のシンボルの間隔のエンドポイント以下になるまで、リストを単純にステップすることにより、選択（および提供された分布のサンプリング）に使用できます。

正規化を使用すると、すべての合計が何らかの値になることを確認する必要がなくなります。正規化後、確率の「ベクトル」の合計は1.0になります。

コードの残りを選択し、分布から任意の長さのサンプルを生成するには、次のとおりです。

def select(symbol_intervals,random): print symbol_intervals,random i = 0 while random > symbol_intervals[i][1]: i += 1 if i >= len(symbol_intervals): raise Exception( "What did you DO to that poor list?" ) return symbol_intervals[i][0] def gen_random(alphabet,length,probabilities=None): from random import random from itertools import repeat if probabilities is None: probabilities = dict(Zip(alphabet,repeat(1.0))) Elif len(probabilities) > 0 and isinstance(probabilities[0],(int,long,float)): probabilities = dict(Zip(alphabet,probabilities)) #ordered usable_probabilities = accumulate_normalize_values(probabilities) gen = [] while len(gen) < length: gen.append(select(usable_probabilities,random())) return gen

使用法：

>>> gen_random (['a','b','c','d'],10,[100,300,400,200]) ['d', 'b', 'b', 'a', 'c', 'c', 'b', 'c', 'c', 'c'] #<--- some of the time