タプルのリスト内のアイテムの頻度をカウントする

Question

以下に示すタプルのリストがあります。 1より大きい数値を持つアイテムの数を数える必要があります。これまでに作成したコードは非常に低速です。約10Kのタプルがある場合でも、以下の例の文字列が2回表示される場合、そのような文字列を取得する必要があります。私の質問は、ジェネレーターを反復してここで文字列の数を取得する最良の方法は何ですか

リスト：

 b_data=[('example',123),('example-one',456),('example',987),.....]

これまでの私のコード：

blockslst=[]
for line in b_data:
    blockslst.append(line[0])

blocklstgtone=[]
for item in blockslst:
    if(blockslst.count(item)>1):
        blocklstgtone.append(item)

リスト：

 b_data=[('example',123),('example-one',456),('example',987),.....]

これまでの私のコード：

blockslst=[] for line in b_data: blockslst.append(line[0]) blocklstgtone=[] for item in blockslst: if(blockslst.count(item)>1): blocklstgtone.append(item)

cs95 · Accepted Answer

各タプルから最初のアイテムを抽出する適切なアイデアがあります。以下に示すように、リスト/ジェネレーター内包表記を使用して、コードをより簡潔にすることができます。

それ以降、要素の頻度カウントを見つける最も慣用的な方法は、_collections.Counter_オブジェクトを使用することです。

タプルのリストから最初の要素を抽出します（内包表記を使用）
これをCounterに渡します
exampleのクエリ数

_from collections import Counter counts = Counter(x[0] for x in b_data) print(counts['example']) _

確かに、頻度カウントを検索したいoneアイテムだけの場合は_list.count_を使用できますが、一般的にはCounterが適しています。

Counterの利点は、線形（O(N)）時間でall要素（exampleだけでなく）の頻度カウントを実行することです。たとえば、fooのように、別の要素の数もクエリしたいとします。それは-

_print(counts['foo']) _

_'foo'_がリストに存在しない場合、_0_が返されます。

最も一般的な要素を見つけたい場合は、_counts.most_common_を呼び出します-

_print(counts.most_common(n)) _

ここで、nは、表示する要素の数です。すべてを見たい場合は、nを渡さないでください。

最も一般的な要素の数を取得するには、これを行う1つの効率的な方法は、_most_common_をクエリし、itertoolsを使用して、1を超える数のすべての要素を効率的に抽出することです。

_from itertools import takewhile l = [1, 1, 2, 2, 3, 3, 1, 1, 5, 4, 6, 7, 7, 8, 3, 3, 2, 1] c = Counter(l) list(takewhile(lambda x: x[-1] > 1, c.most_common())) [(1, 5), (3, 4), (2, 3), (7, 2)] _

（OP編集）または、list comprehensionを使用して、カウントが1より大きいアイテムのリストを取得します-

_[item[0] for item in counts.most_common() if item[-1] > 1] _

これは_itertools.takewhile_ソリューションほど効率的ではないことに注意してください。たとえば、カウントが1より大きいアイテムが1つあり、カウントが1であるアイテムが100万個ある場合、必要がないときにリストを100万回繰り返します（_most_common_は、頻度カウントを降順に返します）。 takewhileでは、カウント> 1の条件がfalseになるとすぐに反復を停止するため、そうではありません。

Aaditya Ura · Answer

最初の方法：

ループなしはどうですか？

print(list(map(lambda x:x[0],b_data)).count('example'))

出力：

2番目の方法：

外部モジュールをインポートしたり、複雑にしたりせずに、単純なdictを使用して計算できます。

b_data = [('example', 123), ('example-one', 456), ('example', 987)] dict_1={} for i in b_data: if i[0] not in dict_1: dict_1[i[0]]=1 else: dict_1[i[0]]+=1 print(dict_1) print(list(filter(lambda y:y!=None,(map(lambda x:(x,dict_1.get(x)) if dict_1.get(x)>1 else None,dict_1.keys())))))

出力：

[('example', 2)]

テストケース：

b_data = [('example', 123), ('example-one', 456), ('example', 987),('example-one', 456),('example-one', 456),('example-two', 456),('example-two', 456),('example-two', 456),('example-two', 456)]

出力：

[('example-two', 4), ('example-one', 3), ('example', 2)]

Patrick Artner · Answer

私がこれを行うのにかかった時間 ayodhyankit-paul は同じことを投稿しました-テストケースとタイミングのジェネレーターコードのためにそれをより少なくしました：

100001アイテムの作成には約5秒かかり、カウントには約0.3sかかりました。カウントでのフィルタリングは速すぎて測定できませんでした（日時付き）。 now（）- perf_counter ）を気にしませんでした-全体として、データの約10倍の時間で、開始から終了までに5.1秒かかりました運営する。

これはCounterの- [〜＃〜] coldspeed [〜＃〜] s answer が行うことと似ていると思います：

foreach item in list of tuples：

item[0]がリストにない場合は、count of 1を使用してdictに入れます。
それ以外の場合increment count in dict by 1

コード：

from collections import Counter import random from datetime import datetime # good enough for a loong running op dt_datagen = datetime.now() numberOfKeys = 100000 # basis for testdata textData = ["example", "pose", "text","someone"] numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant # create random testdata from above lists tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)] tData.append(("aaa",99)) dt_dictioning = datetime.now() # create a dict countEm = {} # put all your data into dict, counting them for p in tData: if p[0] in countEm: countEm[p[0]] += 1 else: countEm[p[0]] = 1 dt_filtering = datetime.now() #comparison result-wise (commented out) #counts = Counter(x[0] for x in tData) #for c in sorted(counts): # print(c, " = ", counts[c]) #print() # output dict if count > 1 subList = [x for x in countEm if countEm[x] > 1] # without "aaa" dt_printing = datetime.now() for c in sorted(subList): if (countEm[c] > 1): print(c, " = ", countEm[c]) dt_end = datetime.now() print( "

Creating ", len(tData) , " testdataitems took:	", (dt_dictioning-dt_datagen).total_seconds(), " seconds") print( "Putting them into dictionary took 	", (dt_filtering-dt_dictioning).total_seconds(), " seconds") print( "Filtering donw to those > 1 hits took 	", (dt_printing-dt_filtering).total_seconds(), " seconds") print( "Printing all the items left took 	", (dt_end-dt_printing).total_seconds(), " seconds") print( "
Total time: 	", (dt_end- dt_datagen).total_seconds(), " seconds" )

出力：

# reformatted for bevity example0 = 2520 example1 = 2535 example2 = 2415 example3 = 2511 example4 = 2511 example5 = 2444 example6 = 2517 example7 = 2467 example8 = 2482 example9 = 2501 pose0 = 2528 pose1 = 2449 pose2 = 2520 pose3 = 2503 pose4 = 2531 pose5 = 2546 pose6 = 2511 pose7 = 2452 pose8 = 2538 pose9 = 2554 someone0 = 2498 someone1 = 2521 someone2 = 2527 someone3 = 2456 someone4 = 2399 someone5 = 2487 someone6 = 2463 someone7 = 2589 someone8 = 2404 someone9 = 2543 text0 = 2454 text1 = 2495 text2 = 2538 text3 = 2530 text4 = 2559 text5 = 2523 text6 = 2509 text7 = 2492 text8 = 2576 text9 = 2402 Creating 100001 testdataitems took: 4.728604 seconds Putting them into dictionary took 0.273245 seconds Filtering donw to those > 1 hits took 0.0 seconds Printing all the items left took 0.031234 seconds Total time: 5.033083 seconds

Soudipta Dutta · Answer

この例はあなたの例とは大きく異なりますが、これらのタイプの質問を解決するときに非常に役立ちました。

from collections import Counter a = [ (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"), (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"), (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"), (3, "statistics"), (3, "regression"), (3, "probability"), (4, "machine learning"), (4, "regression"), (4, "decision trees"), (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"), (5, "Haskell"), (5, "programming languages"), (6, "statistics"), (6, "probability"), (6, "mathematics"), (6, "theory"), (7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"), (7, "neural networks"), (8, "neural networks"), (8, "deep learning"), (8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"), (9, "Java"), (9, "MapReduce"), (9, "Big Data") ] # # 1. Lowercase everything # 2. Split it into words. # 3. Count the results. dictionary = Counter(Word for i, j in a for Word in j.lower().split()) print(dictionary) # print out every words if the count > 1 [print(Word, count) for Word, count in dictionary.most_common() if count > 1]

これは上記の方法で解決されたあなたの例です

from collections import Counter a=[('example',123),('example-one',456),('example',987),('example2',987),('example3',987)] dict = Counter(Word for i,j in a for Word in i.lower().split() ) print(dict) [print(Word ,count) for Word,count in dict.most_common() if count > 1 ]