リアルタイムシナリオでconcurrent.futuresとキューを使用するにはどうすればよいですか？

Question

以下に示すように、Python 3のconcurrent.futuresモジュールを使用して並列作業を行うのはかなり簡単です。

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: future_to = {executor.submit(do_work, input, 60): input for input in dictionary} for future in concurrent.futures.as_completed(future_to): data = future.result()

アイテムをキューに挿入および取得することも非常に便利です。

q = queue.Queue() for task in tasks: q.put(task) while not q.empty(): q.get()

更新をリッスンするスクリプトをバックグラウンドで実行しています。さて、理論的には、これらの更新が到着したら、それらをキューに入れ、ThreadPoolExecutorを使用して同時に作業すると仮定します。

さて、個別に、これらのコンポーネントはすべて分離して機能し、意味がありますが、どのように一緒に使用するのですか？作業するデータが事前に決定されていない限り、キューからThreadPoolExecutor作業をリアルタイムでフィードできるかどうかわかりませんか？

一言で言えば、私がやりたいのは、たとえば1秒間に4つのメッセージの更新を受信し、それらをキューに入れて、concurrent.futuresがそれらを処理できるようにすることです。そうしないと、遅いシーケンシャルアプローチで立ち往生します。

以下の Pythonの標準的な例ドキュメントを見てみましょう：

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: future_to_url = {executor.submit(load_url, url, 60): url for url in URLS} for future in concurrent.futures.as_completed(future_to_url): url = future_to_url[future] try: data = future.result() except Exception as exc: print('%r generated an exception: %s' % (url, exc)) else: print('%r page is %d bytes' % (url, len(data)))

URLSのリストが修正されました。このリストをリアルタイムでフィードし、おそらく管理目的のキューから、リストが到着したときにワーカーに処理させることは可能ですか？私のアプローチが実際に可能かどうかについて少し混乱していますか？

Stephen Rauch · Accepted Answer

example from Python docs、expanded to take that work from a queue。注：このコードはconcurrent.futures.waitの代わりにconcurrent.futures.as_completedを使用して新しい作業を許可します他の作業が完了するのを待っている間に開始されます。

import concurrent.futures import urllib.request import time import queue q = queue.Queue() URLS = ['http://www.foxnews.com/', 'http://www.cnn.com/', 'http://europe.wsj.com/', 'http://www.bbc.co.uk/', 'http://some-made-up-domain.com/'] def feed_the_workers(spacing): """ Simulate outside actors sending in work to do, request each url twice """ for url in URLS + URLS: time.sleep(spacing) q.put(url) return "DONE FEEDING" def load_url(url, timeout): """ Retrieve a single page and report the URL and contents """ with urllib.request.urlopen(url, timeout=timeout) as conn: return conn.read() # We can use a with statement to ensure threads are cleaned up promptly with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: # start a future for a thread which sends work in through the queue future_to_url = { executor.submit(feed_the_workers, 0.25): 'FEEDER DONE'} while future_to_url: # check for status of the futures which are currently working done, not_done = concurrent.futures.wait( future_to_url, timeout=0.25, return_when=concurrent.futures.FIRST_COMPLETED) # if there is incoming work, start a new future while not q.empty(): # fetch a url from the queue url = q.get() # Start the load operation and mark the future with its URL future_to_url[executor.submit(load_url, url, 60)] = url # process any completed futures for future in done: url = future_to_url[future] try: data = future.result() except Exception as exc: print('%r generated an exception: %s' % (url, exc)) else: if url == 'FEEDER DONE': print(data) else: print('%r page is %d bytes' % (url, len(data))) # remove the now completed future del future_to_url[future]

各urlを2回フェッチしたときの出力：

'http://www.foxnews.com/' page is 67574 bytes 'http://www.cnn.com/' page is 136975 bytes 'http://www.bbc.co.uk/' page is 193780 bytes 'http://some-made-up-domain.com/' page is 896 bytes 'http://www.foxnews.com/' page is 67574 bytes 'http://www.cnn.com/' page is 136975 bytes DONE FEEDING 'http://www.bbc.co.uk/' page is 193605 bytes 'http://some-made-up-domain.com/' page is 896 bytes 'http://europe.wsj.com/' page is 874649 bytes 'http://europe.wsj.com/' page is 874649 bytes

Pedro M Duarte · Answer

職場で、無制限のデータストリームに対して並列作業を行いたいという状況を見つけました。 StephenRauchがすでに提供している優れた回答に触発された小さなライブラリを作成しました。

私は当初、2つの別々のスレッドについて考えることでこの問題に取り組みました。1つはキューに作業を送信し、もう1つは完了したタスクについてキューを監視し、新しい作業が入る余地を増やします。これは、StephenRauchが提案したものと似ています。彼は、別のスレッドで実行されるfeed_the_workers関数を使用してストリームを消費します。

私の同僚の1人と話して、彼は、準備ができるたびに入力ストリームから解放される要素の数を制御できるバッファー付きイテレーターを定義すると、単一のスレッドですべてを実行することを回避できることを理解するのに役立ちましたスレッドプールにさらに作業を送信します。

そこで、BufferedIterクラスを紹介します

class BufferedIter(object): def __init__(self, iterator): self.iter = iterator def nextN(self, n): vals = [] for _ in range(n): vals.append(next(self.iter)) return vals

これにより、次の方法でストリームプロセッサを定義できます。

import logging import queue import signal import sys import time from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED level = logging.DEBUG log = logging.getLogger(__name__) handler = logging.StreamHandler(sys.stdout) handler.setFormatter(logging.Formatter('%(asctime)s %(message)s')) handler.setLevel(level) log.addHandler(handler) log.setLevel(level) WAIT_SLEEP = 1 # second, adjust this based on the timescale of your tasks def stream_processor(input_stream, task, num_workers): # Use a queue to signal shutdown. shutting_down = queue.Queue() def shutdown(signum, frame): log.warning('Caught signal %d, shutting down gracefully ...' % signum) # Put an item in the shutting down queue to signal shutdown. shutting_down.put(None) # Register the signal handler signal.signal(signal.SIGTERM, shutdown) signal.signal(signal.SIGINT, shutdown) def is_shutting_down(): return not shutting_down.empty() futures = dict() buffer = BufferedIter(input_stream) with ThreadPoolExecutor(num_workers) as executor: num_success = 0 num_failure = 0 while True: idle_workers = num_workers - len(futures) if not is_shutting_down(): items = buffer.nextN(idle_workers) for data in items: futures[executor.submit(task, data)] = data done, _ = wait(futures, timeout=WAIT_SLEEP, return_when=ALL_COMPLETED) for f in done: data = futures[f] try: f.result(timeout=0) except Exception as exc: log.error('future encountered an exception: %r, %s' % (data, exc)) num_failure += 1 else: log.info('future finished successfully: %r' % data) num_success += 1 del futures[f] if is_shutting_down() and len(futures) == 0: break log.info("num_success=%d, num_failure=%d" % (num_success, num_failure))

以下に、ストリームプロセッサの使用方法の例を示します。

import itertools def integers(): """Simulate an infinite stream of work.""" for i in itertools.count(): yield i def task(x): """The task we would like to perform in parallel. With some delay to simulate a time consuming job. With a baked in exception to simulate errors. """ time.sleep(3) if x == 4: raise ValueError('bad luck') return x * x stream_processor(integers(), task, num_workers=3)

この例の出力を以下に示します。

2019-01-15 22:34:40,193 future finished successfully: 1 2019-01-15 22:34:40,193 future finished successfully: 0 2019-01-15 22:34:40,193 future finished successfully: 2 2019-01-15 22:34:43,201 future finished successfully: 5 2019-01-15 22:34:43,201 future encountered an exception: 4, bad luck 2019-01-15 22:34:43,202 future finished successfully: 3 2019-01-15 22:34:46,208 future finished successfully: 6 2019-01-15 22:34:46,209 future finished successfully: 7 2019-01-15 22:34:46,209 future finished successfully: 8 2019-01-15 22:34:49,215 future finished successfully: 11 2019-01-15 22:34:49,215 future finished successfully: 10 2019-01-15 22:34:49,215 future finished successfully: 9 ^C <=== THIS IS WHEN I HIT Ctrl-C 2019-01-15 22:34:50,648 Caught signal 2, shutting down gracefully ... 2019-01-15 22:34:52,221 future finished successfully: 13 2019-01-15 22:34:52,222 future finished successfully: 14 2019-01-15 22:34:52,222 future finished successfully: 12 2019-01-15 22:34:52,222 num_success=14, num_failure=1

Vitalis · Answer

上記の@pedroによる興味深いアプローチが本当に気に入りました。ただし、何千ものファイルを処理するときに、最後にStopIterationがスローされ、一部のファイルが常にスキップされることに気付きました。私は次のように少し変更を加える必要がありました。再び非常に有用な答え。

class BufferedIter(object): def __init__(self, iterator): self.iter = iterator def nextN(self, n): vals = [] try: for _ in range(n): vals.append(next(self.iter)) return vals, False except StopIteration as e: return vals, True

-次のように呼び出します

... if not is_shutting_down(): items, is_finished = buffer.nextN(idle_workers) if is_finished: stop() ...

--stopは、単にシャットダウンするように指示する関数です。

def stop(): shutting_down.put(None)