非常にシンプルなマルチスレッド並列URLフェッチ（キューなし）

Question

Pythonで可能な限りシンプルなマルチスレッドURLフェッチャーを探して丸1日過ごしましたが、ほとんどのスクリプトはキューまたはマルチプロセッシングまたは複雑なライブラリを使用しています。

最後に私は自分自身でそれを書きました。それを答えとして報告しています。改善点をお気軽にご提案ください。

他の人が似たようなものを探していたのではないかと思います。

abarnert · Accepted Answer

元のバージョンを可能な限り簡素化する：

import threading import urllib2 import time start = time.time() urls = ["http://www.google.com", "http://www.Apple.com", "http://www.Microsoft.com", "http://www.Amazon.com", "http://www.facebook.com"] def fetch_url(url): urlHandler = urllib2.urlopen(url) html = urlHandler.read() print "'%s\' fetched in %ss" % (url, (time.time() - start)) threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls] for thread in threads: thread.start() for thread in threads: thread.join() print "Elapsed Time: %s" % (time.time() - start)

ここでの唯一の新しいトリックは次のとおりです。

作成したスレッドを追跡します。
スレッドがすべて終了したことを知りたいだけの場合は、スレッドのカウンターを気にしないでください。 joinはすでにそれを伝えています。
状態または外部APIが必要ない場合は、Threadサブクラスは不要で、target関数のみが必要です。

jfs · Answer

multiprocessingには、他のプロセスを開始しないスレッドプールがあります。

#!/usr/bin/env python from multiprocessing.pool import ThreadPool from time import time as timer from urllib2 import urlopen urls = ["http://www.google.com", "http://www.Apple.com", "http://www.Microsoft.com", "http://www.Amazon.com", "http://www.facebook.com"] def fetch_url(url): try: response = urlopen(url) return url, response.read(), None except Exception as e: return url, None, e start = timer() results = ThreadPool(20).imap_unordered(fetch_url, urls) for url, html, error in results: if error is None: print("%r fetched in %ss" % (url, timer() - start)) else: print("error fetching %r: %s" % (url, error)) print("Elapsed Time: %s" % (timer() - start,))

Threadベースのソリューションと比較した利点：

ThreadPoolは、同時接続の最大数を制限できます（20コード例では）
すべての出力がメインスレッドにあるため、出力は文字化けしません。
エラーが記録されます
コードは、Python 2と3で変更なしで機能します（from urllib.request import urlopen on Python 3）。

abarnert · Answer

concurrent.futures は、あなたが望むすべてを、もっと簡単に行います。さらに、一度に5を実行するだけで膨大な数のURLを処理でき、エラーをより適切に処理できます。

もちろん、このモジュールはPython 3.2以降でのみ組み込まれています...しかし、2.5-3.1を使用している場合は、バックポートをインストールするだけです futures 、off PyPI。サンプルコードから変更する必要があるのは、検索と置換concurrent.futuresとfutures、および2.xの場合はurllib.request with urllib2。

URLリストを使用して時刻を追加するように変更された、2.xにバックポートされたサンプルを次に示します。

import concurrent.futures import urllib2 import time start = time.time() urls = ["http://www.google.com", "http://www.Apple.com", "http://www.Microsoft.com", "http://www.Amazon.com", "http://www.facebook.com"] # Retrieve a single page and report the url and contents def load_url(url, timeout): conn = urllib2.urlopen(url, timeout=timeout) return conn.readall() # We can use a with statement to ensure threads are cleaned up promptly with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: # Start the load operations and mark each future with its URL future_to_url = {executor.submit(load_url, url, 60): url for url in urls} for future in concurrent.futures.as_completed(future_to_url): url = future_to_url[future] try: data = future.result() except Exception as exc: print '%r generated an exception: %s' % (url, exc) else: print '"%s" fetched in %ss' % (url,(time.time() - start)) print "Elapsed Time: %ss" % (time.time() - start)

しかし、これをさらに簡単にすることができます。本当に必要なのは次のとおりです。

def load_url(url): conn = urllib2.urlopen(url, timeout) data = conn.readall() print '"%s" fetched in %ss' % (url,(time.time() - start)) return data with futures.ThreadPoolExecutor(max_workers=5) as executor: pages = executor.map(load_url, urls) print "Elapsed Time: %ss" % (time.time() - start)

Daniele B · Answer

私は今、別のソリューションを公開しています-の終わりを通知する代わりにワーカースレッドを非デーモンにしてメインスレッドに参加させる（すべてのワーカースレッドが終了するまでメインスレッドをブロックすることを意味します）グローバルファンクションへのコールバックを使用した各ワーカースレッドの実行（前の回答で行ったように）。一部のコメントでは、そのような方法はスレッドセーフではないことが指摘されていました。

import threading import urllib2 import time start = time.time() urls = ["http://www.google.com", "http://www.Apple.com", "http://www.Microsoft.com", "http://www.Amazon.com", "http://www.facebook.com"] class FetchUrl(threading.Thread): def __init__(self, url): threading.Thread.__init__(self) self.url = url def run(self): urlHandler = urllib2.urlopen(self.url) html = urlHandler.read() print "'%s\' fetched in %ss" % (self.url,(time.time() - start)) for url in urls: FetchUrl(url).start() #Join all existing threads to main thread. for thread in threading.enumerate(): if thread is not threading.currentThread(): thread.join() print "Elapsed Time: %s" % (time.time() - start)