ファイルのダウンロードを並列化する方法は？

Question

次の方法で一度にファイルをダウンロードできます。

_import urllib.request urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz'] for u in urls: urllib.request.urlretrieve(u) _

私はそれをそのようにsubprocessしようとすることができます：

_import subprocess import os def parallelized_commandline(command, files, max_processes=2): processes = set() for name in files: processes.add(subprocess.Popen([command, name])) if len(processes) >= max_processes: os.wait() processes.difference_update( [p for p in processes if p.poll() is not None]) #Check if all the child processes were closed for p in processes: if p.poll() is None: p.wait() urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz', 'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz'] parallelized_commandline('wget', urls) _

_os.system_またはurlretrieveを使用せずにsubprocessを並列化する方法はありますか？

今のところ「チート」に頼らなければならないことを考えると、_subprocess.Popen_はデータをダウンロードする正しい方法ですか？

上記のparallelized_commandline()を使用する場合、wgetにはマルチスレッドを使用しますが、マルチコアは使用しません。これは正常ですか？マルチスレッドではなくマルチコアにする方法はありますか？

jfs · Accepted Answer

スレッドプールを使用して、ファイルを並行してダウンロードできます。

_#!/usr/bin/env python3 from multiprocessing.dummy import Pool # use threads for I/O bound tasks from urllib.request import urlretrieve urls = [...] result = Pool(4).map(urlretrieve, urls) # download 4 files at a time _

asyncioを使用して、1つのスレッドで一度に複数のファイルをダウンロードすることもできます。

_#!/usr/bin/env python3 import asyncio import logging from contextlib import closing import aiohttp # $ pip install aiohttp @asyncio.coroutine def download(url, session, semaphore, chunk_size=1<<15): with (yield from semaphore): # limit number of concurrent downloads filename = url2filename(url) logging.info('downloading %s', filename) response = yield from session.get(url) with closing(response), open(filename, 'wb') as file: while True: # save file chunk = yield from response.content.read(chunk_size) if not chunk: break file.write(chunk) logging.info('done %s', filename) return filename, (response.status, Tuple(response.headers.items())) urls = [...] logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s') with closing(asyncio.get_event_loop()) as loop, \ closing(aiohttp.ClientSession()) as session: semaphore = asyncio.Semaphore(4) download_tasks = (download(url, session, semaphore) for url in urls) result = loop.run_until_complete(asyncio.gather(*download_tasks)) _

ここで url2filename()はここで定義されます。