どのようにすればより速くこすることができますか

Question

ここでの作業は、https://xxx.xxx.xxx/xxx/1.jsonからhttps://xxx.xxx.xxx/xxx/1417749.jsonまでのサイトのAPIをスクレイピングし、mongodbに正確に書き込むことです。そのために私は次のコードを持っています：

client = pymongo.MongoClient("mongodb://127.0.0.1:27017") db = client["thread1"] com = db["threadcol"] start_time = time.time() write_log = open("logging.log", "a") min = 1 max = 1417749 for n in range(min, max): response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n))) if response.status_code == 200: parsed = json.loads(response.text) inserted = com.insert_one(parsed) write_log.write(str(n) + "	" + str(inserted) + "
") print(str(n) + "	" + str(inserted) + "
") write_log.close()

しかし、それを行うには多くの時間がかかります。ここでの質問は、どうすればこのプロセスをスピードアップできるかです。

Frans · Accepted Answer

マルチスレッドを使用したくない場合は、asyncioもソリューションです

import time import pymongo import json import asyncio from aiohttp import ClientSession async def get_url(url, session): async with session.get(url) as response: if response.status == 200: return await response.text() async def create_task(sem, url, session): async with sem: response = await get_url(url, session) if response: parsed = json.loads(response) n = url.rsplit('/', 1)[1] inserted = com.insert_one(parsed) write_log.write(str(n) + "	" + str(inserted) + "
") print(str(n) + "	" + str(inserted) + "
") async def run(minimum, maximum): url = 'https:/xx.xxx.xxx/{}.json' tasks = [] sem = asyncio.Semaphore(1000) # Maximize the concurrent sessions to 1000, stay below the max open sockets allowed async with ClientSession() as session: for n in range(minimum, maximum): task = asyncio.ensure_future(create_task(sem, url.format(n), session)) tasks.append(task) responses = asyncio.gather(*tasks) await responses client = pymongo.MongoClient("mongodb://127.0.0.1:27017") db = client["thread1"] com = db["threadcol"] start_time = time.time() write_log = open("logging.log", "a") min_item = 1 max_item = 100 loop = asyncio.get_event_loop() future = asyncio.ensure_future(run(min_item, max_item)) loop.run_until_complete(future) write_log.close()

keiv.fly · Answer

あなたができることがいくつかあります：

接続を再利用します。以下のベンチマークによると、それは約3倍高速です
複数のプロセスで並行してこすることができます

ここからの並列コード

from threading import Thread from Queue import Queue q = Queue(concurrent * 2) for i in range(concurrent): t = Thread(target=doWork) t.daemon = True t.start() try: for url in open('urllist.txt'): q.put(url.strip()) q.join() except KeyboardInterrupt: sys.exit(1)

再利用可能な接続の場合この質問からのタイミング

>>> timeit.timeit('_ = requests.get("https://www.wikipedia.org")', 'import requests', number=100) Starting new HTTPS connection (1): www.wikipedia.org Starting new HTTPS connection (1): www.wikipedia.org Starting new HTTPS connection (1): www.wikipedia.org ... Starting new HTTPS connection (1): www.wikipedia.org Starting new HTTPS connection (1): www.wikipedia.org Starting new HTTPS connection (1): www.wikipedia.org 52.74904417991638 >>> timeit.timeit('_ = session.get("https://www.wikipedia.org")', 'import requests; session = requests.Session()', number=100) Starting new HTTPS connection (1): www.wikipedia.org 15.770191192626953

albestro · Answer

次の2つの点でコードを改善できます。

Sessionを使用して、リクエストごとに接続が再配置されずに開いたままになるようにします。
asyncioを使用してコードで並列処理を使用する;

ここを見てください https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

T Piper · Answer

おそらく探しているのは非同期スクレイピングです。私はいくつかのURLのバッチ、つまり5つのURL（Webサイトを壊さないようにしてください）を作成し、それらを非同期にスクレイピングすることをお勧めします。非同期についてあまり知らない場合は、ライブラリasyncioのgoogleを使用してください。お役に立てれば幸いです。

thuva4 · Answer

リクエストをチャンクして、MongoDB一括書き込み操作を使用してみてください。

リクエストをグループ化する（グループごとに100リクエスト）
グループを反復する
非同期リクエストモデルを使用してデータをフェッチする（グループ内のURL）
グループの完了後にDBを更新する（一括書き込み操作）

これにより、次の方法で多くの時間を節約できます* MongoDB書き込みレイテンシ*同期ネットワーク呼び出しレイテンシ

ただし、並列要求数（チャンクサイズ）を増加させないでください。サーバーのネットワーク負荷が増加し、サーバーはこれをDDoS攻撃と見なす可能性があります。

https://api.mongodb.com/python/current/examples/bulk.html

Ibrahim Dar · Answer

APIによってブロックされず、レート制限がないと仮定すると、このコードはプロセスを50倍高速化するはずです（すべてのリクエストが同じセッションを使用して送信されるようになるため、それ以上になる可能性があります）。

import pymongo import threading client = pymongo.MongoClient("mongodb://127.0.0.1:27017") db = client["thread1"] com = db["threadcol"] start_time = time.time() logs=[] number_of_json_objects=1417750 number_of_threads=50 session=requests.session() def scrap_write_log(session,start,end): for n in range(start, end): response = session.get("https:/xx.xxx.xxx/{}.json".format(n)) if response.status_code == 200: try: logs.append(str(n) + "	" + str(com.insert_one(json.loads(response.text))) + "
") print(str(n) + "	" + str(inserted) + "
") except: logs.append(str(n) + "	" + "Failed to insert" + "
") print(str(n) + "	" + "Failed to insert" + "
") thread_ranges=[[x,x+number_of_json_objects//number_of_threads] for x in range(0,number_of_json_objects,number_of_json_objects//number_of_threads)] threads=[threading.Thread(target=scrap_write_log, args=(session,start_and_end[0],start_and_end[1])) for start_and_end in thread_ranges] for thread in threads: thread.start() for thread in threads: thread.join() with open("logging.log", "a") as f: for line in logs: f.write(line)

anonymous · Answer

私は何年も前にたまたま同じ質問をしました。かなり遅いか複雑すぎるpythonベースの回答に満足することはありません。他の成熟したツールに切り替えた後、速度は速く、二度と戻らない。

最近、私は次のようなプロセスをスピードアップするためにそのようなステップを使用しています。

txtで一連のURLを生成する
使用する aria2c -x16 -d ~/Downloads -i /path/to/urls.txtこれらのファイルをダウンロードするには
ローカルで解析

これは私がこれまでに考え出した最速のプロセスです。

Webページのスクレイピングに関しては、一度に1ページずつアクセスするのではなく、必要な* .htmlをダウンロードすることもあり、実際には違いはありません。 python requestsまたはscrapyまたはurllibなどのツールを使用してページにアクセスすると、Webコンテンツ全体がキャッシュされ、ダウンロードされますあなたのために。

mobin alhassan · Answer

すべてが同じなので、最初にすべてのリンクのリストを作成し、変更を繰り返すだけです。

list_of_links=[] for i in range(1,1417749): list_of_links.append("https:/xx.xxx.xxx/{}.json".format(str(i))) t_no=2 for i in range(0, len(list_of_links), t_no): all_t = [] twenty_links = list_of_links[i:i + t_no] for link in twenty_links: obj_new = Demo(link,) t = threading.Thread(target=obj_new.get_json) t.start() all_t.append(t) for t in all_t: t.join() class Demo: def __init__(self, url): self.json_url = url def get_json(self): try: your logic except Exception as e: print(e)

T_noを単純に増減することで、スレッド数を変更できません。