スクレイピー-Reactor not Restartable

Question

で：

_from twisted.internet import reactor from scrapy.crawler import CrawlerProcess _

私は常にこのプロセスをうまく実行しました：

_process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is finished process.start() _

しかし、このコードをweb_crawler(self)関数に移動したため、次のようになります。

_def web_crawler(self): # set up a crawler process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is finished process.start() # (...) return (result1, result2) _

次のように、クラスのインスタンス化を使用してメソッドの呼び出しを開始しました。

_def __call__(self): results1 = test.web_crawler()[1] results2 = test.web_crawler()[0] _

実行中：

_test() _

次のエラーが表示されます。

_Traceback (most recent call last): File "test.py", line 573, in <module> print (test()) File "test.py", line 530, in __call__ artists = test.web_crawler() File "test.py", line 438, in web_crawler process.start() File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start reactor.run(installSignalHandlers=False) # blocking call File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning ReactorBase.startRunning(self) File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable _

なにが問題ですか？

Ferrard · Accepted Answer

Reactorを再起動することはできませんが、別のプロセスをフォークすることにより、Reactorを何度も実行できるはずです。

import scrapy import scrapy.crawler as crawler from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = ['http://quotes.toscrape.com/tag/humor/'] def parse(self, response): for quote in response.css('div.quote'): print(quote.css('span.text::text').extract_first()) # the wrapper to make it run more times def run_spider(spider): def f(q): try: runner = crawler.CrawlerRunner() deferred = runner.crawl(spider) deferred.addBoth(lambda _: reactor.stop()) reactor.run() q.put(None) except Exception as e: q.put(e) q = Queue() p = Process(target=f, args=(q,)) p.start() result = q.get() p.join() if result is not None: raise result

2回実行します。

print('first run:') run_spider(QuotesSpider) print('
second run:') run_spider(QuotesSpider)

結果：

first run: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” “A day without sunshine is like, you know, night.” ... second run: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” “A day without sunshine is like, you know, night.” ...

Chiefir · Answer

これは、ReactorNotRestartableエラーとの戦いに勝つために私を助けたものです：質問の著者からの最後の回答
0）pip install crochet
1）import from crochet import setup
2）setup()-ファイルの先頭
3）2行を削除します。
a）d.addBoth(lambda _: reactor.stop())
b）reactor.run()

このエラーには同じ問題があり、この問題を解決するために4時間以上かかります。それについてのすべての質問を読んでください。最後にそれを見つけました-そしてそれを共有します。それが私がこれを解決した方法です。 Scrapy docs leftの意味のある行は、このコードの最後の2行のみです。

#some more imports from crochet import setup setup() def run_spider(spiderName): module_name="first_scrapy.spiders.{}".format(spiderName) scrapy_var = import_module(module_name) #do some dynamic import of selected spider spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs crawler.crawl(spiderObj) #from Scrapy docs

このコードを使用すると、run_spider関数に渡された名前で実行するスパイダーを選択し、廃棄が完了した後、別のスパイダーを選択して再実行できます。
これが誰かの助けになることを願っています。

Rejected · Answer

Scrapy documentation に従って、CrawlerProcessクラスのstart()メソッドは次のことを行います。

「[...]はTwistedリアクターを起動し、そのプールサイズをREACTOR_THREADPOOL_MAXSIZEに調整し、DNSCACHE_ENABLEDおよびDNSCACHE_SIZEに基づいてDNSキャッシュをインストールします。」

Twistedリアクターを再起動できないため、受け取っているエラーはTwistedによってスローされています。それは大量のグローバルを使用し、それを再起動するために何らかのコードをjimmy-rigで実行したとしても（私はそれを見たことがあります）、それが機能する保証はありません。

正直なところ、リアクターを再起動する必要があると思う場合は、おそらく何か間違ったことをしている可能性があります。

あなたが何をしたいかに応じて、ドキュメントのスクリプトからのスクレイピーの実行の部分も確認します。

data_garden · Answer

間違いはこのコードにあります：

_def __call__(self): result1 = test.web_crawler()[1] result2 = test.web_crawler()[0] # here _

web_crawler()は2つの結果を返します。そのため、@ Rejectedが指すように、Reactorを再起動してプロセスを2回開始しようとしています。

1つのプロセスを実行して結果を取得し、両方の結果をTupleに保存する方法がここにあります。

_def __call__(self): result1, result2 = test.web_crawler() _

Neeraj Yadav · Answer

これは私の問題を解決しました、reactor.run()またはprocess.start()の後にコードを置きます

time.sleep(0.5) os.execl(sys.executable, sys.executable, *sys.argv)

Granitosaurus · Answer

一部の人々がすでに指摘したように：リアクタを再起動する必要はないはずです。

理想的には、プロセスを連鎖させたい場合（crawl1、crawl2、crawl3）、コールバックを追加するだけです。

たとえば、私はこのパターンに従うこのループスパイダーを使用しています。

1. Crawl A 2. Sleep N 3. goto 1

そして、これはスクレイジーに見える方法です：

import time from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings from twisted.internet import reactor class HttpbinSpider(scrapy.Spider): name = 'httpbin' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/ip'] def parse(self, response): print(response.body) def sleep(_, duration=5): print(f'sleeping for: {duration}') time.sleep(duration) # block here def crawl(runner): d = runner.crawl(HttpbinSpider) d.addBoth(sleep) d.addBoth(lambda _: crawl(runner)) return d def loop_crawl(): runner = CrawlerRunner(get_project_settings()) crawl(runner) reactor.run() if __name__ == '__main__': loop_crawl()

プロセスをさらに説明するために、crawl関数はクロールをスケジュールし、クロールが終了したときに呼び出される2つの追加のコールバックを追加します。

$ python endless_crawl.py b'{
 "Origin": "000.000.000.000"
}
' sleeping for: 5 b'{
 "Origin": "000.000.000.000"
}
' sleeping for: 5 b'{
 "Origin": "000.000.000.000"
}
' sleeping for: 5 b'{
 "Origin": "000.000.000.000"
}
' sleeping for: 5