ユーザータイムアウトエラーが発生したときに何かを実行できるように、scrapyでエラーをキャッチするにはどうすればよいですか？

Question

ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.

スクレーパーを使用すると、この問題が発生することがあります。この問題をキャッチして、発生したときに関数を実行する方法はありますか？私はどこでもそれをオンラインで行う方法を見つけることができません。

paul trmbrth · Accepted Answer

できることは、errbackインスタンスで Request を定義することです。

errback（呼び出し可能）–要求の処理中に例外が発生した場合に呼び出される関数。これには、404HTTPエラーなどで失敗したページが含まれます。 Twisted Failureインスタンスを最初のパラメーターとして受け取ります。

使用できるサンプルコード（scrapy 1.0用）は次のとおりです。

# -*- coding: utf-8 -*- # errbacks.py import scrapy # from scrapy.contrib.spidermiddleware.httperror import HttpError from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError class ErrbackSpider(scrapy.Spider): name = "errbacks" start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error "http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding Host, timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.error('Got successful response from {}'.format(response.url)) # do something useful now def errback_httpbin(self, failure): # log all errback failures, # in case you want to do something special for some errors, # you may need the failure's type self.logger.error(repr(failure)) #if isinstance(failure.value, HttpError): if failure.check(HttpError): # you can get the response response = failure.value.response self.logger.error('HttpError on %s', response.url) #Elif isinstance(failure.value, DNSLookupError): Elif failure.check(DNSLookupError): # this is the original request request = failure.request self.logger.error('DNSLookupError on %s', request.url) #Elif isinstance(failure.value, TimeoutError): Elif failure.check(TimeoutError): request = failure.request self.logger.error('TimeoutError on %s', request.url)

そして、スクレイプシェルでの出力（1回の再試行と5秒のダウンロードタイムアウトのみ）：

$ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1 2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11 2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'} 2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 2015-06-30 23:45:56 [scrapy] INFO: Spider opened 2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>> 2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/ 2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None) 2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None) 2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/ 2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404 2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error 2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error 2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None) 2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500 2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure. 2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure. 2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>> 2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/ 2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished) 2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats: {'downloader/exception_count': 4, 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, 'downloader/request_bytes': 1748, 'downloader/request_count': 8, 'downloader/request_method_count/GET': 8, 'downloader/response_bytes': 12506, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'downloader/response_status_count/500': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191), 'log_count/DEBUG': 10, 'log_count/ERROR': 9, 'log_count/INFO': 7, 'response_received_count': 3, 'scheduler/dequeued': 8, 'scheduler/dequeued/memory': 8, 'scheduler/enqueued': 8, 'scheduler/enqueued/memory': 8, 'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)} 2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)

Scrapyがどのように例外を統計に記録するかに注目してください。

'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,

Aminah Nuraini · Answer

私は次のようなカスタムの再試行ミドルウェアを使用することを好みます。

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware from fake_useragent import FakeUserAgentError class FakeUserAgentErrorRetryMiddleware(RetryMiddleware): def process_exception(self, request, exception, spider): if type(exception) == FakeUserAgentError: return self._retry(request, exception, spider)