Scrapy：response.bodyをHTMLファイルとして保存しますか？

Question

クモは機能しますが、.htmlファイルでクロールするWebサイトの本文をダウンロードできません。 self.html_fil.write（ 'test'）と書けば問題なく動作します。トゥルプを文字列に変換する方法がわかりません。

私はPython 3.6を使用しています

スパイダー：

class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ['google.com'] start_urls = ['http://google.com/'] def __init__(self): self.path_to_html = html_path + 'index.html' self.path_to_header = header_path + 'index.html' self.html_file = open(self.path_to_html, 'w') def parse(self, response): url = response.url self.html_file.write(response.body) self.html_file.close() yield { 'url': url }

トラックトレース：

Traceback (most recent call last): File "c:\python\python36-32\lib\site-packages	wisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders \example.py", line 35, in parse self.html_file.write(response.body) TypeError: write() argument must be str, not bytes

Somil · Accepted Answer

実際の問題は、バイトコードを取得していることです。文字列形式に変換する必要があります。バイトを文字列形式に変換する方法はたくさんあります。使用できます

 self.html_file.write(response.body.decode("utf-8"))

の代わりに

 self.html_file.write(response.body)

また、あなたは使うことができます

 self.html_file.write(response.text)

nirvana-msu · Answer

正しい方法は、response.body.decode("utf-8")ではなく_response.text_を使用することです。引用するには documentation ：

_Response.body_は常にbytesオブジェクトであることに注意してください。ユニコードバージョンが必要な場合は、_TextResponse.text_を使用します（TextResponseおよびサブクラスでのみ使用可能）。

そして

テキスト：Unicodeとしての応答本文。

response.body.decode(response.encoding)と同じですが、最初の呼び出しの後に結果がキャッシュされるため、余分なオーバーヘッドなしに_response.text_に複数回アクセスできます。

注：unicode(response.body)は、応答本文をUnicodeに変換する正しい方法ではありません。応答のエンコードではなく、システムのデフォルトのエンコード（通常はASCII）を使用します。

Mariano Ruiz · Answer

上記の応答を考慮して、可能な限りPythonicにすることで、withステートメントの使用を追加すると、例は次のように書き直す必要があります。

class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ['google.com'] start_urls = ['http://google.com/'] def __init__(self): self.path_to_html = html_path + 'index.html' self.path_to_header = header_path + 'index.html' def parse(self, response): with open(self.path_to_html, 'w') as html_file: html_file.write(response.text) yield { 'url': response.url }

しかし html_fileはparseメソッドからのみアクセスできます。