urllibを使用してPDFをダウンロードしますか？

Question

Urllibを使用してWebサイトからPDFファイルをダウンロードしようとしています。これは私がこれまでに得たものです：

import urllib def download_file(download_url): web_file = urllib.urlopen(download_url) local_file = open('some_file.pdf', 'w') local_file.write(web_file.read()) web_file.close() local_file.close() if __name__ == 'main': download_file('http://www.example.com/some_file.pdf')

このコードを実行すると、空のpdfファイルしか取得できません。何が間違っていますか？

jamiemcg · Answer

動作する例を次に示します。

import urllib2 def main(): download_file("http://mensenhandel.nl/files/pdftest2.pdf") def download_file(download_url): response = urllib2.urlopen(download_url) file = open("document.pdf", 'wb') file.write(response.read()) file.close() print("Completed") if __name__ == "__main__": main()

shockburner · Answer

open('some_file.pdf', 'w')をopen('some_file.pdf', 'wb')に変更します。pdfファイルはバイナリファイルなので、 'b'が必要です。これは、テキストエディターで開くことができないほとんどすべてのファイルに当てはまります。

romulomadu · Answer

urllib.retrieve（Python 3）を使用してみてください。

from urllib.request import urlretrieve def download_file(download_url): urlretrieve(download_url, 'path_to_save_plus_some_file.pdf') if __name__ == 'main': download_file('http://www.example.com/some_file.pdf')

Piyush Rumao · Answer

上記のコードを試しましたが、場合によっては正常に動作しますが、pdfが埋め込まれたWebサイトでは、HTTPError：HTTP Error 403：Forbiddenのようなエラーが表示される場合があります。このようなWebサイトには、既知のボットをブロックするサーバーセキュリティ機能がいくつかあります。 urllibの場合、====> python urllib/3.3.のようなヘッダーを使用します。そのため、以下に示すように、urllibのリクエストモジュールにもカスタムヘッダーを追加することをお勧めします。

from urllib.request import Request, urlopen import requests url="https://realpython.com/python-tricks-sample-pdf" import urllib.request req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) r = requests.get(url) with open("<location to dump pdf>/<name of file>.pdf", "wb") as code: code.write(r.content)

Piyush Rumao · Answer

次のコード行を使用することをお勧めします

import urllib.request import shutil url = "link to your website for pdf file to download" output_file = "local directory://name.pdf" with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file: shutil.copyfileobj(response, out_file)