Pythonを使用してHTMLページソースから画像ファイルをダウンロードしますか？

Question

HTMLページからすべての画像ファイルをダウンロードして特定のフォルダーに保存するスクレーパーを書いています。すべての画像はHTMLページの一部です。

Ryan Ginstrom · Accepted Answer

指定されたURLからすべての画像をダウンロードし、指定された出力フォルダーに保存するコードを次に示します。必要に応じて変更できます。

""" dumpimages.py Downloads all the images on the supplied URL, and saves them to the specified output file ("/test/" by default) Usage: python dumpimages.py http://example.com/ [output] """ from bs4 import BeautifulSoup as bs from urllib.request import ( urlopen, urlparse, urlunparse, urlretrieve) import os import sys def main(url, out_folder="/test/"): """Downloads all the images at 'url' to /test/""" soup = bs(urlopen(url)) parsed = list(urlparse(url)) for image in soup.findAll("img"): print("Image: %(src)s" % image) filename = image["src"].split("/")[-1] parsed[2] = image["src"] outpath = os.path.join(out_folder, filename) if image["src"].lower().startswith("http"): urlretrieve(image["src"], outpath) else: urlretrieve(urlunparse(parsed), outpath) def _usage(): print("usage: python dumpimages.py http://example.com [outpath]") if __== "__main__": url = sys.argv[-1] out_folder = "/test/" if not url.lower().startswith("http"): out_folder = sys.argv[-1] url = sys.argv[-2] if not url.lower().startswith("http"): _usage() sys.exit(-1) main(url, out_folder)

編集：出力フォルダーを指定できるようになりました。

Catherine Devlin · Answer

Ryanのソリューションは優れていますが、画像ソースURLが絶対URLまたはメインページURLに単純に連結した場合に良い結果をもたらさないものである場合は失敗します。 urljoinは絶対URLと相対URLを認識するため、中央のループを次のように置き換えます。

for image in soup.findAll("img"): print "Image: %(src)s" % image image_url = urlparse.urljoin(url, image['src']) filename = image["src"].split("/")[-1] outpath = os.path.join(out_folder, filename) urlretrieve(image_url, outpath)

user20955 · Answer

ページをダウンロードしてhtmlドキュメントを解析し、正規表現で画像を見つけてダウンロードする必要があります。ダウンロードにはurllib2を、htmlファイルの解析にはBeautiful Soupを使用できます。

Dingo · Answer

そして、これは1つの画像をダウンロードするための関数です：

def download_photo(self, img_url, filename): file_path = "%s%s" % (DOWNLOADED_IMAGE_PATH, filename) downloaded_image = file(file_path, "wb") image_on_web = urllib.urlopen(img_url) while True: buf = image_on_web.read(65536) if len(buf) == 0: break downloaded_image.write(buf) downloaded_image.close() image_on_web.close() return file_path

Martin v. L&#246;wis · Answer

Htmllibを使用してすべてのimgタグを抽出し（do_imgをオーバーライドします）、urllib2を使用してすべての画像をダウンロードします。

Lerner Zhang · Answer

リクエストに承認が必要な場合は、次を参照してください。

r_img = requests.get(img_url, auth=(username, password)) f = open('000000.jpg','wb') f.write(r_img.content) f.close()