beautifulSoupを使用してWebサイトからすべての画像を抽出してダウンロードする方法

Question

URLからすべての画像を抽出してダウンロードしようとしています。脚本を書いた

import urllib2 import re from os.path import basename from urlparse import urlsplit url = "http://filmygyan.in/katrina-kaifs-top-10-cutest-pics-gallery/" urlContent = urllib2.urlopen(url).read() # HTML image tag: <img src="url" alt="some_text"/> imgUrls = re.findall('img .*?src="(.*?)"', urlContent) # download all images for imgUrl in imgUrls: try: imgData = urllib2.urlopen(imgUrl).read() fileName = basename(urlsplit(imgUrl)[2]) output = open(fileName,'wb') output.write(imgData) output.close() except: pass

このページの画像を抽出したくないこの画像を参照 http://i.share.pho.to/1c9884b1_l.jpeg 「次へ」をクリックせずにすべての画像を取得したいボタン「次へ」クラス内のすべての写真を取得する方法がわかりません。

Jonathan · Answer

以下は、指定されたページからすべての画像を抽出し、スクリプトが実行されているディレクトリに書き込む必要があります。

import re import requests from bs4 import BeautifulSoup site = 'http://pixabay.com' response = requests.get(site) soup = BeautifulSoup(response.text, 'html.parser') img_tags = soup.find_all('img') urls = [img['src'] for img in img_tags] for url in urls: filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url) with open(filename.group(1), 'wb') as f: if 'http' not in url: # sometimes an image source can be relative # if it is provide the base url which also happens # to be the site variable atm. url = '{}{}'.format(site, url) response = requests.get(url) f.write(response.content)