Webページのコンテンツをダウンロードする

Question

Webページのコンテンツをダウンロードするpythonプログラムを作成してから、最初のページがリンクしているWebページのコンテンツをダウンロードします。

たとえば、これはメインWebページ http://www.Adobe.com/support/security/ で、ダウンロードするページは http：//www.Adobeです。 com/support/security/bulletins/apsb13-23.html および http://www.Adobe.com/support/security/bulletins/apsb13-22.html

私が満たしたい特定の条件があります：アドバイザリではなくセキュリティ情報の下のWebページのみをダウンロードする必要があります（ http://www.Adobe.com/support/security/advisories/apsa13-02.html ）

 #!/usr/bin/env python import urllib import re import sys page = urllib.urlopen("http://www.Adobe.com/support/security/") page = page.read() fileHandle = open('content', 'w') links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page) for link in links: sys.stdout = fileHandle print ('%s' % (link[0])) sys.stdout = sys.__stdout__ fileHandle.close() os.system("grep -i '\/support\/security\/bulletins\/' content >> content1")

既に速報のリンクをcontent1に抽出していますが、content1を入力として提供することにより、これらのWebページのコンテンツをダウンロードする方法がわかりません。

Content1ファイルは次のとおりです。-/support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-22.html/support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-22.html/support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-07.html

Radu Rădeanu · Accepted Answer

あなたの質問を理解したなら、次のスクリプトがあなたが望むものであるべきです：

#!/usr/bin/env python import urllib import re import sys import os page = urllib.urlopen("http://www.Adobe.com/support/security/") page = page.read() fileHandle = open('content', 'w') links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page) for link in links: sys.stdout = fileHandle print ('%s' % (link[0])) sys.stdout = sys.__stdout__ fileHandle.close() os.system("grep -i '\/support\/security\/bulletins\/' content 2>/dev/null | head -n 3 | uniq | sed -e 's/^/http:\/\/www.Adobe.com/g' > content1") os.system("wget -i content1")

bikram990 · Answer

おそらく、この質問はstackoverflowに関するものです！

しかし、とにかく HTTrack で見ることができます。このため、同様の種類の操作を行い、さらにそのオープンソース