PythonでSeleniumとBeautifulsoupを使用してWebサイトを解析するにはどうすればよいですか？

Question

プログラミングが初めてで、Seleniumを使用して行く必要がある場所に移動する方法を見つけました。今すぐデータを解析したいのですが、どこから始めればいいのかわかりません。誰かが私の手を少し握って正しい方向に向けることができますか？

任意の助けに感謝-

RocketDonkey · Accepted Answer

解析したいページにいると仮定すると、SeleniumはソースHTMLをドライバーのpage_source属性。その後、page_sourceを次のようにBeautifulSoupに追加します。

In [8]: from bs4 import BeautifulSoup In [9]: from Selenium import webdriver In [10]: driver = webdriver.Firefox() In [11]: driver.get('http://news.ycombinator.com') In [12]: html = driver.page_source In [13]: soup = BeautifulSoup(html) In [14]: for tag in soup.find_all('title'): ....: print tag.text ....: ....: Hacker News

root · Answer

あなたの質問は特に具体的ではないので、簡単な例を示します。より便利なことを行うには、BS docs を読んでください。また、ここではSelenium（およびBS）の使用例がたくさんあります。

from Selenium import webdriver from bs4 import BeautifulSoup browser=webdriver.Firefox() browser.get('http://webpage.com') soup=BeautifulSoup(browser.page_source) #do something useful #prints all the links with corresponding text for link in soup.find_all('a'): print link.get('href',None),link.get_text()

Vor · Answer

Seleniumを使用してもよろしいですか？このため、私は PyQt4 を使用しました。これは非常に強力であり、必要なことは何でもできます。

URLを変更するだけで、先ほど書いたサンプルコードを提供できます。

#! /usr/bin/env python2.7 from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import * from bs4 import BeautifulSoup import sys, signal class Browser(QWebView): def __init__(self): QWebView.__init__(self) self.loadProgress.connect(self._progress) self.loadFinished.connect(self._loadFinished) self.frame = self.page().currentFrame() def _progress(self, progress): print str(progress) + "%" def _loadFinished(self): print "Load Finished" html = unicode(self.frame.toHtml()).encode('utf-8') soup = BeautifulSoup(html) print soup.prettify() self.close() if __== "__main__": app = QApplication(sys.argv) br = Browser() url = QUrl('http://web site that can contain javascript.com') br.load(url) br.show() if signal.signal(signal.SIGINT, signal.SIG_DFL): sys.exit(app.exec_()) app.exec_()