BeautifulSoupでxpathを使用できますか？

Question

私はBeautifulSoupを使用してURLをスクレイピングしましたが、次のコードがありました

import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" req = urllib2.Request(url) response = urllib2.urlopen(req) the_page = response.read() soup = BeautifulSoup(the_page) soup.findAll('td',attrs={'class':'empformbody'})

上記のコードでは、findAllを使用してタグとそれに関連する情報を取得できますが、xpathを使用します。 BeautifulSoupでxpathを使用することは可能ですか？可能であれば、誰でも私にサンプルコードを提供してください。

Martijn Pieters · Accepted Answer

いや、BeautifulSoup自体は、XPath式をサポートしていません。

代替ライブラリ lxml 、doesはXPath 1.0をサポートします。 BeautifulSoup互換モードがあり、Soupのように壊れたHTMLを解析しようとします。ただし、デフォルトのlxml HTMLパーサーは、壊れたHTMLを解析するのと同じくらい良い仕事をし、私はより速いと信じています。

ドキュメントをlxmlツリーに解析したら、.xpath()メソッドを使用して要素を検索できます。

import urllib2 from lxml import etree url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" response = urllib2.urlopen(url) htmlparser = etree.HTMLParser() tree = etree.parse(response, htmlparser) tree.xpath(xpathselector)

興味を引く可能性があるのは、 CSS Selector support ;です。 CSSSelectorクラスはCSSステートメントをXPath式に変換し、td.empformbodyの検索をはるかに簡単にします。

from lxml.cssselect import CSSSelector td_empformbody = CSSSelector('td.empformbody') for elem in td_empformbody(tree): # Do something with these table cells.

完全なサークル：BeautifulSoup自体doesは非常に完全な CSSセレクターのサポート：

for cell in soup.select('table#foobar td.empformbody'): # Do something with these table cells.

Leonard Richardson · Answer

Beautiful SoupにはXPathがサポートされていないことを確認できます。

wordsforthewise · Answer

Martijnのコードは正しく機能しなくなりました（今では4年以上経過しています...）、etree.parse()行はコンソールに出力され、tree変数に値を割り当てません。 this を参照すると、リクエストとlxmlを使用してこれが機能することがわかりました。

from lxml import html import requests page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') tree = html.fromstring(page.content) #This will create a list of buyers: buyers = tree.xpath('//div[@title="buyer-name"]/text()') #This will create a list of prices prices = tree.xpath('//span[@class="item-price"]/text()') print 'Buyers: ', buyers print 'Prices: ', prices

user3820561 · Answer

BeautifulSoupには、childern、directedの現在の要素から findNext という名前の関数があります。

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上記のコードは、次のxpathを模倣できます。

div[class=class_value]/div[id=id_value]

Nikola · Answer

docs を検索しましたが、xpathオプションはないようです。また、SOの同様の質問で here を見るとわかるように、OPはxpathからBeautifulSoupへの翻訳を求めているので、私の結論は-いいえ、利用可能なxpath解析はありません。

Oleksandr Panchenko · Answer

lxmlをすべてシンプルに使用する場合：

tree = lxml.html.fromstring(html) i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

しかし、BeautifulSoup BS4をすべてシンプルに使用する場合：

最初に「//」と「@」を削除します
2番目-「=」の前にスターを追加します

この魔法を試してください：

soup = BeautifulSoup(html, "lxml") i_need_element = soup.select ('a[class*="shared-components"]')

ご覧のとおり、これはサブタグをサポートしていないため、「/ @ href」部分を削除します

David A · Answer

これはかなり古いスレッドですが、現在は回避策があり、その時点ではBeautifulSoupにはなかった可能性があります。

これが私がしたことの例です。「requests」モジュールを使用してRSSフィードを読み取り、そのテキストコンテンツを「rss_text」という変数で取得します。それを使用して、BeautifulSoupで実行し、xpath/rss/channel/titleを検索し、そのコンテンツを取得します。すべての栄光（ワイルドカード、複数のパスなど）が正確にXPathになっているわけではありませんが、探したい基本的なパスがある場合にのみ機能します。

from bs4 import BeautifulSoup rss_obj = BeautifulSoup(rss_text, 'xml') cls.title = rss_obj.rss.channel.title.get_text()