Beautiful Soupを使用してクラス名でコンテンツを取得する

Question

Beautiful Soupモジュールを使用して、クラス名がfeeditemcontent cxfeeditemcontentであるdivタグのデータを取得するにはどうすればよいですか？それは...ですか：

soup.class['feeditemcontent cxfeeditemcontent']

または：

soup.find_all('class')

これはHTMLソースです：

<div class="feeditemcontent cxfeeditemcontent"> <div class="feeditembodyandfooter"> <div class="feeditembody"> <span>The actual data is some where here</span> </div> </div> </div>

これはPythonコードです：

 from BeautifulSoup import BeautifulSoup html_doc = open('home.jsp.html', 'r') soup = BeautifulSoup(html_doc) class="feeditemcontent cxfeeditemcontent"

jadkik94 · Accepted Answer

これを試してください、多分それはこの単純なことには多すぎますが、うまくいきます：

def match_class(target): target = target.split() def do_match(tag): try: classes = dict(tag.attrs)["class"] except KeyError: classes = "" classes = classes.split() return all(c in classes for c in target) return do_match html = """<div class="feeditemcontent cxfeeditemcontent"> <div class="feeditembodyandfooter"> <div class="feeditembody"> <span>The actual data is some where here</span> </div> </div> </div>""" from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent")) for m in matches: print m print "-"*10 matches = soup.findAll(match_class("feeditembody")) for m in matches: print m print "-"*10

Leonard Richardson · Answer

Beautiful Soup 4は、「class」属性の値を文字列ではなくリストとして扱います。つまり、jadkik94のソリューションは簡略化できます。

from bs4 import BeautifulSoup def match_class(target): def do_match(tag): classes = tag.get('class', []) return all(c in classes for c in target) return do_match soup = BeautifulSoup(html) print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))

Aziz Alto · Answer

soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

したがって、クラスヘッダーのすべてのdivタグを取得する場合は<div class="header"> stackoverflow.comから、BeautifulSoupの例は次のようになります。

from bs4 import BeautifulSoup as bs import requests url = "http://stackoverflow.com/" html = requests.get(url).text soup = bs(html) tags = soup.findAll("div", class_="header")

それはすでにbs4 documentation にあります。

user1438327 · Answer

from BeautifulSoup import BeautifulSoup f = open('a.htm') soup = BeautifulSoup(f) list = soup.findAll('div', attrs={'id':'abc def'}) print list

Jordan Dimov · Answer

soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})

UltraInstinct · Answer

このバグレポートを確認してください： https://bugs.launchpad.net/beautifulsoup/+bug/410304

ご覧のとおり、美しいスープは本当に理解できませんclass="a b" 2つのクラスaおよびbとして。

ただし、最初のコメントに記載されているように、単純な正規表現で十分です。あなたの場合：

soup = BeautifulSoup(html_doc) for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}): print "result: ",x

注：これは最近のベータ版で修正されています。私は最近のバージョンのドキュメントを読んでいません、あなたはそれを行うことができるかもしれません。または、古いバージョンを使用して機能させたい場合は、上記を使用できます。