pythonおよびBeautifulSoupを使用してhtmlからテーブルコンテンツを抽出する

Question

HTMLドキュメントから特定の情報を抽出したい。例えば。これには、次のようなテーブルが含まれています（他のコンテンツを持つ他のテーブルの中で）。

 <table class="details"> <tr> <th>Advisory:</th> <td>RHBA-2013:0947-1</td> </tr> <tr> <th>Type:</th> <td>Bug Fix Advisory</td> </tr> <tr> <th>Severity:</th> <td>N/A</td> </tr> <tr> <th>Issued on:</th> <td>2013-06-13</td> </tr> <tr> <th>Last updated on:</th> <td>2013-06-13</td> </tr> <tr> <th valign="top">Affected Products:</th> <td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td> </tr> </table>

「発行日」などの情報を抽出したい。 BeautifulSoup4はこれを簡単に行うことができるように見えますが、どういうわけか私はそれを正しく行うことができません。これまでの私のコード：

 from bs4 import BeautifulSoup soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc) table_tag=soup.table if table_tag['class'] == ['details']: print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text() a=table_tag.next_sibling print unicode(a) print table_tag.contents

これにより、最初のテーブル行の内容と、内容のリストが表示されます。しかし、次の兄弟は正しく機能していません。私はそれを間違って使用しているだけだと思います。もちろん、内容を解析することもできますが、美しいスープは、これを正確に実行できないように設計されているようです（自分で解析を開始した場合は、ドキュメント全体を解析したほうがよいでしょう...）。誰かがこれを達成する方法について私に教えてくれるなら、私は感謝するでしょう。 BeautifulSoupよりも良い方法があれば、それについて聞いてみたいと思います。

falsetru · Accepted Answer

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc) >>> table = soup.find('table', {'class': 'details'}) >>> th = table.find('th', text='Issued on:') >>> th <th>Issued on:</th> >>> td = th.findNext('td') >>> td <td>2013-06-13</td> >>> td.text u'2013-06-13'