BeautifulSoupを使用して、特定のテーブルからすべての行を取得するにはどうすればよいですか？

Question

PythonとBeautifulSoupを使用して、Webからデータを取得し、HTMLテーブルを読み取ることを学習しています。OpenOfficeに読み込むことができ、テーブル＃11であると表示されます。

BeautifulSoupが好ましい選択のようですが、特定のテーブルとすべての行を取得する方法を誰かに教えてもらえますか？モジュールのドキュメントを見ましたが、頭を悩ませることができません。私がオンラインで見つけた例の多くは、私が必要とする以上のことをしているように見えます。

JJ Geewax · Accepted Answer

BeautifulSoupで解析するHTMLのチャンクがある場合、これは非常に簡単です。一般的な考え方は、findChildrenメソッドを使用してテーブルに移動することです。その後、stringプロパティを使用してセル内のテキスト値を取得できます。

>>> from BeautifulSoup import BeautifulSoup >>> >>> html = """ ... <html> ... <body> ... <table> ... <th><td>column 1</td><td>column 2</td></th> ... <tr><td>value 1</td><td>value 2</td></tr> ... </table> ... </body> ... </html> ... """ >>> >>> soup = BeautifulSoup(html) >>> tables = soup.findChildren('table') >>> >>> # This will get the first (and only) table. Your page may have more. >>> my_table = tables[0] >>> >>> # You can find children with multiple tags by passing a list of strings >>> rows = my_table.findChildren(['th', 'tr']) >>> >>> for row in rows: ... cells = row.findChildren('td') ... for cell in cells: ... value = cell.string ... print "The value in this cell is %s" % value ... The value in this cell is column 1 The value in this cell is column 2 The value in this cell is value 1 The value in this cell is value 2 >>>

Andriy Makukha · Answer

（昔ながらの設計のWebサイトのように）ネストされたテーブルがある場合、上記のアプローチは失敗する可能性があります。

解決策として、ネストされていないテーブルを最初に抽出することをお勧めします。

html = '''<table> <tr> <td>Top level table cell</td> <td> <table> <tr><td>Nested table cell</td></tr> <tr><td>...another nested cell</td></tr> </table> </td> </tr> </table>''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]

または、他のテーブルをネストするテーブルを含むすべてのテーブルのコンテンツを抽出する場合は、最上位のtrヘッダーとth/tdヘッダーのみを抽出できます。このため、find_allメソッドを呼び出すときに再帰をオフにする必要があります。

soup = BeautifulSoup(html, 'lxml') tables = soup.find_all('table') cnt = 0 for my_table in tables: cnt += 1 print ('=============== TABLE {} ==============='.format(cnt)) rows = my_table.find_all('tr', recursive=False) # <-- HERE for row in rows: cells = row.find_all(['th', 'td'], recursive=False) # <-- HERE for cell in cells: # DO SOMETHING if cell.string: print (cell.string)

出力：

=============== TABLE 1 =============== Top level table cell =============== TABLE 2 =============== Nested table cell ...another nested cell