BeautifulSoupを使用してテーブルから選択した列を抽出する

Question

BeautifulSoupを使用してこのデータテーブルの最初と3番目の列を抽出しようとしています。 HTMLを見ると、最初の列には_<th>_タグがあります。関心のある他の列には、_<td>_タグがあります。いずれにせよ、私が得ることができたのは、タグが付いた列のリストだけです。しかし、私はただテキストが欲しいだけです。

tableはすでにリストになっているので、findAll(text=True)は使用できません。最初の列のリストを別の形式で取得する方法がわかりません。

_from BeautifulSoup import BeautifulSoup from sys import argv import re filename = argv[1] #get HTML file as a string html_doc = ''.join(open(filename,'r').readlines()) soup = BeautifulSoup(html_doc) table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one print table _

jonhkr · Accepted Answer

このコードを試すことができます：

import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm" soup = BeautifulSoup(urllib2.urlopen(url).read()) for row in soup.findAll('table')[0].tbody.findAll('tr'): first_column = row.findAll('th')[0].contents third_column = row.findAll('td')[2].contents print first_column, third_column

ご覧のとおり、コードはURLに接続してhtmlを取得し、BeautifulSoupは最初のテーブルを見つけ、次にすべての「tr」を選択して、最初の列である「th」と3番目の列である「th」を選択します。 'td'。

mac389 · Answer

@jonhkrの回答に加えて、私が思いついた別の解決策を投稿すると思いました。

 #!/usr/bin/python from BeautifulSoup import BeautifulSoup from sys import argv filename = argv[1] #get HTML file as a string html_doc = ''.join(open(filename,'r').readlines()) soup = BeautifulSoup(html_doc) table = soup.findAll('table')[0].tbody data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr')) print data

Webページにダイヤルインするjonhkrの回答とは異なり、私の回答は、コンピューターに保存してコマンドライン引数として渡すことを前提としています。例えば：

python file.py table.html

KUSHA B K · Answer

このコードも試すことができます

import requests from bs4 import BeautifulSoup page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm") soup = BeautifulSoup(page.content, 'html.parser') for row in soup.findAll('table')[0].tbody.findAll('tr'): first_column = row.findAll('th')[0].contents third_column = row.findAll('td')[2].contents print (first_column, third_column)