pythonのスクレイピーセレクターでテキストのみを抽出する方法

Question

私はこのコードを持っています

 site = hxs.select("//h1[@class='state']") log.msg(str(site[0].extract()),level=log.ERROR)

出力は

 [scrapy] ERROR: <h1 class="state"><strong> 1</strong> <span> job containing <strong>php</strong> in <strong>region</strong> paying <strong>$30-40k per year</strong></span> </h1>

HTMLタグなしでテキストのみを取得することは可能ですか？

akhter wahab · Accepted Answer

_//h1[@class='state'] _

上記のxpathでは、class属性stateを含む_h1_タグを選択しています

それが_h1 element_に含まれるすべてのものを選択する理由です

_h1_タグのテキストを選択したいだけの場合は、

_//h1[@class='state']/text() _

_h1_タグのテキストとその子タグを選択する場合は、

_//h1[@class='state']//text() _

そのため、特定のタグテキストの場合は/text()であり、特定のタグとその子タグのテキストの場合は//text()です。

以下のコードはあなたのために働きます

_site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip() _

Aminah Nuraini · Answer

BeautifulSoup get_text()機能を使用できます。

from bs4 import BeautifulSoup text = ''' <td><a href="http://www.fakewebsite.com">Please can you strip me?</a> <br/><a href="http://www.fakewebsite.com">I am waiting....</a> </td> ''' soup = BeautifulSoup(text) print(soup.get_text())

E.Z. · Answer

実行中のスクレイピーインスタンスがないため、これをテストできませんでした。ただし、検索式内でtext()を使用することもできます。

例えば：

site = hxs.select("//h1[@class='state']/text()")

（それを tutorial から取得します）

pm007 · Answer

BeautifulSoup を使用してhtmlタグを削除できます。例は次のとおりです。

from BeautifulSoup import BeautifulSoup ''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))

次に、すべての追加の空白、新しい行などを削除できます。

追加のモジュールを使用したくない場合は、単純な正規表現を試すことができます。

# replace html tags with ' ' text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))

Aminah Nuraini · Answer

html2textを使用できます

import html2text converter = html2text.HTML2Text() print converter.handle("<div>Please!!!<span>remove me</span></div>")