BeautifulSoup getTextは<p>の間にあり、後続の段落をピックアップしません

Question

まず、私はPythonに関してはまったくの初心者です。ただし、RSSフィードを調べてリンクを開き、記事からテキストを抽出するコードを記述しました。これは私がこれまでに持っているものです：

from BeautifulSoup import BeautifulSoup import feedparser import urllib # Dictionaries links = {} titles = {} # Variables n = 0 rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80- 30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d" # Parse the RSS feed feed = feedparser.parse(rss_url) # view the entire feed, one entry at a time for post in feed.entries: # Create variables from posts link = post.link title = post.title # Add the link to the dictionary n += 1 links[n] = link for k,v in links.items(): # Open RSS feed page = urllib.urlopen(v).read() page = str(page) soup = BeautifulSoup(page) # Find all of the text between paragraph tags and strip out the html page = soup.find('p').getText() # Strip ampersand codes and WATCH: page = re.sub('&\w+;','',page) page = re.sub('WATCH:','',page) # Print Page print(page) print(" ") # To stop after 3rd article, just whilst testing ** to be removed ** if (k >= 3): break

これにより、次の出力が生成されます。

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py") Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago. Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago. The higher figures reflected the effects both of volume and exchange rate factors. The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%). The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations. In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms, >>>

問題は、これが各記事の最初の段落ですが、記事全体を表示する必要があることです。どんな助けもありがたいことに受け取られます。

Amanda · Accepted Answer

近づいています！

# Find all of the text between paragraph tags and strip out the html page = soup.find('p').getText()

find （お気づきのとおり）の使用は、1つの結果を見つけた後に停止します。すべての段落が必要な場合は、 find_all が必要です。ページが一貫してフォーマットされている場合（1つだけを見て）、次のようなものを使用することもできます。

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

記事の本文をゼロにします。

connorbode · Answer

これは、テキストがすべて_<p>_タグでラップされている特定の記事に適しています。 Webは見苦しい場所なので、常にそうとは限りません。

多くの場合、ウェブサイトにはさまざまな種類のタグ（たとえば、_<span>_または_<div>_、または_<li>_）でラップされたテキストが散らばっています。

DOM内のすべてのテキストノードを見つけるには、soup.find_all(text=True)を使用できます。

これにより、_<script>_および_<style>_タグの内容のような、望ましくないテキストが返されます。不要な要素のテキストコンテンツを除外する必要があります。

_blacklist = [ 'style', 'script', # other elements, ] text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist] _

既知のタグのセットを使用している場合、反対のアプローチをタグ付けできます。

_whitelist = [ 'p' ] text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist] _