Python 3 with Beautiful Soupを使用してウィキペディアの記事のテキストを取得するにはどうすればよいですか？

Question

このスクリプトをPython 3：

response = simple_get("https://en.wikipedia.org/wiki/Mathematics") result = {} result["url"] = url if response is not None: html = BeautifulSoup(response, 'html.parser') title = html.select("#firstHeading")[0].text

ご覧のように、記事からタイトルを取得できますが、「数学（ギリシャ語μά...）」から目次にテキストを取得する方法がわかりません...

chitown88 · Accepted Answer

を選択 <p> 鬼ごっこ。 52の要素があります。すべてが必要かどうかはわかりませんが、これらのタグを繰り返し処理して、必要に応じて保存できます。出力を表示するために、それぞれを印刷することにしました。

import bs4 import requests response = requests.get("https://en.wikipedia.org/wiki/Mathematics") if response is not None: html = bs4.BeautifulSoup(response.text, 'html.parser') title = html.select("#firstHeading")[0].text paragraphs = html.select("p") for para in paragraphs: print (para.text) # just grab the text up to contents as stated in question intro = '
'.join([ para.text for para in paragraphs[0:5]]) print (intro)

alecxe · Answer

ウィキペディアから情報を取得するはるかに簡単な方法があります-ウィキペディアAPI。

this Python wrapper があります。これにより、HTML解析なしで数行で実行できます。

import wikipediaapi wiki_wiki = wikipediaapi.Wikipedia('en') page = wiki_wiki.page('Mathematics') print(page.summary)

プリント：

数学（ギリシャ語μάθημαmáthēma、「知識、研究、学習」より）には、量、構造、空間、変化などのトピックの研究が含まれます...（意図的に省略）

また、一般に、直接APIが利用可能な場合は、画面のスクレイピングを避けるようにしてください。

QHarr · Answer

ライブラリwikipediaを使用します

import wikipedia #print(wikipedia.summary("Mathematics")) #wikipedia.search("Mathematics") print(wikipedia.page("Mathematics").content)

SIM · Answer

次のようなlxmlライブラリを使用して、目的の出力を取得できます。

import requests from lxml.html import fromstring url = "https://en.wikipedia.org/wiki/Mathematics" res = requests.get(url) source = fromstring(res.content) paragraph = '
'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')]) print(paragraph)

BeautifulSoupを使用：

from bs4 import BeautifulSoup import requests res = requests.get("https://en.wikipedia.org/wiki/Mathematics") soup = BeautifulSoup(res.text, 'html.parser') for item in soup.find_all("p"): if item.text.startswith("The history"):break print(item.text)

Ilmari Karonen · Answer

必要なのは、周囲のナビゲーション要素のない（HTML）ページコンテンツです。 2013年のこの以前の回答で説明したように、（少なくとも）2つの方法があります：

おそらく最も簡単な方法は、 https://en.wikipedia.org/wiki/Mathematics?action=render のように、URLにパラメータaction=renderを含めることです。これにより、コンテンツHTMLのみが提供され、他には何も提供されません。
または、 MediaWiki API を介してページコンテンツを取得することもできます（ https://en.wikipedia.org/w/api.php ？format = xml＆action = parse＆page = Mathematics 。

APIを使用する利点は、多くのother情報も提供できることですあなたが役に立つかもしれないページ。たとえば、通常ページのサイドバーに表示される言語間リンクのリスト、または通常コンテンツエリアの下に表示されるカテゴリを使用する場合、次のようなAPIからそれらを取得できます。

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

（同じリクエストでページコンテンツも取得するには、prop=langlinks|categories|textを使用します。）

サポートする機能セットはさまざまですが、MediaWiki APIを使用するためのいくつかの Pythonライブラリがあります。ただし、ライブラリを介さずにコードから直接APIを使用することも完全に可能です。

LaSul · Answer

関数を使用して適切な方法を取得するには、Wikipediaが提供するJSON APIを取得するだけです：

from urllib.request import urlopen from urllib.parse import urlencode from json import loads def getJSON(page): params = urlencode({ 'format': 'json', 'action': 'parse', 'prop': 'text', 'redirects' : 'true', 'page': page}) API = "https://en.wikipedia.org/w/api.php" response = urlopen(API + "?" + params) return response.read().decode('utf-8') def getRawPage(page): parsed = loads(getJSON(page)) try: title = parsed['parse']['title'] content = parsed['parse']['text']['*'] return title, content except KeyError: # The page doesn't exist return None, None title, content = getRawPage("Mathematics")

その後、必要なものを抽出したいライブラリで解析できます:)