Webフォームを介してデータを送信し、結果を抽出する

Question

私のpythonレベルは初心者です。私はWebスクレイパーやクローラーを書いたことがありません。APIに接続してデータを抽出するpythonコードを書いたことがあります欲しいけど、抽出されたデータの中には、著者の性別を知りたいものもあります。このWebサイトを見つけましたhttp://bookblog.net/gender/genie.phpしかし欠点は、利用できるAPIがないことです。 pythonを記述して、ページのフォームにデータを送信し、戻りデータを抽出する方法を考えていました。これについてのガイダンスを得ることができれば、非常に役立ちます。

これはdomの形式です。

<form action="analysis.php" method="POST"> <textarea cols="75" rows="13" name="text"></textarea> <div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div> <p> <b>Genre:</b> <input type="radio" value="fiction" name="genre"> fiction&nbsp;&nbsp; <input type="radio" value="nonfiction" name="genre"> nonfiction&nbsp;&nbsp; <input type="radio" value="blog" name="genre"> blog entry </p> <p> </form>

結果ページdom：

<p> <b>The Gender Genie thinks the author of this passage is:</b> male! </p>

Acorn · Accepted Answer

Mechanizeを使用する必要はありません。POSTリクエストで正しいフォームデータを送信するだけです。

また、正規表現を使用してHTMLを解析することはお勧めできません。 lxml.htmlのようなHTMLパーサーを使用したほうがよいでしょう。

import requests import lxml.html as lh def gender_genie(text, genre): url = 'http://bookblog.net/gender/analysis.php' caption = 'The Gender Genie thinks the author of this passage is:' form_data = { 'text': text, 'genre': genre, 'submit': 'submit', } response = requests.post(url, data=form_data) tree = lh.document_fromstring(response.content) return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip() if __name__ == '__main__': print gender_genie('I have a beard!', 'blog')

brandizzi · Answer

mechanize を使用してコンテンツを送信および取得し、必要なものを取得するために re モジュールを使用できます。たとえば、次のスクリプトはあなた自身の質問のテキストに対してそれを行います：

_import re from mechanize import Browser text = """ My python level is Novice. I have never written a web scraper or crawler. I have written a python code to connect to an api and extract the data that I want. But for some the extracted data I want to get the gender of the author. I found this web site http://bookblog.net/gender/genie.php but downside is there isn't an api available. I was wondering how to write a python to submit data to the form in the page and extract the return data. It would be a great help if I could get some guidance on this.""" browser = Browser() browser.open("http://bookblog.net/gender/genie.php") browser.select_form(nr=0) browser['text'] = text browser['genre'] = ['nonfiction'] response = browser.submit() content = response.read() result = re.findall( r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content) print result[0] _

それは何をするためのものか？ _mechanize.Browser_を作成し、指定されたURLに移動します。

_browser = Browser() browser.open("http://bookblog.net/gender/genie.php") _

次に、フォームを選択します（入力するフォームは1つしかないため、最初のフォームになります）。

_browser.select_form(nr=0) _

また、フォームのエントリを設定します...

_browser['text'] = text browser['genre'] = ['nonfiction'] _

...そしてそれを提出する：

_response = browser.submit() _

これで、結果が得られます。

_content = response.read() _

結果は次の形式であることがわかります。

_<b>The Gender Genie thinks the author of this passage is:</b> male! _

そこで、照合用の正規表現を作成し、re.findall()を使用します。

_result = re.findall( r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content) _

これで、結果を使用できるようになりました。

_print result[0] _

jan zegan · Answer

mechanize を使用できます。詳細は examples を参照してください。

from mechanize import ParseResponse, urlopen, urljoin uri = "http://bookblog.net" response = urlopen(urljoin(uri, "/gender/genie.php")) forms = ParseResponse(response, backwards_compat=False) form = forms[0] #print form form['text'] = 'cheese' form['genre'] = ['fiction'] print urlopen(form.click()).read()