UnicodeEncodeError： 'ascii'コーデックは文字u '\ u2026'をエンコードできません

Question

私はurllib2とBeautiful Soupについて学んでいますが、最初のテストでは次のようなエラーが発生しています：

_UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128) _

このタイプのエラーに関する多くの投稿があるようで、私が理解できる解決策を試しましたが、それらにキャッチ22があるようです、例えば：

_post.text_を印刷したい（テキストはテキストを返すだけの美しいスープメソッド）。 str(post.text)および_post.text_は、ユニコードエラーを生成します（右アポストロフィの_'_および_..._など）。

post = unicode(post)をstr(post.text)の上に追加すると、次のようになります：

_AttributeError: 'unicode' object has no attribute 'text' _

_(post.text).encode()_と_(post.text).renderContents()_も試しました。エラーを生成する後者：

_AttributeError: 'unicode' object has no attribute 'renderContents' _

そして、str(post.text).renderContents()を試してエラーを受け取りました：

_AttributeError: 'str' object has no attribute 'renderContents' _

文書の先頭のどこかで_'make this content 'interpretable''_を定義し、必要なtext関数にアクセスできるようになれば、素晴らしいことです。

更新：提案後：

post = post.decode("utf-8")を上記のstr(post.text)に追加すると、次のようになります。

_TypeError: unsupported operand type(s) for -: 'str' and 'int' _

post = post.decode()を上記のstr(post.text)に追加すると、次のようになります。

_AttributeError: 'unicode' object has no attribute 'text' _

_(post.text)_の上にpost = post.encode("utf-8")を追加すると、次のようになります：

_AttributeError: 'str' object has no attribute 'text' _

私はprint post.text.encode('utf-8')を試してみました：

_UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128) _

そして、うまくいくかもしれないことを試すために、 here からWindows用のlxmlをインストールし、次のように実装しました：

_parsed_content = BeautifulSoup(original_content, "lxml") _

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters に従います。

これらの手順は違いをもたらさないようでした。

Python 2.7.4およびBeautiful Soup 4を使用しています。

解決策：

Unicode、utf-8、Beautiful Soupの種類についてより深く理解した後、それは私の印刷方法論と関係がありました。すべてのstrメソッドと連結を削除しました。 str(something) + post.text + str(something_else)であるため、_something, post.text, something_else_であり、この段階で書式設定をあまり制御できない（たとえば、_,_にスペースが挿入されている）以外は、うまく印刷されているようです。

icktoofay · Accepted Answer

Python 2、unicodeオブジェクトは、ASCIIに変換できる場合にのみ印刷できます。ASCIIでエンコードできない場合、そのエラーが発生します。おそらく明示的にエンコードしてから、結果のstrを出力したいでしょう：

print post.text.encode('utf-8')

Patpog · Answer

 html = urllib.request.urlopen(THE_URL).read() soup = BeautifulSoup(html) print("'" + str(soup.encode("ascii")) + "'")

私のために働いた;-)

jeyraof · Answer

.decode()または.decode("utf-8")を試しましたか？

そして、html5lib parserを使用してlxmlを使用することをお勧めします

http://lxml.de/html5parser.html