htmlファイルをどうやって開くか？

Question

test.htmlというhtmlファイルがあり、1つのWord בדיקהがあります。

Test.htmlを開き、次のコードブロックを使用してコンテンツを印刷します。

file = open("test.html", "r") print file.read()

しかし、それは??????を出力します、なぜこれが起こったのですか、どうすれば修正できますか？

ところで。テキストファイルを開くと正常に動作します。

編集：私はこれを試してみました：

>>> import codecs >>> f = codecs.open("test.html",'r') >>> print f.read() ?????

vks · Accepted Answer

import codecs f=codecs.open("test.html", 'r') print f.read()

このようなものを試してください。

Benjamin · Answer

「urllib」を使用してHTMLページを読むことができます。

 #python 2.x import urllib page = urllib.urlopen("your path ").read() print page

Dibin Joseph · Answer

次のコードを使用できます。

from __future__ import division, unicode_literals import codecs from bs4 import BeautifulSoup f=codecs.open("test.html", 'r', 'utf-8') document= BeautifulSoup(f.read()).get_text() print document

間にあるすべての空白行を削除し、すべての単語を文字列として取得する場合（特殊文字、数字も避ける）、次も含めます。

import nltk from nltk.tokenize import Word_tokenize docwords=Word_tokenize(document) for line in docwords: line = (line.rstrip()) if line: if re.match("^[A-Za-z]*$",line): if (line not in stop and len(line)>1): st=st+" "+line print st

* stを最初にstringとして定義します。st=""

wenzul · Answer

codecs.open をエンコードパラメーターと共に使用します。

import codecs f = codecs.open("test.html", 'r', 'utf-8')

Chen Mier · Answer

今日もこの問題に遭遇しました。 Windowsを使用しており、システム言語はデフォルトで中国語です。したがって、誰かが同様にこのUnicodeエラーに遭遇する可能性があります。 encoding = 'utf-8'を追加するだけです：

with open("test.html", "r", encoding='utf-8') as f: text= f.read()

SHUBHAM SINGH · Answer

コード：

import codecs path="D:\Users\html\abc.html" file=codecs.open(path,"rb") file1=file.read() file1=str(file1)