文字列を単語のリストに変換しますか？

Question

Pythonを使用して文字列を単語のリストに変換しようとしています。私は次のようなものを取りたいです：

string = 'This is a string, with words!'

次に、このようなものに変換します：

list = ['This', 'is', 'a', 'string', 'with', 'words']

句読点とスペースの省略に注意してください。これを行う最も速い方法は何ですか？

Bryan · Accepted Answer

これを試して：

import re mystr = 'This is a string, with words!' wordList = re.sub("[^\w]", " ", mystr).split()

仕組み：

ドキュメントから：

re.sub(pattern, repl, string, count=0, flags=0)

String内のパターンの左端の非重複オカレンスを置換replで置換することによって取得されたストリングを返します。パターンが見つからない場合、文字列は変更されずに返されます。 replは文字列または関数です。

私たちの場合：

patternは、英数字以外の文字です。

[\ w]は任意の英数字を意味し、文字セット[a-zA-Z0-9_]と等しい

aからz、AからZ、0から9、および下線。

したがって、英数字以外の文字に一致し、スペースに置き換えます。

次に、文字列をスペースで分割し、リストに変換するsplit（）it

だから「ハローワールド」

「ハローワールド」になります

re.subで

そして、['hello'、 'world']

split（）の後

疑問が生じた場合はお知らせください。

gilgamar · Answer

これは、応答が遅いことを考えると、この投稿につまずく誰かにとって最も簡単な方法だと思います：

>>> string = 'This is a string, with words!' >>> string.split() ['This', 'is', 'a', 'string,', 'with', 'words!']

Tim McNamara · Answer

これを適切に行うことは非常に複雑です。あなたの研究では、Wordトークン化として知られています。最初から始めるのではなく、他の人が何をしたかを確認したい場合は、 NLTK を見てください。

>>> import nltk >>> paragraph = u"Hi, this is my first sentence. And this is my second." >>> sentences = nltk.sent_tokenize(paragraph) >>> for sentence in sentences: ... nltk.Word_tokenize(sentence) [u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.'] [u'And', u'this', u'is', u'my', u'second', u'.']

JBernardo · Answer

最も簡単な方法：

>>> import re >>> string = 'This is a string, with words!' >>> re.findall(r'\w+', string) ['This', 'is', 'a', 'string', 'with', 'words']

mtrw · Answer

完全を期すためにstring.punctuationを使用する：

import re import string x = re.sub('['+string.punctuation+']', '', s).split()

これも改行を処理します。

Cameron · Answer

まあ、あなたは使うことができます

import re list = re.sub(r'[.!,;?]', ' ', string).split()

stringとlistは両方とも組み込み型の名前であるため、変数名としてこれらを使用することはおそらくないでしょう。

tofutim · Answer

単語の正規表現を使用すると、最も制御しやすくなります。「I'm」のようなダッシュまたはアポストロフィの付いた単語の扱い方を慎重に検討する必要があります。

Akhil Cherian Verghese · Answer

個人的には、これは提供された答えよりもわずかにきれいだと思います

def split_to_words(sentence): return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed

Paulo Freitas · Answer

@mtrwの答えに触発されましたが、Wordの境界でのみ句読点を取り除くように改善されました。

import re import string def extract_words(s): return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()] >>> str = 'This is a string, with words!' >>> extract_words(str) ['This', 'is', 'a', 'string', 'with', 'words'] >>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.''' >>> extract_words(str) ["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']

sanchit · Answer

list=mystr.split(" ",mystr.count(" "))

BenyaR · Answer

これにより、アルファベット以外のすべての特殊文字を削除できます。

def wordsToList(strn): L = strn.split() cleanL = [] abc = 'abcdefghijklmnopqrstuvwxyz' ABC = abc.upper() letters = abc + ABC for e in L: Word = '' for c in e: if c in letters: Word += c if Word != '': cleanL.append(Word) return cleanL s = 'She loves you, yea yea yea! ' L = wordsToList(s) print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

これが高速なのか最適なのか、それともプログラムの正しい方法なのかはわかりません。

guest201505281433 · Answer

これは、正規表現を使用できないコーディングの挑戦に対する私の試みからです。

outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')

アポストロフィの役割は興味深いようです。