Pythonのnグラム、4、5、6グラム？

Question

テキストをn-gramに分割する方法を探しています。通常、私は次のようなことをします：

import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams

Nltkはバイグラムとトライグラムのみを提供することを知っていますが、テキストを4グラム、5グラム、または100グラムに分割する方法はありますか？

ありがとう！

alvas · Accepted Answer

他のユーザーからの素晴らしいネイティブpythonベースの回答。しかし、ここにnltkアプローチがあります（念のため、OPはnltkライブラリーに既に存在するものを再発明したことで罰せられます）。

nltkでめったに使用されない ngramモジュールがあります。 ngramを読むのが難しいからではありませんが、n> 3のngramに基づいてモデルをトレーニングすると、データのスパース性が大きくなります。

from nltk import ngrams sentence = 'this is a foo bar sentences and i want to ngramize it' n = 6 sixgrams = ngrams(sentence.split(), n) for grams in sixgrams: print grams

inspectorG4dget · Answer

これがまだ現れていないことに驚いています：

In [34]: sentence = "I really like python, it's pretty awesome.".split() In [35]: N = 4 In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)] In [37]: for gram in grams: print gram ['I', 'really', 'like', 'python,'] ['really', 'like', 'python,', "it's"] ['like', 'python,', "it's", 'pretty'] ['python,', "it's", 'pretty', 'awesome.']

M.A.Hassan · Answer

n-gramを実行する別の簡単な方法を次に示します

>>> from nltk.util import ngrams >>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams" >>> tokenize = nltk.Word_tokenize(text) >>> tokenize ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] >>> bigrams = ngrams(tokenize,2) >>> bigrams [('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')] >>> trigrams=ngrams(tokenize,3) >>> trigrams [('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')] >>> fourgrams=ngrams(tokenize,4) >>> fourgrams [('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

Δημητρης Παππάς · Answer

Nltkツールのみを使用する

from nltk.tokenize import Word_tokenize from nltk.util import ngrams def get_ngrams(text, n ): n_grams = ngrams(Word_tokenize(text), n) return [ ' '.join(grams) for grams in n_grams]

出力例

get_ngrams('This is the simplest text i could think of', 3 ) ['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

Ngramを配列形式で保持するには、' '.joinを削除するだけです

tzaman · Answer

itertoolsを使用して、独自の関数を簡単に作成してこれを行うことができます。

from itertools import izip, islice, tee s = 'spam and eggs' N = 3 trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N)))) list(trigrams) # [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '), # ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'), # ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'), # ('e', 'g', 'g'), ('g', 'g', 's')]

bhatman · Answer

あなたはバイグラムまたはトライグラムが必要なシナリオについてはすでにかなりうまく答えていますが、その場合の文にeverygramが必要な場合はnltk.util.everygramsを使用できます

>>> from nltk.util import everygrams >>> message = "who let the dogs out" >>> msg_split = message.split() >>> list(everygrams(msg_split)) [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

最大長を3にする必要があるトライグラムの場合のように制限がある場合は、max_lenパラメーターを使用して指定できます。

>>> list(everygrams(msg_split, max_len=2)) [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

Max_lenパラメーターを変更するだけで、4グラム、5グラム、6グラム、さらには100グラムのグラムを達成できます。

前述のソリューションは、上記のソリューションを実装するように変更できますが、このソリューションはそれよりはるかに簡単です。

さらに読むには、こちらをクリックしてください

そして、バイグラムやトライグラムなどの特定のグラムが必要な場合は、M.A。ハッサンの答えで述べたように、nltk.util.ngramsを使用できます。

sel · Answer

Four_gramsについては、すでに NLTK にあります。これに役立つコードの一部を次に示します。

 from nltk.collocations import * import nltk #You should tokenize your text text = "I do not like green eggs and ham, I do not like them Sam I am!" tokens = nltk.wordpunct_tokenize(text) fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens) for fourgram, freq in fourgrams.ngram_fd.items(): print fourgram, freq

役に立てば幸いです。

Serendipity · Answer

Pythonの組み込みZip()でバイグラムを構築するよりエレガントなアプローチ。 split()によって元の文字列をリストに変換し、リストを1回通常通り、1つの要素で1回オフセットするだけです。

string = "I really like python, it's pretty awesome." def find_bigrams(s): input_list = s.split(" ") return Zip(input_list, input_list[1:]) def find_ngrams(s, n): input_list = s.split(" ") return Zip(*[input_list[i:] for i in range(n)]) find_bigrams(string) [('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

Nik · Answer

私はnltkを扱ったことがありませんが、いくつかの小さなクラスのプロジェクトの一部としてN-gramを使いました。文字列で発生するすべてのN-gramの頻度を検索する場合は、これを行う方法があります。 Dは、Nワードのヒストグラムを提供します。

D = dict() string = 'whatever string...' strparts = string.split() for i in range(len(strparts)-N): # N-grams try: D[Tuple(strparts[i:i+N])] += 1 except: D[Tuple(strparts[i:i+N])] = 1

Yann Dubois · Answer

効率が問題であり、複数の異なるn-gram（言うまでに100個まで）を構築する必要があるが、純粋なpythonを使用する場合：

from itertools import chain def n_grams(seq, n=1): """Returns an itirator over the n-grams given a listTokens""" shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i) shiftedTokens = (shiftToken(i) for i in range(n)) tupleNGrams = Zip(*shiftedTokens) return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams) def range_ngrams(listTokens, ngramRange=(1,2)): """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens.""" return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

使用法：

>>> input_list = input_list = 'test the ngrams generator'.split() >>> list(range_ngrams(input_list, ngramRange=(1,3))) [('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

〜NLTKと同じ速度：

import nltk %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 nltk.ngrams(input_list,n=5) # 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 n_grams(input_list,n=5) # 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 nltk.ngrams(input_list,n=1) nltk.ngrams(input_list,n=2) nltk.ngrams(input_list,n=3) nltk.ngrams(input_list,n=4) nltk.ngrams(input_list,n=5) # 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 range_ngrams(input_list, ngramRange=(1,6)) # 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

前の回答から再投稿します。

Franck Dernoncourt · Answer

sklearn.feature_extraction.text.CountVectorizer を使用できます。

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

出力：

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

ngram_sizeには任意の正の整数を設定できます。つまりテキストを4グラム、5グラム、または100グラムに分割できます。

Daniel P&#233;rez Rada · Answer

Nltkは優れていますが、一部のプロジェクトではオーバーヘッドになる場合があります。

import re def tokenize(text, ngrams=1): text = re.sub(r'[\b\\"\'/\s+\,\.:\?;]', ' ', text) text = re.sub(r'\s+', ' ', text) tokens = text.split() return [Tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

使用例：

>> text = "This is an example text" >> tokenize(text, 2) [('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')] >> tokenize(text, 3) [('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

Joe Zhow · Answer

以下の他のパッケージなしでコードを使用して4〜6グラムすべてを取得できます。

from itertools import chain def get_m_2_ngrams(input_list, min, max): for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]): yield ' '.join(s) def get_ngrams(input_list, n): return Zip(*[input_list[i:] for i in range(n)]) if __== '__main__': input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] for s in get_m_2_ngrams(input_list, 4, 6): print(s)

出力は以下のとおりです。

I am aware that am aware that nltk aware that nltk only that nltk only offers nltk only offers bigrams only offers bigrams and offers bigrams and trigrams bigrams and trigrams , and trigrams , but trigrams , but is , but is there but is there a is there a way there a way to a way to split way to split my to split my text split my text in my text in four-grams text in four-grams , in four-grams , five-grams four-grams , five-grams or , five-grams or even five-grams or even hundred-grams I am aware that nltk am aware that nltk only aware that nltk only offers that nltk only offers bigrams nltk only offers bigrams and only offers bigrams and trigrams offers bigrams and trigrams , bigrams and trigrams , but and trigrams , but is trigrams , but is there , but is there a but is there a way is there a way to there a way to split a way to split my way to split my text to split my text in split my text in four-grams my text in four-grams , text in four-grams , five-grams in four-grams , five-grams or four-grams , five-grams or even , five-grams or even hundred-grams I am aware that nltk only am aware that nltk only offers aware that nltk only offers bigrams that nltk only offers bigrams and nltk only offers bigrams and trigrams only offers bigrams and trigrams , offers bigrams and trigrams , but bigrams and trigrams , but is and trigrams , but is there trigrams , but is there a , but is there a way but is there a way to is there a way to split there a way to split my a way to split my text way to split my text in to split my text in four-grams split my text in four-grams , my text in four-grams , five-grams text in four-grams , five-grams or in four-grams , five-grams or even four-grams , five-grams or even hundred-grams

詳細についてはこちらをご覧くださいブログ