Pythonで文のリストに単語のバイグラムを形成する

Question

文のリストがあります：

text = ['cant railway station','citadel hotel',' police stn'].

バイグラムペアを作成し、変数に保存する必要があります。問題は、それを行うと、単語の代わりに1組の文が表示されることです。ここに私がやったことがあります：

text2 = [[Word for Word in line.split()] for line in text] bigrams = nltk.bigrams(text2) print(bigrams)

をもたらす

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

鉄道駅と城塞ホテルは1つのバイグラムを形成できません。私が欲しいのは

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

最初の文の最後の単語は、2番目の文の最初の単語とマージしないでください。動作させるにはどうすればよいですか？

butch · Accepted Answer

リスト内包表記および Zip の使用：

>>> text = ["this is a sentence", "so is this one"] >>> bigrams = [b for l in text for b in Zip(l.split(" ")[:-1], l.split(" ")[1:])] >>> print(bigrams) [('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this', 'one')]

Dan · Answer

テキストを文字列のリストに変えるのではなく、各文を文字列として個別に開始します。句読点とストップワードも削除しました。あなたに関係ない場合は、これらの部分を削除してください。

import nltk from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.tokenize import WordPunctTokenizer from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures def get_bigrams(myString): tokenizer = WordPunctTokenizer() tokens = tokenizer.tokenize(myString) stemmer = PorterStemmer() bigram_Finder = BigramCollocationFinder.from_words(tokens) bigrams = bigram_Finder.nbest(BigramAssocMeasures.chi_sq, 500) for bigram_Tuple in bigrams: x = "%s %s" % bigram_Tuple tokens.append(x) result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8] return result

それを使用するには、次のようにします。

for line in sentence: features = get_bigrams(line) # train set here

これはもう少し先に進み、実際にバイグラムを統計的にスコアリングすることに注意してください（これはモデルのトレーニングに役立ちます）。

gurinder · Answer

from nltk import Word_tokenize from nltk.util import ngrams text = ['cant railway station', 'citadel hotel', 'police stn'] for line in text: token = nltk.Word_tokenize(line) bigram = list(ngrams(token, 2)) # the '2' represents bigram...you can change it to get ngrams with different size

alfasin · Answer

Nltkなし：

ans = [] text = ['cant railway station','citadel hotel',' police stn'] for line in text: arr = line.split() for i in range(len(arr)-1): ans.append([[arr[i]], [arr[i+1]]]) print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

Tanveer Alam · Answer

>>> text = ['cant railway station','citadel hotel',' police stn'] >>> bigrams = [(ele, tex.split()[i+1]) for tex in text for i,ele in enumerate(tex.split()) if i < len(tex.split())-1] >>> bigrams [('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

列挙および分割機能を使用します。

Jay Marm · Answer

Danのコードを修正するだけです。

def get_bigrams(myString): tokenizer = WordPunctTokenizer() tokens = tokenizer.tokenize(myString) stemmer = PorterStemmer() bigram_Finder = BigramCollocationFinder.from_words(tokens) bigrams = bigram_Finder.nbest(BigramAssocMeasures.chi_sq, 500) for bigram_Tuple in bigrams: x = "%s %s" % bigram_Tuple tokens.append(x) result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8] return result

avi · Answer

データセットを読む

df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")

利用可能なすべての月を収集する

df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])

月ごとのすべてのツイートのトークンを作成する

tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))

月ごとにバイグラムを作成する

bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))

月ごとにバイグラムを数える

count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))

結果をきちんとしたデータフレームにまとめる

month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"]) month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])

saicharan · Answer

方法の数がありますが、これを解決しました：

>>text = ['cant railway station','citadel hotel',' police stn'] >>text2 = [[Word for Word in line.split()] for line in text] >>text2 [['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']] >>output = [] >>for i in range(len(text2)): output = output+list(bigrams(text2[i])) >>#Here you can use list comphrension also >>output [('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]