GensimでWord2VecモデルからWordを完全に削除する方法は？

Question

与えられたモデル、例えば.

_from gensim.models.Word2vec import Word2Vec documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] texts = [d.lower().split() for d in documents] w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10) _

Wordをw2v語彙から削除することは可能です。

_# Originally, it's there. >>> print(w2v_model['graph']) [-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626] >>> print(w2v_model.wv.vocab['graph']) Vocab(count:3, index:5, sample_int:750148289) # Find most similar words. >>> print(w2v_model.most_similar('graph')) [('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)] # We can delete it from the dictionary >>> del w2v_model.wv.vocab['graph'] >>> print(w2v_model['graph']) KeyError: "Word 'graph' not in vocabulary" _

しかし、graphを削除した後に他の単語を類似させると、単語graphがポップアップ表示されます。

_>>> w2v_model.most_similar('binary') [('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)] _

gensimでWord2VecモデルからWordを完全に削除する方法

更新しました

@vumaashaのコメントに回答するには：

wordを削除する理由を詳しく教えてください

すべての単語間の密な関係を学ぶために、コーパスのすべての単語の中で私の単語の世界を言いましょう。
しかし、同様の単語を生成したい場合は、ドメイン固有の単語のサブセットからのみ取得する必要があります。
.most_similar()から十分な数を生成し、単語をフィルタリングすることは可能ですが、特定のドメインのスペースが小さいと言うと、非効率である1000位に最も近いランクの単語を探している可能性があります。
WordがWordのベクターから完全に削除されている場合は、.most_similar()の単語が特定のドメインの外部の単語を返さないほうがよいでしょう。

zsozso · Answer

定義済みの単語リストにないKeyedVectorから単語を削除する関数を作成しました。

def restrict_w2v(w2v, restricted_Word_set): new_vectors = [] new_vocab = {} new_index2entity = [] new_vectors_norm = [] for i in range(len(w2v.vocab)): Word = w2v.index2entity[i] vec = w2v.vectors[i] vocab = w2v.vocab[Word] vec_norm = w2v.vectors_norm[i] if Word in restricted_Word_set: vocab.index = len(new_index2entity) new_index2entity.append(Word) new_vocab[Word] = vocab new_vectors.append(vec) new_vectors_norm.append(vec_norm) w2v.vocab = new_vocab w2v.vectors = new_vectors w2v.index2entity = new_index2entity w2v.index2Word = new_index2entity w2v.vectors_norm = new_vectors_norm

Word2VecKeyedVectors に基づいて、単語に関連するすべての変数を書き換えます。

使用法：

w2v = KeyedVectors.load_Word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True) w2v.most_similar("beer")

[（ 'ビール'、0.8409687876701355）、
（ 'lager'、0.7733745574951172）、
（ 'ビール'、0.71753990650177）、
（ 'drinks'、0.668931245803833）、
（ 'lagers'、0.6570086479187012）、
（ 'Yuengling_Lager'、0.655455470085144）、
（ 'microbrew'、0.6534324884414673）、
（ 'Brooklyn_Lager'、0.6501551866531372）、
（ 'suds'、0.6497018337249756）、
（ 'brewed_beer'、0.6490240097045898）]

restricted_Word_set = {"beer", "wine", "computer", "python", "bash", "lagers"} restrict_w2v(w2v, restricted_Word_set) w2v.most_similar("beer")

[（ 'ラガーズ、0.6570085287094116）、
（ 'wine'、0.6217695474624634）、
（ 'bash'、0.20583480596542358）、
（ 'コンピュータ'、0.06677375733852386）、
（ 'python'、0.005948573350906372）]

vumaasha · Answer

あなたが探しているものを直接行う方法はありません。ただし、完全に失われるわけではありません。メソッドmost_similarはクラス WordEmbeddingsKeyedVectors に実装されています（リンクを確認してください）。このメソッドを見て、ニーズに合わせて変更できます。

以下に示す lines は、類似の単語を計算する実際のロジックを実行します。変数limitedを、興味のある単語に対応するベクトルに置き換える必要があります。その後、完了です

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab] dists = dot(limited, mean) if not topn: return dists best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)

更新：

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]

この行が表示された場合、restrict_vocabを使用すると、語彙の上位n語が制限されます。これは、語彙を頻度でソートした場合にのみ意味があります。 restrict_vocabを渡さない場合、self.vectors_normが制限されます

most_similarメソッドは別のメソッド init_sims を呼び出します。これにより、次に示すように[self.vector_norm][4]の値が初期化されます

 self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

だから、あなたが興味のある言葉を拾い、それらの規範を準備して、制限されたものの代わりにそれを使うことができます。これはうまくいくはずです

Feng Mai · Answer

これはモデル自体をトリミングしないことに注意してください。これは、類似性ルックアップのベースとなるKeyedVectorsオブジェクトをトリミングします。

モデルの上位5000語のみを保持したいとします。

_wv = w2v_model.wv words_to_trim = wv.index2Word[5000:] # In op's case # words_to_trim = ['graph'] ids_to_trim = [wv.vocab[w].index for w in words_to_trim] for w in words_to_trim: del wv.vocab[w] wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0) wv.init_sims(replace=True) for i in sorted(ids_to_trim, reverse=True): del(wv.index2Word[i]) _

BaseKeyedVectorsクラスに次の属性が含まれているため、これが機能します：self.vectors、self.vectors_norm、self.vocab、self.vector_size、self.index2Word 。

これの利点は、save_Word2vec_format()などのメソッドを使用してKeyedVectorを書き込む場合、ファイルがはるかに小さくなることです。

Ryan Y · Answer

最も簡単な方法は次のとおりであると試して感じました：

Word2Vec埋め込みをテキストファイル形式で取得します。
保持したいWordベクトルに対応する行を特定します。
新しいテキストファイルWord2Vec埋め込みモデルを記述します。
モデルをロードしてお楽しみください（必要に応じてバイナリに保存など）...

私のサンプルコードは次のとおりです：

line_no = 0 # line0 = header numEntities=0 targetLines = [] with open(file_entVecs_txt,'r') as fp: header = fp.readline() # header while True: line = fp.readline() if line == '': #EOF break line_no += 1 isLatinFlag = True for i_l, char in enumerate(line): if not isLatin(char): # Care about entity that is Latin-only isLatinFlag = False break if char==' ': # reached separator ent = line[:i_l] break if not isLatinFlag: continue # Check for numbers in entity if re.search('\d',ent): continue # Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History') if re.match('^ENTITY/.*#',ent): continue targetLines.append(line_no) numEntities += 1 # Update header with new metadata header_new = re.sub('^\d+',str(numEntities),header,count=1) # Generate the file txtWrite('',file_entVecs_SHORT_txt) txtAppend(header_new,file_entVecs_SHORT_txt) line_no = 0 ptr = 0 with open(file_entVecs_txt,'r') as fp: while ptr < len(targetLines): target_line_no = targetLines[ptr] while (line_no != target_line_no): fp.readline() line_no+=1 line = fp.readline() line_no+=1 ptr+=1 txtAppend(line,file_entVecs_SHORT_txt)

FYI。FAILED ATTEMPT @zsozsoの方法を試しました（np.array @Taegyungによって提案された変更）、それを少なくとも12時間一晩実行するように放置しましたが、制限されたセットから新しい単語を取得するのにまだ行き詰っていました...）。これはおそらく私が多くのエンティティを持っているためです...しかし、私のテキストファイルメソッドは1時間以内に機能します。

失敗したコード

# [FAILED] Stuck at Building new vocab... def restrict_w2v(w2v, restricted_Word_set): new_vectors = [] new_vocab = {} new_index2entity = [] new_vectors_norm = [] print('Building new vocab..') for i in range(len(w2v.vocab)): if (i%int(1e6)==0) and (i!=0): print(f'working on {i}') Word = w2v.index2entity[i] vec = np.array(w2v.vectors[i]) vocab = w2v.vocab[Word] vec_norm = w2v.vectors_norm[i] if Word in restricted_Word_set: vocab.index = len(new_index2entity) new_index2entity.append(Word) new_vocab[Word] = vocab new_vectors.append(vec) new_vectors_norm.append(vec_norm) print('Assigning new vocab') w2v.vocab = new_vocab print('Assigning new vectors') w2v.vectors = np.array(new_vectors) print('Assigning new index2entity, index2Word') w2v.index2entity = new_index2entity w2v.index2Word = new_index2entity print('Assigning new vectors_norm') w2v.vectors_norm = np.array(new_vectors_norm)