Pytorchに埋め込む

Question

StackoverflowでPyTorchチュートリアルとこれに似た質問を確認しました。

混乱します。 pytorch（ Embedding ）に埋め込むと、類似した単語が互いに近くなりますか？そして、私はそれにすべての文章を与える必要がありますか？または、単なるルックアップテーブルであり、モデルをコーディングする必要がありますか？

AveryLiu · Answer

nn.Embeddingをルックアップテーブルとして扱うことができます。キーはWordインデックスで、値は対応するWordベクトルです。ただし、使用する前に、ルックアップテーブルのサイズを指定し、Wordベクトルを自分で初期化する必要があります。以下は、これを示すコード例です。

import torch.nn as nn # vocab_size is the number of words in your train, val and test set # vector_size is the dimension of the Word vectors you are using embed = nn.Embedding(vocab_size, vector_size) # intialize the Word vectors, pretrained_weights is a # numpy array of size (vocab_size, vector_size) and # pretrained_weights[i] retrieves the Word vector of # i-th Word in the vocabulary embed.weight.data.copy_(torch.fromnumpy(pretrained_weights)) # Then turn the Word index into actual Word vector vocab = {"some": 0, "words": 1} Word_indexes = [vocab[w] for w in ["some", "words"]] Word_vectors = embed(Word_indexes)

Escachator · Answer

nn.Embeddingは、次元(vocab_size, vector_size)のテンソル、つまり語彙のサイズx各埋め込みベクトルの次元、およびルックアップを行うメソッドを保持します。

埋め込みレイヤーを作成すると、Tensorはランダムに初期化されます。類似した単語間のこの類似性が表示されるのは、トレーニングを行った場合のみです。 GloVeやWord2Vecなど、以前にトレーニングしたモデルで埋め込みの値を上書きしていない限り、それは別の話です。

したがって、埋め込みレイヤーを定義し、ボキャブラリーを定義およびエンコードしたら（つまり、ボキャブラリー内の各単語に一意の番号を割り当てます）、nn.Embeddingクラスのインスタンスを使用して、対応する埋め込みを取得できます。

例えば：

import torch from torch import nn embedding = nn.Embedding(1000,128) embedding(torch.LongTensor([3,4]))

ボキャブラリーのワード3および4に対応する埋め込みベクトルを返します。モデルがトレーニングされていないため、ランダムになります。