NLTK：コーパスレベルのブルー対文レベルのBLEUスコア

Question

UbuntuでBLEUスコアを計算するためにpython=にnltkをインポートしました。文レベルのBLEUスコアの仕組みは理解していますが、コーパスレベルのBLEUスコアの仕組みは理解していません。

以下は、コーパスレベルのBLEUスコアの私のコードです。

import nltk hypothesis = ['This', 'is', 'cat'] reference = ['This', 'is', 'a', 'cat'] BLEUscore = nltk.translate.bleu_score.corpus_bleu([reference], [hypothesis], weights = [1]) print(BLEUscore)

何らかの理由で、上記のコードのブルースコアは0です。コーパスレベルのBLEUスコアが少なくとも0.5であることを期待していました。

これが文レベルのBLEUスコアの私のコードです

import nltk hypothesis = ['This', 'is', 'cat'] reference = ['This', 'is', 'a', 'cat'] BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights = [1]) print(BLEUscore)

ここでは、文レベルのBLEUスコアは0.71で、短さのペナルティと欠落している単語 "a"を考慮に入れています。ただし、コーパスレベルのBLEUスコアの仕組みがわかりません。

任意の助けいただければ幸いです。

alvas · Accepted Answer

TL; DR：

_>>> import nltk >>> hypothesis = ['This', 'is', 'cat'] >>> reference = ['This', 'is', 'a', 'cat'] >>> references = [reference] # list of references for 1 sentence. >>> list_of_references = [references] # list of references for all sentences in corpus. >>> list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references. >>> nltk.translate.bleu_score.corpus_bleu(list_of_references, list_of_hypotheses) 0.6025286104785453 >>> nltk.translate.bleu_score.sentence_bleu(references, hypothesis) 0.6025286104785453 _

（注：BLEUスコア実装の安定したバージョンを取得するには、developブランチでNLTKの最新バージョンをプルする必要があります）

長い：

実際、コーパス全体に参照と仮説が1つしかない場合、corpus_bleu()とsentence_bleu()はどちらも、上記の例に示すように同じ値を返す必要があります。

コードでは、 _sentence_bleu_は実際には_corpus_bleu_ のアヒル型であることがわかります。

_def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None): return corpus_bleu([references], [hypothesis], weights, smoothing_function) _

そして、_sentence_bleu_のパラメーターを見ると：

_ def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None): """" :param references: reference sentences :type references: list(list(str)) :param hypothesis: a hypothesis sentence :type hypothesis: list(str) :param weights: weights for unigrams, bigrams, trigrams and so on :type weights: list(float) :return: The sentence-level BLEU score. :rtype: float """ _

_sentence_bleu_の参照の入力はlist(list(str))です。

たとえば、文の文字列がある場合、 _"This is a cat"_、それをトークン化して文字列のリストを取得する必要があります、_["This", "is", "a", "cat"]_。複数の参照が可能であるため、文字列のリストのリストである必要があります。「これは猫です」という2番目の参照がある場合、sentence_bleu()への入力は次のようになります。

_references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ] hypothesis = ["This", "is", "cat"] sentence_bleu(references, hypothesis) _

corpus_bleu() list_of_referencesパラメータに関しては、基本的に sentence_bleu()が参照として取るもののリスト：

_def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None): """ :param references: a corpus of lists of reference sentences, w.r.t. hypotheses :type references: list(list(list(str))) :param hypotheses: a list of hypothesis sentences :type hypotheses: list(list(str)) :param weights: weights for unigrams, bigrams, trigrams and so on :type weights: list(float) :return: The corpus-level BLEU score. :rtype: float """ _

_nltk/translate/bleu_score.py_ 内のdoctestを確認する以外に、 _nltk/test/unit/translate/test_bleu_score.py_ でユニットテストを確認して、それぞれの使用方法を確認することもできます。 _bleu_score.py_内のコンポーネント.

ちなみに、_sentence_bleu_は、bleuとして（_nltk.translate.__init__.py_]（ https://github.com/nltk/nltk/blob/develop/nltk/translate/init。py＃L21 ）、使用

_from nltk.translate import bleu _

次と同じになります：

_from nltk.translate.bleu_score import sentence_bleu _

そしてコードで：

_>>> from nltk.translate import bleu >>> from nltk.translate.bleu_score import sentence_bleu >>> from nltk.translate.bleu_score import corpus_bleu >>> bleu == sentence_bleu True >>> bleu == corpus_bleu False _

alexis · Answer

見てみましょう：

>>> help(nltk.translate.bleu_score.corpus_bleu) Help on function corpus_bleu in module nltk.translate.bleu_score: corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None) Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references. Instead of averaging the sentence level BLEU scores (i.e. marco-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division). ...

アルゴリズムの説明を理解するのに私より良い立場にいるので、私はそれを「説明」しようとはしません。 docstringで十分に解決できない場合は、ソース自体を確認してください。またはローカルで見つける：

>>> nltk.translate.bleu_score.__file__ '.../lib/python3.4/site-packages/nltk/translate/bleu_score.py'