Python文をトークン化解除する

Question

文章をトークン化する方法については非常に多くのガイドがありますが、その逆の方法については何も見つかりませんでした。

_ import nltk words = nltk.Word_tokenize("I've found a medicine for my disease.") result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.'] _

トークン化された文を元の状態に戻す機能はありますか？関数tokenize.untokenize()が何らかの理由で機能しません。

編集：

私はこれを行うことができることを知っています、そしてこれはおそらく問題を解決しますが、私はこれのための統合された機能があることに興味があります：

_result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!') result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'') _

alecxe · Accepted Answer

「treebank detokenizer」を使用できます-TreebankWordDetokenizer：

from nltk.tokenize.treebank import TreebankWordDetokenizer TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown']) # 'The quick brown'

MosesDetokenizerもnltkにありましたが、ライセンスの問題のために削除されましたが、 Sacremosesとして利用できます。スタンドアロンパッケージ。

alvas · Answer

逆にするにはWord_tokenize nltkから http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.Word_tokenize を調べて、リバースエンジニアリングを行うことをお勧めします。

Nltkでクレイジーなハックを行うのではなく、これを試すことができます。

>>> import nltk >>> import string >>> nltk.Word_tokenize("I've found a medicine for my disease.") ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.'] >>> tokens = nltk.Word_tokenize("I've found a medicine for my disease.") >>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip() "I've found a medicine for my disease."

Uri · Answer

from nltk.tokenize.treebank import TreebankWordDetokenizer TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown']) # 'The quick brown'

Renklauf · Answer

使用する token_utils.untokenize from ここ

import re def untokenize(words): """ Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be. Ideally, `untokenize(tokenize(text))` should be identical to `text`, except for line breaks. """ text = ' '.join(words) step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...') step2 = step1.replace(" ( ", " (").replace(" ) ", ") ") step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2) step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3) step5 = step4.replace(" '", "'").replace(" n't", "n't").replace( "can not", "cannot") step6 = step5.replace(" ` ", " '") return step6.strip() tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.'] untokenize(tokenized) "I've found a medicine for my disease."

Sathyanarayanan Kulasekaran · Answer

私にとっては、python nltk 3.2.5、

pip install -U nltk

その後、

import nltk nltk.download('perluniprops') from nltk.tokenize.moses import MosesDetokenizer

Insides pandas dataframeを使用している場合、

df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))

dparpyani · Answer

tokenize.untokenizeが機能しないのは、単語だけではなく、より多くの情報が必要だからです。以下は、tokenize.untokenizeを使用したプログラム例です。

from StringIO import StringIO import tokenize sentence = "I've found a medicine for my disease.
" tokens = tokenize.generate_tokens(StringIO(sentence).readline) print tokenize.untokenize(tokens)

追加ヘルプ： Tokenize-Python Docs | Potential Problem

alemol · Answer

トークン化でオフセットを維持することを提案します：（トークン、オフセット）。この情報は、元の文の処理に役立つと思います。

import re from nltk.tokenize import Word_tokenize def offset_tokenize(text): tail = text accum = 0 tokens = self.tokenize(text) info_tokens = [] for tok in tokens: scaped_tok = re.escape(tok) m = re.search(scaped_tok, tail) start, end = m.span() # global offsets gs = accum + start ge = accum + end accum += end # keep searching in the rest tail = tail[end:] info_tokens.append((tok, (gs, ge))) return info_token sent = '''I've found a medicine for my disease. This is line:3.''' toks_offsets = offset_tokenize(sent) for t in toks_offsets: (tok, offset) = t print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]

与える：

True I I True 've 've True found found True a a True medicine medicine True for for True my my True disease disease True . . True This This True is is True line:3 line:3 True . .

shaktimaan · Answer

join 関数を使用します。

' '.join(words)を実行して、元の文字列に戻すことができます。

gss · Answer

単純な答えがない理由は、文字列内の元のトークンのスパン位置が実際に必要だからです。それがなく、元のトークン化をリバースエンジニアリングしていない場合、再構築された文字列は、使用されたトークン化ルールに関する推測に基づいています。トークナイザーがスパンを提供しなかった場合でも、次の3つがあればこれを行うことができます。

1）元の文字列

2）元のトークン

3）変更されたトークン（トークンを何らかの方法で変更したと想定しています。これは、すでに＃1を持っている場合に考えられる唯一のアプリケーションであるためです）

元のトークンセットを使用してスパンを識別し（トークナイザーがそれを行ったとしたらいいのではないでしょうか）、ストリングを後ろから前に変更して、スパンが変更されないようにします。

ここでは、TweetTokenizerを使用していますが、使用するトークナイザーがトークンの値を変更せず、元の文字列に実際に含まれない限り、問題はありません。

tokenizer=nltk.tokenize.casual.TweetTokenizer() string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin." tokens=tokenizer.tokenize(string) replacement_tokens=list(tokens) replacement_tokens[-3]="cute" def detokenize(string,tokens,replacement_tokens): spans=[] cursor=0 for token in tokens: while not string[cursor:cursor+len(token)]==token and cursor<len(string): cursor+=1 if cursor==len(string):break newcursor=cursor+len(token) spans.append((cursor,newcursor)) cursor=newcursor i=len(tokens)-1 for start,end in spans[::-1]: string=string[:start]+replacement_tokens[i]+string[end:] i-=1 return string >>> detokenize(string,tokens,replacement_tokens) 'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'

Asad · Answer

私は主要なライブラリ関数なしで以下のコードを使用してdetokeizationの目的で使用しています。特定のトークンにトークン化解除を使用しています

_SPLITTER_ = r"([-.,/:!?\";)(])" def basic_detokenizer(sentence): """ This is the basic detokenizer helps us to resolves the issues we created by our tokenizer""" detokenize_sentence =[] words = sentence.split(' ') pos = 0 while( pos < len(words)): if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1: left = detokenize_sentence.pop() detokenize_sentence.append(left +''.join(words[pos:pos + 2])) pos +=1 Elif words[pos] in '[(' and pos < len(words) - 1: detokenize_sentence.append(''.join(words[pos:pos + 2])) pos +=1 Elif words[pos] in ']).,:!?;' and pos > 0: left = detokenize_sentence.pop() detokenize_sentence.append(left + ''.join(words[pos:pos + 1])) else: detokenize_sentence.append(words[pos]) pos +=1 return ' '.join(detokenize_sentence)