NLTKの実際の使用例

Question

そのドキュメント（ Book および [〜＃〜] howto [〜＃〜] ）は非常にかさばり、例は少し進んでいることがあります。

NLTKの用途/アプリケーションの良いが基本的な例はありますか？ Stream Hackerブログの NTLK記事のようなものを考えています。

Mat · Answer

これは、この質問を調べている他の人の利益のための私自身の実用的な例です（サンプルテキストを言い訳、それは私が最初に見つけた Wikipedia ）：

import nltk import pprint tokenizer = None tagger = None def init_nltk(): global tokenizer global tagger tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+') tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents()) def tag(text): global tokenizer global tagger if not tokenizer: init_nltk() tokenized = tokenizer.tokenize(text) tagged = tagger.tag(tokenized) tagged.sort(lambda x,y:cmp(x[1],y[1])) return tagged def main(): text = """Mr Blobby is a fictional character who featured on Noel Edmonds' Saturday night entertainment show Noel's House Party, which was often a ratings winner in the 1990s. Mr Blobby also appeared on the Jamie Rose show of 1997. He was designed as an outrageously over the top parody of a one-dimensional, mute novelty character, which ironically made him distinctive, absurd and popular. He was a large pink humanoid, covered with yellow spots, sporting a permanent toothy grin and jiggling eyes. He communicated by saying the Word "blobby" in an electronically-altered voice, expressing his moods through tone of voice and repetition. There was a Mrs. Blobby, seen briefly in the video, and sold as a doll. However Mr Blobby actually started out as part of the 'Gotcha' feature during the show's second series (originally called 'Gotcha Oscars' until the threat of legal action from the Academy of Motion Picture Arts and Sciences[citation needed]), in which celebrities were caught out in a Candid Camera style prank. Celebrities such as dancer Wayne Sleep and rugby union player Will Carling would be enticed to take part in a fictitious children's programme based around their profession. Mr Blobby would clumsily take part in the activity, knocking over the set, causing mayhem and saying "blobby blobby blobby", until finally when the prank was revealed, the Blobby costume would be opened - revealing Noel inside. This was all the more surprising for the "victim" as during rehearsals Blobby would be played by an actor wearing only the arms and legs of the costume and speaking in a normal manner.[citation needed]""" tagged = tag(text) l = list(set(tagged)) l.sort(lambda x,y:cmp(x[1],y[1])) pprint.pprint(l) if __== '__main__': main()

出力：

[('rugby', None), ('Oscars', None), ('1990s', None), ('",', None), ('Candid', None), ('"', None), ('blobby', None), ('Edmonds', None), ('Mr', None), ('outrageously', None), ('.[', None), ('toothy', None), ('Celebrities', None), ('Gotcha', None), (']),', None), ('Jamie', None), ('humanoid', None), ('Blobby', None), ('Carling', None), ('enticed', None), ('programme', None), ('1997', None), ('s', None), ("'", "'"), ('[', '('), ('(', '('), (']', ')'), (',', ','), ('.', '.'), ('all', 'ABN'), ('the', 'AT'), ('an', 'AT'), ('a', 'AT'), ('be', 'BE'), ('were', 'BED'), ('was', 'BEDZ'), ('is', 'BEZ'), ('and', 'CC'), ('one', 'CD'), ('until', 'CS'), ('as', 'CS'), ('This', 'DT'), ('There', 'EX'), ('of', 'IN'), ('inside', 'IN'), ('from', 'IN'), ('around', 'IN'), ('with', 'IN'), ('through', 'IN'), ('-', 'IN'), ('on', 'IN'), ('in', 'IN'), ('by', 'IN'), ('during', 'IN'), ('over', 'IN'), ('for', 'IN'), ('distinctive', 'JJ'), ('permanent', 'JJ'), ('mute', 'JJ'), ('popular', 'JJ'), ('such', 'JJ'), ('fictional', 'JJ'), ('yellow', 'JJ'), ('pink', 'JJ'), ('fictitious', 'JJ'), ('normal', 'JJ'), ('dimensional', 'JJ'), ('legal', 'JJ'), ('large', 'JJ'), ('surprising', 'JJ'), ('absurd', 'JJ'), ('Will', 'MD'), ('would', 'MD'), ('style', 'NN'), ('threat', 'NN'), ('novelty', 'NN'), ('union', 'NN'), ('prank', 'NN'), ('winner', 'NN'), ('parody', 'NN'), ('player', 'NN'), ('actor', 'NN'), ('character', 'NN'), ('victim', 'NN'), ('costume', 'NN'), ('action', 'NN'), ('activity', 'NN'), ('dancer', 'NN'), ('grin', 'NN'), ('doll', 'NN'), ('top', 'NN'), ('mayhem', 'NN'), ('citation', 'NN'), ('part', 'NN'), ('repetition', 'NN'), ('manner', 'NN'), ('tone', 'NN'), ('Picture', 'NN'), ('entertainment', 'NN'), ('night', 'NN'), ('series', 'NN'), ('voice', 'NN'), ('Mrs', 'NN'), ('video', 'NN'), ('Motion', 'NN'), ('profession', 'NN'), ('feature', 'NN'), ('Word', 'NN'), ('Academy', 'NN-TL'), ('Camera', 'NN-TL'), ('Party', 'NN-TL'), ('House', 'NN-TL'), ('eyes', 'NNS'), ('spots', 'NNS'), ('rehearsals', 'NNS'), ('ratings', 'NNS'), ('arms', 'NNS'), ('celebrities', 'NNS'), ('children', 'NNS'), ('moods', 'NNS'), ('legs', 'NNS'), ('Sciences', 'NNS-TL'), ('Arts', 'NNS-TL'), ('Wayne', 'NP'), ('Rose', 'NP'), ('Noel', 'NP'), ('Saturday', 'NR'), ('second', 'OD'), ('his', 'PP$'), ('their', 'PP$'), ('him', 'PPO'), ('He', 'PPS'), ('more', 'QL'), ('However', 'RB'), ('actually', 'RB'), ('also', 'RB'), ('clumsily', 'RB'), ('originally', 'RB'), ('only', 'RB'), ('often', 'RB'), ('ironically', 'RB'), ('briefly', 'RB'), ('finally', 'RB'), ('electronically', 'RB-HL'), ('out', 'RP'), ('to', 'TO'), ('show', 'VB'), ('Sleep', 'VB'), ('take', 'VB'), ('opened', 'VBD'), ('played', 'VBD'), ('caught', 'VBD'), ('appeared', 'VBD'), ('revealed', 'VBD'), ('started', 'VBD'), ('saying', 'VBG'), ('causing', 'VBG'), ('expressing', 'VBG'), ('knocking', 'VBG'), ('wearing', 'VBG'), ('speaking', 'VBG'), ('sporting', 'VBG'), ('revealing', 'VBG'), ('jiggling', 'VBG'), ('sold', 'VBN'), ('called', 'VBN'), ('made', 'VBN'), ('altered', 'VBN'), ('based', 'VBN'), ('designed', 'VBN'), ('covered', 'VBN'), ('communicated', 'VBN'), ('needed', 'VBN'), ('seen', 'VBN'), ('set', 'VBN'), ('featured', 'VBN'), ('which', 'WDT'), ('who', 'WPS'), ('when', 'WRB')]

Pete Mancini · Answer

NLPは一般的に非常に有用であるため、テキスト分析の一般的なアプリケーションに検索範囲を広げることができます。 NLTKを使用して、MOSS 2010概念マップを抽出することによりファイル分類を生成しました。これは非常にうまく機能しました。ファイルが有用な方法でクラスター化されるまでに時間がかかりません。

多くの場合、テキスト分析を理解するには、思考に慣れている方法に反して考える必要があります。たとえば、テキスト分析は発見に非常に役立ちます。しかし、ほとんどの人は、検索と発見の違いが何であるかさえ知りません。これらのテーマを読んだ場合、NLTKを機能させる方法を「発見」する可能性があります。

また、NLTKを使用しないテキストファイルの世界観を考慮してください。空白と句読点で区切られたランダムな長さの文字列がたくさんあります。句読点の一部は、ピリオドなどの使用方法を変更します（ピリオド（これは小数点と省略形の接尾辞マーカーでもあります）。NLTKを使用すると、品詞を取得するポイントまで単語などを取得できます。これで、コンテンツを処理できました。 NLTKを使用して、ドキュメントの概念とアクションを見つけます。 NLTKを使用して、ドキュメントの「意味」に到達します。この場合の意味は、ドキュメント内の本質的な関係を指します。

NLTKに興味があるのは良いことです。テキスト分析は、今後数年間で大々的にブレークアウトする予定です。それを理解している人は、新しい機会をよりうまく活用するのにより適しています。

Jacob · Answer

私は streamhacker.com の作成者です（言及してくれたおかげで、この特定の質問からかなりのクリックトラフィックが得られます）。具体的に何をしようとしていますか？ NLTKにはさまざまなことを行うためのツールがたくさんありますが、ツールの使用目的とその最適な使用方法に関する明確な情報がいくらか欠けています。また、学問的な問題に向けられているため、教育学の例を実用的な解決策に変換するのは大変です。