私は この質問 を経験していました。
NLTKがWord /文のトークン化で正規表現よりも高速であるかどうか疑問に思っています。
デフォルトのnltk.Word_tokenize()
は、 Penn Treebankトークナイザー からトークナイザーをエミュレートする Treebankトークナイザー を使用しています。
str.split()
は、言語学的な意味でトークンを達成しないことに注意してください。例:
_>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import Word_tokenize
>>> Word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']
_
これは通常、指定された区切り文字で文字列を区切るために使用されます。タブ区切りファイルでは、str.split('\t')
を使用できます。または、テキストファイルに1行あたり1つの文がある場合、改行_\n
_で文字列を分割しようとする場合に使用できます。
そして、_python3
_でベンチマークを実行しましょう。
_import time
from nltk import Word_tokenize
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
for _ in range(10):
start = time.time()
for line in data.split('\n'):
line.split()
print ('str.split():\t', time.time() - start)
for _ in range(10):
start = time.time()
for line in data.split('\n'):
Word_tokenize(line)
print ('Word_tokenize():\t', time.time() - start)
_
[でる]:
_str.split(): 0.05451083183288574
str.split(): 0.054320573806762695
str.split(): 0.05368804931640625
str.split(): 0.05416440963745117
str.split(): 0.05299568176269531
str.split(): 0.05304527282714844
str.split(): 0.05356955528259277
str.split(): 0.05473494529724121
str.split(): 0.053118228912353516
str.split(): 0.05236077308654785
Word_tokenize(): 4.056122779846191
Word_tokenize(): 4.052812337875366
Word_tokenize(): 4.042144775390625
Word_tokenize(): 4.101543664932251
Word_tokenize(): 4.213029146194458
Word_tokenize(): 4.411528587341309
Word_tokenize(): 4.162556886672974
Word_tokenize(): 4.225975036621094
Word_tokenize(): 4.22914719581604
Word_tokenize(): 4.203172445297241
_
Edge NLTKのブリーディングの別のトークナイザー from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl :
_import time
from nltk.tokenize import ToktokTokenizer
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
toktok = ToktokTokenizer().tokenize
for _ in range(10):
start = time.time()
for line in data.split('\n'):
toktok(line)
print ('toktok:\t', time.time() - start)
_
[でる]:
_toktok: 1.5902607440948486
toktok: 1.5347232818603516
toktok: 1.4993178844451904
toktok: 1.5635688304901123
toktok: 1.5779635906219482
toktok: 1.8177132606506348
toktok: 1.4538452625274658
toktok: 1.5094449520111084
toktok: 1.4871931076049805
toktok: 1.4584410190582275
_
(注:テキストファイルのソースは https://github.com/Simdiva/DSL-Task からです)
ネイティブのPerl
実装を見ると、python
とPerl
のToktokTokenizer
時間は同等です。しかし、python実装では、正規表現はPerlでプリコンパイルされていますが、そうではありません 証拠はまだプリンにあります :
_alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36-- https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’
100%[===============================================================================================================================>] 2,690 --.-K/s in 0s
2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]
alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38-- https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’
100%[===============================================================================================================================>] 3,483,550 363KB/s in 7.4s
2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]
alvas@ubi:~$ time Perl tok-tok.pl < test.txt > /tmp/null
real 0m1.703s
user 0m1.693s
sys 0m0.008s
alvas@ubi:~$ time Perl tok-tok.pl < test.txt > /tmp/null
real 0m1.715s
user 0m1.704s
sys 0m0.008s
alvas@ubi:~$ time Perl tok-tok.pl < test.txt > /tmp/null
real 0m1.700s
user 0m1.686s
sys 0m0.012s
alvas@ubi:~$ time Perl tok-tok.pl < test.txt > /tmp/null
real 0m1.727s
user 0m1.700s
sys 0m0.024s
alvas@ubi:~$ time Perl tok-tok.pl < test.txt > /tmp/null
real 0m1.734s
user 0m1.724s
sys 0m0.008s
_
(注:_tok-tok.pl
_のタイミングをとるときは、出力をファイルにパイプする必要があったため、ここでのタイミングには、マシンがファイルに出力するのにかかる時間が含まれますが、_nltk.tokenize.ToktokTokenizer
_のタイミングではそうではありません。 tファイルに出力する時間を含める)
sent_tokenize()
に関しては少し異なり、精度を考慮せずに速度ベンチマークを比較するのは少し風変わりです。
このことを考慮:
正規表現がテキストファイル/段落を1つの文に分割する場合、速度はほぼ瞬時になります。つまり、0の作業が完了します。しかし、それは恐ろしい文章のトークナイザーになります...
ファイル内の文がすでに_\n
_で区切られている場合、それはstr.split('\n')
とre.split('\n')
を比較した場合であり、nltk
は文のトークン化とは何の関係もありません。 ; P
NLTKでのsent_tokenize()
の動作については、以下を参照してください。
したがって、sent_tokenize()
と他の正規表現ベースのメソッド(str.split('\n')
ではない)を効果的に比較するには、精度も評価し、トークン化された形式で人間が評価した文を含むデータセットを用意する必要があります。
このタスクを検討してください: https://www.hackerrank.com/challenges/from-paragraphs-to-sentences
与えられたテキスト:
3番目のカテゴリーには、フリーメーソンでは外部の形式と儀式以外は何も見なかった兄弟(大多数)が含まれ、その趣旨や重要性について悩むことなく、これらの形式の厳格なパフォーマンスを高く評価しました。ウィラルスキーやプリンシパルロッジのグランドマスターもそうだった。最後に、4番目のカテゴリーにも非常に多くの兄弟、特に最近参加した兄弟が属していました。ピエールの観察によると、これらは何も信じておらず、何も望んでいない男性でしたが、彼らのつながりや階級に影響力があり、ロッジに非常に多くいた裕福な若い兄弟たちと交流するためにフリーメーソンに参加しましたピエールは自分のしていることに不満を感じ始めました。彼がここで見たフリーメイソンは、とにかく単に見た目だけに基づいているように見えた。彼はフリーメーソン自身を疑うことを考えなかったが、ロシアのメーソンリーが間違った道をたどり、元の原則から逸脱したのではないかと疑った。そして、年末に向けて、彼は海外に行き、秩序のより高い秘密を開始しました。このような状況で何をすべきでしょうか?革命を支持するために、すべてを倒し、力で力を撃退しますか?私たちはそれから非常に遠いです。すべての暴力的な改革は非難に値します。なぜなら、男性が現状のままである間、それは悪を治療するのにまったく失敗し、また知恵は暴力を必要としないからです。 「しかし、そのようにそれを横切って走ることには何がありますか?」イラギンの新郎は言った。 「彼女がそれを見逃してそれを背けたら、どんな雑種でもそれを取ることができた」とイラギンは同時に彼のギャロップと彼の興奮から息を切らして言っていた。
これを取得したい:
_In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.
_
したがって、str.split('\n')
を実行するだけでは何も得られません。文の順序を考慮しなくても、0の肯定的な結果が得られます。
_>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>>
>>> output = text.split('\n')
>>> sum(1 for sent in text.split('\n') if sent in answer)
0
_