Python 3で何百万もの正規表現の置換を高速化

Question

Python 3.5.2を使用しています

私は2つのリストを持っています

約750,000の「文」のリスト（長い文字列）
750,000の文章から削除したい約20,000の「単語」のリスト

そのため、750,000文をループして約20,000の置換を実行する必要がありますが、私の単語が実際に「単語」であり、より大きな文字列の一部ではない場合のみです。

私はこれを、プリコンパイルすることで行っています。これにより、単語が\bメタキャラクタに挟まれます

compiled_words = [re.compile(r'\b' + Word + r'\b') for Word in my20000words]

次に、「文」をループします

import re for sentence in sentences: for Word in compiled_words: sentence = re.sub(Word, "", sentence) # put sentence into a growing list

このネストされたループは、1秒あたり約50文を処理しています。これは素晴らしいですが、すべての文を処理するのにまだ数時間かかります。

str.replaceメソッドを使用する方法はありますか（より高速だと思います）、それでも置換はWord境界でのみ発生する必要がありますか？
または、re.subメソッドを高速化する方法はありますか？ Wordの長さが文の長さよりも長い場合は、re.subをスキップすることで速度をわずかに改善しましたが、あまり改善されていません。

ご提案ありがとうございます。

Liteye · Accepted Answer

試すことができることの1つは、"\b(Word1|Word2|Word3)\b"のような1つのパターンをコンパイルすることです。

reは実際のマッチングを行うためにCコードに依存しているため、大幅に節約できます。

@pvgがコメントで指摘したように、シングルパスマッチングの利点もあります。

単語が正規表現でない場合、エリックの answer の方が高速です。

Eric Duminil · Answer

TLDR

最速のソリューションが必要な場合は、このメソッドを使用します（ルックアップを設定します）。 OPに似たデータセットの場合、受け入れられた回答よりも約2000倍高速です。

ルックアップに正規表現を使用することを主張する場合は、このトライベースのバージョンを使用します。これは、正規表現の結合よりも1000倍高速です。

理論

文章が巨大な文字列ではない場合、1秒あたり50をはるかに超える処理が可能です。

禁止された単語をすべてセットに保存すると、そのセットに別の単語が含まれているかどうかを確認するのが非常に高速になります。

ロジックを関数にパックし、この関数をre.subの引数として渡せば完了です！

コード

import re with open('/usr/share/dict/american-english') as wordbook: banned_words = set(Word.strip().lower() for Word in wordbook) def delete_banned_words(matchobj): Word = matchobj.group(0) if Word.lower() in banned_words: return "" else: return Word sentences = ["I'm eric. Welcome here!", "Another boring sentence.", "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000 Word_pattern = re.compile('\w+') for sentence in sentences: sentence = Word_pattern.sub(delete_banned_words, sentence)

変換された文は次のとおりです。

' . ! . GiraffeElephantBoat sfgsdg sdwerha aswertwe

ご了承ください：

検索では大文字と小文字が区別されません（lower()のおかげ）
wordを""で置き換えると、コードのように2つのスペースが残る場合があります
Python3では、\w+もアクセント付き文字に一致します（例："ångström"）。
Word以外の文字（タブ、スペース、改行、マークなど）は変更されません。

性能

数百万の文があり、banned_wordsにはほぼ100000の単語があり、スクリプトは7秒未満で実行されます。

それに比べて、Liteyeの answer は1万文に対して160秒必要でした。

nが単語の総量であり、mが禁止された単語、OP、およびLiteyeのコードの量がO(n*m)です。

それに比べて、私のコードはO(n+m)で実行する必要があります。禁止されている単語よりも多くの文があることを考慮すると、アルゴリズムはO(n)になります。

正規表現の結合テスト

'\b(Word1|Word2|...|wordN)\b'パターンを使用した正規表現検索の複雑さは何ですか？ O(N)またはO(1)ですか？

正規表現エンジンの動作を把握するのはかなり難しいので、簡単なテストを作成しましょう。

このコードは、10**iランダムな英単語をリストに抽出します。対応する正規表現の和集合を作成し、異なる単語でテストします。

1つは明らかにWordではありません（#で始まります）
1つはリストの最初の単語です
1つはリストの最後の単語です
言葉のように見えるがそうではない

import re import timeit import random with open('/usr/share/dict/american-english') as wordbook: english_words = [Word.strip().lower() for Word in wordbook] random.shuffle(english_words) print("First 10 words :") print(english_words[:10]) test_words = [ ("Surely not a Word", "#surely_NöTäWord_so_regex_engine_can_return_fast"), ("First Word", english_words[0]), ("Last Word", english_words[-1]), ("Almost a Word", "couldbeaword") ] def find(Word): def fun(): return union.match(Word) return fun for exp in range(1, 6): print("
Union of %d words" % 10**exp) union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp])) for description, test_Word in test_words: time = timeit.timeit(find(test_Word), number=1000) * 1000 print(" %-17s : %.1fms" % (description, time))

以下を出力します：

First 10 words : ["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime'] Union of 10 words Surely not a Word : 0.7ms First Word : 0.8ms Last Word : 0.7ms Almost a Word : 0.7ms Union of 100 words Surely not a Word : 0.7ms First Word : 1.1ms Last Word : 1.2ms Almost a Word : 1.2ms Union of 1000 words Surely not a Word : 0.7ms First Word : 0.8ms Last Word : 9.6ms Almost a Word : 10.1ms Union of 10000 words Surely not a Word : 1.4ms First Word : 1.8ms Last Word : 96.3ms Almost a Word : 116.6ms Union of 100000 words Surely not a Word : 0.7ms First Word : 0.8ms Last Word : 1227.1ms Almost a Word : 1404.1ms

したがって、'\b(Word1|Word2|...|wordN)\b'パターンを持つ単一のWordの検索には次のようになります。

O(1)ベストケース
O(n/2)平均ケース、それはまだO(n)
O(n)最悪の場合

これらの結果は、単純なループ検索と一致しています。

正規表現結合のはるかに高速な代替方法は、トライからの正規表現パターンを作成することです。

Eric Duminil · Answer

TLDR

最速の正規表現ベースのソリューションが必要な場合は、この方法を使用します。 OPに類似したデータセットの場合、受け入れられた回答よりも約1000倍高速です。

正規表現を気にしない場合は、このセットベースのバージョンを使用してください。これは正規表現のユニオンよりも2000倍高速です。

Trieで最適化された正規表現

単純なRegex union のアプローチは、多くの禁止された単語で遅くなります。なぜなら、パターンを最適化する正規表現エンジンあまり良い仕事をしません .

禁止されたすべての単語で Trie を作成し、対応する正規表現を書くことができます。結果のトライまたは正規表現は実際には人間が読めるものではありませんが、非常に高速な検索と一致が可能です。

例

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

リストはトライに変換されます：

{ 'f': { 'o': { 'o': { 'x': { 'a': { 'r': { '': 1 } } }, 'b': { 'a': { 'r': { '': 1 }, 'h': { '': 1 } } }, 'z': { 'a': { '': 1, 'p': { '': 1 } } } } } } }

そして、この正規表現パターンに：

r"\bfoo(?:ba[hr]|xar|zap?)\b"

大きな利点は、Zooが一致するかどうかをテストするために、正規表現エンジンのみ最初の文字を比較する必要がある（一致しない）で、 5ワードを試すです。 5単語では前処理が過剰になりますが、数千の単語で有望な結果を示しています。

(?:)非キャプチャグループが使用されることに注意してください：

コード

以下は少し変更された Gist で、trie.pyライブラリとして使用できます：

import re class Trie(): """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern. The corresponding Regex should match much faster than a simple Regex union.""" def __init__(self): self.data = {} def add(self, Word): ref = self.data for char in Word: ref[char] = char in ref and ref[char] or {} ref = ref[char] ref[''] = 1 def dump(self): return self.data def quote(self, char): return re.escape(char) def _pattern(self, pData): data = pData if "" in data and len(data.keys()) == 1: return None alt = [] cc = [] q = 0 for char in sorted(data.keys()): if isinstance(data[char], dict): try: recurse = self._pattern(data[char]) alt.append(self.quote(char) + recurse) except: cc.append(self.quote(char)) else: q = 1 cconly = not len(alt) > 0 if len(cc) > 0: if len(cc) == 1: alt.append(cc[0]) else: alt.append('[' + ''.join(cc) + ']') if len(alt) == 1: result = alt[0] else: result = "(?:" + "|".join(alt) + ")" if q: if cconly: result += "?" else: result = "(?:%s)?" % result return result def pattern(self): return self._pattern(self.dump())

テスト

これは小さなテストです（ this one と同じです）：

# Encoding: utf-8 import re import timeit import random from trie import Trie with open('/usr/share/dict/american-english') as wordbook: banned_words = [Word.strip().lower() for Word in wordbook] random.shuffle(banned_words) test_words = [ ("Surely not a Word", "#surely_NöTäWord_so_regex_engine_can_return_fast"), ("First Word", banned_words[0]), ("Last Word", banned_words[-1]), ("Almost a Word", "couldbeaword") ] def trie_regex_from_words(words): trie = Trie() for Word in words: trie.add(Word) return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE) def find(Word): def fun(): return union.match(Word) return fun for exp in range(1, 6): print("
TrieRegex of %d words" % 10**exp) union = trie_regex_from_words(banned_words[:10**exp]) for description, test_Word in test_words: time = timeit.timeit(find(test_Word), number=1000) * 1000 print(" %s : %.1fms" % (description, time))

以下を出力します：

TrieRegex of 10 words Surely not a Word : 0.3ms First Word : 0.4ms Last Word : 0.5ms Almost a Word : 0.5ms TrieRegex of 100 words Surely not a Word : 0.3ms First Word : 0.5ms Last Word : 0.9ms Almost a Word : 0.6ms TrieRegex of 1000 words Surely not a Word : 0.3ms First Word : 0.7ms Last Word : 0.9ms Almost a Word : 1.1ms TrieRegex of 10000 words Surely not a Word : 0.1ms First Word : 1.0ms Last Word : 1.2ms Almost a Word : 1.2ms TrieRegex of 100000 words Surely not a Word : 0.3ms First Word : 1.2ms Last Word : 0.9ms Almost a Word : 1.6ms

情報については、正規表現は次のように始まります。

（？：a（？：（？：\ 's | a（？：\' s | chen | liyah（？：\ 's）？| r（？：dvark（？：（？：\' s | s ））？| on））| b（？：\ 's | a（？：c（？：us（？：（？：\' s | es）））？| [ik]）| ft | lone（？：（？：\ 's | s））？| ndon（？:( ?: ed | ing | ment（？：\' s）？| s））？| s（？：e（？:( ?: ment（？：\ 's）？| [ds]））？| h（？:( ?: e [ds] | ing））？| ing）| t（？：e（？:( ?: ment（？：\ 's）？| [ds]））？| ing | toir（？：（？：\' s | s））？））| b（？：as（？：id）？| e（？：ss（？：（？：\ 's | es））？| y（？：（？：\' s | s））？）| ot（？：（？：\ 's | t（？：\ 's）？| s））？| reviat（？：e [ds]？| i（？：ng | on（？：（？：\' s | s））？））| y（？：\ ' s）？| \é（？：（？：\ 's | s））？）| d（？：icat（？：e [ds]？| i（？：ng | on（？：（？：\ 's | s））？））| om（？：en（？：（？：\' s | s））？| inal）| u（？：ct（？:( ?: ed | i（?: ng | on（？：（？：\ 's | s））？）| or（？：（？：\' s | s））？| s））？| l（？：\ 's）？））| e（？：（？：\ 's | am | l（？：（？：\' s | ard | son（？：\ 's）？））？| r（？：deen（？：\ 's）？| nathy（？：\' s）？| ra（？：nt | tion（？：（？：\ 's | s））？））| t（？:( ?: t（?: e（？：r（？：（？：\ 's | s））？| d）| ing | or（？：（？：\' s | s））？）| s））？| yance（？：\ 's）？| d））？| hor（？:( ?: r（？：e（？：n（？：ce（？：\' s）？| t）| d）| ing）| s））？| i（？：d（？：e [ds]？| ing | jan（？：\ 's）？）| gail | l（？：ene | it（？：ies | y（?:\'s）？）））| j（？：ect（？：ly）？| ur（？：ation（？：（？：\' s | s））？| e [ds]？| ing）） | l（？：a（？：tive（？：（？：\ 's | s））？| ze）| e（？:( ?: st | r））？| oom | ution（？:(？：\ 's | s））？| y）| m\'s | n（？：e（？：gat（？：e [ds]？| i（？：ng | on（？：\' s）？））| r（？：\ ' s）？）| ormal（？:( ?: it（？：ies | y（？：\ 's）？）| ly））？）| o（？：ard | de（？：（？：\' s | s））？| li（？：sh（？:( ?: e [ds] | ing））？| tion（？：（？：\ 's | ist（？：（？：\' s | s））？））？）| mina（？：bl [ey] | t（？：e [ds]？| i（？：ng | on（？：（？：\ 's | s））？）））| r（？：igin（？：al（？：（？：\ 's | s））?? | e（？：（？：\' s | s））？）| t（？:(？：ed | i（？：ng | on（？：（？：\ 's | ist（？：（？：\' s | s））？| s））？| ve）| s））？）| u（？：nd（？:( ?: ed | ing | s））？| t）| ve（？：（？：\ 's | board））？）| r（？：a（？：cadabra（？：\ 's）？| d（？：e [ds]？| ing）| ham（？：\' s）？| m（？：（？：\ 's | s））？| si（？：on（？：（？：\ 's | s））？| ve（？：（？：\' s | ly | ness（？：\ 's）？| s））？）| east | idg （？：e（？:( ?: ment（？：（？：\ 's | s））？| [ds]））？| ing | ment（？：（？：\' s | s））?? ）| o（？：ad | gat（？：e [ds]？| i（？：ng | on（？：（？：\ 's | s））？）））| upt（？:( ?: e（？：st | r）| ly | ness（？：\ 's）？））？）| s（？：alom | c（？：ess（？：（？：\' s | e [ds]） | ing））？| issa（？：（？：\ 's | [es]））？| ond（？:( ?: ed | ing | s））？）| en（？：ce（？:( ？：\ 's | s））？| t（？:( ?: e（？：e（？：（？：\' s | ism（？：\ 's）？| s））？| d） | ing | ly | s））？）| inth（？：（？：\ 's | e（？：\' s）？））？| o（？：l（？：ut（？：e（？：（？：\ 's | ly | st？））？| i（？：on（？：\' s）？| sm（？：\ 's）？））| v（？：e [ds] ？| ing））| r（？：b（？:( ?: e（？：n（？：cy（？：\ 's）？| t（？：（？：\' s | s））？）？）| d）| ing | s））？| pt私...

本当に読みにくいですが、禁止された単語のリストが100000ある場合、このTrie正規表現は単純な正規表現の結合よりも1000倍高速です！

trie-python-graphviz およびgraphviz twopi でエクスポートされた完全なトライの図を次に示します。

Denziloe · Answer

試してみたいことの1つは、文を前処理してWordの境界をエンコードすることです。基本的に、単語の境界で分割することにより、各文を単語のリストに変換します。

文を処理するには、各単語をステップスルーし、一致するかどうかをチェックするだけなので、これはより高速になります。

現在、正規表現検索では、毎回文字列全体を再度調べて、Wordの境界を探し、次のパスの前にこの作業の結果を「破棄」する必要があります。

peufeu · Answer

さて、ここにテストセットを使用した迅速で簡単なソリューションを示します。

勝利戦略：

re.sub（ "\ w +"、repl、sentence）は単語を検索します。

「repl」は呼び出し可能です。辞書検索を実行する関数を使用しましたが、辞書には検索および置換する単語が含まれています。

これは、最も単純で最速のソリューションです（以下のサンプルコードの関数replace4を参照）。

次善

アイデアは、re.splitを使用して文を単語に分割し、セパレータを保存して後で文を再構築することです。次に、単純なdictルックアップで置換が行われます。

（以下のサンプルコードの関数replace3を参照してください）。

サンプル関数のタイミング：

replace1: 0.62 sentences/s replace2: 7.43 sentences/s replace3: 48498.03 sentences/s replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

...そしてコード：

#! /bin/env python3 # -*- coding: utf-8 import time, random, re def replace1( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns: sentence = re.sub( "\b"+search+"\b", repl, sentence ) def replace2( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns_comp: sentence = re.sub( search, repl, sentence ) def replace3( sentences ): pd = patterns_dict.get for n, sentence in enumerate( sentences ): #~ print( n, sentence ) # Split the sentence on non-Word characters. # Note: () in split patterns ensure the non-Word characters ARE kept # and returned in the result list, so we don't mangle the sentence. # If ALL separators are spaces, use string.split instead or something. # Example: #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf") #~ ['ab', ' ', 'céé', '? . ', 'd2eéf'] words = re.split(r"([^\w]+)", sentence) # and... done. sentence = "".join( pd(w,w) for w in words ) #~ print( n, sentence ) def replace4( sentences ): pd = patterns_dict.get def repl(m): w = m.group() return pd(w,w) for n, sentence in enumerate( sentences ): sentence = re.sub(r"\w+", repl, sentence) # Build test set test_words = [ ("Word%d" % _) for _ in range(50000) ] test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ] # Create search and replace patterns patterns = [ (("Word%d" % _), ("repl%d" % _)) for _ in range(20000) ] patterns_dict = dict( patterns ) patterns_comp = [ (re.compile("\b"+search+"\b"), repl) for search, repl in patterns ] def test( func, num ): t = time.time() func( test_sentences[:num] ) print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t))) print( "Sentences", len(test_sentences) ) print( "Words ", len(test_words) ) test( replace1, 1 ) test( replace2, 10 ) test( replace3, 1000 ) test( replace4, 1000 )

karakfa · Answer

おそらくPythonはここでは適切なツールではありません。ここにUnixツールチェーンがあります

sed G file | tr ' ' '
' | grep -vf blacklist | awk -v RS= -v OFS=' ' '{$1=$1}1'

ブラックリストファイルがワード境界が追加された状態で前処理されていると仮定します。手順は次のとおりです。ファイルをダブルスペースに変換し、各文を1行につき1ワードに分割し、ファイルからブラックリストの単語を一括削除して、行をマージします。

これは、少なくとも1桁速く実行する必要があります。

単語からブラックリストファイルを前処理するため（1行に1単語）

sed 's/.*/\b&\b/' words > blacklist

Lie Ryan · Answer

これはどう：

#!/usr/bin/env python3 from __future__ import unicode_literals, print_function import re import time import io def replace_sentences_1(sentences, banned_words): # faster on CPython, but does not use \b as the Word separator # so result is slightly different than replace_sentences_2() def filter_sentence(sentence): words = Word_SPLITTER.split(sentence) words_iter = iter(words) for Word in words_iter: norm_Word = Word.lower() if norm_Word not in banned_words: yield Word yield next(words_iter) # yield the Word separator Word_SPLITTER = re.compile(r'(\W+)') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) def replace_sentences_2(sentences, banned_words): # slower on CPython, uses \b as separator def filter_sentence(sentence): boundaries = Word_BOUNDARY.finditer(sentence) current_boundary = 0 while True: last_Word_boundary, current_boundary = current_boundary, next(boundaries).start() yield sentence[last_Word_boundary:current_boundary] # yield the separators last_Word_boundary, current_boundary = current_boundary, next(boundaries).start() Word = sentence[last_Word_boundary:current_boundary] norm_Word = Word.lower() if norm_Word not in banned_words: yield Word Word_BOUNDARY = re.compile(r'\b') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) corpus = io.open('corpus2.txt').read() banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()] sentences = corpus.split('. ') output = io.open('output.txt', 'wb') print('number of sentences:', len(sentences)) start = time.time() for sentence in replace_sentences_1(sentences, banned_words): output.write(sentence.encode('utf-8')) output.write(b' .') print('time:', time.time() - start)

これらのソリューションは、Wordの境界で分割され、セット内の各Wordを検索します。これらはWordの代替のre.sub（Liteyesのソリューション）よりも高速である必要があります。これらのソリューションはO(n)であり、nはamortized O(1) setルックアップによる入力のサイズであり、regexの代替を使用するとregexエンジンがチェックする必要があるためですWordは、Wordの境界だけでなく、すべての文字で一致します。私の解決策は、元のテキストで使用された空白を保持するために細心の注意を払っています（つまり、空白を圧縮せず、タブ、改行、およびその他の空白文字を保持します）が、気にしないと判断した場合は、それらを出力から削除するのはかなり簡単です。

Corpus.txtでテストしました。これは、Gutenberg Projectからダウンロードした複数の電子書籍を連結したもので、banned_words.txtはUbuntuのワードリスト（/ usr/share/dict/american-english）からランダムに選択された20000ワードです。 862462文（およびPyPyの半分）を処理するのに約30秒かかります。「。」で区切られたものとして文を定義しました。

$ # replace_sentences_1() $ python3 filter_words.py number of sentences: 862462 time: 24.46173644065857 $ pypy filter_words.py number of sentences: 862462 time: 15.9370770454 $ # replace_sentences_2() $ python3 filter_words.py number of sentences: 862462 time: 40.2742919921875 $ pypy filter_words.py number of sentences: 862462 time: 13.1190629005

PyPyは特に2番目のアプローチからより多くの恩恵を受けますが、CPythonは最初のアプローチよりも優れています。上記のコードは、Python 2と3の両方で動作するはずです。

I159 · Answer

実践的アプローチ

以下で説明するソリューションでは、大量のメモリを使用してすべてのテキストを同じ文字列に格納し、複雑さのレベルを下げます。 RAMが問題である場合は、使用する前によく考えてください。

join/splitトリックを使用すると、アルゴリズムを高速化するループをまったく回避できます。

文に含まれていない特別な区切り文字で文を連結します。

merged_sentences = ' * '.join(sentences)

| "or" regexステートメントを使用して、文章から取り除く必要があるすべての単語に対して単一の正規表現をコンパイルします。

regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

コンパイルされた正規表現で単語に添え字を付け、特別な区切り文字で分割して、分離した文に戻します。

clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

性能

"".join複雑度はO（n）です。これは非常に直感的ですが、とにかくソースからの短い引用があります：

for (i = 0; i < seqlen; i++) { [...] sz += PyUnicode_GET_LENGTH(item);

したがって、join/splitを使用すると、O(words) + 2 * O（sentences）が得られますが、それでも2 * O（N²）最初のアプローチで。

b.t.w。マルチスレッドを使用しないでください。タスクは厳密にCPUにバインドされているため、GILは各操作をブロックするため、GILを解放する機会はありませんが、各スレッドはティックを同時に送信するため、余分な作業が発生し、操作が無限になります。

Edi Bice · Answer

すべての文章を1つのドキュメントに連結します。 Aho-Corasickアルゴリズムの実装（ここに1つ）を使用して、すべての「悪い」単語を見つけます。ファイルを走査し、不良な各Wordを置換し、見つかった単語のオフセットを更新するなど。