文字列内で最も長い繰り返しシーケンスを見つける

Question

文字列内で最長のシーケンスを見つける必要があるのは、シーケンスを3回以上繰り返す必要があるという警告です。したがって、たとえば、私の文字列が次の場合：

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

次に、値 "helloworld"が返されるようにします。

私はこれを達成するいくつかの方法を知っていますが、私が直面している問題は、実際の文字列がばかばかしいほど大きいため、タイムリーにそれを実行できる方法を本当に探しています。

templatetypedef · Accepted Answer

この問題は最長の繰り返し部分文字列問題の変形であり、それを解決するための接尾辞木を使用するO（n）時間アルゴリズムがあります。（ウィキペディアで提案されている）アイデアは、サフィックスツリー（時間O（n））を構築し、ツリー内のすべてのノードに子孫の数（時間O(n)そして、少なくとも3つの子孫を持つツリーの最も深いノードを見つけます（DFSを使用して時間O(n)）。この全体的なアルゴリズムには時間O（n）がかかります。

そうは言っても、サフィックスツリーは構築が難しいことで有名なので、この実装を試みる前に、サフィックスツリーを実装するPythonライブラリを見つけてください。簡単なGoogle検索が見つかりますこのライブラリ、これが良い実装であるかどうかはわかりませんが。

お役に立てれば！

PaulMcG · Answer

Defaultdictを使用して、入力文字列の各位置で始まる各部分文字列を集計します。 OPは、重複する一致を含めるべきかどうかを明確にしませんでした。このブルートフォースメソッドには、それらが含まれます。

from collections import defaultdict def getsubs(loc, s): substr = s[loc:] i = -1 while(substr): yield substr substr = s[loc:i] i -= 1 def longestRepetitiveSubstring(r, minocc=3): occ = defaultdict(int) # tally all occurrences of all substrings for i in range(len(r)): for sub in getsubs(i,r): occ[sub] += 1 # filter out all substrings with fewer than minocc occurrences occ_minocc = [k for k,v in occ.items() if v >= minocc] if occ_minocc: maxkey = max(occ_minocc, key=len) return maxkey, occ[maxkey] else: raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

プリント：

('helloworld', 3)

Max Li · Answer

最後から始めて、頻度を数え、最も頻繁な要素が3回以上現れたらすぐに停止しましょう。

from collections import Counter a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld' times=3 for n in range(1,len(a)/times+1)[::-1]: substrings=[a[i:i+n] for i in range(len(a)-n+1)] freqs=Counter(substrings) if freqs.most_common(1)[0][1]>=3: seq=freqs.most_common(1)[0][0] break print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

結果：

>>> sequence 'helloworld' of length 10 occurs 3 or more times

編集：ランダムな入力を処理していて、共通の部分文字列の長さを短くする必要があるという感覚がある場合は、（速度が必要な場合）小さい部分文字列で開始し、可能なときに停止することをお勧めします少なくとも3回出現するものは見つかりません。

from collections import Counter a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld' times=3 for n in range(1,len(a)/times+1): substrings=[a[i:i+n] for i in range(len(a)-n+1)] freqs=Counter(substrings) if freqs.most_common(1)[0][1]<3: n-=1 break else: seq=freqs.most_common(1)[0][0] print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

上記と同じ結果。

Matt Coughlin · Answer

頭に浮かんだ最初のアイデアは、次第に大きくなる正規表現で検索することです。

import re text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld' largest = '' i = 1 while 1: m = re.search("(" + ("\w" * i) + ").*\1.*\1", text) if not m: break largest = m.group(1) i += 1 print largest # helloworld

コードは正常に実行されました。時間の複雑さは少なくともO（n ^ 2）のようです。

sln · Answer

入力文字列を逆にする場合は、(.+)(?:.*\1){2}などの正規表現に入力します
3回繰り返される最も長い文字列が表示されます。（答えはリバースキャプチャグループ1）

編集：
この方法でキャンセルする必要があります。最初の一致に依存します。これまでのところ、現在の長さと最大の長さの比較テストを行わない限り、イタラティブループでは正規表現は機能しません。