文字列内の部分文字列のn番目の出現を検索します

Question

これはかなり些細なことのように思えますが、私はPythonを初めて使い、最もPython的な方法でやりたいと思っています。

文字列内の部分文字列のn番目の出現を検索したい。

私がやりたいことと同等のものがあります

mystring.find("substring", 2nd)

Pythonでこれをどのように実現できますか？

bobince · Accepted Answer

マークの反復アプローチは通常の方法だと思います。

文字列分割の代替方法は次のとおりです。これは、検索関連のプロセスに役立つことがよくあります。

def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle)

そして、ここに簡単な（そして少し汚い、針に合わないチャフを選ぶ必要があるという点で）ワンライナーがあります：

'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')

Todd Gamblin · Answer

簡単な反復ソリューションのよりPython的なバージョンを次に示します。

def find_nth(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+len(needle)) n -= 1 return start

例：

>>> find_nth("foofoofoofoo", "foofoo", 2) 6

needleのn番目のoverlapappingの出現を検索する場合は、len(needle)の代わりに1でインクリメントできます。、このような：

def find_nth_overlapping(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+1) n -= 1 return start

例：

>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2) 3

これは、Markのバージョンよりも読みやすく、分割バージョンやインポートする正規表現モジュールの追加メモリを必要としません。また、さまざまなreアプローチとは異なり、 Zen of python のいくつかのルールを順守します。

単純なものは複雑なものよりも優れています。
ネストはフラットよりも優れています。
読みやすさが重要です。

Sriram Murali · Answer

これにより、string内のサブストリングの2番目の出現が検出されます。

def find_2nd(string, substring): return string.find(substring, string.find(substring) + 1)

編集：パフォーマンスについてはあまり考えていませんが、n番目のオカレンスを見つけるには、簡単な再帰が役立ちます。

def find_nth(string, substring, n): if (n == 1): return string.find(substring) else: return string.find(substring, find_nth(string, substring, n - 1) + 1)

Mark Peters · Answer

正規表現が常に最良のソリューションであるとは限らないことを理解し、おそらくここで使用します。

>>> import re >>> s = "ababdfegtduab" >>> [m.start() for m in re.finditer(r"ab",s)] [0, 2, 11] >>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 11

Stefan · Answer

@bobinceのfindnth()（@str.split()に基づく）と@tgamblinまたは@Mark Byersのfind_nth()（str.find()に基づく）を比較して、これまでに提示された最も顕著なアプローチを比較するベンチマーク結果を提供します。また、C拡張機能（_find_nth.so）と比較して、どれだけ速く処理できるかを確認します。 find_nth.pyは次のとおりです。

def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle) def find_nth(s, x, n=0, overlap=False): l = 1 if overlap else len(x) i = -l for c in xrange(n + 1): i = s.find(x, i + l) if i < 0: break return i

もちろん、文字列が大きい場合、パフォーマンスが最も重要になるため、 'bigfile'と呼ばれる1.3 GBのファイルで1000001番目の改行（ '\ n'）を見つけたいとします。メモリを節約するために、ファイルのmmap.mmapオブジェクト表現で作業したい：

In [1]: import _find_nth, find_nth, mmap In [2]: f = open('bigfile', 'r') In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

mmap.mmapオブジェクトはfindnth()をサポートしていないため、split()にはすでに最初の問題があります。したがって、実際にはファイル全体をメモリにコピーする必要があります。

In [4]: %time s = mm[:] CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s Wall time: 17.7 s

痛い！幸い、sはまだMacbook Airの4 GBのメモリに収まっているので、findnth()のベンチマークを行います。

In [5]: %timeit find_nth.findnth(s, '
', 1000000) 1 loops, best of 3: 29.9 s per loop

明らかにひどいパフォーマンス。 str.find()に基づくアプローチがどのように機能するかを見てみましょう。

In [6]: %timeit find_nth.find_nth(s, '
', 1000000) 1 loops, best of 3: 774 ms per loop

ずっといい！明らかに、findnth()の問題は、split()中に文字列のコピーを強制されることです。これは、s = mm[:]の後に1.3 GBのデータをコピーしたのはすでに2回目です。ここにfind_nth()の2番目の利点があります：zeroファイルのコピーが必要になるように、mmで直接使用できます。

In [7]: %timeit find_nth.find_nth(mm, '
', 1000000) 1 loops, best of 3: 1.21 s per loop

mmとsでは、パフォーマンスがわずかに低下するように見えますが、これは、findnthの合計47秒と比較して、find_nth()が1.2秒で答えを得ることができることを示しています。

str.find()ベースのアプローチがstr.split()ベースのアプローチよりも著しく悪いケースは見つからなかったため、この時点で、@ bobinceの代わりに@tgamblinまたは@Mark Byersの回答を受け入れるべきだと主張します。

私のテストでは、上記のfind_nth()のバージョンは、私が思いつく最も高速なPythonソリューションでした（@Mark Byersのバージョンに非常に似ています）。 C拡張モジュールでどれだけ改善できるか見てみましょう。 _find_nthmodule.cは次のとおりです。

#include <Python.h> #include <string.h> off_t _find_nth(const char *buf, size_t l, char c, int n) { off_t i; for (i = 0; i < l; ++i) { if (buf[i] == c && n-- == 0) { return i; } } return -1; } off_t _find_nth2(const char *buf, size_t l, char c, int n) { const char *b = buf - 1; do { b = memchr(b + 1, c, l); if (!b) return -1; } while (n--); return b - buf; } /* mmap_object is private in mmapmodule.c - replicate beginning here */ typedef struct { PyObject_HEAD char *data; size_t size; } mmap_object; typedef struct { const char *s; size_t l; char c; int n; } params; int parse_args(PyObject *args, params *P) { PyObject *obj; const char *x; if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) { return 1; } PyTypeObject *type = Py_TYPE(obj); if (type == &PyString_Type) { P->s = PyString_AS_STRING(obj); P->l = PyString_GET_SIZE(obj); } else if (!strcmp(type->tp_name, "mmap.mmap")) { mmap_object *m_obj = (mmap_object*) obj; P->s = m_obj->data; P->l = m_obj->size; } else { PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0"); return 1; } P->c = x[0]; return 0; } static PyObject* py_find_nth(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n)); } else { return NULL; } } static PyObject* py_find_nth2(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n)); } else { return NULL; } } static PyMethodDef methods[] = { {"find_nth", py_find_nth, METH_VARARGS, ""}, {"find_nth2", py_find_nth2, METH_VARARGS, ""}, {0} }; PyMODINIT_FUNC init_find_nth(void) { Py_InitModule("_find_nth", methods); }

setup.pyファイルは次のとおりです。

from distutils.core import setup, Extension module = Extension('_find_nth', sources=['_find_nthmodule.c']) setup(ext_modules=[module])

通常どおりpython setup.py installでインストールします。 Cコードは、単一の文字の検索に限定されているため、ここで利点がありますが、これがどれほど速いかを見てみましょう。

In [8]: %timeit _find_nth.find_nth(mm, '
', 1000000) 1 loops, best of 3: 218 ms per loop In [9]: %timeit _find_nth.find_nth(s, '
', 1000000) 1 loops, best of 3: 216 ms per loop In [10]: %timeit _find_nth.find_nth2(mm, '
', 1000000) 1 loops, best of 3: 307 ms per loop In [11]: %timeit _find_nth.find_nth2(s, '
', 1000000) 1 loops, best of 3: 304 ms per loop

明らかにかなり高速です。興味深いことに、メモリ内のケースとマップされたケースの間でCレベルに違いはありません。 string.hの_find_nth2()ライブラリ関数に基づくmemchr()が、_find_nth()の単純な実装に負けていることも興味深いです。memchr()の追加の「最適化」は明らかにバックファイアです...

結論として、（findnth()に基づく）str.split()の実装は、（a）必要なコピーのために大きな文字列に対してひどく動作し、（b）mmap.mmapオブジェクトでは動作しないため、本当に悪い考えです。まったく。 find_nth()の実装（str.find()に基づく）は、すべての状況で優先されるべきです（したがって、この質問に対する受け入れられた答えになります）。

C拡張は、純粋なPythonコードよりもほぼ4倍速く実行され、専用のPythonの場合があるかもしれないことを示すため、まだかなり改善の余地があります。 _ライブラリー関数。

Mark Byers · Answer

おそらく、インデックスパラメーターを受け取る検索機能を使用して、次のようなことを行います。

def find_nth(s, x, n): i = -1 for _ in range(n): i = s.find(x, i + len(x)) if i == -1: break return i print find_nth('bananabanana', 'an', 3)

おそらくPythonicではありませんが、簡単です。代わりに再帰を使用してそれを行うことができます：

def find_nth(s, x, n, i = 0): i = s.find(x, i) if n == 1 or i == -1: return i else: return find_nth(s, x, n - 1, i + len(x)) print find_nth('bananabanana', 'an', 3)

それはそれを解決するための機能的な方法ですが、それがそれをもっとPythonicにするのかどうかはわかりません。

forbzie · Answer

最も簡単な方法？

text = "This is a test from a test ok" firstTest = text.find('test') print text.find('test', firstTest + 1)

Hank Gay · Answer

reまたはitertoolsを検索するときに機能する別のstr + RegexpObjectバージョンがあります。私はこれが過剰に設計されている可能性が高いことを自由に認めますが、何らかの理由でそれは私を楽しませました。

import itertools import re def find_nth(haystack, needle, n = 1): """ Find the starting index of the nth occurrence of ``needle`` in \ ``haystack``. If ``needle`` is a ``str``, this will perform an exact substring match; if it is a ``RegexpObject``, this will perform a regex search. If ``needle`` doesn't appear in ``haystack``, return ``-1``. If ``needle`` doesn't appear in ``haystack`` ``n`` times, return ``-1``. Arguments --------- * ``needle`` the substring (or a ``RegexpObject``) to find * ``haystack`` is a ``str`` * an ``int`` indicating which occurrence to find; defaults to ``1`` >>> find_nth("foo", "o", 1) 1 >>> find_nth("foo", "o", 2) 2 >>> find_nth("foo", "o", 3) -1 >>> find_nth("foo", "b") -1 >>> import re >>> either_o = re.compile("[oO]") >>> find_nth("foo", either_o, 1) 1 >>> find_nth("FOO", either_o, 1) 1 """ if (hasattr(needle, 'finditer')): matches = needle.finditer(haystack) else: matches = re.finditer(re.escape(needle), haystack) start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1)) try: return next(start_here)[1].start() except StopIteration: return -1

Zv_oDD · Answer

modle1の回答に基づいていますが、reモジュール依存関係はありません。

def iter_find(haystack, needle): return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]

これが組み込みの文字列メソッドであることを願っています。

>>> iter_find("http://stackoverflow.com/questions/1883980/", '/') [5, 6, 24, 34, 42]

modle13 · Answer

これにより、yourstringに一致する開始インデックスの配列が得られます。

import re indices = [s.start() for s in re.finditer(':', yourstring)]

次に、n番目のエントリは次のようになります。

n = 2 nth_entry = indices[n-1]

もちろん、インデックスの境界には注意する必要があります。次のようにyourstringのインスタンスの数を取得できます。

num_instances = len(indices)

Jason · Answer

# return -1 if nth substr (0-indexed) d.n.e, else return index def find_nth(s, substr, n): i = 0 while n >= 0: n -= 1 i = s.find(substr, i + 1) return i

John La Rooy · Answer

Re.finditerを使用した別のアプローチを次に示します。
違いは、これは必要な範囲でのみ干し草の山を調べることです

from re import finditer from itertools import dropwhile needle='an' haystack='bananabanana' n=2 next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()

ghostdog74 · Answer

>>> s="abcdefabcdefababcdef" >>> j=0 >>> for n,i in enumerate(s): ... if s[n:n+2] =="ab": ... print n,i ... j=j+1 ... if j==2: print "2nd occurence at index position: ",n ... 0 a 6 a 2nd occurence at index position: 6 12 a 14 a

GetItDone · Answer

どうですか：

c = os.getcwd().split('\') print '\'.join(c[0:-2])

Karthik · Answer

ループと再帰を使用しないソリューション。

コンパイルメソッドで必要なパターンを使用し、変数'n'に目的の出現を入力すると、最後のステートメントは、指定された文字列のパターンのn番目の出現の開始インデックスを出力します。ここでは、finditerの結果、つまりイテレータがリストに変換され、n番目のインデックスに直接アクセスしています。

import re n=2 sampleString="this is history" pattern=re.compile("is") matches=pattern.finditer(sampleString) print(list(matches)[n].span()[0])

yarz-tech · Answer

これはあなたが本当に望む答えです：

def Find(String,ToFind,Occurence = 1): index = 0 count = 0 while index <= len(String): try: if String[index:index + len(ToFind)] == ToFind: count += 1 if count == Occurence: return index break index += 1 except IndexError: return False break return False

Ivor Zhou · Answer

splitおよびjoinを使用する別の「トリッキー」ソリューションを提供します。

あなたの例では、使用できます

len("substring".join([s for s in ori.split("substring")[:2]]))

黄锐铭 · Answer

文字列nのbのath番目の出現を見つけるための私のソリューションは次のとおりです。

from functools import reduce def findNth(a, b, n): return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)

純粋なPythonで反復的です。 0またはnが大きすぎる場合、-1を返します。ワンライナーで、直接使用できます。以下に例を示します。

>>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1) 7

Charles Doutriaux · Answer

1つのライナーの交換は素晴らしいですが、XXとバーは同じ長さであるためにのみ機能します

適切で一般的な定義は次のとおりです。

def findN(s,sub,N,replaceString="XXX"): return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)