c ++ 11正規表現pythonよりも遅い

Question

こんにちは、私はなぜ正規表現を使用して分割文字列を分割する次のコードを理解したいと思います

#include<regex> #include<vector> #include<string> std::vector<std::string> split(const std::string &s){ static const std::regex rsplit(" +"); auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1); auto rend = std::sregex_token_iterator(); auto res = std::vector<std::string>(rit, rend); return res; } int main(){ for(auto i=0; i< 10000; ++i) split("a b c", " "); return 0; }

次のpythonコードよりも遅い

import re for i in range(10000): re.split(' +', 'a b c')

ここにあります

> python test.py 0.05s user 0.01s system 94% cpu 0.070 total ./test 0.26s user 0.00s system 99% cpu 0.296 total

Osxでclang ++を使用しています。

-O3でコンパイルすると、0.09s user 0.00s system 99% cpu 0.109 total

pepper_chico · Accepted Answer

通知

この回答も参照してください： https://stackoverflow.com/a/21708215 これはEDIT 2のベースでしたここの底。

より良いタイミング測定を得るために、ループを1000000に増やしました。

これは私のPythonタイミング：

_real 0m2.038s user 0m2.009s sys 0m0.024s _

コードに相当するのが、少しだけきれいです：

_#include <regex> #include <vector> #include <string> std::vector<std::string> split(const std::string &s, const std::regex &r) { return { std::sregex_token_iterator(s.begin(), s.end(), r, -1), std::sregex_token_iterator() }; } int main() { const std::regex r(" +"); for(auto i=0; i < 1000000; ++i) split("a b c", r); return 0; } _

タイミング：

_real 0m5.786s user 0m5.779s sys 0m0.005s _

これは、ベクトルおよび文字列オブジェクトの構築/割り当てを回避するための最適化です：

_#include <regex> #include <vector> #include <string> void split(const std::string &s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1); auto rend = std::sregex_token_iterator(); v.clear(); while(rit != rend) { v.Push_back(*rit); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; } _

タイミング：

_real 0m3.034s user 0m3.029s sys 0m0.004s _

これは、ほぼ100％のパフォーマンス向上です。

ベクトルはループの前に作成され、最初の反復でメモリを増やすことができます。その後、clear()によるメモリの割り当て解除はありません。ベクトルはメモリを維持し、文字列in-placeを構築します。

別のパフォーマンス向上は、構築/破壊_std::string_を完全に回避することであり、そのため、オブジェクトの割り当て/割り当て解除を行います。

これはこの方向の暫定的なものです：

_#include <regex> #include <vector> #include <string> void split(const char *s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.Push_back(*rit); ++rit; } } _

タイミング：

_real 0m2.509s user 0m2.503s sys 0m0.004s _

最終的な改善は、戻り値として_std::vector_を_const char *_にすることです。各charポインターは元のs c string自体の中のサブストリングを指します。。問題は、それぞれがヌルで終了しないため、それができないことです（これについては、後のサンプルのC++ 1y _string_ref_の使用法を参照してください）。

この最後の改善は、これでも達成できます。

_#include <regex> #include <vector> #include <string> void split(const std::string &s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.Push_back(*rit); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); // the constant string("a b c") should be optimized // by the compiler. I got the same performance as // if it was an object outside the loop return 0; } _

-O3を使用してclang 3.3（トランクから）でサンプルを作成しました。他の正規表現ライブラリの方がパフォーマンスが向上する可能性がありますが、いずれの場合でも、割り当て/割り当て解除はしばしばパフォーマンスに影響します。

Boost.Regex

これは、c string引数サンプルの_boost::regex_タイミングです。

_real 0m1.284s user 0m1.278s sys 0m0.005s _

同じサンプル、このサンプルの_boost::regex_および_std::regex_インターフェースは同一であり、名前空間を変更してインクルードするために必要です。

C++ stdlib正規表現の実装は、時間の経過とともに改善されることを願っています。

編集

完了のために、私はこれを試してみました（上記の「究極の改善」の提案）。また、同等の_std::vector<std::string> &v_バージョンのパフォーマンスは何も改善しませんでした。

_#include <regex> #include <vector> #include <string> template<typename Iterator> class intrusive_substring { private: Iterator begin_, end_; public: intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {} Iterator begin() {return begin_;} Iterator end() {return end_;} }; using intrusive_char_substring = intrusive_substring<const char *>; void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); // This can potentially be optimized away by the compiler because // the intrusive_char_substring destructor does nothing, so // resetting the internal size is the only thing to be done. // Formerly allocated memory is maintained. while(rit != rend) { v.emplace_back(rit->first, rit->second); ++rit; } } int main() { const std::regex r(" +"); std::vector<intrusive_char_substring> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; } _

これは、 array_refおよびstring_refの提案と関係があります。これを使用したサンプルコードを次に示します。

_#include <regex> #include <vector> #include <string> #include <string_ref> void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.emplace_back(rit->first, rit->length()); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string_ref> v; for(auto i=0; i < 1000000; ++i) split("a b c", r, v); return 0; } _

また、stringの場合、splitコピーよりも_string_ref_のベクトルを返す方が安価です。

編集2

この新しいソリューションは、リターンで出力を取得できます。 https://github.com/mclow/string_view で見つかったMarshall Clowの_string_view_（_string_ref_の名前が変更されました）libc ++実装を使用しました。

_#include <string> #include <string_view> #include <boost/regex.hpp> #include <boost/range/iterator_range.hpp> #include <boost/iterator/transform_iterator.hpp> using namespace std; using namespace std::experimental; using namespace boost; string_view stringfier(const cregex_token_iterator::value_type &match) { return {match.first, static_cast<size_t>(match.length())}; } using string_view_iterator = transform_iterator<decltype(&stringfier), cregex_token_iterator>; iterator_range<string_view_iterator> split(string_view s, const regex &r) { return { string_view_iterator( cregex_token_iterator(s.begin(), s.end(), r, -1), stringfier ), string_view_iterator() }; } int main() { const regex r(" +"); for (size_t i = 0; i < 1000000; ++i) { split("a b c", r); } } _

タイミング：

_real 0m0.385s user 0m0.385s sys 0m0.000s _

これが以前の結果と比較してどれだけ速いかに注意してください。もちろん、ループ内でvectorを埋めることはありません（おそらく事前に何かに一致することもありません）が、とにかく範囲を取得します。範囲ベースのforで範囲を広げることができます。 vectorを埋めるためにも使用します。

_iterator_range_に及ぶと、元のstring（またはnullで終わる文字列）に_string_view_ sが作成されるため、これは非常に軽量になり、不要な文字列割り当てを生成しません。

このsplit実装を使用して比較するだけで、実際にvectorを埋めるには、次のようにします。

_int main() { const regex r(" +"); vector<string_view> v; v.reserve(10); for (size_t i = 0; i < 1000000; ++i) { copy(split("a b c", r), back_inserter(v)); v.clear(); } } _

これは、ブーストレンジコピーアルゴリズムを使用して、各反復でベクトルを埋めます。タイミングは次のとおりです。

_real 0m1.002s user 0m0.997s sys 0m0.004s _

ご覧のように、最適化された_string_view_出力パラメーターバージョンと比較しても大きな違いはありません。

_std::split_の提案もあり、これはこのように機能することに注意してください。

Matthieu M. · Answer

一般に、最適化のために、次の2つのことを避けたいと思います。

不必要なもののためにCPUサイクルを焼き払う
何かが起こるのを無造作に待っています（メモリ読み取り、ディスク読み取り、ネットワーク読み取り、...）

2つは、すべてをメモリにキャッシュするよりも何かを高速に計算することがあるため、相反する場合があります。したがって、バランスのとれたゲームです。

コードを分析しましょう：

std::vector<std::string> split(const std::string &s){ static const std::regex rsplit(" +"); // only computed once // search for first occurrence of rsplit auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1); auto rend = std::sregex_token_iterator(); // simultaneously: // - parses "s" from the second to the past the last occurrence // - allocates one `std::string` for each match... at least! (there may be a copy) // - allocates space in the `std::vector`, possibly multiple times auto res = std::vector<std::string>(rit, rend); return res; }

もっと良くできますか？さて、メモリの割り当てと割り当て解除を続ける代わりに既存のストレージを再利用できる場合、大幅な改善が見られるはずです[1]。

// Overwrites 'result' with the matches, returns the number of matches // (note: 'result' is never shrunk, but may be grown as necessary) size_t split(std::string const& s, std::vector<std::string>& result){ static const std::regex rsplit(" +"); // only computed once auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1); auto rend = std::cregex_token_iterator(); size_t pos = 0; // As long as possible, reuse the existing strings (in place) for (size_t max = result.size(); rit != rend && pos != max; ++rit, ++pos) { result[pos].assign(rit->first, rit->second); } // When more matches than existing strings, extend capacity for (; rit != rend; ++rit, ++pos) { result.emplace_back(rit->first, rit->second); } return pos; } // split

繰り返し実行するサブマッチの数が一定である実行するテストでは、このバージョンは打たれそうにありません：最初の実行時にのみメモリを割り当てます（rsplitとresultの両方））その後、既存のメモリを再利用し続けます。

[1]：免責事項、このコードが正しいことを証明しただけで、テストしていません（ドナルドクヌースが言うように）。

schorsch_76 · Answer

このバージョンはどうですか？正規表現ではありませんが、分割をかなり速く解決します...

#include <vector> #include <string> #include <algorithm> size_t split2(const std::string& s, std::vector<std::string>& result) { size_t count = 0; result.clear(); std::string::const_iterator p1 = s.cbegin(); std::string::const_iterator p2 = p1; bool run = true; do { p2 = std::find(p1, s.cend(), ' '); result.Push_back(std::string(p1, p2)); ++count; if (p2 != s.cend()) { p1 = std::find_if(p2, s.cend(), [](char c) -> bool { return c != ' '; }); } else run = false; } while (run); return count; } int main() { std::vector<std::string> v; std::string s = "a b c"; for (auto i = 0; i < 100000; ++i) split2(s, v); return 0; }

$ time splittest.exe

実数0m0.132sユーザー0m0.000s sys 0m0.109s