シンプルPythonチャレンジ：ビット単位で最速XORデータバッファ

Question

チャレンジ：

2つの等しいサイズのバッファーでビット単位のXORを実行します。これは伝統的にタイプであるため、バッファーはpython strタイプである必要がありますPythonのデータバッファの場合。結果の値をstrとして返します。これをできるだけ速く実行してください。

入力は2つの1メガバイト（2 ** 20バイト）文字列です。

課題は実質的に pythonまたは既存のサードパーティpythonモジュールを使用して非効率なアルゴリズムを打ち負かす（緩和されたルール：または独自のルールを作成する）モジュール）。限界増加は役に立たない。

from os import urandom from numpy import frombuffer,bitwise_xor,byte def slow_xor(aa,bb): a=frombuffer(aa,dtype=byte) b=frombuffer(bb,dtype=byte) c=bitwise_xor(a,b) r=c.tostring() return r aa=urandom(2**20) bb=urandom(2**20) def test_it(): for x in xrange(1000): slow_xor(aa,bb)

Torsten Marek · Accepted Answer

初挑戦

_scipy.weave_ および SSE2 組み込み関数を使用すると、わずかに改善されます。コードはディスクからロードしてキャッシュする必要があるため、最初の呼び出しは少し遅く、その後の呼び出しはより高速です。

_import numpy import time from os import urandom from scipy import weave SIZE = 2**20 def faster_slow_xor(aa,bb): b = numpy.fromstring(bb, dtype=numpy.uint64) numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b) return b.tostring() code = """ const __m128i* pa = (__m128i*)a; const __m128i* pend = (__m128i*)(a + arr_size); __m128i* pb = (__m128i*)b; __m128i xmm1, xmm2; while (pa < pend) { xmm1 = _mm_loadu_si128(pa); // must use unaligned access xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; } """ def inline_xor(aa, bb): a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.fromstring(bb, dtype=numpy.uint64) arr_size = a.shape[0] weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"']) return b.tostring() _

再試行

コメントを考慮して、コードを再検討して、コピーを回避できるかどうかを確認しました。文字列オブジェクトのドキュメントを間違って読んだことが判明したので、2回目の試行を行います。

_support = """ #define ALIGNMENT 16 static void memxor(const char* in1, const char* in2, char* out, ssize_t n) { const char* end = in1 + n; while (in1 < end) { *out = *in1 ^ *in2; ++in1; ++in2; ++out; } } """ code2 = """ PyObject* res = PyString_FromStringAndSize(NULL, real_size); const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT; const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT; memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head); const __m128i* pa = (__m128i*)((char*)a + head); const __m128i* pend = (__m128i*)((char*)a + real_size - tail); const __m128i* pb = (__m128i*)((char*)b + head); __m128i xmm1, xmm2; __m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head); while (pa < pend) { xmm1 = _mm_loadu_si128(pa); xmm2 = _mm_loadu_si128(pb); _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; ++pc; } memxor((const char*)pa, (const char*)pb, (char*)pc, tail); return_val = res; Py_DECREF(res); """ def inline_xor_nocopy(aa, bb): real_size = len(aa) a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.frombuffer(bb, dtype=numpy.uint64) return weave.inline(code2, ["a", "b", "real_size"], headers = ['"emmintrin.h"'], support_code = support) _

違いは、文字列がCコード内に割り当てられることです。 SSE2命令で要求されるように16バイト境界で整列させることは不可能であるため、最初と最後の整列されていないメモリ領域は、バイト単位のアクセスを使用してコピーされます。

とにかく、weaveはPython strオブジェクトを_std::string_ sにコピーすることを要求するため、入力データはnumpy配列を使用して渡されます。frombufferはコピーされないため、これは問題ありませんが、メモリは16バイトで整列されていないため、より高速な__mm_loadu_si128_ではなく__mm_load_si128_を使用する必要があります。

__mm_store_si128_を使用する代わりに、__mm_stream_si128_を使用します。これにより、書き込みができるだけ早くメインメモリにストリーミングされるようになります---これにより、出力配列が貴重なキャッシュラインを使い果たしません。

タイミング

タイミングについては、最初の編集の_slow_xor_エントリが私の改良版（インラインビットワイズxor、_uint64_）を参照していたため、その混乱を取り除きました。 _slow_xor_は、元の質問のコードを指します。すべてのタイミングは1000回の実行に対して行われます。

_slow_xor_：1.85秒（1回）
_faster_slow_xor_：1.25秒（1.48x）
_inline_xor_：0.95s（1.95x）
_inline_xor_nocopy_：0.32s（5.78x）

コードはgcc 4.4.3を使用してコンパイルされ、コンパイラが実際にSSE命令を使用することを確認しました。

jfs · Answer

パフォーマンスの比較：numpy対Cython対C対Fortran対Boost.Python（pyublas）

_| function | time, usec | ratio | type | |------------------------+------------+-------+--------------| | slow_xor | 2020 | 1.0 | numpy | | xorf_int16 | 1570 | 1.3 | fortran | | xorf_int32 | 1530 | 1.3 | fortran | | xorf_int64 | 1420 | 1.4 | fortran | | faster_slow_xor | 1360 | 1.5 | numpy | | inline_xor | 1280 | 1.6 | C | | cython_xor | 1290 | 1.6 | cython | | xorcpp_inplace (int32) | 440 | 4.6 | pyublas | | cython_xor_vectorised | 325 | 6.2 | cython | | inline_xor_nocopy | 172 | 11.7 | C | | xorcpp | 144 | 14.0 | boost.python | | xorcpp_inplace | 122 | 16.6 | boost.python | #+TBLFM: $3=@2$2/$2;%.1f _

結果を再現するには、ダウンロード http://Gist.github.com/353005 と入力し、make（依存関係をインストールするには、次のように入力します：_Sudo apt-get install build-essential python-numpy python-scipy cython gfortran_、_Boost.Python_、pyublasの依存関係は、手動で操作する必要があるため含まれていません）

どこ：

slow_xor()はOPの質問からのものです
faster_slow_xor()、inline_xor()、inline_xor_nocopy()は @ Torsten Marekの回答からのものです
cython_xor()とcython_vectorised()は @ gnibblerの回答からのものです

そしてxor_$type_sig()は：

_! xorf.f90.template subroutine xor_$type_sig(a, b, n, out) implicit none integer, intent(in) :: n $type, intent(in), dimension(n) :: a $type, intent(in), dimension(n) :: b $type, intent(out), dimension(n) :: out integer i forall(i=1:n) out(i) = ieor(a(i), b(i)) end subroutine xor_$type_sig _

これは、Pythonから次のように使用されます。

_import xorf # extension module generated from xorf.f90.template import numpy as np def xor_strings(a, b, type_sig='int64'): assert len(a) == len(b) a = np.frombuffer(a, dtype=np.dtype(type_sig)) b = np.frombuffer(b, dtype=np.dtype(type_sig)) return getattr(xorf, 'xor_'+type_sig)(a, b).tostring() _

`xorcpp_inplace()`（Boost.Python、pyublas）：

xor.cpp ：

_#include <inttypes.h> #include <algorithm> #include <boost/lambda/lambda.hpp> #include <boost/python.hpp> #include <pyublas/numpy.hpp> namespace { namespace py = boost::python; template<class InputIterator, class InputIterator2, class OutputIterator> void xor_(InputIterator first, InputIterator last, InputIterator2 first2, OutputIterator result) { // `result` migth `first` but not any of the input iterators namespace ll = boost::lambda; (void)std::transform(first, last, first2, result, ll::_1 ^ ll::_2); } template<class T> py::str xorcpp_str_inplace(const py::str& a, py::str& b) { const size_t alignment = std::max(sizeof(T), 16ul); const size_t n = py::len(b); const char* ai = py::extract<const char*>(a); char* bi = py::extract<char*>(b); char* end = bi + n; if (n < 2*alignment) xor_(bi, end, ai, bi); else { assert(n >= 2*alignment); // applying Marek's algorithm to align const ptrdiff_t head = (alignment - ((size_t)bi % alignment))% alignment; const ptrdiff_t tail = (size_t) end % alignment; xor_(bi, bi + head, ai, bi); xor_((const T*)(bi + head), (const T*)(end - tail), (const T*)(ai + head), (T*)(bi + head)); if (tail > 0) xor_(end - tail, end, ai + (n - tail), end - tail); } return b; } template<class Int> pyublas::numpy_vector<Int> xorcpp_pyublas_inplace(pyublas::numpy_vector<Int> a, pyublas::numpy_vector<Int> b) { xor_(b.begin(), b.end(), a.begin(), b.begin()); return b; } } BOOST_PYTHON_MODULE(xorcpp) { py::def("xorcpp_inplace", xorcpp_str_inplace<int64_t>); // for strings py::def("xorcpp_inplace", xorcpp_pyublas_inplace<int32_t>); // for numpy } _

これは、Pythonから次のように使用されます。

_import os import xorcpp a = os.urandom(2**20) b = os.urandom(2**20) c = xorcpp.xorcpp_inplace(a, b) # it calls xorcpp_str_inplace() _

John La Rooy · Answer

これがcythonの私の結果です

slow_xor 0.456888198853 faster_xor 0.400228977203 cython_xor 0.232881069183 cython_xor_vectorised 0.171468019485

Cythonでのベクトル化により、私のコンピューターのforループが約25％削られますが、python文字列（returnステートメント）の構築に費やされる時間の半分以上-必要ありません配列にnullバイトが含まれる可能性があるため、余分なコピーを（合法的に）回避できると考えます。

不正な方法は、Python文字列を渡してそれを適切に変更することであり、関数の速度が2倍になります。

xor.py

from time import time from os import urandom from numpy import frombuffer,bitwise_xor,byte,uint64 import pyximport; pyximport.install() import xor_ def slow_xor(aa,bb): a=frombuffer(aa,dtype=byte) b=frombuffer(bb,dtype=byte) c=bitwise_xor(a,b) r=c.tostring() return r def faster_xor(aa,bb): a=frombuffer(aa,dtype=uint64) b=frombuffer(bb,dtype=uint64) c=bitwise_xor(a,b) r=c.tostring() return r aa=urandom(2**20) bb=urandom(2**20) def test_it(): t=time() for x in xrange(100): slow_xor(aa,bb) print "slow_xor ",time()-t t=time() for x in xrange(100): faster_xor(aa,bb) print "faster_xor",time()-t t=time() for x in xrange(100): xor_.cython_xor(aa,bb) print "cython_xor",time()-t t=time() for x in xrange(100): xor_.cython_xor_vectorised(aa,bb) print "cython_xor_vectorised",time()-t if __name__=="__main__": test_it()

xor_.pyx

cdef char c[1048576] def cython_xor(char *a,char *b): cdef int i for i in range(1048576): c[i]=a[i]^b[i] return c[:1048576] def cython_xor_vectorised(char *a,char *b): cdef int i for i in range(131094): (<unsigned long long *>c)[i]=(<unsigned long long *>a)[i]^(<unsigned long long *>b)[i] return c[:1048576]

Alex Martelli · Answer

簡単な高速化は、より大きな「チャンク」を使用することです。

def faster_xor(aa,bb): a=frombuffer(aa,dtype=uint64) b=frombuffer(bb,dtype=uint64) c=bitwise_xor(a,b) r=c.tostring() return r

uint64はもちろんnumpyからもインポートされます。これは4ミリ秒でtimeitですが、byteバージョンでは6ミリ秒です。

Joshua · Answer

あなたの問題は、NumPyのxOrメソッドの速度ではなく、すべてのバッファリング/データ型変換にあります。個人的には、この投稿のポイントは実際にはPythonを自慢することだったのではないかと思います。ここで行っているのは、本質的に高速な非解釈言語と同等のタイムフレームで3ギガバイトのデータを処理するためです。

以下のコードは、私の控えめなコンピュータでもPython can can xOr "aa"（1MB）and "bb"（1MB）into "c"（1MB）into 1000 times（total 3GB）in under two seconds。真剣に、どのくらい改善したいですか？特にインタプリタ言語から！時間の80％が "frombuffer"および "tostring"の呼び出しに費やされました。実際のxOr-ingは完了しました時間の残りの20％です。2秒で3GBになると、cでmemcpyを使用するだけでも、それを改善するのは困難になります実質的に。

これが本当の質問であり、Pythonについての自慢話だけではない場合、答えは、「frombuffer」や「tostring」などの型変換の数、量、頻度を最小限に抑えるようにコーディングすることです。実際のxOr'ingはすでに非常に高速です。

from os import urandom from numpy import frombuffer,bitwise_xor,byte,uint64 def slow_xor(aa,bb): a=frombuffer(aa,dtype=byte) b=frombuffer(bb,dtype=byte) c=bitwise_xor(a,b) r=c.tostring() return r bb=urandom(2**20) aa=urandom(2**20) def test_it(): for x in xrange(1000): slow_xor(aa,bb) def test_it2(): a=frombuffer(aa,dtype=uint64) b=frombuffer(bb,dtype=uint64) for x in xrange(1000): c=bitwise_xor(a,b); r=c.tostring() test_it() print 'Slow Complete.' #6 seconds test_it2() print 'Fast Complete.' #under 2 seconds

とにかく、上記の「test_it2」は、「test_it」とまったく同じxOr-ingの量を達成しますが、時間は1/5です。 5倍の速度向上は、「実質的な」と見なされるべきです。

Steve314 · Answer

最速のビット単位XORは "^"です。これを入力できますmuch "bitwise_xor"よりも速い;-)

Dima Tisnek · Answer

Python3にはint.from_bytesおよびint.to_bytes、つまり：

x = int.from_bytes(b"a" * (1024*1024), "big") y = int.from_bytes(b"b" * (1024*1024), "big") (x ^ y).to_bytes(1024*1024, "big")

それはIOよりも高速であり、それがどれほど高速であるかをテストするのは少し難しいです。私のマシンでは.018 .. 0.020sのように見えます。妙に"little"-エンディアン変換は少し高速です。

CPython 2.xには基礎となる関数_PyLong_FromByteArray、エクスポートされていませんが、ctypesを介してアクセスできます：

In [1]: import ctypes In [2]: p = ctypes.CDLL(None) In [3]: p["_PyLong_FromByteArray"] Out[3]: <_FuncPtr object at 0x2cc6e20>

Python 2の詳細は読者への演習として残されています。

David M. Cooke · Answer

文字列としての答えはどれほどひどいですか？ Python文字列は不変であるため、c.tostring()メソッドはcopy cのデータを新しい文字列にコピーする必要があることに注意してください。およびcは変更可能です）Python 2.6および3.1にはbytearrayタイプがあり、str（bytes Python 3.x）では、変更可能です。

別の最適化は、outパラメータを_bitwise_xor_に使用して、結果を保存する場所を指定することです。

私のマシンでは

_slow_xor (int8): 5.293521 (100.0%) outparam_xor (int8): 4.378633 (82.7%) slow_xor (uint64): 2.192234 (41.4%) outparam_xor (uint64): 1.087392 (20.5%) _

この投稿の最後にコードを追加しました。特に、事前に割り当てられたバッファを使用するメソッドは、新しいオブジェクトの作成よりも2倍高速であることに注意してください（4バイトの（_uint64_）チャンクで操作する場合）。これは、チャンクごとに2つの操作（xor + copy）を実行して低速のメソッドが高速の1（xorのみ）に一致することと一致しています。

また、FWIW、_a ^ b_はbitwise_xor(a,b)と同等であり、_a ^= b_はbitwise_xor(a, b, a)と同等です。

したがって、外部モジュールを作成せずに5倍のスピードアップが得られます:)

_from time import time from os import urandom from numpy import frombuffer,bitwise_xor,byte,uint64 def slow_xor(aa, bb, ignore, dtype=byte): a=frombuffer(aa, dtype=dtype) b=frombuffer(bb, dtype=dtype) c=bitwise_xor(a, b) r=c.tostring() return r def outparam_xor(aa, bb, out, dtype=byte): a=frombuffer(aa, dtype=dtype) b=frombuffer(bb, dtype=dtype) c=frombuffer(out, dtype=dtype) assert c.flags.writeable return bitwise_xor(a, b, c) aa=urandom(2**20) bb=urandom(2**20) cc=bytearray(2**20) def time_routine(routine, dtype, base=None, ntimes = 1000): t = time() for x in xrange(ntimes): routine(aa, bb, cc, dtype=dtype) et = time() - t if base is None: base = et print "%s (%s): %f (%.1f%%)" % (routine.__name__, dtype.__name__, et, (et/base)*100) return et def test_it(ntimes = 1000): base = time_routine(slow_xor, byte, ntimes=ntimes) time_routine(outparam_xor, byte, base, ntimes=ntimes) time_routine(slow_xor, uint64, base, ntimes=ntimes) time_routine(outparam_xor, uint64, base, ntimes=ntimes) _

myurko · Answer

配列データ型に対して高速な演算を実行したい場合は、Cython（cython.org）を試してください。正しい宣言を与えると、純粋なcコードにコンパイルできます。

Nikwin · Answer

セージのビットセットの対称的な違いを試すことができます。

http://www.sagemath.org/doc/reference/sage/misc/bitset.html

Juergen · Answer

最速の方法（スピードワイズ）はMaxを実行することです。 Sをお勧めします。 Cで実装します。

このタスクをサポートするコードは、かなり単純に書く必要があります。これは、新しい文字列を作成してxorを実行するモジュール内の1つの関数にすぎません。それで全部です。このようなモジュールを1つ実装すると、コードをテンプレートとして使用するのは簡単です。または、Pythonの単純な拡張モジュールを実装する他の誰かから実装されたモジュールを受け取り、タスクに必要のないすべてのものを捨てるだけです。

本当に複雑な部分は、RefCounter-Stuffを正しく行うことです。しかし、それがどのように機能するかを理解すると、管理可能になります-また、当面のタスクは非常に単純なので（メモリを割り当て、それを返します-paramsは変更されません（Ref-wise））。

シンプルPythonチャレンジ：ビット単位で最速XORデータバッファ

初挑戦

再試行

タイミング

パフォーマンスの比較：numpy対Cython対C対Fortran対Boost.Python（pyublas）

xorcpp_inplace()（Boost.Python、pyublas）：

`xorcpp_inplace()`（Boost.Python、pyublas）：