pandas series / dataframesでnumpy関数がとても遅いのはなぜですか？

Question

別の質問から取得した小さなMWEを考えてみましょう。

DateTime Data 2017-11-21 18:54:31 1 2017-11-22 02:26:48 2 2017-11-22 10:19:44 3 2017-11-22 15:11:28 6 2017-11-22 23:21:58 7 2017-11-28 14:28:28 28 2017-11-28 14:36:40 0 2017-11-28 14:59:48 1

目標は、すべての値を上限1でクリップすることです。私の答えはnp.clip、これは正常に動作します。

np.clip(df.Data, a_min=None, a_max=1) array([1, 1, 1, 1, 1, 1, 0, 1])

または、

np.clip(df.Data.values, a_min=None, a_max=1) array([1, 1, 1, 1, 1, 1, 0, 1])

どちらも同じ答えを返します。私の質問は、これら2つの方法の相対的なパフォーマンスについてです。検討-

df = pd.concat([df]*1000).reset_index(drop=True) %timeit np.clip(df.Data, a_min=None, a_max=1) 1000 loops, best of 3: 270 µs per loop %timeit np.clip(df.Data.values, a_min=None, a_max=1) 10000 loops, best of 3: 23.4 µs per loop

後者でvaluesを呼び出すだけで、この2つの間に大きな違いがあるのはなぜですか？言い換えると...

なぜnumpy関数はとても遅いのですかpandasオブジェクト？

MSeifert · Accepted Answer

はい、_np.clip_は_pandas.Series_よりも_numpy.ndarray_の方がずっと遅いようです。それは正しいのですが、実際には（少なくとも無症状では）それほど悪くはありません。 8000個の要素は、ランタイムで一定の要因が主要な要因である体制にまだあります。これは質問にとって非常に重要な側面だと思うので、これを視覚化しています（別の回答から借りています）：

_# Setup import pandas as pd import numpy as np def on_series(s): return np.clip(s, a_min=None, a_max=1) def on_values_of_series(s): return np.clip(s.values, a_min=None, a_max=1) # Timing setup timings = {on_series: [], on_values_of_series: []} sizes = [2**i for i in range(1, 26, 2)] # Timing for size in sizes: func_input = pd.Series(np.random.randint(0, 30, size=size)) for func in timings: res = %timeit -o func(func_input) timings[func].append(res) %matplotlib notebook import matplotlib.pyplot as plt import numpy as np fig, (ax1, ax2) = plt.subplots(1, 2) for func in timings: ax1.plot(sizes, [time.best for time in timings[func]], label=str(func.__name__)) ax1.set_xscale('log') ax1.set_yscale('log') ax1.set_xlabel('size') ax1.set_ylabel('time [seconds]') ax1.grid(which='both') ax1.legend() baseline = on_values_of_series # choose one function as baseline for func in timings: ax2.plot(sizes, [time.best / ref.best for time, ref in Zip(timings[func], timings[baseline])], label=str(func.__name__)) ax2.set_yscale('log') ax2.set_xscale('log') ax2.set_xlabel('size') ax2.set_ylabel('time relative to {}'.format(baseline.__name__)) ax2.grid(which='both') ax2.legend() plt.tight_layout() _

これは重要な機能をより明確に示していると思うので、対数プロットです。たとえば、_np.clip_上の_numpy.ndarray_は高速ですが、その場合は定数係数がはるかに小さいことも示しています。大きな配列の違いはわずか3です！それはまだ大きな違いですが、小さな配列の違いよりもはるかに小さいです。

しかし、それはまだ時差が生じる質問への答えではありません。

実際の解決策は非常に簡単です。_np.clip_は、最初の引数のclipmethodに委任します。

_>>> np.clip?? Source: def clip(a, a_min, a_max, out=None): """ ... """ return _wrapfunc(a, 'clip', a_min, a_max, out=out) >>> np.core.fromnumeric._wrapfunc?? Source: def _wrapfunc(obj, method, *args, **kwds): try: return getattr(obj, method)(*args, **kwds) # ... except (AttributeError, TypeError): return _wrapit(obj, method, *args, **kwds) _

__wrapfunc_と_np.ndarray.clip_は異なるメソッドであるため、ここでは_pd.Series.clip_関数のgetattr行が重要な行です。はい、さまざまな方法：

_>>> np.ndarray.clip <method 'clip' of 'numpy.ndarray' objects> >>> pd.Series.clip <function pandas.core.generic.NDFrame.clip> _

残念ながら_np.ndarray.clip_はC関数であるため、プロファイルを作成するのは困難ですが、_pd.Series.clip_は通常のPython関数であるため、プロファイルを作成するのは簡単です。5000のシリーズを使用しましょうここに整数：

_s = pd.Series(np.random.randint(0, 100, 5000)) _

values上の_np.clip_について、次の行プロファイリングを取得します。

_%load_ext line_profiler %lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 2.25641e-05 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 55 55.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 1.51795e-05 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 2 2.0 5.4 try: 57 1 35 35.0 94.6 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds) _

しかし、Series上の_np.clip_については、まったく異なるプロファイリング結果が得られます。

_%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 0.000823794 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 2008 2008.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 0.00081846 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 2 2.0 0.1 try: 57 1 1993 1993.0 99.9 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds) Total time: 0.000804922 s File: pandas\core\generic.py Function: clip at line 4969 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4969 def clip(self, lower=None, upper=None, axis=None, inplace=False, 4970 *args, **kwargs): 4971 """ ... 5021 """ 5022 1 12 12.0 0.6 if isinstance(self, ABCPanel): 5023 raise NotImplementedError("clip is not supported yet for panels") 5024 5025 1 10 10.0 0.5 inplace = validate_bool_kwarg(inplace, 'inplace') 5026 5027 1 69 69.0 3.5 axis = nv.validate_clip_with_axis(axis, args, kwargs) 5028 5029 # GH 17276 5030 # numpy doesn't like NaN as a clip value 5031 # so ignore 5032 1 158 158.0 8.1 if np.any(pd.isnull(lower)): 5033 1 3 3.0 0.2 lower = None 5034 1 26 26.0 1.3 if np.any(pd.isnull(upper)): 5035 upper = None 5036 5037 # GH 2747 (arguments were reversed) 5038 1 1 1.0 0.1 if lower is not None and upper is not None: 5039 if is_scalar(lower) and is_scalar(upper): 5040 lower, upper = min(lower, upper), max(lower, upper) 5041 5042 # fast-path for scalars 5043 1 1 1.0 0.1 if ((lower is None or (is_scalar(lower) and is_number(lower))) and 5044 1 28 28.0 1.4 (upper is None or (is_scalar(upper) and is_number(upper)))): 5045 1 1654 1654.0 84.3 return self._clip_with_scalar(lower, upper, inplace=inplace) 5046 5047 result = self 5048 if lower is not None: 5049 result = result.clip_lower(lower, axis, inplace=inplace) 5050 if upper is not None: 5051 if inplace: 5052 result = self 5053 result = result.clip_upper(upper, axis, inplace=inplace) 5054 5055 return result Total time: 0.000662153 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4920 def _clip_with_scalar(self, lower, upper, inplace=False): 4921 1 2 2.0 0.1 if ((lower is not None and np.any(isna(lower))) or 4922 1 25 25.0 1.5 (upper is not None and np.any(isna(upper)))): 4923 raise ValueError("Cannot use an NA value as a clip threshold") 4924 4925 1 22 22.0 1.4 result = self.values 4926 1 571 571.0 35.4 mask = isna(result) 4927 4928 1 95 95.0 5.9 with np.errstate(all='ignore'): 4929 1 1 1.0 0.1 if upper is not None: 4930 1 141 141.0 8.7 result = np.where(result >= upper, upper, result) 4931 1 33 33.0 2.0 if lower is not None: 4932 result = np.where(result <= lower, lower, result) 4933 1 73 73.0 4.5 if np.any(mask): 4934 result[mask] = np.nan 4935 4936 1 90 90.0 5.6 axes_dict = self._construct_axes_dict() 4937 1 558 558.0 34.6 result = self._constructor(result, **axes_dict).__finalize__(self) 4938 4939 1 2 2.0 0.1 if inplace: 4940 self._update_inplace(result) 4941 else: 4942 1 1 1.0 0.1 return result _

_pd.Series.clip_が_np.ndarray.clip_よりも多くの作業を行う場所が既に強調表示されているため、その時点でサブルーチンに入ることを停止しました。 values（55タイマー単位）での_np.clip_呼び出しの合計時間を、_pandas.Series.clip_メソッドの最初のチェックの1つであるif np.any(pd.isnull(lower))（ 158タイマーユニット）。その時点でpandasメソッドはクリッピングでさえ開始されず、すでに3倍の時間がかかります。

ただし、配列が大きい場合、これらの「オーバーヘッド」のいくつかは重要ではなくなります。

_s = pd.Series(np.random.randint(0, 100, 1000000)) %lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 0.00593476 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 14466 14466.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 0.00592779 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 1 1.0 0.0 try: 57 1 14448 14448.0 100.0 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds) Total time: 0.00591302 s File: pandas\core\generic.py Function: clip at line 4969 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4969 def clip(self, lower=None, upper=None, axis=None, inplace=False, 4970 *args, **kwargs): 4971 """ ... 5021 """ 5022 1 17 17.0 0.1 if isinstance(self, ABCPanel): 5023 raise NotImplementedError("clip is not supported yet for panels") 5024 5025 1 14 14.0 0.1 inplace = validate_bool_kwarg(inplace, 'inplace') 5026 5027 1 97 97.0 0.7 axis = nv.validate_clip_with_axis(axis, args, kwargs) 5028 5029 # GH 17276 5030 # numpy doesn't like NaN as a clip value 5031 # so ignore 5032 1 125 125.0 0.9 if np.any(pd.isnull(lower)): 5033 1 2 2.0 0.0 lower = None 5034 1 30 30.0 0.2 if np.any(pd.isnull(upper)): 5035 upper = None 5036 5037 # GH 2747 (arguments were reversed) 5038 1 2 2.0 0.0 if lower is not None and upper is not None: 5039 if is_scalar(lower) and is_scalar(upper): 5040 lower, upper = min(lower, upper), max(lower, upper) 5041 5042 # fast-path for scalars 5043 1 2 2.0 0.0 if ((lower is None or (is_scalar(lower) and is_number(lower))) and 5044 1 32 32.0 0.2 (upper is None or (is_scalar(upper) and is_number(upper)))): 5045 1 14092 14092.0 97.8 return self._clip_with_scalar(lower, upper, inplace=inplace) 5046 5047 result = self 5048 if lower is not None: 5049 result = result.clip_lower(lower, axis, inplace=inplace) 5050 if upper is not None: 5051 if inplace: 5052 result = self 5053 result = result.clip_upper(upper, axis, inplace=inplace) 5054 5055 return result Total time: 0.00575753 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4920 def _clip_with_scalar(self, lower, upper, inplace=False): 4921 1 2 2.0 0.0 if ((lower is not None and np.any(isna(lower))) or 4922 1 28 28.0 0.2 (upper is not None and np.any(isna(upper)))): 4923 raise ValueError("Cannot use an NA value as a clip threshold") 4924 4925 1 120 120.0 0.9 result = self.values 4926 1 3525 3525.0 25.1 mask = isna(result) 4927 4928 1 86 86.0 0.6 with np.errstate(all='ignore'): 4929 1 2 2.0 0.0 if upper is not None: 4930 1 9314 9314.0 66.4 result = np.where(result >= upper, upper, result) 4931 1 61 61.0 0.4 if lower is not None: 4932 result = np.where(result <= lower, lower, result) 4933 1 283 283.0 2.0 if np.any(mask): 4934 result[mask] = np.nan 4935 4936 1 78 78.0 0.6 axes_dict = self._construct_axes_dict() 4937 1 532 532.0 3.8 result = self._constructor(result, **axes_dict).__finalize__(self) 4938 4939 1 2 2.0 0.0 if inplace: 4940 self._update_inplace(result) 4941 else: 4942 1 1 1.0 0.0 return result _

isnaや_np.where_などの複数の関数呼び出しがかなりの時間を要しますが、全体的には少なくとも_np.ndarray.clip_の時間に匹敵します（それはレジーム内にあります）私のコンピューターでは、タイミングの差は約3です）。

テイクアウトはおそらく次のとおりです。

多くのNumPy関数は、渡されたオブジェクトのメソッドに委任するだけなので、異なるオブジェクトを渡すと大きな違いが生じる可能性があります。
プロファイリング、特にラインプロファイリングは、パフォーマンスの違いがどこから生じるかを見つけるための優れたツールとなります。
そのような場合は、異なるサイズのオブジェクトを必ずテストしてください。多数の小さな配列を処理する場合を除いて、おそらく重要ではない定数係数を比較できます。

使用バージョン：

_Python 3.6.3 64-bit on Windows 10 Numpy 1.13.3 Pandas 0.21.1 _

obgnaw · Answer

ソースコードを読むだけで、それは明らかです。

def clip(a, a_min, a_max, out=None): """a : array_like Array containing elements to clip.""" return _wrapfunc(a, 'clip', a_min, a_max, out=out) def _wrapfunc(obj, method, *args, **kwds): try: return getattr(obj, method)(*args, **kwds) #This situation has occurred in the case of # a downstream library like 'pandas'. except (AttributeError, TypeError): return _wrapit(obj, method, *args, **kwds) def _wrapit(obj, method, *args, **kwds): try: wrap = obj.__array_wrap__ except AttributeError: wrap = None result = getattr(asarray(obj), method)(*args, **kwds) if wrap: if not isinstance(result, mu.ndarray): result = asarray(result) result = wrap(result) return result

修正：

after pandas v0.13.0_ahl1、pandasにはclipの独自の実装があります。

Simon Bowly · Answer

ここで注意すべきパフォーマンスの違いには2つの部分があります。

各ライブラリでのPythonオーバーヘッド（pandasがさらに役立つ）
数値アルゴリズムの実装の違い（pd.clipは実際にnp.whereを呼び出します）

これを非常に小さな配列で実行すると、Pythonオーバーヘッドの違いがわかります。numpyの場合、これは理解できるほど非常に小さいですが、pandasは多くのチェックを行います（ null値、より柔軟な引数処理など）を使用して、Cのコード基盤に到達する前に2つのコードが通過する段階の大まかな内訳を表示しようとしました。

data = pd.Series(np.random.random(100))

ndarrayでnp.clipを使用する場合、オーバーヘッドは単にオブジェクトのメソッドを呼び出すnumpyラッパー関数です。

>>> %timeit np.clip(data.values, 0.2, 0.8) # numpy wrapper, calls .clip() on the ndarray >>> %timeit data.values.clip(0.2, 0.8) # C function call 2.22 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 1.32 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Pandasは、アルゴリズムに到達する前にEdgeケースをチェックするのにより多くの時間を費やします。

>>> %timeit np.clip(data, a_min=0.2, a_max=0.8) # numpy wrapper, calls .clip() on the Series >>> %timeit data.clip(lower=0.2, upper=0.8) # pandas API method >>> %timeit data._clip_with_scalar(0.2, 0.8) # lowest level python function 102 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 90.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 73.7 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

全体的な時間と比較して、Cコードをヒットする前の両方のライブラリのオーバーヘッドは非常に重要です。 numpyの場合、単一のラッピング命令は数値処理と同じくらい実行に時間がかかります。 Pandasは、関数呼び出しの最初の2層でオーバーヘッドが最大30倍になります。

アルゴリズムレベルで何が起きているのかを分離するには、より大きな配列でこれを確認し、同じ関数をベンチマークする必要があります。

>>> data = pd.Series(np.random.random(1000000)) >>> %timeit np.clip(data.values, 0.2, 0.8) >>> %timeit data.values.clip(0.2, 0.8) 2.85 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.85 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit np.clip(data, a_min=0.2, a_max=0.8) >>> %timeit data.clip(lower=0.2, upper=0.8) >>> %timeit data._clip_with_scalar(0.2, 0.8) 12.3 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 12.3 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 12.2 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

python両方の場合のオーバーヘッドは無視できるようになりました。ラッパー関数と引数チェックの時間は、100万個の値の計算時間に比べて小さくなっています。ただし、ソースコードを少し調べると、pandasのclip実装は、実際にはnp.whereではなくnp.clipを使用していることがわかります。

def clip_where(data, lower, upper): ''' Actual implementation in pd.Series._clip_with_scalar (minus NaN handling). ''' result = data.values result = np.where(result >= upper, upper, result) result = np.where(result <= lower, lower, result) return pd.Series(result) def clip_clip(data, lower, upper): ''' What would happen if we used ndarray.clip instead. ''' return pd.Series(data.values.clip(lower, upper))

条件付き置換を実行する前に各ブール条件を個別にチェックするために必要な追加の努力は、速度の違いを説明しているように思われます。 upperとlowerの両方を指定すると、numpy配列を4回通過します（2つの不等式チェックとnp.whereの2回の呼び出し）。これらの2つの関数をベンチマークすると、3〜4倍の速度比が示されます。

>>> %timeit clip_clip(data, lower=0.2, upper=0.8) >>> %timeit clip_where(data, lower=0.2, upper=0.8) 11.1 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.97 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pandas devsがこの実装で行った理由がわかりません。np.clipは、以前は回避策を必要としていた新しいAPI関数かもしれません。 'pandasは、最終アルゴリズムを実行する前にさまざまなケースをチェックします。これは、呼び出される可能性のある実装の1つにすぎません。

Bharath M · Answer

パフォーマンスが異なる理由は、numpyは、pandasオブジェクトが渡されたときに、組み込みのnumpy関数で同じことを行うよりも、最初にgetattrを使用して関数のpandas実装を検索する傾向があるためです。

遅いpandasオブジェクトのnumpyではなく、pandasバージョンです。

するとき

np.clip(pd.Series([1,2,3,4,5]),a_min=None,amax=1)

_wrapfuncが呼び出されています：

# Code from source def _wrapfunc(obj, method, *args, **kwds): try: return getattr(obj, method)(*args, **kwds)

_wrapfuncのgetattrメソッドのため：

getattr(pd.Series([1,2,3,4,5]),'clip')(None, 1) # Equivalent to `pd.Series([1,2,3,4,5]).clip(lower=None,upper=1)` # 0 1 # 1 1 # 2 1 # 3 1 # 4 1 # dtype: int64

pandas実装を実行すると、多くの事前チェック作業が行われます。 numpyを介してpandas実装を行う関数の速度にこのような違いがあるのはそのためです。

クリップだけでなく、cumsum、cumprod、reshape、searchsorted、transposeなどの関数は、pandasオブジェクトを渡すときにnumpyよりも多くのpandasバージョンを使用します。

Numpyはこれらのオブジェクトに対して作業を行っているように見えるかもしれませんが、内部ではpandas関数です。