損失関数から独立してSoftmax導関数を実装する方法は？

Question

ニューラルネットワークライブラリでは、いくつかのアクティベーション関数と損失関数とそれらの派生関数を実装しました。それらは任意に組み合わせることができ、出力層の導関数は損失導関数と活性化導関数の積になります。

しかし、損失関数とは別にSoftmaxアクティベーション関数の導関数を実装することに失敗しました。正規化、つまり方程式の分母により、1つの入力アクティベーションを変更すると、1つだけではなくすべての出力アクティベーションが変更されます。

これが私のSoftmax実装で、導関数は勾配チェックに約1％失敗します。 Softmax導関数を実装して、損失関数と組み合わせるにはどうすればよいですか？

import numpy as np class Softmax: def compute(self, incoming): exps = np.exp(incoming) return exps / exps.sum() def delta(self, incoming, outgoing): exps = np.exp(incoming) others = exps.sum() - exps return 1 / (2 + exps / others + others / exps) activation = Softmax() cost = SquaredError() outgoing = activation.compute(incoming) delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)

Aerin · Accepted Answer

数学的には、ロジットZi（たとえばWi * X）に関するSoftmaxσ（j）の導関数は、

ここで、赤いデルタはクロネッカーデルタです。

繰り返し実装する場合：

def softmax_grad(s): # input s is softmax value of the original input x. Its shape is (1,n) # i.e. s = np.array([0.3,0.7]), x = np.array([0,1]) # make the matrix whose size is n^2. jacobian_m = np.diag(s) for i in range(len(jacobian_m)): for j in range(len(jacobian_m)): if i == j: jacobian_m[i][j] = s[i] * (1 - s[i]) else: jacobian_m[i][j] = -s[i] * s[j] return jacobian_m

テスト：

In [95]: x Out[95]: array([1, 2]) In [96]: softmax(x) Out[96]: array([ 0.26894142, 0.73105858]) In [97]: softmax_grad(softmax(x)) Out[97]: array([[ 0.19661193, -0.19661193], [-0.19661193, 0.19661193]])

ベクトル化されたバージョンで実装する場合：

soft_max = softmax(x) # reshape softmax to 2d so np.dot gives matrix multiplication def softmax_grad(softmax): s = softmax.reshape(-1,1) return np.diagflat(s) - np.dot(s, s.T) softmax_grad(soft_max) #array([[ 0.19661193, -0.19661193], # [-0.19661193, 0.19661193]])

ticcky · Answer

次のようになります：（xはソフトマックスレイヤーへの入力で、dyはその上の損失からのデルタです）

_ dx = y * dy s = dx.sum(axis=dx.ndim - 1, keepdims=True) dx -= y * s return dx _

ただし、エラーを計算する方法は次のとおりです。

_ yact = activation.compute(x) ycost = cost.compute(yact) dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue)) _

説明：delta関数は逆伝搬アルゴリズムの一部であるため、その責任はベクトルdy（私のコードではoutgoing）をヤコビアンで乗算することです。 xで評価されるcompute(x)関数のこのヤコビアンがsoftmax [1]でどのように見えるかを計算し、左からベクトルdyを乗算すると、少しの代数の後、次のような結果が得られます。 my Pythonコード。

[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network

Kari · Answer

以下は、組み込みを使用するc ++ベクトル化バージョンです（非SSEバージョンよりも22倍（！）高速）。

// How many floats fit into __m256 "group". // Used by vectors and matrices, to ensure their dimensions are appropriate for // intrinsics. // Otherwise, consecutive rows of matrices will not be 16-byte aligned, and // operations on them will be incorrect. #define F_MULTIPLE_OF_M256 8 //check to quickly see if your rows are divisible by m256. //you can 'undefine' to save performance, after everything was verified to be correct. #define ASSERT_THE_M256_MULTIPLES #ifdef ASSERT_THE_M256_MULTIPLES #define assert_is_m256_multiple(x) assert( (x%F_MULTIPLE_OF_M256) == 0) #else #define assert_is_m256_multiple (q) #endif // usually used at the end of our Reduce functions, // where the final __m256 mSum needs to be collapsed into 1 scalar. static inline float slow_hAdd_ps(__m256 x){ const float *sumStart = reinterpret_cast<const float*>(&x); float sum = 0.0f; for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){ sum += sumStart[i]; } return sum; } f_vec SoftmaxGrad_fromResult(const float *softmaxResult, size_t size, const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer assert_is_m256_multiple(size); //allocate vector, where to store output: f_vec grad_v(size, true);//true: skip filling with zeros, to save performance. const __m256* end = (const __m256*)(softmaxResult + size); for(size_t i=0; i<size; ++i){// <--for every row //go through this i'th row: __m256 sum = _mm256_set1_ps(0.0f); const __m256 neg_sft_i = _mm256_set1_ps( -softmaxResult[i] ); const __m256 *s = (const __m256*)softmaxResult; const __m256 *gAbove = (__m256*)gradFromAbove; for (s; s<end; ){ __m256 mul = _mm256_mul_ps(*s, neg_sft_i); // sftmaxResult_j * (-sftmaxResult_i) mul = _mm256_mul_ps( mul, *gAbove ); sum = _mm256_add_ps( sum, mul );//adding to the total sum of this row. ++s; ++gAbove; } grad_v[i] = slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row). }//end for every row //reset back to start and subtract a vector, to account for Kronecker delta: __m256 *g = (__m256*)grad_v._contents; __m256 *s = (__m256*)softmaxResult; __m256 *gAbove = (__m256*)gradFromAbove; for(s; s<end; ){ __m256 mul = _mm256_mul_ps(*s, *gAbove); *g = _mm256_add_ps( *g, mul ); ++s; ++g; } return grad_v; }

なんらかの理由で単純な（非SSE）バージョンが必要な場合は、次のようになります：

inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult, const float *gradFromAbove, //<--gradient vector, flowing into us from the above layer float *gradOutput, size_t count ){ // every pre-softmax element in a layer contributed to the softmax of every other element // (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem. for(size_t i=0; i<count; ++i){ //go through this i'th row: float sum = 0.0f; const float neg_sft_i = -softmaxResult[i]; for(size_t j=0; j<count; ++j){ float mul = gradFromAbove[j] * softmaxResult[j] * neg_sft_i; sum += mul;//adding to the total sum of this row. } //NOTICE: equals, overwriting any old values: gradOutput[i] = sum; }//end for every row for(size_t i=0; i<count; ++i){ gradOutput[i] += softmaxResult[i] * gradFromAbove[i]; } }