Mathematics Behind Backpropagation in Neural Network

Ufldl introduces a provement for backpropagation in neural network. Let’s summarize here and include sparsity penalty.

What’s $\delta_i^{(l)}$

Definition
$\delta_i^{(l)} = \frac{\partial}{\partial z_i^{(l)}}J(W,b)$
where $l$ is $lth$ layer in neural network, and $i$ is $ith$ node in this $lth$ layer.
In vector form, we can define $\delta^{(l)} = \frac{\partial}{\partial z^{(l)}}J(W,b)$
So we calculate in vector form (ignore subscript $i$) to simplify.
Firstly we have the following equation by definition,
$z^{(l)} = a^{(l-1)}\cdot W^{(l-1)} + b^{(l-1)}$, where $l = 2, 3, …, L$
$a^{(l)} = f(z^{(l)})$, where $f$ is signal function.
Typically we choose sigmoid function to be $f$.
$f(z)=\frac{1}{1+e^{-z}}$
Sigmoid function has this property.
$f’(z)=(1-f(z))f(z)$

Backpropagation without sparsity

Cost function : $J(W,b) = \frac{1}{2}(y-h_{W,b}(x))^2$
Let’s define $L$ is the output layer.

$\delta^{(L)} = \frac{\partial}{\partial z^{(L)}}J(W,b) =
\frac{\partial}{\partial z^{(L)}}\frac{1}{2}(y-h_{W,b}(x))^2$
$= \frac{\partial}{\partial z^{(L)}}\frac{1}{2}(y - a^{(L)})^2
= \frac{\partial}{\partial z^{(L)}}\frac{1}{2}(y - f(z^{(L)}))^2$
$= -(y - f(z^{(L)}))\cdot f’(z^{(L)})$

For $l = L-1, L-2, …, 2,$ we calculates $\delta^{(l)} = ((W^{(l)})^T\delta^{(l+1)}) .* f’(z^{(l)})$
Assume it is true for $l$ layer,

For $l-1$,
$\delta^{(l-1)} = \frac{\partial}{\partial z^{(l-1)}}J(W,b) =\frac{\partial}{\partial z^{(l)}}J(W,b)\cdot \frac{\partial}{\partial z^{(l-1)}} z^{(l)} $
$= \delta^{(l)}\cdot \frac{\partial}{\partial z^{(l-1)}} z^{(l)}
= \delta^{(l)}\cdot \frac{\partial}{\partial z^{(l-1)}}(a^{(l-1)}W^{(l-1)}+b^{(l-1)})$
$=\delta^{(l)}W^{(l-1)}\cdot \frac{\partial}{\partial z^{(l-1)}}a^{(l-1)}
= \delta^{(l)}W^{(l-1)}\cdot \frac{\partial}{\partial z^{(l-1)}}f(z^{(l-1)})$
$=\delta^{(l)}W^{(l-1)}\cdot f’(z^{(l-1)})$
By mathematics conduction, prove done.

So we have
$\delta^{(l)} = \frac{\partial}{\partial z^{(L)}}J(W,b)=((W^{(l)})^T\delta^{(l+1)}) .* f’(z^{(l)})$
Backpropagation uses a dynamic programming method to calculate $\delta^{(l)}$ to improve the performance.
For partitial derivative of our parameters.
$\frac{\partial}{\partial W^{(l)}}J(W,b) = \frac{\partial}{\partial z^{(l+1)}}J(W,b)\cdot \frac{\partial}{\partial W^{(l)}}z^{(l+1)}$
$=\delta^{(l+1)}\cdot \frac{\partial}{\partial W^{(l)}}z^{(l+1)}
=\delta^{(l+1)}\cdot \frac{\partial}{\partial W^{(l)}}(a^{(l)}W^{(l)}+b^{(l)})$
$=\delta^{(l+1)}a^{(l)}$

$\frac{\partial}{\partial b^{(l)}}J(W,b) = \frac{\partial}{\partial z^{(l+1)}}J(W,b)\cdot \frac{\partial}{\partial b^{(l)}}z^{(l+1)}$
$=\delta^{(l+1)}\cdot \frac{\partial}{\partial b^{(l)}}z^{(l+1)}
=\delta^{(l+1)}\cdot \frac{\partial}{\partial b^{(l)}}(a^{(l)}W^{(l)}+b^{(l)})$
$=\delta^{(l+1)}$

Backpropagation with sparsity

The following conduction only works on the neural network with one single hidden layer. I do not know how it works in other scenarios. Leave it to be an open issue.
$\rho_j^{(2)}=\frac{1}{m}\sum_{i=1}^m(a_j^{(2)})$ for all data
$S_l$ is number of node in layer $l$.
Sparsity pernality
$KL_j^{(2)} = \rho log\frac{\rho}{\rho_j^{(2)}}+(1-\rho) log\frac{1-\rho}{1-\rho_j^{(2)}}$
$k’(x)=(alog\frac{a}{x}+(1-a)log\frac{1-a}{1-x})’=-\frac{a}{x}+\frac{1-a}{1-x}$
For overal cost function
$J_{sparse}(W,b) = J(W,b) + \beta\sum_{j=1}^{S_2}KL_j^{(2)}$
Let’s calculate
$\frac{\partial}{\partial w_{i,k}^{(1)}}\beta\sum_{j=1}^{\S_2}KL_j^{(2)} $
$=\frac{\partial}{\partial \rho_{k}^{(1)}}\beta\sum_{j=1}^{S_2}KL_j^{(2)}\cdot \frac{\partial}{\partial w_{i,k}^{(1)}}\rho_{k}^{(2)}$
$=\beta\sum_{j=1}^{S_2}\frac{\partial}{\partial \rho_{k}^{(2)}}(\rho log\frac{\rho}{\rho_j^{(2)}}+(1-\rho) log\frac{1-\rho}{1-\rho_j^{(2)}})\cdot \frac{\partial}{\partial w_{i,k}^{(1)}}\rho_{k}^{(2)}$
$=\beta(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot\frac{\partial}{\partial w_{i,k}^{(1)}}\rho_{k}^{(2)}$
$=\beta(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot\frac{\partial}{\partial w_{i,k}^{(1)}}\frac{1}{m}\sum_{i=1}^m(a_k^{(2)})$
$=\frac{\beta}{m}(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot\frac{\partial}{\partial z_{k}^{(1)}}\sum_{j=1}^m(a_k^{(2)})\cdot \frac{\partial}{\partial w_{i,k}^{(1)}}z_k^{(1)}$
$=\frac{\beta}{m}(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot\frac{\partial}{\partial z_{k}^{(1)}}\sum_{j=1}^m(a_k^{(2)})\cdot \frac{\partial}{\partial w_{i,k}^{(1)}}z_k^{(1)}$
$=\frac{\beta}{m}(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot \sum_{j=1}^mf’(z_k^{(1)})\cdot \frac{\partial}{\partial w_{i,k}^{(1)}}z_k^{(1)}$
$=\frac{\beta}{m}(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot \sum_{j=1}^mf’(z_k^{(1)})\cdot a_i^{(1)}$
$=\frac{\beta}{m}\sum_{j=1}^m(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot f’(z_k^{(1)})\cdot a_i^{(1)}$
Similarily,
$\frac{\partial}{\partial b_{k}^{(1)}}\beta\sum_{j=1}^{S_2}KL_j^{(2)} = =\frac{\beta}{m}\sum_{j=1}^m(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot f’(z_k^{(1)})$
So by backpropagation conduction, we can calculate
$\delta_k^{(2)}+=\beta(-\frac{\rho}{\rho_k^{(2)}}+\frac{1-\rho}{1-\rho_k^{(2)}})\cdot f’(z_k^{(1)})$
But $\delta_k^{(2)} \not= \frac{\partial}{\partial z_k^{(2)}}J_{sparse}(W,b)$

Open Questions for Sparsity Penalty

  • Why is Sparsity Penalty important? How does it work?
  • Does Sparsity Penalty works for neural network with one single hidden layer?
  • If not, how does Sparsity Penalty work for neural network with more than one hidden layers? How to implement the backpropagation algorithm ?
坚持原创技术分享,据说打赏的人都实现一个小目标,赚了一个亿!