3 Shallow Neural Networks

Neural Network Overview

Refer to Example : [latex]x^{(i)}[/latex] Refer to Layer : [latex]\alpha^{[m]}[/latex] Algorithm 3.1 : [latex]\left.\begin{matrix} x\\ w\\ b \end{matrix}\right\} \Rightarrow z = w^Tx + b[/latex] Algorithm 3.2 : [latex]\left.\begin{matrix} x\\ w\\ b \end{matrix}\right\} \Rightarrow z = w^Tx + b \Rightarrow \alpha = \sigma(z) \Rightarrow L(a,y) \ (Loss \ Function)[/latex] Algorithm 3.3 : [latex]\left.\begin{matrix} x\\ W^{[1]}\\ b^{[1]} \end{matrix}\right\} \Rightarrow z^{[1]} = W^{[1]}x + b^{[1]} \Rightarrow \alpha^{[1]} = \sigma(z^{[1]})[/latex] Algorithm 3.4 : [latex]\left.\begin{matrix} x\\ dW^{[1]}\\ db^{[1]} \end{matrix}\right\} \Leftarrow dz^{[1]} = d(W^{[1]}x + b^{[1]}) \Leftarrow d\alpha^{[1]} = d\sigma(z^{[1]})[/latex] Algorithm 3.5 : [latex]\left.\begin{matrix} x\\ dW^{[1]}\\ db^{[1]} \end{matrix}\right\} \Leftarrow dz^{[1]} = d(W^{[1]}x + b^{[1]}) \Leftarrow d\alpha^{[1]} = d\sigma(z^{[1]})[/latex] Algorithm 3.6 : [latex]\left.\begin{matrix} d\alpha^{[1]} = d\sigma(z^{[1]})\\ dW^{[2]}\\ db^{[2]} \end{matrix}\right\} \Leftarrow dz^{[2]} = d(W^{[2]}\alpha^{[1]} + b^{[2]}) \Leftarrow d\alpha^{[2]} = d\sigma(z^{[2]}) \Leftarrow dL(a^{[2]}, y)[/latex]

Neural Network Representation

input layer
hidden layer
output layer

[latex]a[/latex] : activations [latex]a^{[0]}[/latex] : the activations of the input layer [latex]a^{[0]}_1[/latex] : first node Algorithm 3.7 : [latex]a^{[1]}=\begin{bmatrix} a^{[1]}_1\\ a^{[1]}_2\\ a^{[1]}_3\\ a^{[1]}_4 \end{bmatrix}[/latex]

when we count layers in neural networks we don’t count the input layer so the hidden layer is layer 1
In our notational convention we’re calling the input layer layer 0

so a two layer neural network looks like a neural network with one hidden layer.

Computing a Neural Network’s output

Symbols in neural networks：

𝑥 : features
𝑎 : output
𝑊 : weight
superscript : layers
subscript : number of the items

How this neural network computers outputs :

[latex]z_1^{[1]} = w_1^{[1]T}x + b_1^{[1]}[/latex]
[latex]a_1^{[1]} = \sigma(z_1^{[1]})[/latex]
[latex]a_2^{[1]}, a_3^{[1]}, a_4^{[1]}[/latex]

Vectorizing : stack nodes in a layer vertically Algorithm 3.10 : [latex]a^{[1]} = \begin{bmatrix} a_1^{[1]}\\ a_2^{[1]}\\ a_3^{[1]}\\ a_4^{[1]} \end{bmatrix} = \sigma(z^{[1]})[/latex] Algorithm 3.11 : [latex]\begin{bmatrix} z_1^{[1]}\\ z_2^{[1]}\\ z_3^{[1]}\\ z_4^{[1]} \end{bmatrix} = \begin{bmatrix} \cdots W_1^{[1]T} \cdots \\ \cdots W_2^{[1]T} \cdots \\ \cdots W_3^{[1]T} \cdots \\ \cdots W_4^{[1]T} \cdots \end{bmatrix} * \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} + \begin{bmatrix} b_1^{[1]}\\ b_2^{[1]}\\ b_3^{[1]}\\ b_4^{[1]} \end{bmatrix}[/latex]

Vectorizing across multiple examples

Take the equations you had from the previous algorithm and with very little modification, change them to make the neural network compute the outputs on all the examples, pretty much all at the same time. [latex]a^{[2](i)}[/latex] : Refers to training example i and layer two Algorithm 3.12 : [latex]x = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)}\\ \vdots & \vdots & \vdots & \vdots \end{bmatrix}[/latex] Algorithm 3.13 : [latex]Z^{[1]} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ z^{[1](1)} & z^{[1](2)} & \cdots & z^{[1](m)}\\ \vdots & \vdots & \vdots & \vdots \end{bmatrix}[/latex] Algorithm 3.14 : [latex]A^{[1]} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ a^{[1](1)} & a^{[1](2)} & \cdots & a^{[1](m)}\\ \vdots & \vdots & \vdots & \vdots \end{bmatrix}[/latex] Algorithm 3.15 : [latex]\left.\begin{matrix} z^{[1](i)} = W^{[1](i)}x^{(i)} + b^{[1]}\\ a^{[1](i)} = \sigma(z^{[1](i)})\\ z^{[2](i)} = W^{[2](i)}a^{[1](i)} + b^{[2]}\\ a^{[2](i)} = \sigma(z^{[2](i)}) \end{matrix}\right\} \Rightarrow \left\{\begin{matrix} A^{[1]} = \sigma (z^{[1]})\\ z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}\\ A^{[2]} = \sigma (z^{[2]}) \end{matrix}\right.[/latex]

Justification for vectorized implementation

Algorithm 3.16 : [latex]\begin{matrix} z^{[1](1)} = W^{[1]}x^{(1)} + b^{[1]}\\ z^{[1](2)} = W^{[1]}x^{(2)} + b^{[1]}\\ z^{[1](3)} = W^{[1]}x^{(3)} + b^{[1]} \end{matrix}[/latex] Algorithm 3.17 : [latex]W^{[1]}x = \begin{bmatrix} \cdots \\ \cdots \\ \cdots \end{bmatrix} \begin{bmatrix} \vdots &\vdots &\vdots &\vdots \\ x^{(1)} &x^{(2)} &x^{(3)} &\vdots \\ \vdots &\vdots &\vdots &\vdots \end{bmatrix} = \begin{bmatrix} \vdots &\vdots &\vdots &\vdots \\ w^{(1)}x^{(1)} &w^{(1)}x^{(2)} &w^{(1)}x^{(3)} &\vdots \\ \vdots &\vdots &\vdots &\vdots \end{bmatrix} = \begin{bmatrix} \vdots &\vdots &\vdots &\vdots \\ z^{[1](1)} &z^{[1](2)} &z^{[1](3)} &\vdots \\ \vdots &\vdots &\vdots &\vdots \end{bmatrix} = Z^{[1]}[/latex] Stack up the training examples in the columns of matrix X, and their outputs are also stacked into the columns of matrix [latex]z^{[1]}[/latex].

Activation functions

Algorithm 3.18 Sigmoid : [latex]a = \sigma (z) = \frac{1}{1 + e^{-z}}[/latex] Algorithm 3.19 tanh : [latex]a = \tanh (z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}[/latex] (almost always works better than the sigmoid function) Algorithm 3.20 hidden layer : [latex]g(z^{[1]}) = tanh(z^{[1]})[/latex] (almost always strictly superior) Algorithm 3.21 binaray : [latex]g(z^{[2]}) = \sigma(z^{[2]})[/latex] (if y is either 0 or 1) if z is either very large or very small then the gradient of the derivative or the slope of this function becomes very small so z is very large or z is very small the slope of the function you know ends up being close to zero and so this can slow down gradient descent Algorithm 3.22 Relu (Rectified Linear Unit) : [latex]a = max(0, z)[/latex] Algorithm 3.23 Leaky Relu: [latex]a = max(0.01z, z)[/latex] some rules of thumb for choosing activation functions :

sigmoid : binary classification
tanh : pretty much strictly superior
ReLu : default

If you’re not sure which one of these activation functions work best you know try them all and then evaluate on like a holdout validation set or like a development set which we’ll talk about later and see which one works better and then go with that.

Why need a nonlinear activation function?

It turns out that for your neural network to compute interesting functions you do need to take a nonlinear activation function. It turns out that if you use a linear activation function or alternatively if you don’t have an activation function then no matter how many layers your neural network has always doing is just computing a linear activation function so you might as well not have any hidden layers.

Derivatives of activation functions

Algorithm 3.25 : [latex]\frac{\mathrm{d} }{\mathrm{d} z}g(z) = \frac{1}{1+e^{-z}}(1 - \frac{1}{1+e^{-z}}) = g(z)(1 - g(z))[/latex] Algorithm 3.26 : [latex]g(z) = \tanh (z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}[/latex] Algorithm 3.27 : [latex]\frac{\mathrm{d} }{\mathrm{d} z}g(z) = 1 - (tanh(z))^2[/latex] Algorithm Rectified Linear Unit (ReLU) : [latex]g(z)’ = \left\{\begin{matrix} 0 & if\ z < 0\\ 1 & if\ z > 0\\ undefined & if\ z = 0 \end{matrix}\right.[/latex] Algorithm Leaky linear unit (Leaky ReLU) : [latex]g(z)’ = \left\{\begin{matrix} 0.01 & if\ z < 0\\ 1 & if\ z > 0\\ undefined & if\ z = 0 \end{matrix}\right.[/latex]

Gradient descent for neural networks

**forward propagation : ** (1) : [latex]z^{[1]} = W^{[1]}x + b^{[1]}[/latex] (2) : [latex]a^{[1]} = \sigma(z^{[1]})[/latex] (3) : [latex]z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}[/latex] (4) : [latex]a^{[2]} = g^{[2]}(z^{[2]}) = \sigma(z^{[2]})[/latex] **back propagation : ** Algorithm 3.32 : [latex]\mathrm{d}z^{[2]} = A^{[2]} - Y, \ Y = [y^{[1]} \ y^{[2]} \ \cdots \ y^{[m]}][/latex] Algorithm 3.33 : [latex]\mathrm{d}W^{[2]} = \frac{1}{m} \mathrm{d}z^{[2]}A^{[1]T}[/latex] Algorithm 3.34 : [latex]\mathrm{d}b^{[2]} = \frac{1}{m} np.sum(\mathrm{d}z^{[2]}, axis = 1, keepdims = True)[/latex] Algorithm 3.35 : [latex]\mathrm{d}z^{[1]} = \underbrace{W^{[2]T}\mathrm{d}z^{[2]}} * \underbrace{g^{[1]^{‘}}} * \underbrace{z^{[1]}}[/latex] Algorithm 3.36 : [latex]\mathrm{d}W^{[1]} = \frac{1}{m}\mathrm{d}z^{[1]}x^T[/latex] Algorithm 3.37 : [latex]\underbrace{\mathrm{d}b^{[1]}} = \frac {1}{m} np.sum(\mathrm{d}z^{[1]}, axis = 1, keepdims = True)[/latex] (axis = 1 : horizontally, keepdims : ensures that Python outputs, for d b^[2] a vector that is some n by one)

Backpropagation intuition

It is one of the very hardest pieces of math. One of the very hardest derivations in all of machine learning.

//TODO, maybe never …

Random Initialization

It is important to initialize the weights randomly.

Gaussian random variable (2,2) : [latex]W^{[1]} = np.random.randn(2, 2)[/latex]
then usually you multiply this by a very small number such as 0.01 so you initialize it to very small random values