My first neural network

Introduction

Is it a hype or a reasonable prediction? Will neural networks be able to reproduce our complex behaviour? As medical doctors much of our work relies on pattern recognition. Will neural networks be able to do this better than humans? 

To find the truth behind things you sometimes have to dig deeper into a subject. I’m not going to build a neural network from scratch over here. To see what is going on in the internal structure of a neural network I will go through the basic steps of creating a neural network in this post. There are a lot of possible substeps to discuss. To prevent going too much into detail I will not use Python but the Octave language as proposed by Andrew Ng.    

Dataset

We will use the digit classification dataset to setup our network. It is a dataset (X) that consists of 5000 handwritten digits that are stored as matrices of 20×20 pixels. The classifications are stored in vector y that contains a number between 1-10 in which 10 represents 0 to avoid confusion. 

We define:

input_layer_size  = 400;  % 20x20 Input Images of Digits
hidden_layer_size = 25; % 25 hidden units
num_labels = 10; % 10 labels, from 1 to 10
% (note that we have mapped "0" to label 10)

The dataset and the weights are loaded with the commands:

load('ex4data1.mat');
load('ex4weights.mat');
m = size(X, 1);

In which m is the number digits or training examples. 

Step 1 one-hot encoding

Transform the classifications with one-hot encoding. This is where the digits encoded as integer variables (1,2,3, etc) is replaced by a binary variable is added for each unique integer value. For example, if there are three classes in our classification, red, blue, yellow then these would be encoded into the vectors [1, 0, 0], [0, 1, 0] and [0, 0, 1].

Yhot = zeros(m, num_labels); 
for i=1:m
zeroline = zeros(1,num_labels);
zeroline(y(i)) = 1;
Yhot(i, :) = zeroline;
end

Step 2 sigmoid activation function

Create a sigmoid function. This sigmoid function is going to provide information from layer to layer about the activation pattern. The output of the sigmoid function looks as shown on the right. The output of function g, given the variable z, will thus represent a probability between 0 and 1. 

function g = sigmoid(z)
g = 1.0 ./ (1.0 + exp(-z));
end

Step 3 feedforward

In this step we feedforward the information through the neural network. The flow of information will be roughly as shown on this picture from wikipedia:

The steps are as follows:

  1. Create a matrix with name a1 that contains a matrix of “activation values” that are defined by the input data (called “X” which is a 5000x20x20 matrix) that has in the first column a vector of ones with length m (number of training examples) to account for the “bias unit.”
  2. Perform a matrix multiplication of matrix activation values (a1) with the matrix of the weights (Theta1) and call the output z2. We have to transpose the matrix theta to make a matrix multiplication possible. 
  3. Apply the earlier defined sigmoid function on matrix z2.
  4. Define the new outputs of the previous step, with in the first column a vector of ones with length m that is representative of the bias units, as a matrix of activation values for the hidden or middle layer (a2).
  5. Repeat step 2 on a2 and Theta2 to get the output labels for the training examples (which should be a 5000×10 matrix).
  6. Repeat step 3 on z3 (without adding the bias units vector because it will calculate the activation values of the output layer).
  7. Rename the matrix of the output layer as h, which refers to the outcome of the algorithm of the steps 1 to 7  (h stands for “hypothesis”) applied on X. 
a1 = [ones(m, 1) X];
z2 = a1*Theta1';
a2 = [ones(m, 1) sigmoid(z2)];
z3 = a2*Theta2';
a3 = sigmoid(z3);
h = a3;

step 4 backpropagation

Use the difference between the outputs in the one-hot encoded digit labels (Yhot) to correct the outcome of the hypothesis (h).

  1. Elementwise (.-) subtract the vector of true outcomes of the training example from the expected outcomes of the hypothesis and define the result as sigma3.
  2. Multiplicate the weights of the hidden layer projected on the output layer (theta2) with the matrix sigma 3 of the previous step.
  3. The errors of the hidden layer (sigma2) are calculated by the formula sigma3*theta2 .* g'(z) in which g’ is the derivative of the sigmoid function. The derivative of the sigmoid function (g’) can however be proven to be equal to a.*(1-a) in some miraculous way so we can just use that expression function to calculate sigma2.
  4. Remove the first column of zeros from sigma2.
sigma3 = h.-Yhot;
sigma2 = (sigma3*Theta2).*(a2.*(1-a2));
sigma2 = sigma2(:, 2:end);

Step 5 Gradients

The sigma values are used to calculate a matrix of gradients that show us whether the weights should be corrected upwards or downwards. 

  1. Transpose sigma derived from backpropagation and multiplying the transposed matrix with the activation values and perform and element wise division by the number of samples
  2. Add regularization terms to optimize the neural network. Regularization is a technique that makes slight modifications to the learning algorithm such that the model generalizes better. Do this by multiplying lambda divided by the number of training examples (m) with a matrix with weights theta1  in which the first column is replaced by a vector of zeros. Do the same with theta2.
delta_1 = (sigma2'*a1);
delta_2 = (sigma3'*a2);
Theta1_grad = delta_1./m;
Theta2_grad = delta_2./m;
reg1 = (lambda/m)*[zeros(size(Theta1, 1), 1) Theta1(:, 2:end)];
reg2 = (lambda/m)*[zeros(size(Theta2, 1), 1) Theta2(:, 2:end)];
Theta1_grad = (delta_1./m)+reg1;
Theta2_grad = (delta_2./m)+reg2;

Step 6 initialize parameters

Create a function for randomisation and use that function to create random initialisation weights for the input layer->hidden layer and the hidden layer->output layer. 

function W = randInitializeWeights(L_in, L_out)
epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
end
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];

Step 7 Training 

First set the number of iterations:

options = optimset('MaxIter', 50);

Create “short hand” to minimize the cost function:

costFunction = @(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, X, y, lambda);

Minimise the cost function and return the parameters with fmincg (a complicated function that uses the gradient descent parameters to update the parameters):

[nn_params, cost] = fmincg(costFunction, initial_nn_params, options)

Running the function yields:

Iteration    50 | Cost: 4.944190e-01

Predict

Unpack theta1 and theta2

Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

Predict by:

  1. Running earlier feedforward algorithm as a prediction algorithm with new theta1 and theta2.  
  2. Get the vector of predicted values (pred) and compare it with y.
  3. If predicted values are equal to y get a double-precision (float) number 1 or 0.
  4. Take the mean of the values of this vector and multiply by 100 to get. an accuracy percentage. 
a1 = [ones(m,1),X];
z2 = a1 * Theta1';
a2 = [ones(m,1), sigmoid(z2)];
z3 = a2 *Theta2';
a3 = sigmoid(z3);
[maxval, pred] = max(a3, [], 2); mean(double(pred == y)) * 100;

Returns a training set accuracy if: 94.98