[ad_1]
A step-by-step guide on developing a Multi-Layer Neural Network for classifying Magnetic Resonance images of Alzheimer’s disease
In December 2019, with the support of Medium and Toward Data Science, I started writing a series of articles and Python tutorials explaining how Machine Learning (ML) works and how it should be applied to biomedical data. The idea of constructing ML algorithms from scratch, without the support of frameworks like TensorFlow or Scikit-learn, was intentional and animated by the intent of explaining the base of ML, with a particular benefit for the reader interested in a deep understanding of all the concepts. Moreover, the over-cited frameworks are potent tools that require special attention to parameter settings. For this reason, an in-depth study of their bases can explain to the reader how to make the most of their potential.
Over time, I realized that diagnostic imaging could represent the biomedical sector primarily benefiting from AI’s help. If nothing else, it is a sector that offers many hints for optimizing ML algorithms. Indeed, diagnostic imaging is the medical field where variability and difficulty of definition collide with the need to abstract, broadly speaking, the nature of a pathological process.
Today, more efficient AI systems flank physicians and scientists in challenging tasks, like the early diagnosis of disease based on imaging. More often, AI’s contribution is not only limited to helping with the prompt diagnosis of an illness but can proficiently reveal what underlies a pathology process and when a tissue or an organ starts to degenerate. Consider the case of Alzheimer’s disease (AD), a chronic neuro disease causing neuron degeneration and brain tissue loss (McKhann et al., 1984): the cost of caring for AD patients tends to rise, and getting a prompt diagnosis is becoming necessary.
In this article, I wish to bring to your attention the construction of a Multi-Layer Perceptron (MLP Neural Network) that we will use as an image classification system for Alzheimer’s Disease Magnetic Resonance Imaging data (ADMRI) from Kaggle. In my previous articles, I described other straightforward approaches for image detection and outcome prediction based on Logistic Regression (LR). In those articles, we have seen how successfully classify data from MNIST digits handwritten or Zalando, with an accuracy of 97%. Still, these approaches could not be so efficient when the structure of a training set reaches a certain complexity, as in the case of the ADMRI.
For the code implemented in this article, we will use Python. I recommend using Jupyter notebook and a Python version ≥ 3.8. The code requires a minimum of optimized calculations, especially for the linear algebra involved. We will use packages like Pandas, NumPy, matplotlib, and SciPy, part of SciPy.org, a Python-based open-source software ecosystem for mathematics, science, and engineering. Also, we will import matplotlib, a Python data visualization library. Moreover, an object “opt” from the scipy.optimize will be created to make optimizations to the Gradient. The last library is “pickle,” fundamental for opening files in pickle format.
Suggestion:
For a better understanding of the concepts, such as the sigmoidal function, Cost Function, and Gradient Descent in Linear and Logistic Regressions, I suggest the reader read my previous articles on these arguments first:
1. Linear Regression with one or more variables;
2. Logistic Regression for malignancy prediction in cancer;
3. Detecting Melanoma with One-vs-All;
4. One-vs-All Logistic Regression for Image Recognition in Python.
These are fundamental preparatory concepts that will be taken for granted here. Moreover, many concepts here will be briefly mentioned, then wait to start reading this article before reading the others in the suggested numerical order.
Datasets.
The Alzheimer’s Disease Magnetic Resonance Imaging data (ADMRI) is downloadable from Kaggle. The data contains two folders: the first contains 33984 augmented MRI images, and the latter includes the originals. The data has four classes of images in training and testing sets: 1) Non-Demented, 2) Moderate Demented, 3) Mild Demented, and 4) Very Mild Demented. Each picture comprises ~ 200 x 200 pixels, corresponding to many features (40000).
For this exercise, I have provided a transformed version of the same dataset you can download from the following link. This “light” version contains the entire augmented dataset reduced to a 50 x 50 resolution. The pickle file that you will get after unzipping is a pandas dataframe composed of two columns, one with the 2500 features characterizing each image and the other containing the corresponding labels, here defined:
Label 1: Non-Demented
Label 2: Moderate Demented
Label 3: Mild Demented
Label 4: Very Mild Demented
First, we need to upload all the necessary libraries for opening the dataset and displaying images:
import pandas as pd
import numpy as np
from matplotlib.pyplot import imshow
import pickle
Then, open the pickle file, and discover the contained dataframe, typing the following code in a Jupyter notebook cell:
with open('AugmentedAlzheimer.50X50.training.dataset.pickle', 'rb') as handle:
ALZ = pickle.load(handle)
The dataframe we call “ALZ” here has all the pictures and their relative labels. Typing “ALZ” in a new cell will show the data structure:
Now, transform each data frame’s column into two distinct numpy objects, X, the training dataset, and y, the labels vector. (Transforming data into numpy objects is necessary for linear algebra calculations):
X = np.stack(ALZ['X'])
y = np.int64(ALZ['y'])
The X vector contains 33,984 items, each containing a vector of 2500 grayscale values. The numpy shape method shows the X structure:
Manipulating X, we can access the single items using their index; for example, to access the first 250 values of the first image (index = 0), type:
Each pixel corresponds to a specific gray value in a 0–255 range, where 0 is black, and 255 is white. Now that we can access the X vector, the matplot function imshow() will display a grayscale picture of 2500 pixels framed in a 2D representation of 50 x 50 pixels:
'''
Reshape a 2500-values vector extracted from one of the images
stored in the vector X (i.e.: #12848).
Use the NumPy method .reshape, specifiying the double argument '50'
then show the image with the function imshow, specifying the argument
cmap='gray'
'''image = X[12848].reshape(50, 50)
print('image:', 12848)
print('label:', y[12848])
imshow(image, cmap='gray')
Running the code, image number #12848 appears as follows:
Instead of a one-by-one procedure, we prefer an easy function for randomly displaying fifty images picked from the dataset. Here is the code:
'''
plotSamplesRandomly
Function for visualizing fifty randomly picked images, from the dataset
'''def plotSamplesRandomly(X, y):
from random import randint
import matplotlib.pyplot as plt
%matplotlib inline
# create a list of randomly picked indexes.
# the function randint creates the list, picking numbers in a
# range 0-33983, which is the length of X
randomSelect = [randint(0, len(X)) for i in range(0, 51)]
# reshape all the pictures on the n X n pixels,
# where n = sqrt(size of X), in this case 50 = sqrt(2500)
w, h =int(np.sqrt(X.shape[1])), int(np.sqrt(X.shape[1]))
fig=plt.figure(figsize=(int(np.sqrt(X.shape[1])), int(np.sqrt(X.shape[1]))))
# Define a grid of 10 X 10 for the big plot.
columns = 10
rows = 10
# The for loop
for i in range(1, 51):
# create the 2-dimensional picture
image = X[randomSelect[i]].reshape(w,h)
ax = fig.add_subplot(rows, columns, i)
# create a title for each pictures, containing #index and label
title = "#"+str(randomSelect[i])+"; "+"y:"+str(y[randomSelect[i]])
# set the title font size
ax.set_title(title, fontsize=np.int(np.sqrt(X.shape[1])/2))
# don't display the axis
ax.set_axis_off()
# plot the image in grayscale
plt.imshow(image, cmap='gray')
# Show some sample randomly
print('\nShow samples randomly:')
plt.show()
Run the plotSampleRandomly() function passing as arguments X and y:
plotSamplesRandomly(X, y)
The output is in Figure 4:
Also, I have produced a light version of the original Alzheimer’s Disease Magnetic Resonance Imaging dataset, downloadable from this Kaggle link. It is possible using this dataset for testing.
Importantly: the only aim of this exercise is to expose basic concepts on the application of MLP Neural Network to images of Alzheimer’s Disease Magnetic Resonance, and it is intended ONLY for experimental purposes, not for clinical use.
The Logistic Unit model
Let us start with some historical definitions: with Neural Networks (NN), we define a class of AI algorithms that withdraw inspiration from biological neuronal cells’ networks. They were born with the specific aim of trying to mimic human brain functions. The assumption of their functioning has its foundations, precisely in studies conducted eighty years ago by eminent neuroscientists. In 1943, Warren McCulloch and Walter Pitts published a pivotal work entitled “A Logical Calculus of Ideas Immanent in Nervous Activity,” describing, for the first time, a computational model of how networks of biological neurons work together to accomplish complex tasks.
Starting from that year, one behind the other, many NN models followed. For example, in 1957, Frank Rosenblatt defined the concept of Perceptron, a sort of artificial neuron able to classify the linear relationship between inputs and outputs. We can undoubtedly say that Artificial Intelligence was born at that moment. Still, despite the interest in all these models dramatically growing, huge hardware limitations of the time relegated Artificial Intelligence to the world of hopes and dreams rather than the real one. We had to wait until 1986 when D. E. Rumelhart published a paper introducing the concept of backpropagation training, thus a new algorithm that could find the minimum of a Cost Function, automatizing the calculation of the Cost Function’s derivative, in a few words, Rumelhart had invented the Gradient Descent.
When we implement a Neural Network, in its modern meaning, we refer to a simplified model of what a group of biological neurons does. In this simplification, a neuron is just a logistic unit, connected with various dendrites to other similar input units, and produces an output value through its axon.
The logistic unit model simplifies the biological neuron. The body (the blue circle in Figure 5) integrates values from the input neurons (the X vector), including the basis neuron, colored red, connected to the logistic unit through wires, and produces an output value through its axon. The output corresponds to the hypothesis model hθ. A way to represent hθ is the g(z) function, a sigmoid function (or logistic function), thus non-linear, whose formula can be written like this:
The g(z) function uses the product of the translated θ vector with the X vector (we will call this product z) as an argument and can be defined as:
Suppose the X vector assumes 0 or 1 values, g(z), when z = [θ₀X₀ + θ₁X₁ + θ₂X₂], calculates the probability that the output can be 0 or 1. The sigmoid logistic function is also called the activation function, a definition deriving from the physiological function of each biological neuron, which is activated based on the type of received input. This input is then algebraically multiplied by the θ vector, the weight vector in the logistic unit. As for all the regression models (linear or logistic), the logistic unit has a Cost function, which can register how far we are from the minimum of the hθ and help us find the best θ.
But before affording the Neural Network’s Cost function, let us understand how a Neural Network works with some intuitions and two practical examples. This Neural Network we will describe is deliberately simple; it does not have hidden layers, composed only of a logistic unit and a couple of input neurons.
Simple Neural Network Intuition I: the AND gate
Imagine, for instance, that we want our Neural Network learns the logic gate AND:
Figure 6a represents an implementation of the AND gate into the logistic unit, which exploits three θ values that we have deliberately assigned: θ vector is, indeed, equal to the triad θ₀ = -3, θ₁ = +2, and θ₂=+2, (values in green). The table portion in red represents the AND truth table. The two AND operands are X1 and X2, and they can assume values equal to zero or one, while X0 is the input neuron representing the bias, and its value is equal to 1 as default. The output function Y corresponds to the hypothesis model hθ, implemented by the function g(z). Depending on which is z, the function g(z) returns a value approximated to 0 or 1, deriving from the z position mapped to the sigmoid.
For example, for X1 = 0 and X2 = 0, g(z) will equal 0.05 since the only value of X vector ≠ 0 is X0, thus the value stored in the bias neuron (-3). Indeed, the value returned by g(-3)=0.05 is approximatively close to zero when mapped into the sigmoid. Conversely, when X1 = 1 and X2 = 1 (the only condition where X1 AND X2 is true), g(z) is equal to g(-3+2+2) = g(1), that, mapped into the sigmoid, returns 0.73, a value abundantly beyond the 0.5 thresholds and that approximates 1. In all the other two remaining cases, that are X1=0, X2=1, and X1=1, X2=0, g(z) is equal to g(-1)=0.27, a value that is again near 0 (Figure 6b). The g(z) output in all four cases (yellow in Figure 6a) returns the expected outcome for the AND gate. The goodness of the sigmoid logistic function is that it can smoothly calculate the probability 0/1 of any combination of signals coming from the input neurons, activating itself physiologically as a biological neuron would do. Moreover, the sigmoid function will help guarantee that the Gradient Descent algorithm for seeking θ values will converge to the global minimum, avoiding problems with the non-convex Cost function with many local optima (see Logistic Regression for malignancy prediction in cancer for a detailed discussion on issues related to non-convex Cost function). The python implementation for the sigmoid logistic function is the following:
# Sigomid Logistic (Activation) Function
def sigmoid(z):
g = 1.0 / (1.0 + np.exp(-z))
return g
Please copy and paste it into a new notebook cell. You can run the function, passing to it various z values as an argument, and see what output it produces, for instance:
Or is it possible to implement an excellent function for visualizing g(z) and its output like this:
# Plot the Sigmoid function, and its output:def plotSigmoid(z):
import numpy as np
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True
x = np.linspace(-10, 10, 100)
# setting the axes at the centre
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.spines['left'].set_position('center')
ax.spines['bottom'].set_position('zero')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
plt.plot(x, sigmoid(x), color='blue')
plt.plot(0, sigmoid(0), marker="o", markersize=5,
markeredgecolor="black", markerfacecolor="black")
if(z!=0):
plt.text(0, sigmoid(0), " 0.5")
plt.plot(z, sigmoid(z), marker="o", markersize=5,
markeredgecolor="red", markerfacecolor="red")
value = "{:5.3f}".format(sigmoid(z))
plt.text(z, sigmoid(z), str(" z="+str(z)+"; value="+str(value)))
plt.show()
For example, running plotSigmoid, the output for z = 1, that is 0.73, will be displayed as follows:
Simple Neural Network Intuition II: the OR gate
As for the AND gate, we can play with another example that involves the OR gate.
By switching the θ value for the bias neuron X0 from -3 to -1 and redoing all the calculations, the output now reflects the OR gate: we have three couples of X1 and X2 inputs that are mapped in the g(z) sigmoid function, producing values approximating > 0.5, close to 1.
The Multi-Layer Perceptron (MLP) model
A Multi-Layer Neural Network consists of one input layer, one or more hidden layers, and one output layer. All the layers, but not the output layer, embodies a bias neuron, and all the units are fully connected. The number of hidden layers differentiates between a classic MLP and a Deep Learning Neural Network. Thanks to the nature of their architecture, Neural Networks can help us to formulate complex non-linear hypotheses.
Observing the MLP’s model representation, we notice that the calculations at the base of the MLP functioning are similar to those done for the Logistic Regression. This observation confirms that an MLP is nothing else than a “super” logistic regressor. The structure (Figure 8) shows three layers of neurons. The first layer is the input layer, composed of the X vector. The second layer consists of hidden nodes which activate based on the product of X times θ. The outcome of these activations becomes the input for the third layer, the output layer. The third layer comprises activation units, too. The only difference is that its outcome should correspond to the hypothesis model hθ. And here, we have to use the conditional verb because the heart of the matter in Neural Networks is precisely trying to minimize the error between the hypothesis and the predicted outcome deriving from the X times θ products.
The minimization process is embodied in the so-called training phase. It consists of finding those θ values (weights) for which Xθ will fit the hθ requirements through the Gradient Descent. We have to multiply a θ matrix with dimension Sj times (n + 1), where Sj is the number of the activation nodes for J layers:
and,
Implementation of an MLP Neural Networks for augmented Alzheimer MRI classification
We are going to create a one-layer MLP. The number of units for the hidden layer is calculated as 2/3 of the input units plus the number of the output units. This setting warrants the training success, despite other architectures using more o fewer activation units in the hidden layer. The number of input units depends on the number of pixels composing a single image in the dataset, and the number of output units corresponds to the category labels. The entire algorithm is structured into three main functions:
1. init
2. training
3. testing
These three main functions embody other functions we will describe at the appropriate time.
The NN presented here is a Forward Propagation MLP; thus, the input data is fed through the network forward to generate the output. The hidden layer processes the data and moves to the successive layer. During forward propagation, the activations occur at each hidden and output layer node. The activation function is the calculation of the weighted sum. Based on the weighted sum, the activation function is applied to make the neural network flow non-linearly using bias.
Initialization
After uploading the dataset, define X and y vectors:
with open('AugmentedAlzheimer.50X50.training.dataset.pickle', 'rb') as handle:
ALZ = pickle.load(handle)
X = np.stack(ALZ['X'])
y = np.int64(ALZ['y'])
The X vector will contains 33,984 items, each containing a vector of 2500 grayscale values. The init function uses the method X.shape for initializing the NN architecture based on the X size:
'''
function "init"
Neural Network initialization and parameter setup
'''
def init(X, y):
I = X.shape[1] # n x n MRI input size (2500)
H1 = int(X.shape[1]*2/3) + max(set(y)) # hidden units size:
# 2/3 of the input units + the number of output units (1670)
O = max(set(y)) # output units size or labels (4)
m = X.shape[0] ini_Th1 = thetasRandomInit(I, H1)
ini_Th2 = thetasRandomInit(H1, O)
# Unroll parameters
ini_nn_params = np.concatenate((ini_Th1.T.flatten(), ini_Th2.T.flatten()))
ini_nn_params = ini_nn_params.reshape(len(ini_nn_params), 1)
print('\nNeural Network Parameters initialized!\n')
print('\nNeural Network structure:\n')
print('Input layer neurons:', I)
print('1st hidden layer neurons:', H1)
print('Output layer neurons:', O)
return ini_nn_params, I, H1, O
The parameters I, H1, and O correspond to the size of the input, hidden, and output layers; ini_nn_params contains an “unrolled” version of the θ vectors (Th1 for the input layer and Th2 for the hidden layer). Using an unrolled flattened vector will reduce the complexity of handling too many vectors, making the code more readable. The init function accepts the vectors X and y as arguments but needs the sub-function thetasRandomInit nested inside itself, which randomizes Th1 and Th2. (These must be randomly generated before the training phase).
The thetasRandomInit function can be implemented as follows:
'''
function "thetasRandomInit"
Random initialization of thetas
'''
def thetasRandomInit(L_in, L_out):
e_init = 0.12
W = np.random.rand(L_out, 1 + L_in) * 2 * e_init - e_init
return W
The W matrix, which contains a random values vector, is implemented based on an e_init value, set to 0.12 as default. Please copy and paste the previous code into a new cell and run it. For calling init and defining the NN parameters ini_nn_params, I, H1, O, type as follows:
ini_nn_params, I, H1, O = init(X, y)
The init function will initialize the four variables and will produce the output:
Neural Network Parameters initialized!Neural Network structure:
Input layer neurons: 2500
1st hidden layer neurons: 1670
Output layer neurons: 4
Which describes the NN structure, comprising 2500 input units, 1670 hidden units, and four output units.
Training
For training the NN with the dataset of Alzheimer’s MRIs we uploaded (X and y vectors), we need to create a training function that is just a “wrapping” code passing the nnCostFunction to the opt object:
'''
function "training"
Neural Network training
'''
def training(ini_nn_params, I, H1, O, X, y):
import scipy.optimize as optlambd = 1
print('\nTraining Neural Network... \n')
result = opt.minimize(fun=nnCostFunction,
x0=ini_nn_params,
args=(I, H1, O, X, y, lambd),
method='TNC',
jac=True,
options={'maxiter': 1000, "disp": True})
params_trained = result.x
Th1_trained = np.reshape(params_trained[0:(H1 * (I + 1)), ],
(H1, I + 1))
Th2_trained = np.reshape(params_trained[(H1 * (I + 1)):, ],
(O, H1 + 1))
print('\nTrained.\n')
return Th1_trained, Th2_trained
The function training uses the scipy object opt to minimize the Gradient Descent of the Neural Network’s Cost Function (nnCostFunction). The function establishes how many iterations are necessary to accomplish the minimization goal through the “maxiter” parameter. As specified among the argument list of opt.minimize, the method used in this case is the TNC (Truncated Newton) which can minimize a scalar function, like nnCostFunction, of one or more variables. Finally, training will return the two θ vectors, Th1 and Th2, trained and ready to be used in the testing phase.
The nnCostFunction represents the most complex routine presented here, the heart of the whole algorithm. Now we are going to explain what it does in more detail. I suggest considering the schemas in Figure 8 and the calculations in Figure 9 while you’re studying the nnCostFunction code.
'''
nnCostFunction
Implements the Neural Network Cost Function
'''
def nnCostFunction(nn_params, I, H1, O, X, y, lambd):# 1. RESHAPING
# Reshape nn_params back into the parameters Th1 and Th2,
# the weight matrices
Th1 = np.reshape(nn_params[0:(H1 * (I + 1)), ],
(H1, I + 1))
Th2 = np.reshape(nn_params[(H1 * (I + 1)):, ],
(O, H1 + 1))
# 2. SETUP OF Y
# Setup the output (y) layout
m = X.shape[0]
Y = np.zeros((m, O))
for i in range(m):
Y[i, y[i] - 1] = 1
# 3. INITIALIZE J, and THETAS
J = 0
Th1_grad = np.zeros(Th1.shape)
Th2_grad = np.zeros(Th2.shape)
# 4. PREPARE ALL THE VECTORS FOR THE FORWARD PROPAGATION
# Six new vectors are generated here: a1, z2, a2, z3, a3, and h.
# The vector a1 equals X (the input matrix),
# with a column of 1's added (bias units) as the first column.
a1 = np.hstack((np.ones((m, 1)), X))
# z2 equals the product of a1 and Th1
z2 = np.dot(a1, Th1.T)
# The vector a2 is created by adding a column of bias units
# after applying the sigmoid function to z2.
a2 = np.hstack((np.ones((m, 1)), sigmoid(z2)))
# z3 equals the product of a2 and Th2
z3 = np.dot(a2, Th2.T)
a3 = sigmoid(z3)
# The Hypotheses h is = a3
h = a3
# 5. MAKE REGUKARIZED COST FUNCTION.
# calculate P
p = np.sum([
np.sum(np.sum([np.power(Th1[:, 1:], 2)], 2)),
np.sum(np.sum([np.power(Th2[:, 1:], 2)], 2))
])
# Calculate Cost Function
J = np.sum([
np.divide(np.sum(np.sum([np.subtract(np.multiply((-Y), np.log(h)),
np.multiply((1-Y), np.log(1-h)))], 2)), m),
np.divide(np.dot(lambd, p), np.dot(2, m))
])
# 6. FORWARD PROPAGATION
# d3 is the difference between a3 and y.
d3 = np.subtract(a3, Y)
z2 = np.hstack((np.ones((m, 1)), z2))
d2 = d3.dot(Th2) * gradientDescent(z2)
d2 = d2[:, 1:]
# GRADIENTS
# Delta1 is the product of d2 and a1.
delta_1 = np.dot(d2.T, a1)
# Delta2 is the product of d3 and a2.
delta_2 = np.dot(d3.T, a2)
# Regularized Gradients.
P1 = (lambd/m) * np.hstack([np.zeros((Th1.shape[0], 1)), Th1[:, 1:]])
P2 = (lambd/m) * np.hstack([np.zeros((Th2.shape[0], 1)), Th2[:, 1:]])
Theta_1_grad = (delta_1/m) + P1
Theta_2_grad = (delta_2/m) + P2
grad = np.hstack((Theta_1_grad.ravel(), Theta_2_grad.ravel()))
return J, grad
The function nnCostFunction() accepts as arguments nn_params, which is the ini_nn_params we have created before and that contains an unrolled version of the θ vectors; I, H1, O, (respectively the size of the input, hidden and output layers), the two vectors X, y, and lambd, which corresponds to the regularization parameter lambda. The function return J and the θ gradient vectors.
- First, the nnCostFunction reshapes ini_nn_params back into the parameters Th1 and Th2, considering the original size of these two vectors.
- The code needs to create a setup of the output layer, making a vector Y composed of n-label-columns * n-X_samples initially set to zero and then filled with 1’s in a manner to codify the output in this way:
Structure of Y (the output layer):1,0,0,0 for label 1
0,1,0,0 for label 2
0,0,1,0 for label 3
0,0,0,1 for lable 4
The new codification is then attributed to the y vector.
3. Initialize the Cost J and the θ vectors with zeros for the Gradient Descent.
4. Prepare all the vectors for the “Forward Propagation.” Six new vectors are generated here: a1, z2, a2, z3, a3, and h. The vector a1 equals X (the input matrix), with a column of 1’s added (bias units) as the first column. The vector z2 equals the product of a1 and θ1 (Th1). The vector a2 is created by adding a column of bias units after applying the sigmoid function to z2. The vector z3 equals the product of a2 and θ1 (Th2). Finally, the vector z3 is copied into the vector h, representing the hypothesis hθ. This passage, in appearance, looks not so important, but for better readability is essential to duplicate z3 in h.
5. This part of the function implements the Cost Function that we want to be regularized. Recall the formula for the Logistic Regression Cost Function is:
And that the regularized version is:
This implementation optimizes the regularization by multiplying lambda times the probability p (not present in the formula); p is calculated as the sum of the power of θ1 and θ2.
6. This part of the code will perform the Forward Propagation. In doing this task, the code must create a new vector, d3, for collecting the difference between a3 and y; Here, z2 is manipulated, adding a column of 1s, while a new vector, d2, will contain the product of d3 and Th2 times the output of the Gradient Descent, applied to z2.
The Gradient Descent function is implemented as follows:
'''
gradientDescent
returns the gradient of the sigmoid function'''
def gradientDescent(z):
g = np.multiply(sigmoid(z), (1-sigmoid(z)))
return g
To run the training function, please invoke it, specifying as arguments the ini_nn_params, I, H1, O, X, and, y:
Th1_trained, Th2_trained = training(ini_nn_params, I, H1, O, X, y)
Note:
The training can take twelve hours of calculation. A couple of trained θ vectors are available as files at these two links: Theta1, Theta2
After the download, you can upload the two files by typing:
# Upload thetas from the two files
with open('ALZ.50.df_theta_1.pickle', 'rb') as handle:
Th1_trained = pickle.load(handle).values
with open('ALZ.50.df_theta_2.pickle', 'rb') as handle:
Th2_trained = pickle.load(handle).values
Testing
The testing function serves to test the NN training:
'''
testing()
'''
def testing(Th1, Th2, X, y):m, n = X.shape
X = np.hstack((np.ones((m, 1)), X))
a2 = sigmoid(X.dot(Th1.T))
a2 = np.hstack((np.ones((m, 1)), a2))
a3 = sigmoid(a2.dot(Th2.T))
pred = np.argmax(a3, axis=1)
pred += 1
print('Training Set Accuracy:', np.mean(pred == y) * 100)
return pred
The function is straightforward and conceptually retraces the Forward Propagation path. After applying the 1s bias column to X, a2 loads the sigmoid values of X times θ1, and only after adding the 1s bias column to a2, a3 loads the sigmoid values of X times θ2. The parameter pred (prediction) equals the max value of a3. Finally, the function returns the percentage of successful outcome matches between the pred and y vectors.
Before running the code, upload the testing version of the same dataset:
# LOAD TESTING DATASET
with open('Alz.original.50X50.testing.dataset.pickle', 'rb') as handle:
ALZ = pickle.load(handle)
X = np.stack(ALZ['X'])
y = np.int64(ALZ['y'])
To run the testing, please invoke it, specifying as arguments the two trained θ vectors, X and Y:
pred = testing(Th1_trained, Th2_trained, X, y)
The accuracy for the Alzheimer MRI dataset, with this NN architecture, is 95%
Seeing the NN in action would be great, showing some concrete results. The following function, which represents a modified version of plotSampleRandomly, the function we used to randomly display MRIs from the dataset, suits us. It takes vectors X and y as arguments plus the vector pred created during the testing phase:
'''
plotSamplesRandomlyWithPrediction
Function for visualizing fifty randomly picked images with their prediction.
'''def plotSamplesRandomlyWithPrediction(X, y, pred):
from random import randint
from matplotlib import pyplot as plt
# create a list of randomly picked indexes.
# the function randint creates the list, picking numbers in a
# range 0-Xn, which is the length of X
randomSelect = [randint(0, len(X)) for i in range(0, 51)]
# reshape all the pictures on the n X n pixels,
w, h =int(np.sqrt(X.shape[1])), int(np.sqrt(X.shape[1]))
fig=plt.figure(figsize=(int(np.sqrt(X.shape[1])), int(np.sqrt(X.shape[1]))))
# Define a grid of 10 X 10 for the big plot.
columns = 10
rows = 10
# The for loop
for i in range(1, 51):
# create the 2-dimensional picture
image = X[randomSelect[i]].reshape(w,h)
ax = fig.add_subplot(rows, columns, i)
# create a title for each pictures, containing #index and label
#title = "#"+str(randomSelect[i])+"; "+"y="+str(y[randomSelect[i]])
title = "#"+str(randomSelect[i])+"; "+"y:"+str(y[randomSelect[i]])+"; "+"p:"+str(pred[randomSelect[i]])
# set the title font size
ax.set_title(title, fontsize=np.int(np.sqrt(X.shape[1])/2))
# don't display the axis
ax.set_axis_off()
# plot the image in grayscale
plt.imshow(image, cmap='gray')
plt.show()
And run the function as follows:
plotSamplesRandomlyWithPrediction(X, y, pred)
A possible output of plotSamplesRandomlyWithPrediction could be:
Conclusions
Neural Networks are an excellent solution for classifying complex images, especially when these come from neuroimaging. In other articles dedicated to a similar argument, we have seen how poorly contrasted images may present difficulties in classification. The Augmented Alzheimer MRI dataset provided by Kaggle shows some advantages since each image appears well contrasted. However, the complexity offered by the pattern diversities characterizing each pathological class is high and only sometimes identifiable at a glance.
This article aimed to identify a computational and relatively fast technology for classification and pattern prediction with a testing accuracy of ~ 95%. Tools like this offer new opportunities for quick diagnosis and a clever way to extract essential features in pathology.
Again the article was an occasion to explain in detail the mechanisms underlying the construction and the training process of Multi-Layer Neural Networks. Some intuitions have been unveiled regarding the Neural Network computational process in logic gates and the strong correlation with Logistic Regression, demonstrating how Neural Networks are, to all effects, a super Logistic Regression model.
Extending the code described in the article may increase the number of hidden layers. For editorial reasons, I didn’t share these code extensions, and increasing the hidden layers in this specific application will not enrich the accuracy percentage; instead, it will make it worse. Anyway, code for two-hidden and three-hidden layers is available for those interested.
References
- The Augmented Alzheimer MRI Dataset from Kaggle is available under GNU Lesser General Public License. It represents a branch of the Kaggle Alzheimer’s Dataset ( 4 class of Images ) available under Open Data Commons Open Database License (ODbL) v1.0
- Luca Zammataro, Linear Regression with one or more variables, Towards Data Science, 2019
- Luca Zammataro, Logistic Regression for malignancy prediction in cancer, Towards Data Science, 2019
- Luca Zammataro, Detecting Melanoma with One-vs-All, Towards Data Science, 2020
- Luca Zammataro, One-vs-All Logistic Regression for Image Recognition in Python, Towards Data Science, 2022
- Andrew NG, Machine Learning | Coursera
- Chris Albon, Machine Learning with Python Cookbook, O’Reilly, ISBN-13: 978–1491989388.
- Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition, O’Reilly, ISBN-10: 1492032646
- Wen, J. et al.: Convolutional Neural Networks for Classification of Alzheimer’s Disease: Overview and Reproducible Evaluation
Source link