Most of the best-performing artifical-intelligence systems - such as self driving cars, voice search & voice-activated assistants, automatic machine translation, image recoginition were possible because of deep learning. Deep learning is a branch artifical intelligence powered by neural networks.
Neural Networks are a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples. The first trainable neural network, the Perceptron, was demonstrated by the Cornell University psychologist Frank Rosenblatt in 1957. The Perceptron’s design was much like that of the modern neural net, except that it had only one layer with adjustable weights and thresholds, sandwiched between input and output layers. It's worth understanding how simple perceptron looks like and how they work? A perceptron takes several binary inputs x1,x2,x3,x4.., and produces a single binary output.
In the above example the perceptron has four inputs,
x1,x2,x3,x4. The perceptron stores some more numbers called weights, w1,w2,w3,w4, real numbers to expressing the importance of the respective inputs to the ouput. The perceptrons output is determined by the weighted sum∑xiwi fed to an activation function. The activation function results in output, 0 or 1, based on the threshold.
The algebraic notation will look like:
Let's write a simple code for above perceptron using python, we will built everything from scratch without using any packages, the mathematics looks pretty simple we calculate loss for each input and update weights accordingly.
from random import random
classPerceptron(object):def__init__(self, x_train, y_train, epochs=1000, learn_rate=0.1):
self.accuracy =0
self.samples =0
self.x_train = x_train
self.y_train = y_train
self.epochs = epochs
self.learn_rate = learn_rate
self.bias =0
self.weights =[random()for _ inrange(len(x_train[0]))]# returns percent accuracydefcurrent_accuracy(self):return self.accuracy / self.samples
# activation functiondefactivation(self, n):return0if n <2else1defpredict(self,ip):
total = self.bias
for i,j inenumerate(self.weights):
total += ip[i]* j
return self.activation(total)deffit(self):for e inrange(self.epochs):print('#### Running for #### Epoch: '+str(e))for i inrange(len(self.x_train)):
prediction = self.predict(self.x_train[i])print('Expected:'+str(self.y_train[i])+'Model Output:'+str(prediction))if self.y_train[i]== prediction:
self.accuracy +=1else:
self.accuracy -=1
self.samples +=1
loss = self.y_train[i]- prediction
for w inrange(len(self.weights)):
self.weights[w]+= loss * self.x_train[i][w]* self.learn_rate
self.bias += loss * self.learn_rate
print('Epoch: %s: Accuracy: %s '%(str(e),str(self.current_accuracy())))
x =[[1,1,1],[0,0,0],[1,0,1]]
y =[1,0,0]
perceptron = Perceptron(x, y, epochs=1000, learn_rate=0.1)
perceptron.fit()
Applications
The above perceptron model can be used to design any logic gate based on how you train your model. Based on our training data and activation threshold of 2, the perceptron will behave as logic AND gate
Table: Perceptron implementing AND gate with Threshold of 2
Input 0
Input 1
Input 2
Threshold
Output
0
0
0
2
0
1
0
1
2
0
1
1
1
2
1
Combining perceptrons
Perceptron isn't a complete model of human decision-making!. The perceptron can be chained together; the output of a perceptron can be used as the input to another perceptron. A standard neural network consists of many simple, connected processors called perceptrons, each producing a sequence of real-valued activations.
Input neurons get activated through sensors perceiving the environment, other neurons(hidden layer neurons) get activated through weighted connections from previously active neurons.
Let's simplify our perceptron model. The first change is to write ∑xiwi as a dot product, w.x=∑wixi, where wandx are input and weights vectors. we will also use bias instead of threshold, the modified equation will look like
output=⎩⎪⎨⎪⎧01w.x+b≤0w.x+b>0
Learning
Perceptrons are good at separating an input space into two parts (the output). Training a perceptron amounts to adjusting the weights and biases such that it rotates and shifts a line until the input space is properly partitioned. The perceptron had a step function to map inputs to (binary) outputs, which means a small change in weight can result an entirely differrent output. A real neuron either fires an action potential or not (i.e., generates a 1 or 0), the response of a neuron is often described by its firing rate (i.e., the number of spikes per unit time), resulting in a graded response (Adrian and Matthews, 1927). To model these responses, we will use a different neural network model, called a linear network made of neurons. Neurons are similar to perceptrons, modified so that small change in their wieghts and biases results a small change in their output.
Neuron
A neuron is like a perceptron and has inputs x1,x2,x3..., the inputs can take any values between 0 and 1 like 0.345 and also have weights w1,w2,w3.....
The output can be any value between 0 and 1 given by σ(x.w+b), where σ is called the sigmoid function i,e activation function for neuron. The sigmoid function is given by
σ(z)=1+e−z1
The sigmoid function is a squashing function, limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.
Let's try to understand above 3-input neuron with simple example. Assume we have below parameters for the neuron and sigmoid as activation function.
w(1)=[w11(1)w12(1)w13(1)w21(1)w22(1)w3(1)]
For input x=⎣⎢⎡3510512⎦⎥⎤, the output is a dot product of input and weight vectors.
The estimates are utter trash, this is because we have not trained our network. In next step we will train our network.
Training
To improve our model, we need to quantify how wrong our prediction are, we do this with a cost function.
Training a Network = Minimizing a Cost Function
Cost Function
To quantify how well we're able to find weights and biases to approximate ouput we define a cost function.
Let's use mean square error(MSE) as our cost function
MSE=2n1∑i=1n(yactual−y∧)2(5)
Substituting values of equations (1),(2),(3),(4) in equation (5)
J=2n1∑i=1n(yactual−σ(σ(XW(1))W(2)))2(6)
With the help of calculus we can easily know which way is downhill. if ∂x∂J=+veThe cost function is going uphill and vice versa. The reason we choose cost function to be the sum of square errors is to avoid non-convex nature of function.
The cost function is a function of w and b. Since we have less control over input data, so we will minimize our cost by adjusting the weights. The cost function becomes small, i.e. C(w,b)≃0 when yactual is approximately equal to ypredict we can write it as:
The graph is interactive use graph controls to have closer view of the cost function.
To perform gradient descent, ∂w∂J=?. We will use our two weight matrices w(1) and w(2)
We now have final term ∂w(1)∂J to compute.
I am skipping most of the steps as the derivation for calculation of ∂w(2)∂J is same as ∂w(1)∂J. Therefore the equation-7 for dervative w.r.tw(1) becomes
∂w(1)∂J=−(yactual−y∧)(∂w(1)∂y∧)(11)
=−(yactual−y∧)(f′(z(3))∗(∂w(1)∂z(3))
From chain rule of differentiation∂w(1)∂z(3) can be represented as
∂w(1)∂z(3)=∂a(2)∂z(3)∗∂w(1)∂a(2)
Substituting values in equation-11
∂w(1)∂J=δ(3)(∂a(2)∂z(3)∗∂w(1)∂a(2))(12)
∂a(2)∂z(3) is the rate of change of z(3)w.r.ta(2) and there is linear relation between the two and can be represented as w(2). Therefore equation-12 becomes
∂w(1)∂J=δ(3)((w(2))T∗∂w(1)∂a(2))(13)
Since a(2)is function ofz(2). Therefore,
∂w(1)∂a(2)=∂z(2)∂a(2)∗∂w(1)∂z(2)
∂z(2)∂a(2)=f′(z(2))
Substituting above value in equation-13
∂w(1)∂J=δ(3)(w(2))T∗f′(z(2))∗∂w(1)∂z(2)(14)
From equation-1z(2) is a function of input X and w(1). Therefore, equation-14 becomes
∂w(1)∂J=XTδ(3)(w(2))T∗f′(z(2))(15)
we will add one more method to our Neuron class to compute gardient i.e ∂w(2)∂J and ∂w(1)∂J
A gradient descent is an algorithm for finding the nearest local minimum of a function. The method of steepest descent, also called the gradient descent method, starts at a point P0 and, as many times as needed moves from Pi to Pi+1 by minimizing along the line. Our goal here is to find the weights and biases so that the output from the network approximates y(x) for all training inputs.