Chapter 2 - The mathematical building blocks of Neural Networks
Notes about the book Deep Learning with python, 2nd edition
- Required packages
- A first look at a Neural Network
- Data representation for neural networks: Tensors
- The gears of neural networks: Tensor operations
- The engine of neural networks: Gradient-based optimization
- Reimplementing our first example from scratch in TensorFlow
%config Completer.use_jedi = False
import numpy as np
import tensorflow as tf
Task: classify grayscale images of handwritten digits (28 × 28 pixels) into their 10 categories (0 through 9).
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Training data:
train_images.shape
len(train_labels)
train_labels
Test data:
test_images.shape
len(test_labels)
test_labels
Define a basic multi-layer network
model = tf.keras.Sequential(
[
tf.keras.layers.Dense(512, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax")
]
)
Compile the model by specifying the optimization algorithm, the loss function and the metrics to track:
model.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
Transform the features from an array (60000, 28, 28)
with values between [0, 255]
to a flat array of size (60000, 28 * 28)
of values [0,1]
.
train_images.shape
train_images.dtype
train_images = train_images.reshape((60000, 28*28))
train_images = train_images.astype("float32") / 255
train_images.shape
train_images.dtype
test_images.shape
test_images.dtype
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255
test_images.shape
test_images.dtype
model.fit(train_images, train_labels, epochs=20, batch_size=128)
Select the first 10 images of the test set.
test_digits = test_images[0:10]
test_digits.shape
Compute predictions for the first 10 images:
predictions = model.predict(test_digits)
Class probabilities for the first test images:
predictions[0]
Pick the class with the highest probability:
predictions[0].argmax()
Check what is the true label of the first test image:
test_labels[0]
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.4}\nTest loss: {test_loss:.6}")
Tensors are the basic data structures used in Machine Learning. Tensor are multi-dimmentional arrays. In the context of tensors, a dimensional is also called an axis.
In deep learning, you'll generally manipulate tensors with ranks 0 to 4, although you may go up to 5 if you process video data.
x = np.array(12)
x.ndim
x = np.array([12, 3, 6, 14, 7])
x.ndim
x = np.array(
[
[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]
]
)
x.ndim
If you pack such matrices in a new array, you obtain a rank-3 tensor (or 3D tensor)
x = np.array(
[
[
[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]
],
[
[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]
],
[
[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]
]
]
)
x.ndim
- Number of axis
train_images.ndim
- Shape
train_images.shape
- Data type
train_images.dtype
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Slicing arrays. The following three statements are equivalent. It selects 90 images, from indices 10 to 99:
my_slice = train_images[10:100]
my_slice.shape
my_slice = train_images[10:100, :, :]
my_slice.shape
my_slice = train_images[10:100, 0:28, 0:28]
my_slice.shape
Select 14x14 pixels in the bottom-right corner of all images:
my_slice = train_images[:, 14:, 14:]
my_slice.shape
Select 14x14 pixels from the middle of all images:
my_slice = train_images[:, 7:-7, 7:-7]
my_slice.shape
- Vector data: 2D-tensor
(number_samples, features)
- Timeseries or sequence data: 3D-tensor
(number_samples, timesteps, features)
- Images: 4D-tensor
(number_samples, height, width, channels)
- Video: 5D-tensor
(number_samples, frames, height, width, channels)
The dense layer with a rectified linear unit activation function
tf.keras.layers.Dense(512, activation="relu")
The layer above can be interepreted as the following function:
output = max(dot(input, W) + b, 0)
Element-wise operations are carried out by optimized NumPy code.
When possible, and if there’s no ambiguity, the smaller tensor will be broadcast to match the shape of the larger tensor.
X = np.random.random((32, 10))
y = np.random.random((10,))
result = X + y
result.shape
x = np.random.random((64, 3, 32, 10))
y = np.random.random((32, 10))
z = np.maximum(x, y)
z.shape
- Dot-product between vectors
The dot-product between two vectors is a scalar.
$$x . y = \sum _{i=1}^{n} x_i * y_i$$
- Dot-product between a matrix and a vector.
The dot-product between a matrix X and a vector y is a vector whose element $i$ is the dot-product of the vector y and the $i$-th row of X.
- Higher dimension dot-products
We can take higher dimension dot-product, as long as the last dimension of the first tensor match the first dimension of the second tensor.
(a, b, c, d) . (d,) -> (a,b,c)
(a, b, c, d) . (d, e) -> (a, b, c, e)
Reshaping a tensor means rearranging its rows and columns to match a target shape. Naturally, the reshaped tensor has the same total number of coefficients as the initial tensor.
x = np.array([[0., 1.],
[2., 3.],
[4., 5.]])
x.shape
x = x.reshape((6, 1))
x
x = x.reshape((2, 3))
x
- Transpose
x = np.zeros((300, 20))
x = np.transpose(x)
x.shape
In general, elementary geometric operations such as translation, rotation, scaling and so on can be expressed as tensor operations.
- Translation: Translating a 2D object can be implemented as the sum of two vectors.
- Rotation: A counterclockwise rotation of a 2D vector by an angle $\theta$ can be achieved via a dot-product with a 2 x 2 matrix.
- Scaling: scaling of a 2D object can also be accomplished by a dot-product.
- Linear transform: A dot-product with an arbitrary matrix implements a linear transform. Rotation and Scaling are examples of linear transforms.
$$W \cdot x$$
- Affine transform: It is a combination of a linear transform and a translation.
$$W \cdot x + b$$
-
Importance of activation functions: A sequence of affine transforms is equivalent to an affine transform. So a sequence of
Dense
layers without activation function would still be equivalent to a singleDense
layer.
We just saw that a tensor operation is equivalent to a geometric transformation. Since a neural network is a series of tensor operations, we can say that a neural network is a very complex geometric transformation in a high-dimensional space, implemented via a series of simple steps.
Assume our network is represented by output = relu(dot(input, W)+b)
. W
and b
are the parameters of the model and are initially randomly initialized.
By training the neural network, we will gradually adapt W
and b
with the objective of minimizing a loss function between the model prediction y_pred
and the observed data y_true
.
A training loop involves repeating the following steps until the loss seems sufficiently low:
- Draw a batch of training samples,
x
, and corresponding targets,y_true
. - Run the model on
x
(a step called the forward pass) to obtain predictions,y_pred
. - Compute the loss of the model on the batch, a measure of the mismatch between
y_pred
andy_true
. - Update all weights of the model in a way that slightly reduces the loss on this batch.
Step 4 is carried out by gradient descent, which requires that the loss function be differentiable with respect to the model learnable parameters.
Geometric explanation of the derivative of a continuous and smooth function. The derivative represents the local slope of the curve of the function.
Gradients are just the generalization of the concept of derivatives to functions that take tensors as inputs.
Assuming a model y = f(W)
, where W
is a tensor with model coefficients, grad(loss, W0)
can be interpreted as the tensor describing the direction of steepest ascent of loss_value = f(W)
around W0
. We can reduce loss_value = f(W)
by moving W
inn the oposite direction from the gradient: W_1 = W_0 - step * grad(loss, W0)
. step
is a small scaling factor that is needed because the gradient is a local approximation of the curvature of the function.
Mini-batch stochastic gradient descent draws random batches of data and apply one step of gradient descent to decrease the loss function by a little bit. The process is repeated until convergence.
There exists variations of SGD, such as momentum SDG that uses not only the current gradient but also previous gradient values.
- Chain rule exemplified:
def fghj(x):
x1 = j(x)
x2 = h(x1)
x3 = g(x2)
y = f(x3)
return y
grad(y, x) == grad(y, x3) * grad(x3, x2) * grad(x2, x1) * grad(x1, x)
- Automatic differentiation with computation graphs
A computation graph is a directed acyclic graph of tensor operations. The chain rule says that you can obtain the derivative of a node with the respect of another node by multiplying the derivatives for each edge along the path linking the two nodes.
- Backpropagation
Backpropagation is simply the application of the chain rule to a computation graph. Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, computing the contribution that each parameter had in the loss value.
- TensorFlow gradient tape
It's a Python scope that will "record" the tensor operations that run inside it, in the form of a computation graph (sometimes called a “tape”).
import tensorflow as tf
# gradient wrt a scalar variable
x = tf.Variable(0.)
with tf.GradientTape() as tape:
y = 2 * x + 3
grad_of_y_wrt_x = tape.gradient(y, x)
grad_of_y_wrt_x
# gradient wrt a tensor variable
x = tf.Variable(tf.random.uniform((2, 2)))
with tf.GradientTape() as tape:
y = 2 * x + 3
grad_of_y_wrt_x = tape.gradient(y, x)
grad_of_y_wrt_x
# gradient wrt a list of variables
W = tf.Variable(tf.random.uniform((2, 2)))
b = tf.Variable(tf.zeros((2,)))
x = tf.random.uniform((2, 2))
with tf.GradientTape() as tape:
y = tf.matmul(x, W) + b
grad_of_y_wrt_W_and_b = tape.gradient(y, [W, b])
grad_of_y_wrt_W_and_b
output = activation(dot(W,input)+b)
import tensorflow as tf
class NaiveDense:
def __init__(self, input_size, output_size, activation):
self.activation = activation
w_shape = (input_size, output_size)
w_initial_value = tf.random.uniform(w_shape, minval=0, maxval=1e-1)
self.W = tf.Variable(w_initial_value)
b_shape = (output_size,)
b_initial_value = tf.zeros(b_shape)
self.b = tf.Variable(b_initial_value)
def __call__(self, inputs):
return self.activation(tf.matmul(inputs, self.W) + self.b)
@property
def weights(self):
return [self.W, self.b]
class NaiveSequential:
def __init__(self, layers):
self.layers = layers
def __call__(self, inputs):
x = inputs
for layer in self.layers:
x = layer(x)
return x
@property
def weights(self):
weights = []
for layer in self.layers:
weights += layer.weights
return weights
model = NaiveSequential([
NaiveDense(input_size=28 * 28, output_size=512, activation=tf.nn.relu),
NaiveDense(input_size=512, output_size=10, activation=tf.nn.softmax)
])
assert len(model.weights) == 4
import math
class BatchGenerator:
def __init__(self, images, labels, batch_size=128):
assert len(images) == len(labels)
self.index = 0
self.images = images
self.labels = labels
self.batch_size = batch_size
self.num_batches = math.ceil(len(images) / batch_size)
def next(self):
images = self.images[self.index : self.index + self.batch_size]
labels = self.labels[self.index : self.index + self.batch_size]
self.index += self.batch_size
return images, labels
def one_training_step(model, images_batch, labels_batch):
with tf.GradientTape() as tape:
predictions = model(images_batch)
per_sample_losses = tf.keras.losses.sparse_categorical_crossentropy(
labels_batch, predictions
)
average_loss = tf.reduce_mean(per_sample_losses)
gradients = tape.gradient(average_loss, model.weights)
update_weights(gradients, model.weights)
return average_loss
Manual implementation of the update step:
learning_rate = 1e-3
def update_weights(gradients, weights):
for g, w in zip(gradients, weights):
w.assign_sub(g * learning_rate)
In practice, you would almost never implement a weight update step like this by hand. Instead, you would use an Optimizer instance from Keras, like this:
from tensorflow.keras import optimizers
optimizer = optimizers.SGD(learning_rate=1e-3)
def update_weights(gradients, weights):
optimizer.apply_gradients(zip(gradients, weights))
def fit(model, images, labels, epochs, batch_size=128):
for epoch_counter in range(epochs):
print(f"Epoch {epoch_counter}")
batch_generator = BatchGenerator(images, labels)
for batch_counter in range(batch_generator.num_batches):
images_batch, labels_batch = batch_generator.next()
loss = one_training_step(model, images_batch, labels_batch)
if batch_counter % 100 == 0:
print(f"loss at batch {batch_counter}: {loss:.2f}")
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255
fit(model, train_images, train_labels, epochs=10, batch_size=128)
import numpy as np
predictions = model(test_images)
predictions = predictions.numpy()
predicted_labels = np.argmax(predictions, axis=1)
matches = predicted_labels == test_labels
print(f"accuracy: {matches.mean():.2f}")