%config Completer.use_jedi = False

Required packages

import numpy as np
import tensorflow as tf

A first look at a Neural Network

MNIST dataset

Task: classify grayscale images of handwritten digits (28 × 28 pixels) into their 10 categories (0 through 9).

from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 2s 0us/step
11501568/11490434 [==============================] - 2s 0us/step

Training data:

train_images.shape
(60000, 28, 28)
len(train_labels)
60000
train_labels
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

Test data:

test_images.shape
(10000, 28, 28)
len(test_labels)
10000
test_labels
array([7, 2, 1, ..., 4, 5, 6], dtype=uint8)

Define and compile the model

Define a basic multi-layer network

model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(512, activation="relu"), 
        tf.keras.layers.Dense(10, activation="softmax")
    ]
)

Compile the model by specifying the optimization algorithm, the loss function and the metrics to track:

model.compile(
    optimizer="rmsprop", 
    loss="sparse_categorical_crossentropy", 
    metrics=["accuracy"]
)

Pre-process the data as expected by the model

Transform the features from an array (60000, 28, 28) with values between [0, 255] to a flat array of size (60000, 28 * 28) of values [0,1].

train_images.shape
(60000, 28, 28)
train_images.dtype
dtype('uint8')
train_images = train_images.reshape((60000, 28*28))
train_images = train_images.astype("float32") / 255
train_images.shape
(60000, 784)
train_images.dtype
dtype('float32')
test_images.shape
(10000, 28, 28)
test_images.dtype
dtype('uint8')
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255
test_images.shape
(10000, 784)
test_images.dtype
dtype('float32')

Fit the model

model.fit(train_images, train_labels, epochs=20, batch_size=128)
Epoch 1/20
469/469 [==============================] - 2s 4ms/step - loss: 0.2423 - accuracy: 0.9297
Epoch 2/20
469/469 [==============================] - 2s 4ms/step - loss: 0.2323 - accuracy: 0.9331
Epoch 3/20
469/469 [==============================] - 2s 4ms/step - loss: 0.2227 - accuracy: 0.9359
Epoch 4/20
469/469 [==============================] - 2s 4ms/step - loss: 0.2133 - accuracy: 0.9384
Epoch 5/20
469/469 [==============================] - 2s 4ms/step - loss: 0.2044 - accuracy: 0.9410
Epoch 6/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1957 - accuracy: 0.9437
Epoch 7/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1881 - accuracy: 0.9460
Epoch 8/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1805 - accuracy: 0.9483
Epoch 9/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1732 - accuracy: 0.9505
Epoch 10/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1665 - accuracy: 0.9524
Epoch 11/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1603 - accuracy: 0.9538
Epoch 12/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1541 - accuracy: 0.9560
Epoch 13/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1487 - accuracy: 0.9576
Epoch 14/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1433 - accuracy: 0.9593
Epoch 15/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1384 - accuracy: 0.9604
Epoch 16/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1337 - accuracy: 0.9622
Epoch 17/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1290 - accuracy: 0.9634
Epoch 18/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1247 - accuracy: 0.9642
Epoch 19/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1208 - accuracy: 0.9658
Epoch 20/20
469/469 [==============================] - 2s 4ms/step - loss: 0.1173 - accuracy: 0.9669
<keras.callbacks.History at 0x14a481ee0>

Predict with the model

Select the first 10 images of the test set.

test_digits = test_images[0:10]
test_digits.shape
(10, 784)

Compute predictions for the first 10 images:

predictions = model.predict(test_digits)

Class probabilities for the first test images:

predictions[0]
array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], dtype=float32)

Pick the class with the highest probability:

predictions[0].argmax()
7

Check what is the true label of the first test image:

test_labels[0]
7

Evaluate the model on test data

test_loss, test_acc = model.evaluate(test_images, test_labels)
313/313 [==============================] - 0s 1ms/step - loss: 18.4538 - accuracy: 0.9514
print(f"Test accuracy: {test_acc:.4}\nTest loss: {test_loss:.6}")
Test accuracy: 0.9514
Test loss: 18.4538

Data representation for neural networks: Tensors

Tensors are the basic data structures used in Machine Learning. Tensor are multi-dimmentional arrays. In the context of tensors, a dimensional is also called an axis.

In deep learning, you'll generally manipulate tensors with ranks 0 to 4, although you may go up to 5 if you process video data.

Scalars (rank-0 tensors, 0D tensor)

x = np.array(12)
x.ndim
0

Vectors (rank-1 tensors, 1D tensor)

x = np.array([12, 3, 6, 14, 7])
x.ndim
1

Matrices (rank-2 tensors, 2D tensors)

x = np.array(
    [
        [5, 78, 2, 34, 0],
        [6, 79, 3, 35, 1],
        [7, 80, 4, 36, 2]
    ]
)
x.ndim
2

Rank-3 and higher rank tensors

If you pack such matrices in a new array, you obtain a rank-3 tensor (or 3D tensor)

x = np.array(
    [
        [
            [5, 78, 2, 34, 0],
            [6, 79, 3, 35, 1],
            [7, 80, 4, 36, 2]
        ],
        [
            [5, 78, 2, 34, 0],
            [6, 79, 3, 35, 1],
            [7, 80, 4, 36, 2]
        ],
        [
            [5, 78, 2, 34, 0],
            [6, 79, 3, 35, 1],
            [7, 80, 4, 36, 2]
        ]
    ]
)
x.ndim
3

Tensor key attributes

  1. Number of axis
train_images.ndim
2
  1. Shape
train_images.shape
(60000, 784)
  1. Data type
train_images.dtype
dtype('float32')

Manipulating tensors in NumPy

from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Slicing arrays. The following three statements are equivalent. It selects 90 images, from indices 10 to 99:

my_slice = train_images[10:100]
my_slice.shape
(90, 28, 28)
my_slice = train_images[10:100, :, :]
my_slice.shape
(90, 28, 28)
my_slice = train_images[10:100, 0:28, 0:28]
my_slice.shape
(90, 28, 28)

Select 14x14 pixels in the bottom-right corner of all images:

my_slice = train_images[:, 14:, 14:]
my_slice.shape
(60000, 14, 14)

Select 14x14 pixels from the middle of all images:

my_slice = train_images[:, 7:-7, 7:-7]
my_slice.shape
(60000, 14, 14)

Real-world example of data tensors

  • Vector data: 2D-tensor (number_samples, features)
  • Timeseries or sequence data: 3D-tensor (number_samples, timesteps, features)
  • Images: 4D-tensor (number_samples, height, width, channels)
  • Video: 5D-tensor (number_samples, frames, height, width, channels)

The gears of neural networks: Tensor operations

Layer as a function

The dense layer with a rectified linear unit activation function

tf.keras.layers.Dense(512, activation="relu")
<keras.layers.core.dense.Dense at 0x14a487310>

The layer above can be interepreted as the following function:

output = max(dot(input, W) + b, 0)

Element-wise operations

Element-wise operations are carried out by optimized NumPy code.

Broadcasting

When possible, and if there’s no ambiguity, the smaller tensor will be broadcast to match the shape of the larger tensor.

X = np.random.random((32, 10))
y = np.random.random((10,))   
result = X + y
result.shape
(32, 10)
x = np.random.random((64, 3, 32, 10))
y = np.random.random((32, 10))
z = np.maximum(x, y) 
z.shape
(64, 3, 32, 10)

Tensor-product (dot-product)

  • Dot-product between vectors

The dot-product between two vectors is a scalar.

$$x . y = \sum _{i=1}^{n} x_i * y_i$$

  • Dot-product between a matrix and a vector.

The dot-product between a matrix X and a vector y is a vector whose element $i$ is the dot-product of the vector y and the $i$-th row of X.

  • Higher dimension dot-products

We can take higher dimension dot-product, as long as the last dimension of the first tensor match the first dimension of the second tensor.

(a, b, c, d) . (d,) -> (a,b,c)

(a, b, c, d) . (d, e) -> (a, b, c, e)

Tensor reshaping

Reshaping a tensor means rearranging its rows and columns to match a target shape. Naturally, the reshaped tensor has the same total number of coefficients as the initial tensor.

x = np.array([[0., 1.],
              [2., 3.],
              [4., 5.]])
x.shape
(3, 2)
x = x.reshape((6, 1))
x
array([[0.],
       [1.],
       [2.],
       [3.],
       [4.],
       [5.]])
x = x.reshape((2, 3))
x
array([[0., 1., 2.],
       [3., 4., 5.]])
  • Transpose
x = np.zeros((300, 20))
x = np.transpose(x)
x.shape
(20, 300)

Geometric interpretation of tensor operations

In general, elementary geometric operations such as translation, rotation, scaling and so on can be expressed as tensor operations.

  • Translation: Translating a 2D object can be implemented as the sum of two vectors.
$$ \begin{bmatrix} \text{Horizontal factor} \\ \text{Vertical factor} \\ \end{bmatrix} + \begin{bmatrix} x \\ y \\ \end{bmatrix} $$
  • Rotation: A counterclockwise rotation of a 2D vector by an angle $\theta$ can be achieved via a dot-product with a 2 x 2 matrix.
$$ \begin{bmatrix} cos(\theta) & -sin(\theta) \\ sin(\theta) & cos(\theta) \\ \end{bmatrix} \cdot \begin{bmatrix} x \\ y \\ \end{bmatrix} $$
  • Scaling: scaling of a 2D object can also be accomplished by a dot-product.
$$ \begin{bmatrix} factor_x & 0 \\ 0 & factor_y \\ \end{bmatrix} \cdot \begin{bmatrix} x \\ y \\ \end{bmatrix} $$
  • Linear transform: A dot-product with an arbitrary matrix implements a linear transform. Rotation and Scaling are examples of linear transforms.

$$W \cdot x$$

  • Affine transform: It is a combination of a linear transform and a translation.

$$W \cdot x + b$$

  • Importance of activation functions: A sequence of affine transforms is equivalent to an affine transform. So a sequence of Dense layers without activation function would still be equivalent to a single Dense layer.

We just saw that a tensor operation is equivalent to a geometric transformation. Since a neural network is a series of tensor operations, we can say that a neural network is a very complex geometric transformation in a high-dimensional space, implemented via a series of simple steps.

The engine of neural networks: Gradient-based optimization

Assume our network is represented by output = relu(dot(input, W)+b). W and b are the parameters of the model and are initially randomly initialized.

By training the neural network, we will gradually adapt W and b with the objective of minimizing a loss function between the model prediction y_pred and the observed data y_true.

A training loop involves repeating the following steps until the loss seems sufficiently low:

  1. Draw a batch of training samples, x, and corresponding targets, y_true.
  2. Run the model on x (a step called the forward pass) to obtain predictions, y_pred.
  3. Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true.
  4. Update all weights of the model in a way that slightly reduces the loss on this batch.

Step 4 is carried out by gradient descent, which requires that the loss function be differentiable with respect to the model learnable parameters.

What is a derivative?

Geometric explanation of the derivative of a continuous and smooth function. The derivative represents the local slope of the curve of the function.

Derivative of a tensor operation: The gradient

Gradients are just the generalization of the concept of derivatives to functions that take tensors as inputs.

Assuming a model y = f(W), where W is a tensor with model coefficients, grad(loss, W0) can be interpreted as the tensor describing the direction of steepest ascent of loss_value = f(W) around W0. We can reduce loss_value = f(W) by moving W inn the oposite direction from the gradient: W_1 = W_0 - step * grad(loss, W0). step is a small scaling factor that is needed because the gradient is a local approximation of the curvature of the function.

Stochastic gradient descent

Mini-batch stochastic gradient descent draws random batches of data and apply one step of gradient descent to decrease the loss function by a little bit. The process is repeated until convergence.

There exists variations of SGD, such as momentum SDG that uses not only the current gradient but also previous gradient values.

Chaining derivatives: The backpropagation algorithm

  • Chain rule exemplified:
def fghj(x):
    x1 = j(x)
    x2 = h(x1)
    x3 = g(x2)
    y = f(x3)
    return y
 
grad(y, x) == grad(y, x3) * grad(x3, x2) * grad(x2, x1) * grad(x1, x)
  • Automatic differentiation with computation graphs

A computation graph is a directed acyclic graph of tensor operations. The chain rule says that you can obtain the derivative of a node with the respect of another node by multiplying the derivatives for each edge along the path linking the two nodes.

  • Backpropagation

Backpropagation is simply the application of the chain rule to a computation graph. Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, computing the contribution that each parameter had in the loss value.

  • TensorFlow gradient tape

It's a Python scope that will "record" the tensor operations that run inside it, in the form of a computation graph (sometimes called a “tape”).

import tensorflow as tf

# gradient wrt a scalar variable
x = tf.Variable(0.)
with tf.GradientTape() as tape:
    y = 2 * x + 3
grad_of_y_wrt_x = tape.gradient(y, x)
grad_of_y_wrt_x
<tf.Tensor: shape=(), dtype=float32, numpy=2.0>
# gradient wrt a tensor variable
x = tf.Variable(tf.random.uniform((2, 2)))
with tf.GradientTape() as tape:
    y = 2 * x + 3 
grad_of_y_wrt_x = tape.gradient(y, x)
grad_of_y_wrt_x
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 2.],
       [2., 2.]], dtype=float32)>
# gradient wrt a list of variables
W = tf.Variable(tf.random.uniform((2, 2)))
b = tf.Variable(tf.zeros((2,)))
x = tf.random.uniform((2, 2)) 
with tf.GradientTape() as tape:
    y = tf.matmul(x, W) + b                       
grad_of_y_wrt_W_and_b = tape.gradient(y, [W, b])  
grad_of_y_wrt_W_and_b
[<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
 array([[1.2138343 , 1.2138343 ],
        [0.50162864, 0.50162864]], dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 2.], dtype=float32)>]

Reimplementing our first example from scratch in TensorFlow

A simple dense class

output = activation(dot(W,input)+b)

import tensorflow as tf
  
class NaiveDense:
    def __init__(self, input_size, output_size, activation):
        self.activation = activation
 
        w_shape = (input_size, output_size)
        w_initial_value = tf.random.uniform(w_shape, minval=0, maxval=1e-1)
        self.W = tf.Variable(w_initial_value)
  
        b_shape = (output_size,)
        b_initial_value = tf.zeros(b_shape)
        self.b = tf.Variable(b_initial_value)
  
    def __call__(self, inputs):
        return self.activation(tf.matmul(inputs, self.W) + self.b)
  
    @property
    def weights(self):
        return [self.W, self.b]

A simple sequential class

class NaiveSequential:
    def __init__(self, layers):
        self.layers = layers
  
    def __call__(self, inputs):
        x = inputs
        for layer in self.layers:
            x = layer(x)
        return x
  
    @property 
    def weights(self):
        weights = []
        for layer in self.layers:
            weights += layer.weights
        return weights

Instantiate the model

model = NaiveSequential([
    NaiveDense(input_size=28 * 28, output_size=512, activation=tf.nn.relu),
    NaiveDense(input_size=512, output_size=10, activation=tf.nn.softmax)
]) 
assert len(model.weights) == 4 

A batch generator

import math
  
class BatchGenerator:
    def __init__(self, images, labels, batch_size=128):
        assert len(images) == len(labels)
        self.index = 0
        self.images = images
        self.labels = labels
        self.batch_size = batch_size
        self.num_batches = math.ceil(len(images) / batch_size)
 
    def next(self):
        images = self.images[self.index : self.index + self.batch_size]
        labels = self.labels[self.index : self.index + self.batch_size]
        self.index += self.batch_size
        return images, labels

Running one training step

def one_training_step(model, images_batch, labels_batch):
    with tf.GradientTape() as tape:
        predictions = model(images_batch)
        per_sample_losses = tf.keras.losses.sparse_categorical_crossentropy(
            labels_batch, predictions
        )
        average_loss = tf.reduce_mean(per_sample_losses)
    gradients = tape.gradient(average_loss, model.weights)
    update_weights(gradients, model.weights)
    return average_loss

Manual implementation of the update step:

learning_rate = 1e-3 
  
def update_weights(gradients, weights):
    for g, w in zip(gradients, weights):
        w.assign_sub(g * learning_rate)

In practice, you would almost never implement a weight update step like this by hand. Instead, you would use an Optimizer instance from Keras, like this:

from tensorflow.keras import optimizers
  
optimizer = optimizers.SGD(learning_rate=1e-3)
  
def update_weights(gradients, weights):
    optimizer.apply_gradients(zip(gradients, weights))

The full training loop

def fit(model, images, labels, epochs, batch_size=128):
    for epoch_counter in range(epochs):
        print(f"Epoch {epoch_counter}")
        batch_generator = BatchGenerator(images, labels)
        for batch_counter in range(batch_generator.num_batches):
            images_batch, labels_batch = batch_generator.next()
            loss = one_training_step(model, images_batch, labels_batch)
            if batch_counter % 100 == 0:
                print(f"loss at batch {batch_counter}: {loss:.2f}")
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
  
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255  
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255 
  
fit(model, train_images, train_labels, epochs=10, batch_size=128)
Epoch 0
loss at batch 0: 3.42
loss at batch 100: 2.25
loss at batch 200: 2.20
loss at batch 300: 2.08
loss at batch 400: 2.19
Epoch 1
loss at batch 0: 1.91
loss at batch 100: 1.89
loss at batch 200: 1.83
loss at batch 300: 1.69
loss at batch 400: 1.79
Epoch 2
loss at batch 0: 1.58
loss at batch 100: 1.59
loss at batch 200: 1.50
loss at batch 300: 1.41
loss at batch 400: 1.47
Epoch 3
loss at batch 0: 1.32
loss at batch 100: 1.34
loss at batch 200: 1.24
loss at batch 300: 1.19
loss at batch 400: 1.24
Epoch 4
loss at batch 0: 1.12
loss at batch 100: 1.16
loss at batch 200: 1.04
loss at batch 300: 1.03
loss at batch 400: 1.08
Epoch 5
loss at batch 0: 0.98
loss at batch 100: 1.02
loss at batch 200: 0.91
loss at batch 300: 0.91
loss at batch 400: 0.96
Epoch 6
loss at batch 0: 0.87
loss at batch 100: 0.91
loss at batch 200: 0.80
loss at batch 300: 0.82
loss at batch 400: 0.88
Epoch 7
loss at batch 0: 0.79
loss at batch 100: 0.83
loss at batch 200: 0.73
loss at batch 300: 0.75
loss at batch 400: 0.81
Epoch 8
loss at batch 0: 0.73
loss at batch 100: 0.76
loss at batch 200: 0.66
loss at batch 300: 0.70
loss at batch 400: 0.76
Epoch 9
loss at batch 0: 0.68
loss at batch 100: 0.70
loss at batch 200: 0.62
loss at batch 300: 0.65
loss at batch 400: 0.72

Evaluating the model

import numpy as np

predictions = model(test_images)
predictions = predictions.numpy()
predicted_labels = np.argmax(predictions, axis=1)
matches = predicted_labels == test_labels
print(f"accuracy: {matches.mean():.2f}")
accuracy: 0.82