In the unlikely event that the person reading this is not me, please make sure to check out the actual course “Deep Learning for Coders” over at course.fast.ai. These summaries are written with an audience of one - me - in mind, so proceed with caution.

Building a neural network from scratch

In line with the top-down nature of the course, the first two installments of this series (Lesson 1, Lesson 2) dealt with the use of the high-level API to train and apply pre-trained models. This section covers the inner workings of deep learning models and fundamental elements of the training process, namely stochastic gradient descent (SGD).

As such, this part of the course is the most crucial building block for understanding the more sophisticated methodologies introduced later on.

To provide a practical application for this lesson, we will use the MNIST dataset as a sandbox for our model. For now, however, we will try to make the problem as simple as possible. As such we will not go for recognizing all ten possible digits but focus only on a small subset at first. The goal will be to build some system from scratch which can successfully differentiate between zeros and ones.

The MNIST dataset

We will be working with the famous MNIST dataset (Deng, 2012), which is a collection of images depicting handwritten digits. Is is a widely used dataset for image classification tasks. As such it is also available for download from kaggle.com, which is where I obtained the files for this exercise. A straighforward approach to fetching data from Kaggle is to install the Kaggle API and execute the following in your chosen directory:

kaggle competitions download -c digit-recognizer

This will provide you with the following files:


The dataset is provided as a training and a test set, conveniently prepared to directly work with. But before we dive into modelling, let’s take a look at the contents of the provided files. To do that, I will load the data into a pandas DataFrame, split it into a training and validation set, and take a look at the first few rows:

train_df = pd.read_csv('train.csv')
valid_df = train_df.sample(axis=0, frac=0.2, random_state=1337)
train_df = train_df.drop(valid_df.index)

train_df = train_df.reset_index(drop=True)
valid_df = valid_df.reset_index(drop=True)

label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ...
0 1 0 0 0 0 0 0 0 0 0 ...
1 0 0 0 0 0 0 0 0 0 0 ...
2 1 0 0 0 0 0 0 0 0 0 ...
3 4 0 0 0 0 0 0 0 0 0 ...
4 0 0 0 0 0 0 0 0 0 0 ...

5 rows × 785 columns

As you can see in the table, data in train.csv is structured in a way that every row represents one image. Accordingy, the first column provides the label of the digit we are trying to find and the remaining 784 rows each represent one pixel of the 28x28 image.

This is an important insight to appreciate. Even if we are working with an image recognition problem, the underlying data is nothing more than a collection of pixel values. This becomes even more evident when we dive deeper into the data. So let’s go on to achieve the next milestone in our data preperation - PyTorch tensors for the independent and dependent variables of our model. Remember that we restricted our problem to only ones and zeros, which is why we will start by filtering the dataset for those two labels and then prepare the tensors.

train_ones = []
train_zeros = []

for i in train_df.index:

    if train_df.iloc[i]['label'] == 1:
        train_ones.append(tensor(train_df.iloc[i, 1:].values).reshape((28, 28)))

    elif train_df.iloc[i]['label'] == 0:
        train_zeros.append(tensor(train_df.iloc[i, 1:].values).reshape((28, 28)))

train_ones = torch.stack(train_ones).float()/255
train_zeros = troch.stack(train_zeros).float()/255

So now we have a stacked tensor of size 3764x28x28 of pixel values for all ones and zeros in the MNIST dataset. To make it more clear, what we each element represents, let’s plot one of them:

This singular one is a beautiful specimen of what we want our model to find. It also shows how pixel values in our data relate to images. While this is a rather straightforward thought, it is an important point to grasp - we should always be aware of what the numbers in our model actually represent.

So how to we approach the task of building a neural net from scratch? As Jeremy Howard proposes, we will start with a simple and intuitive baseline.

A simple baseline

Before worrying about how to build a neural network, let’s try to think of a way to solve the problem by other means. So how would we be able to differentiate between zeros and ones? One approach would be to figure out, what an “average” zero/ one looks like and compare each digit to that mean value.

Let’s take a look at what those idealized digits would look like:

mean1 = train_ones.mean(0)
mean0 = train_zeros.mean(0)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), sharey=True)
show_image(mean1, ax=ax[0], cmap=orange_palette, figsize=(5,5))
show_image(mean0, ax=ax[1], cmap=orange_palette, figsize=(5,5))

Having established what we expect a one or zero to look like, we can compare a given one (a_1 = train_ones[0]). In order to do that we will need to calculate the difference between our input and the average. This is where the loss function comes in which we have covered in the previous lessons. We will start with two commonly used loss functions:

Mean Absolute Error (MAE) - a.k.a. L1 Loss

mae=1nn=1nyy^mae = \frac{1}{n}\sum_{n=1}^n|y - \hat{y}|

Root Mean Squared Error (RMSE) - a.k.a. L2 Loss

rmse=1ni=1n(yy^)2rmse = \sqrt{\frac{1}{n}\sum_{i=1}^n(y - \hat{y})^2}

Don’t worry about the math. It becomes much simpler once it is written out as code. So let’s compare our example one to each of our two averages:

(a_1 - mean1).abs().mean()
(a_1 - mean0).abs().mean()

This provides is with a value for the loss compared to each digit:

(tensor(0.0845), tensor(0.1933))

It looks like the calculated loss is lower for mean1 than for mean0. This is already a good sign. By the way, PyTorch provides us with a function for L1 Loss already, so we would have achieved the same result by calling F.l1_loss(a_1, mean1) - F is the recommended name to import torch.nn.functional.

Similarly, RMSE becomes much more straightforward when viewed as code:

((a_1 - mean1)**2).mean().sqrt()
((a_1 - mean0)**2).mean().sqrt()
(tensor(0.2080), tensor(0.3263))

Again, we can use PyTorch for this: F.mse_loss(a_1, mean1).sqrt()

To apply this to our validation set, we will turn it into a python function and pass all ones to it:

def mnist_distance(a, b):
    return (a - b).abs().mean((-1, -2))  # we only want to include the axes with pixel values

valid_ones_dist = mnist_distance(valid_ones, mean1)

This returns a list of 920 loss values, then number of ones we have in our validation set.

Equipped with a loss function, we can define our prediction function based on the comparison which yields the lower loss:

def is_one(x):
    return mnist_distance(x, mean1) < mnist_distance(x, mean0)

So, in order to figure out whether whether the image contains a zero or a one, we can simply pass it to our new prediction function:


This seems to work well based on single example. But in order to verify how well our model works, we need some performance metric (as covered in the previous lessons). For our examples, we will use accuracy as our metric, by counting the number of correctly classified digits:

accuracy_ones = is_one(valid_ones).float().mean()
accuracy_zeros = (1 - is_one(valid_zeros).float()).mean()
accuracy_ones, accuracy_zeros
(tensor(1.), tensor(0.8810))

Who would have thought? It looks like our model can already predict ones with 100% accuracy and zeros with an accuracy of 88%! The problem is only, that our model has no way of improving. What this approach is currently lacking is the key process underlying deep learning - stochastic gradient descent.

Sochastic Gradient Descent

In order to build a process that learns from our loss function, we will need some sort of weight assignment. So, instead of comparing against some mean or “ideal” representation of each digit, we calculate weights that are highest for the most important pixels for a particular digit. Accordingly, we could think of the probability of an image being a one as the sum of each pixel multiplied by some weight:

def prob_1(x, w):
    """Calculate the probability of the image being a one, given pixels and weights."""
    return (w * x).sum()

The problem now is to find the right weights so that a high probability is returned for real ones. We need to figure out an approach of updating these weights to make them incrementally better at predicting ones. This process is stochastic gradient descent.

As the image above shows, it works its magic using only a few simple steps:

  1. Initialize a set of parameters randomly
  2. Predict the label of each image using these parameters
  3. Calculate the Loss of the model based on these predictions
  4. Calculate the gradient to calculate the impact of changing parameters on loss
  5. Step (i.e. change) all weights slightly in the direction of the lower loss
  6. Repeat steps 2 though 5
  7. Stop the training process

So in order to apply this to our problem, we need to turn our images into a dataset. Generally speaking, a PyTorch dataset is expected to by structured in such a way, that it returns a tuple of (x, y). To achieve this, we will first combine all of our images (x) into a single stacked tensor:

train_x = torch.cat([train_ones, train_zeros]).view(-1, 28*28)
torch.Size([7039, 784])

Next, we need to create a tensor with our labels, i.e. the actual digit we are trying to predict. To do so, we simply insert a 1 for every entry in train_ones and a 0 for each entry in train_zeros:

train_y = tensor([1] * len(train_ones) + [0] * len(train_zeros)).unsqueeze(1)
torch.Size([7039, 1])

Finally, to turn these two tensors into the expected list of tuples, we can simply use zip:

dset = list(zip(train_x, train_y))
x, y = dset[0]
x.shape, y
(torch.Size([784]), tensor([1]))

We need to repeat these steps, accordingly, to create our validation set. Once this is done, we can go through the seven steps of the gradient descent process in turn:


In order to initialize parameters, we define a simple function that returns random numbers based on a provided shape:

def init_params(size, std=1.0):
    """Randomly intiialize parameters based on a standard deviation."""
    return (torch.randn(size)*std).requires_grad_()

We use this to initialize one parameter for each pixel in our 28x28 images and another number as the bias.

weights = init_params((28*28, 1))
bias = init_params(1)


Having randomly initialized the parameters of our model. We will use them to predict the digit of an image. We could to this simply by calculating (train_x[0]*weights.T).sum() + bias. In order to be able to pass an entire batch of data, however, we will use matrix multiplication. This is what our prediction function now looks like:

def linear1(xb):
    """Calculate the prediction for a set of images."""
    return xb @ weights + bias

Accordingly, we can now simply pass the entire dataset:

preds = linear1(train_x)
tensor([[ 1.7234],
        [ 2.2772],
        [ 5.0252]], grad_fn=<AddBackward0>)


Now that we can predict the digit depicted in an image based on our parameters, we can use those predictions to calculate the model’s loss. Since there are only two labels, 0 and 1, we can define a very simple loss function, i.e. we simply compute the difference between the actual label and the prediction:

def mnist_loss(predictions , targets):
    predictions = predictions.sigmoid()   # we use the sigmoid function to ensure predictions are between 0 and 1
    return torch.where(targets==1, 1-predictions, predictions).mean()

The below example illustrates how this works:

trgts = tensor([1, 0, 1])
preds = tensor([0.9, 0.4, 0.2])

loss = mnist_loss(preds, trgts)


The loss shows model how far off the current set of parameters is. In order to improve our parameters, we need to calculate the gradient for each of them. PyTorch makes it incredibly straightforward to calculate the gradient, given we have defined our parameters with requires_grad_() and as floating point numbers. Below is an example with the first three entries of our training dataset:

weights = init_params((28*28, 1))
bias = init_params(1)

trgts = tensor(train_y[3])
preds = linear1(tensor(train_x[3]))
loss = mnist_loss(preds, trgts)

weights.grad.shape, weights.grad.mean(), bias.grad
(torch.Size([784, 1]), tensor(-0.0052), tensor([-0.1160]))


With our gradients in hand, we can use them to change our parameters in the direction of lower loss. The size of the step is called the learning rate (lr):

lr = 1e-5
weights.data -= lr * weights.grad.data


Now that we can calculate the losses and gradients and we can use them to optimize the parameters, we can create the actual training loop. First, however, there is another important element to think about. How many elements should we use to calculate the loss? We could take the entire training set, but that would be to computationally expensive. We could also just take a single item, but that would only give us enough information to optimize for this specific image. The compromize is then to split the training set into mini-batches and calculate the loss based on these subsets.

Hint: A larger batch size leads to a more accurate and stable estimate but will take longer

PyTorch provides the DataLoader class for this. It is basically a generator, yielding one batch of the dataset at a time:

data = range(15)
dl = DataLoader(data, batch_size=5, shuffle=True)
[tensor([ 0,  7, 14,  4, 13]),
 tensor([10,  9,  6,  3,  8]),
 tensor([11,  5, 12,  2,  1])]

A generic training loop would then look something like this:

for x, y in dl:
    pred = model(x)
    loss = loss_pred(pred,y)
    parameters -= parameters.grad * lr

To implement this in our dataset, we will do the following:

  1. Create our DataLoader
  2. Create a function to repeat the previous steps
  3. Build a training loop similar to the one above
  4. Define a function to calculate the accuracy of the model
  5. Build a validation loop
# Create our DataLoader
dl = DataLoader(dset, batch_size=256)
valid_dl = DataLoader(valid_dset, batch_size=256)

# Create a function to calculate the prediction and loss
def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = mnist_loss(preds, yb)

# Define training loop
def train_epoch(model, lr, params):
    for xb, yb in dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad * lr
            p.grad.zero_()  # this resets the gradients after our step

# Measure accuracy
def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds > 0.5) == yb

    return correct.float().mean()

# Validation loop
def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb, yb in valid_dl]

    return round(torch.stack(accs).mean().item(), 4)

So let’s try measuring the accuracy and see whether it improves after training for one epoch:

lr = 1.
params = weights, bias
train_epoch(linear1, lr, params)

It works! Let’s see what happens if we train for a few more epochs:

for i in range(5):
    train_epoch(linear1, lr, params)

    print(validate_epoch(linear1), end='\n')

So now we have a simple training loop which allows us to find zeros and ones with an impressive accuracy.

Using PyTorch and fastai

PyTorch and fastai provide helpful simplifications for the above steps. Firstly, our function linear1 is already implemented as nn.Linear and can be initialized by providing the input and output dimenions:

linear_model = nn.Linear(28*28, 1)
w, b = linear_model.parameters()
w.shape, b.shape
(torch.Size([1, 784]), torch.Size([1]))

Secondly, fastai provides the handy learner class which takes virtually all of the effort out of creating a training process. For instance, the implementation of stochastic gradient descent we have been working on so far, can simply be added to the class as SGD:

learn = Learner(dls, nn.Linear(28*28, 1), opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(5, lr=lr)
epoch train_loss valid_loss batch_accuracy time
0 0.169611 0.029667 0.997749 00:00
1 0.068044 0.009250 0.997749 00:00
2 0.035621 0.005843 0.998312 00:00
3 0.020956 0.004643 0.998312 00:00
4 0.013350 0.003998 0.998312 00:00

Building a Neural Net

So far, our function is limited, in that it is only a linear function. To move to a neural network, we need to add a non-linear element between two linear functions. That is basically the entire definition of a neural net - two linear classifiers with a max function between them:

def simple_net(xb):
    res = xb @ w1 + b1
    res = res.max(tensor(0.0))  # identical to F.relu
    res = xb @ w2 + b2

    return res

Just like before, w1, b1, w2, b2 are randomnly initialized parameters:

w1 = init_params((28*28, 30))
b1 = init_params(30)
w2 = init_params((30, 1))
b2 = init_params(1)

Note that 30 is randomly chosen. It means here that w1 has 30 output activations and can therefore construct 30 different features, i.e. different constellations of pixels. w2 accordingly, needs to have the same number of input activations. Increasing this number makes the model more capable to model complex functions.

The very simple relu function adds as an if statement and thereby decouples layers to remove linearity.

The lines of code in the simple_net are referred to as layers. In this case, two linear layers with a nonlinearity, or activation function.

Just as before, PyTorch provides a simple implementation of this already:

simple_net = nn.Sequential(nn.Linear(28*28, 30),
                           nn.Linear(30, 1))

nn.Sequential creates a module that will call each of the listed layers in turn and we can use it just like our original model.

learn = Learner(dls, simple_net, opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(5, 0.1)

This is theoretically everything that is needed to solve any complex function - SGD and a single nonlinearity with two linear layers. Anything that comes from here is optimization!

What is our network actually learning?

I have mentioned the paper of Matthew D. Zeiler and R. Fergus before (Zeiler & Fergus, 2014) in a previous post. In it, the authors provide a fascinating insight into how a convolutional neural network learns to recognize increasingly complex visual features throughout its layers. Let’s see what our simple neural net is learning about recognizing ones.

To do so, we will simply extract the parameters of our first linear layer and plot them as a 28x28 pixel image:

m = learn.model
w, b = m[0].parameters()  # the parameters of the first linear layer
w[0].view(28, 28)
show_image(w[3].view(28, 28), figsize=(5,5), cmap=full_palette)

It looks like this set of weights is optimized for finding a vertical bar in the center of the image, typical for the digit one. Isn’t this fascinating?

One more thing: Moving to Multi-Class Classification

The chapter in the book (Howard & Gugger, 2020) ends with a homework assignment, asking the reader to apply these principles to the entire MNIST dataset.

So far, we have only built a classifier to distinguish between zeros and ones, i.e. binary classification. To solve the MNIST problem, however, we need to be able to classify all 10 handwritten digits.

Let’s start by preparing out data for the full set just as before:

# Turn the pixel values of the DataFrames into tensors:
train_x = tensor(train_df.iloc[:,1:]).float()/255
train_y = tensor(train_df.iloc[:,0])

valid_x = tensor(valid_df.iloc[:,1:]).float()/255
valid_y = tensor(valid_df.iloc[:,0])

# Combine the independent and dependent variables into a dataset:
dset = list(zip(train_x, train_y))
valid_dset = list(zip(valid_x, valid_y))

# Create a DataLoader for training and validation:
dl = DataLoader(dset, batch_size=5, shuffle=True)
valid_dl = DataLoader(dset, batch_size=5)

# Combine both into a single DataLoaders instance:
dls = DataLoaders(dl, valid_dl)
(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([7, 2, 1, 8, 7]))

So as you can see, this looks just like before, just that we now have more than two possible labels.

The first key change we need to make is to the architecture of the model. Our simple_net returned a single number for each set of pixels and was able to provide the probability of the input being a one. If we want to account for 10 numbers, we need to increase the output dimension to 10 accordingly:

mnist_net = nn.Sequential(nn.Linear(28*28, 30),
                          nn.Linear(30, 10))

The second key change we need to make to our learner is the loss function so as to be able to deal with more than one class. Remember our mnist_loss function used sigmoid to turn activations into numbers between zero and one, denoting the probability of each of the two classes. Therefore, we were able to calculate the loss by computing the difference to the actual number.

In this case, this will not work, because our dependent variable takes more shapes than just zero and one.

Let’s start with the same learner we used before and taking a look at the activations our loss function will see:

x, y = dls.one_batch()
preds,_ = learn.get_preds(dl=[(x, y)])
tensor([ 0.0820,  0.0791, -0.2281, -0.1365, -0.2006,  0.0334, -0.2755, -0.0951,
        -0.0061, -0.0384])
len(preds[0]), preds[0].sum()
(10, tensor(-0.7858))

As you can see, these probabilities do not add up to one, as they did before. Similar to what we did with the sigmoid function in mnist_loss, we want to turn our activations into numbers between zero and one, adding up to 100%. But rather than the probability of finding a one and its inverse, we want to achieve this for 10 numbers.

Softmax achieves this:


Let’s look at the impact on our data:

sm_acts = torch.softmax(preds, dim=1)
tensor([[0.1165, 0.1162, 0.0855, 0.0937, 0.0879, 0.1110, 0.0815, 0.0976, 0.1067,
        [0.1219, 0.1070, 0.0861, 0.0888, 0.0924, 0.1109, 0.1011, 0.0850, 0.1033,
        [0.1157, 0.1120, 0.0830, 0.0971, 0.0995, 0.1103, 0.0911, 0.1046, 0.0996,
        [0.1152, 0.1115, 0.0910, 0.0909, 0.0924, 0.1042, 0.0918, 0.0926, 0.1071,
        [0.1192, 0.1122, 0.0928, 0.1003, 0.0917, 0.1072, 0.0849, 0.0889, 0.1001,
len(sm_acts[0]), sm_acts[0].sum()
(10, tensor(1.0000))

Now all that is left to do is adapt the learner, accordingly, and train!

learn = Learner(dls, mnist_net, opt_func=SGD, loss_func=nn.CrossEntropyLoss(), metrics=accuracy)
learn.fine_tune(10, 0.1)
epoch train_loss valid_loss accuracy time
0 0.180775 0.170384 0.946488 00:11
epoch train_loss valid_loss accuracy time
0 0.064815 0.054088 0.983512 00:11
1 0.067600 0.047392 0.984970 00:11
2 0.069272 0.052175 0.983065 00:10
3 0.080398 0.060341 0.980387 00:11
4 0.042787 0.045250 0.985030 00:11
5 0.031226 0.030384 0.990595 00:10
6 0.019606 0.019908 0.995238 00:11
7 0.015953 0.016687 0.996548 00:10
8 0.016785 0.015338 0.997470 00:11
9 0.016400 0.015170 0.997470 00:11

Looks great! We are actually predicting the full set of digits with almost perfect accuracy! Now that we understand the inner workings of neural networks, we can focus on more advanced approaches in the upcoming lessons.


  1. Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 141–142.
  2. Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. ECCV.
  3. Howard, J., & Gugger, S. (2020). Deep Learning for Coders with fastai and PyTorch. O’Reilly Media.