What is Loss in Deep Learning?


Loss is one of the most important functions in Deep Learning; therefore, it deserves an in-depth explanation.

Hello, I am back! I am sorry for the long absence, but I got covid (yes, two years of quarantine and three vaccine shots after, I managed to get covid). But I am so glad to be back after two weeks spent doom scrolling on the couch! And I’ll try to post twice this week to make it up to you folks. And I thought the first post could be a deep dive into the loss function! What is Loss in Deep Learning? Let’s find out!

The Loss Function, the very Backbone of Deep Learning

So while I was couch-ridden waiting for death, I started re-reading the fastai bible: Deep Learning for coders with fastai and Pythorch. I talked about this book in this blog post, and frankly, I can’t say enough good things about it. The book does a really great job at introducing pivotal Machine Learning and Deep Learning concepts in a really high-level, easy to understand way, to then go further into detail. 

And it so happens, they have a great explanation of the role of Loss in Machine Learning. As it happens, there would be no Machine Learning without Loss. So, since it such a fundamental concept, I thought I might try to sum it all up for you!

What exactly is the Function of the Loss Function?

So it turns out, the concept of Loss is as old as Machine Learning. Arthur Samuel, an IBM researcher all the way back in 1949, started looking for different ways of programming computers. In 1962, he wrote an essay that became a classic in the field: “Artificial Intelligence: a Frontier of Automation“; in this essay, he basically described Machine Learning as we have come to know it.

The idea was to show the computer examples of the problems we want to be solved, and let the computer figure it out for itself. To do so, Samuel  said, we need to

[…] arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assigment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.

So when Samuel talks about “testing the effectiveness of any current weight assignment”, he is talking about a loss function. The loss function is a function that returns a number that is small if the performance of the model is good.

The purpose of the loss function is to measure the difference between the values the model predicts and the actual values – the targets or labels.

How does the Loss Function Work?

So what is the loss function? To make sure we are clear up until this point, let’s set up un example. 

Let’s say we are training a model to do some sentiment analysis. We want our model to be able to tell us if a sentence is positive or negative. We will say that a positive sentence is labeled as a 1 and a negative sentence is labeled as a 0.

So let’s say we have three sentences, and we know the first one is positive, the second one is negative, and the third one is positive. We can then make a target vector with these targets. We can also create a vector containing the predictions our model makes on whether these sentences are positive or negative. Such predictions must be a number between 0 and 1.

targets = tensor([1, 0, 1])
predictions = tensor([0.9, 0.2, 0.3])

So these two vectors will be the inputs of our loss function, that will measure the distance between the predictions and the target. 


Writing our First Loss Function

Now that we know all of this, let us try and write our first loss function. As we said, it must take the difference between the targets and the predictions, so we can just write:
def loss_function(predictions, targets):
    return torch.where(targets==1, 1-predictions, predictions).mean()

If we pass our predictions and target vectors from before into this function, we would get this vector back:

tensor([0.1000, 0.2000, 0.7000])

As you can see, the function returns a lower number when our model’s prediction are more accurate, or when accurate predictions are more confident or inaccurate predictions are less confident. 

Great, that is exactly what we wanted! The only problem is that, as you recall, we assumed all our predictions would be numbers between 0 and 1. To ensure this is actually the case, we are going to use another function, the Sigmoid function.

The Sigmoid Function, our Best Friend

The Sigmoid function takes an input and outputs a number between 0 and 1. 

We can define it as follows:

def sigmoid(x): return 1/(1+torch.exp(-x))

So let’s adjust our loss function by applying the sigmoid to our predictions first, to make sure our predictions are a value between 0 and 1:

def loss_function(predictions, targets):
predictions = predictions.sigmoid() return torch.where(targets==1, 1-predictions, predictions).mean()

The Many Flavors of Loss

Now we have a completely functioning loss function that we can use with Stochastic Gradient Descent to optimize our model automatically as Samuel predicted.

However, if you have been paying attention, you might have noticed that our loss function only works for outputs that can be labeled as either a 1 or a 0. Meaning that if we want to have outputs that can have multiple values (think for example of Multi-label Classification or Image Classification), our loss function would not work.

What do we do now? Well, it turns out, there are multiple types of loss functions that we can use! So stay tuned for the next post, where we will talk more about what is loss in Deep Learning!