Different Types of Loss Functions in Machine Learning

rays of light

We know that Loss is a fundamental function. In this article, we will talk about different types of loss functions in Machine Learning

Hello! As promised, a new post to talk about different types of Loss Functions in Machine Learning! In my previous article, I talked about what is Loss function in Machine Learning, and I left you on a bit of a cliffhanger, so without any further ado, let’s talk Machine Learning!

The Basic Loss Function

Just as a quick refresher, last time we talked about a simple loss function that would tell us the difference between our target values and our predicted values, given that our predicted values were a number between 0 and 1. To ensure our values were between 0 and 1, we used a sigmoid function on our predictions.

We defined such function as follows:

def loss_function(predictions, targets):
predictions = predictions.sigmoid() return torch.where(targets==1, 1-predictions, predictions).mean()

Now, as we have seen, this function works well for target values that can be either 1 or 0. Now, we might have targets that look very different from that.

What do we do then? It turns out we can use different types of loss functions depending on the task at hand. Let’s have a look at a few examples.

Different Types of Loss Functions

So let’s set the scene. Let’s say we want to train a model to categorize a book as either adventure, romance, sci-fi, biography, thriller, or historical – in that order (meaning the first column in our vector corresponds to adventure and so forth). We could use a summary of each book as input, and the category label as output. 

To make our life easier for the purpose of this example, we will not talk about how the model pre-processes and represent the text input. Let’s assume it’s just magic. For the same reason, for the moment being, let’s also assume our books can belong to only one category – so a book can be either sci-fi or adventure, but not both.

So let’s focus on our output. What would our output look like in this case? If we assume that each one of these categories corresponds to a number between 0 and 5, for each single input x, we would have an output y that looks something like this:

y = 2

In this case, for example, our book belongs to the third category (since we start counting from 0), meaning sci-fi. Let’s have a look at our predictions vector as well:

predictions = tensor([0.05, 0.00, 0.90, 0.00, 0.05, 0.00])

Each one of the values in our predictions vector correspond to a probability between 1 and 0. This is the probability of the book belonging to the corresponding category. As you can see, the sum of all probabilities adds up to one. Pretty neat right? 

Softmax

To achieve this amazing feat, we need to use a little function called the softmax activation function. Softmax is the equivalent of the sigmoid function for multi-category problems. We use it every time we have more than two categories, and we want the probabilities to add up to one. 

We can define it as follows:

def softmax(x): return exp(x) / exp(x).sum(dim=1, keepdim=True)

That’s all nice and fun, but what does it mean? Taking the exponential of our input ensures all numbers are positive, and the dividing by the sum of exponentials ensures all our numbers will add up to one. Now, this function has two very nice properties:

  • Taking the exponential of a number means that small numbers will get smaller, and large numbers will get larger. 
  • In turn, this means that this function is really set on picking one category, which makes it perfect for those cases where we know our output is a defined label.

Log Likelihood

Now, once we have applied softmax, we can consider the loss of the correct label only, since all the other values add up to 1 minus the prediction for the correct label. So, maximizing the value of the correct prediction, means we decrease all other values. 

This works quite well as a loss function, but we can do better. We are dealing with probabilities between 0 and 1, which means that our model will not care if the correct prediction is 0.99 or 0.999. However, a prediction of 0.999 is ten times more confident than a prediction of 0.99. 

What we can do is take the mean of the logarithm of our values, which scales our values between negative infinity and zero. This way, those small differences in probabilities that were otherwise ignored by our model will be accounted for.

Cross Entropy Loss

When we apply softmax and then the log likelihood to our predictions, we are applying a loss function called Cross Entropy Loss. 

The main advantages of this type of loss function in Machine Learning are:

  •  it works even when we have more than to possible categories as output
  • training is faster and more reliable

Yet Another Type of Loss Function in Machine Learning

Another common problem we might have to solve is multi-label classification. In our previous example, we wanted our books to be associated to just one category: either sci-fi or adventure, but not both. However, literature doesn’t really work that way, and we might want to be able to classify our books as belonging to multiple categories.

In this case, our targets would be a one-hot encoded vector such as:

targets = tensor([1, 0, 1, 0, 0, 0])

In our targets vector, each number (each column) represents a possible category our book can belong to. If that place is occupied with a 1, then it means the book belongs to that category, if it is a 0, it doesn’t. In this case, our book is considered as an adventure sci-fi novel. Our Cross Entropy Loss would not work in that case. So what type of loss function can we use for this machine learning problem? We can use Binary Cross Entropy Loss.

How does it work? It is actually pretty straight forward: we define our loss function as the very first loss function we talked about (our basic loss function) and we take the log:

def loss_function(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets==1, 1-predictions, predictions).log().mean()

Each prediction will then be compared to each target for the corresponding column in our vector, and we will then pick a threshold. Each value above that threshold will be considered a one, and each value below the threshold a zero.

Different Types of Loss functions in Machine Learning

So we have now seen three different types of loss functions in machine learning. These are widely used and fascinating, so I invite you to look further into loss functions! If you won’t, well… your loss (: