Machine Learning Fundamentals: Cost Function and Gradient Descent

ML concepts every Data Scientist or ML engineer should know

Anmol Tomar

Published in

Geek Culture

5 min readAug 4, 2022

Machine Learning Fundamentals (Pic by Author)

Introduction

Cost Function and Gradient Descent are one of the most important concepts you should understand to learn how machine learning algorithms work. In this blog, we will look at the intuition behind these concepts.

We will pick the simplest machine learning algorithm i.e. the linear regression algorithm to understand these concepts.

We will cover the following topics in sequence:

Understanding the linear regression model,
Cost function, and
Gradient Descent.

Linear regression

A machine learning model tries to capture the relationship between the input features and the output features. Imagine, we have the data having heights and weights of thousands of people. We want to use this data to create a Machine Learning model that takes the height of a person as input and predicts the weight of the person.

If we look at the relationship between weight(in pounds) and height(in inches) of people, we can see that the relationship looks linear i.e. with an increase in height, the weight also increases.

This kind of relationship between the input feature(height) and output feature(weight) can be captured by a linear regression model that tries to fit a straight line on this data.

The following is the equation of a line of a simple linear regression model:

Y = mx + c

Y is the output feature (weight), m is the slope of the line, x is the input feature(height) and c is the intercept(weight is equal to c when height is 0 as shown below).

Y = m(0) + c = c

For different values of slope ‘m’ and constant ‘c’, we will get different lines as shown in the below graph.

How does a linear regression find the best fit line?

Now, the question is how does a linear regression model find the line that is the perfect fit for our data?

The answer is the Cost function and Gradient Descent!

First, let’s look at the cost function.

Cost Function

A cost function is a mathematical function that is minimized to get the optimal values of slope ‘m’ and constant ‘c’. The cost function associated with linear regression is called the mean squared errors and can be represented as below:

Error: Difference between actual and predicted. (Image by Author)

Suppose the actual weight and predicted weights are as follows:

We can calculate the MSE as follows:

We can adjust the equation a little to make the calculation a bit easy down the line.

Now, the value of MSE will change based on the change in the values of slope ‘m’ and constant ‘c’.

Now, we need to get the optimal values of ‘m’ and ‘c’ so that MSE becomes minimum. Intuitively, we want the predicted weights to be as close as possible to the actual weights.

But how do we do that? There could be a huge number of combinations of ‘m’ and ‘c’, we cannot test them all.

Gradient Descent is the solution!

Gradient Descent

Gradient Descent is a technique to minimize the outcome of a function, which is the Mean squared error in the case of linear regression.

Let’s understand how the Gradient Descent works in the context of linear regression.

The cost function of linear regression(MSE) is a convex function i.e. it has only one minima across the range of values of slope ‘m’ and constant ‘c’ as shown in the below figure (cost function is represented by J(m,c)).

Mathematically, the Gradient Descent works by calculating the partial derivative or slope corresponding to the current value of ‘m’ and ‘c’ as shown below. At each step, the value of both ‘m’ and ‘c’ get updated simultaneously. The values will keep on updating until we reach the value of ‘m’ and ‘c’ for which the cost function reaches the minimum value.

Learning rate(or alpha) is the rate at which the value of ‘m’ or ‘c’ get updated. The larger the value of alpha, the bigger will be the update to the value of ‘m’ or ‘c’.

If the slope is negative, then the value of ‘m’ increases by — learning rate * slope of ‘m’ and if the slope is positive, then the value of ‘m’ decreases by — learning rate * slope of ‘m’. The same is true for the value of c.

Now, coming back to our partial derivative terms, this is how the partial derivatives get applied to the cost function (you can skip this if you feel it is too much math).

Summary

Every machine learning model has a cost function, which is the mean squared error in the case of linear regression, based on which the model evaluates how close is the predicted value to the actual value.

Gradient descent is used to get to the minimum value of the cost function. Intuitively, gradient descent finds the slope of the cost function at every step and travels down the valley to reach the lowest point (minimum of the cost function).

Thank You!

Firstly, you should get my posts in your inbox. Do that here!
Secondly, if you like to experience Medium yourself, consider supporting me and thousands of other writers by signing up for a membership. It only costs $5 per month, it supports us, writers, greatly, and you have the chance to make money with your writing as well.