The one equation you need in order to understand logistic regression

A binary classification model that's technically a probability regression model under the hood.

Jan 19, 2025

Hello fellow machine learners,

Now that we’ve spoken about the kNN algorithm, we’re switching tact slightly to cover a rather interesting ML model which builds on the workings of linear regression. Namely, the logistic regression model.

Let’s dive right in!

Regression vs classification

A machine learning model is called a classifier if it takes data and assigns it a discrete label.

An example of a classifier is the kNN model we simulated two weeks ago, where we took data points and assigned each a colour, either red or green. We could only choose between one of two colours, with nothing in between.

Conversely, a regressor takes data and assigns it a value from a continuous spectrum. Do you recall the linear regression model? Once the model’s parameters have been found, we take the corresponding linear combination of the data features and we get a number in return.

So if you want to assign each data point a label from a discrete set, you’ll want to build a classification model. And if you want to map each data point to a value in a continuous range, then you’ll need a regression model.

The kNN is not the only example of a classification model. There are plenty others, and we’ll now discuss one such example. Introducing the somewhat awkwardly named…

Logistic regression model

The logistic regression model takes a continuous input and maps it to a continuous range. This is usually done with the sigmoid function, given by

\(\sigma(z) = \frac{1}{1+e^{-z}}.\)

This is a useful function, because no matter the input, the resultant value always lies between 0 and 1. The sigmoid function has some other useful properties which we may revisit down the line.

Why is this the case? Let’s have a think. The output of the exponential function is always greater than zero. If we add one, our result will always be greater than one. Finally, taking the reciprocal of this leaves us with a positive value in the range (0,1).

The larger the input value, the closer the Sigmoid output will be to 1; and the smaller the input value, the closer the output is to 0.

Our 1D linear regression model is of the form

\(y=wx+b,\)

where x is the feature, y is the target, w is the gradient and b is the y-intercept of the line, We can plug in our linear regression output into the sigmoid function to get the following:

\(\sigma(y) = \frac{1}{1+e^{-(wx+b)}}.\)

The below video depicts how the regression line input affects the sigmoid output:

The shape of the sigmoid function resembles that of the line, but it’s a bit curvier and, more importantly, the input is always mapped to the interval (0,1). Nice!

The next step is to choose a threshold between 0 and 1. Let us call it threshold for now. So if our sigmoid output is greater than or equal to threshold, we return the label 1, otherwise we return the label 0:

if sigmoid(w*x + b) >= threshold:

return 1

else:

return 0

The default value for threshold is 0.5. A higher threshold biases outputs towards the ‘0’ label, and vice versa for a lower threshold. Modifications made to the values of w and b will alter the shape of the sigmoid output, which can then be compared to threshold for the resultant classification.

The above animation depicts how our classifications change depending on the value of threshold, which is indicated by the yellow dot. The dots with y-coordinates above the threshold are coloured green, and those below are coloured red (i.e. we are using colours instead of the 1/0 labels here).

Extension to n dimensions?

Yup, I’m back to talking about n dimensions again, get used to it 😤

But indeed, we can plug in the ‘line of best fit’ formula y=wx+b, so why not extend this to account for n features?

That wasn’t a rhetorical question. Let’s try it! Given a multiple linear regression model

\(y=b+w_1x_1+\dotsc+w_nx_n=b+\sum_{i=1}^nw_ix_i\)

our sigmoid output will be given by

\(\sigma(y) = \frac{1}{1+e^{-(b+\sum_{i=1}^nw_ix_i)}}\in (0,1).\)

As before, the result is mapped to the interval (0,1), and we then need to learn the optimal weights w_1,…,w_n.

Why is it called ‘logistic regression’ rather than ‘logistic classification’?

Our end result is a classification mapped to the discrete set {0,1}, which sounds like a classification task. But in order to get there, we leveraged the sigmoid function. Since the sigmoid output is restricted to the interval (0,1), we can think of it as the probability that the input has the label {1}. Concretely,

\(\mathbb{P}(y=1|\textbf x)=\sigma(y)= \frac{1}{1+e^{-(b+\sum_{i=1}^nw_ix_i)}}.\)

The greater the above value, the closer to the value 1 we get, and so this probabilistic formulation makes sense. And the key thing to consider is that our probability value output lies in a continuous range. This is why the model is called a logistic regressor, rather than a classifier. But just to be clear, the logistic regression model is used for classification. Don’t forget this!

Logistic regressor: linear or non-linear?

Despite how non-linear the sigmoid function looks, I claim that the logistic regressor is itself a linear model!

How is this the case? Well, if we rearrange the probability formula from the last section, we arrive at

\( \log \left(\frac{\mathbb{P}(y=1|\textbf x)}{1-\mathbb{P}(y=1|\textbf x)}\right) = b+\sum_{i=1}^nw_ix_i.\)

In words, this means that the output of the logistic regression model is dependent on a linear combination of the x features, fully controlled by the weight coefficients w_i and the value of b.

One more thing: note that either y=1 or y=0. Namely, these two probabilities sum up to 1, which means that in fact

\(\log \left(\frac{\mathbb{P}(y=1|\textbf x)}{1-\mathbb{P}(y=1|\textbf x)}\right) = \log \left(\frac{\mathbb{P}(y=1|\textbf x)}{\mathbb{P}(y=0|\textbf x)}\right).\)

Namely, we are directly comparing the logarithm of the ratio of the two values that y can take.

Closed form solution?

You may recall a few weeks ago that we derived a closed form solution for the linear regression model. This means that, given the input data, we could deterministically compute the gradient(s) and y-intercept of the ‘line of best fit’.

Sadly, there does not exist a closed form solution for logistic regression. However, there are methods we can leverage to approximate a solution. We don’t have time in this article, but stay tuned, I’m sure I’ll get to it at some point.

Packing it all up

Another week, another ML algorithm 😎 let’s summarise what we’ve learned:

Classification models aim to assign discrete labels to data, whereas regression models aim to allocate a value from a continuous range.
The logistic regression model applies a regression technique under the hood, but acts as a classifier on the output data.
The sigmoid function takes any real value and maps it to the interval (0,1). This isn’t the last we’ll see of this function, so keep it in mind!
Despite the use of the non-linear sigmoid function, the logistic regression model is indeed a linear model.

Training complete!

I really hope you enjoyed reading!

I would be happy to write another logistic regression article which focusses more on the implementation- I just wanted to outline the abstract concepts in this week’s article. If this is of interest though, then do leave a comment and I can dedicate a future article to this.

Aside from that, do leave a comment if you’re unsure about anything, if you think I’ve made a mistake somewhere, or if you have a suggestion for what we should learn about next 😎

Until next Sunday,

Ameer

PS… like what you read? If so, feel free to subscribe so that you’re notified about future newsletter releases:

Sources

Lectures notes from Carnegie Mellon University on logistic regression.

Machine Learning Algorithms Unpacked

Discussion about this post