The one equation you need in order to understand logistic regression
A binary classification model that's technically a probability regression model under the hood.
Hello fellow machine learners,
Now that we’ve spoken about the kNN algorithm, we’re switching tact slightly to cover a rather interesting ML model which builds on the workings of linear regression. Namely, the logistic regression model.
Let’s dive right in!
Regression vs classification
A machine learning model is called a classifier if it takes data and assigns it a discrete label.
An example of a classifier is the kNN model we simulated two weeks ago, where we took data points and assigned each a colour, either red or green. We could only choose between one of two colours, with nothing in between.
Conversely, a regressor takes data and assigns it a value from a continuous spectrum. Do you recall the linear regression model? Once the model’s parameters have been found, we take the corresponding linear combination of the data features and we get a number in return.
So if you want to assign each data point a label from a discrete set, you’ll want to build a classification model. And if you want to map each data point to a value in a continuous range, then you’ll need a regression model.
The kNN is not the only example of a classification model. There are plenty others, and we’ll now discuss one such example. Introducing the somewhat awkwardly named…
Logistic regression model
The logistic regression model takes a continuous input and maps it to a continuous range. This is usually done with the sigmoid function, given by
This is a useful function, because no matter the input, the resultant value always lies between 0 and 1. The sigmoid function has some other useful properties which we may revisit down the line.
Why is this the case? Let’s have a think. The output of the exponential function is always greater than zero. If we add one, our result will always be greater than one. Finally, taking the reciprocal of this leaves us with a positive value in the range (0,1).
The larger the input value, the closer the Sigmoid output will be to 1; and the smaller the input value, the closer the output is to 0.
Our 1D linear regression model is of the form
where x is the feature, y is the target, w is the gradient and b is the y-intercept of the line, We can plug in our linear regression output into the sigmoid function to get the following:
The below video depicts how the regression line input affects the sigmoid output:
The shape of the sigmoid function resembles that of the line, but it’s a bit curvier and, more importantly, the input is always mapped to the interval (0,1). Nice!
The next step is to choose a threshold between 0 and 1. Let us call it threshold
for now. So if our sigmoid output is greater than or equal to threshold
, we return the label 1, otherwise we return the label 0:
if sigmoid(w*x + b) >= threshold:
return 1
else:
return 0
The default value for threshold
is 0.5. A higher threshold biases outputs towards the ‘0’ label, and vice versa for a lower threshold. Modifications made to the values of w and b will alter the shape of the sigmoid output, which can then be compared to threshold
for the resultant classification.
The above animation depicts how our classifications change depending on the value of threshold
, which is indicated by the yellow dot. The dots with y-coordinates above the threshold are coloured green, and those below are coloured red (i.e. we are using colours instead of the 1/0 labels here).
Extension to n dimensions?
Yup, I’m back to talking about n dimensions again, get used to it 😤
But indeed, we can plug in the ‘line of best fit’ formula y=wx+b, so why not extend this to account for n features?
That wasn’t a rhetorical question. Let’s try it! Given a multiple linear regression model
our sigmoid output will be given by
As before, the result is mapped to the interval (0,1), and we then need to learn the optimal weights w_1,…,w_n.
Why is it called ‘logistic regression’ rather than ‘logistic classification’?
Our end result is a classification mapped to the discrete set {0,1}, which sounds like a classification task. But in order to get there, we leveraged the sigmoid function. Since the sigmoid output is restricted to the interval (0,1), we can think of it as the probability that the input has the label {1}. Concretely,
The greater the above value, the closer to the value 1 we get, and so this probabilistic formulation makes sense. And the key thing to consider is that our probability value output lies in a continuous range. This is why the model is called a logistic regressor, rather than a classifier. But just to be clear, the logistic regression model is used for classification. Don’t forget this!
Logistic regressor: linear or non-linear?
Despite how non-linear the sigmoid function looks, I claim that the logistic regressor is itself a linear model!
How is this the case? Well, if we rearrange the probability formula from the last section, we arrive at
In words, this means that the output of the logistic regression model is dependent on a linear combination of the x features, fully controlled by the weight coefficients w_i and the value of b.
One more thing: note that either y=1 or y=0. Namely, these two probabilities sum up to 1, which means that in fact
Namely, we are directly comparing the logarithm of the ratio of the two values that y can take.
Closed form solution?
You may recall a few weeks ago that we derived a closed form solution for the linear regression model. This means that, given the input data, we could deterministically compute the gradient(s) and y-intercept of the ‘line of best fit’.
Sadly, there does not exist a closed form solution for logistic regression. However, there are methods we can leverage to approximate a solution. We don’t have time in this article, but stay tuned, I’m sure I’ll get to it at some point.
Packing it all up
Another week, another ML algorithm 😎 let’s summarise what we’ve learned:
Classification models aim to assign discrete labels to data, whereas regression models aim to allocate a value from a continuous range.
The logistic regression model applies a regression technique under the hood, but acts as a classifier on the output data.
The sigmoid function takes any real value and maps it to the interval (0,1). This isn’t the last we’ll see of this function, so keep it in mind!
Despite the use of the non-linear sigmoid function, the logistic regression model is indeed a linear model.
Training complete!
I really hope you enjoyed reading!
I would be happy to write another logistic regression article which focusses more on the implementation- I just wanted to outline the abstract concepts in this week’s article. If this is of interest though, then do leave a comment and I can dedicate a future article to this.
Aside from that, do leave a comment if you’re unsure about anything, if you think I’ve made a mistake somewhere, or if you have a suggestion for what we should learn about next 😎
Until next Sunday,
Ameer
PS… like what you read? If so, feel free to subscribe so that you’re notified about future newsletter releases:
Sources
Lectures notes from Carnegie Mellon University on logistic regression.