Visualising the k-th Nearest Neighbour algorithm
What the UK's train services have taught me about supervised machine learning.
Hello fellow machine learners,
Last week, we discussed how to extend one-dimensional linear regression to account for multiple variables. And I went off on a bit of a rant about higher dimensional space.
(And believe me, there’s be plenty more where that came from…)
This week, we will unpack the k-th Nearest Neighbour algorithm and consider its relationship to, and utility within, real-world data.
Sound good? Then vamos, amigos!
Intro
Picture yourself in the following situation:
You’re on a solo rail trip awaiting your train at the station, positively buzzing with excitement at the wonders that may lie in store at your destination Your anticipaton is palpable!
But all of a sudden, the plan goes off the rails.
Erm, not literally though.
Rather, the train is suffering a mechanical fault and so will be delayed. The staff on the tannoy announce that they don’t know how long the delay will last.
In the meantime, they provide you with the following options:
Stay on the train until it’s back up and running again.
Get off this train and find an alternative service, whether it be a bus, tram, another train, etc.
Swear under your breath at the UK’s public rail services. (Ok maybe they don’t suggest this one, but you can do it if it’d make you feel better.)
What choice do you make?
Perhaps you, a member of the audience, are a maverick who makes their own decisions and doesn’t care what anyone else thinks. A real trailblazer, a pioneer, a genuine trendsetter.
Unfortunately, I am not that kinda guy.
I would most likely look around and see what the people near me are doing 😅
So if I saw a few people get up and leave to look for other transport means, I’d do the same. Conversely, if my peers were to stay put in such a scenario, so would I.
I might just look at what the three people closest to me are doing and make my choice accordingly.
And believe it or not, this basically explains how the k-th nearest neighbour algorithm works when k=3.
Instead of people, we will consider data points on a 2d grid; and instead of leaving vs staying on the train, we consider each data point as having been assigned a colour, either red or green.
Here’s what this may look like with data points on a 2D grid:
A new data point pops up a few seconds in, which has yet to be assigned a colour. Out of its three closest neighbours, two of them are green and one is red. So by majority decision, this new point takes on the colour green.
We repeat this process for two other new points, which are classified by the algorithm as red and green respectively.
The value of k represents the number of neighbours we scan across to make our classification. You can make k equal to whatever positive integer you like. We could look at just our nearest neighbour (k=1), or the entire dataset. If k is an even number, then we would have to determine how to break ties as well. But more on the value of k later. In the meantime, let’s go through…
How we define the ‘neighbours’ of a point
To evaluate the distance between two points in 2d, we sum up the squares of the horizontal and vertical differences, as illustrated below:
This measure of distance is called the Euclidean distance. There are many other ways we can measure distance, but we’ll come back to this next week.
One thing to bear in mind is that the formula for the Euclidean distance extends naturally to data points in n dimensions. Concretely, if x and y are n-dimensional data points, e.g.
then the Euclidean distance between them is given by
How does the algorithm work?
Suppose you have a labelled dataset of points, and you wish to allocate a label to a new point. Let’s call this point new_point
for clarity. Here is how this could be done using the kNN algorithm:
Calculate the ‘distance’ from the
new_point
to every other point in the dataset and store these values. These can be stored as tuples in an array for example, where the first element of the tuple is the distance and the second element is the point itself.Order the array in ascending order of distance, so that the first tuple in the list represents the point closest to
new_point
, and the last element represents the point furthest away fromnew_point
.Check the labels of the first k elements in the list.
Allocate a label to
new_point
depending on the majority label from its k nearest neighbours. You may have to choose how to break ties for the case where k is a multiple of the total number of labels in the dataset, e.g. if k=4 and you only have two labels for your data points.
Relationship to raw data
Recall from our previous discussions of linear regression1 that each feature of a dataset can be used to represent a dimension of a data point. So for a dataset of n fields and 1 label each, we can think of each row of data as a data point in n dimensions.
If the features are all numbers, then we can classify new points using kNN with ease.
However, if our data consists of categorical features (i.e. not numerical), then we will have to encode them before applying the algorithm2.
How do we choose the value for k?
Ah, the age-old question.
Essentially, you may have to experiment with the value of k.
Conduct a train-test split and build some kNN models, each using a different value of k. Then compare their performance and see what’s best.
But don’t forget…
…the higher the dimensionality of the data (and the more data points involved), the longer it’ll take the algorithm to compute the distance of every point from the new point. And attempting to classify the entire testing set using kNN will only multiply this runtime.
In the meantime, I have simulated what happens for a few different k values in the case of our initial demonstration. Note that, in my implementation, the colour green is chosen in the case of any ties:
The point is classified as green for lower values of k, but is then classified as red as the k value increases up to 6.
Packing it all up
Here is what we’ve learned this week:
The kNN algorithm is an example of a supervised learning algorithm which classifies new data points by scanning the labels of the k closest data points.
Computing distances from one point to all others in a dataset is computationally expensive, which can make it difficult to experiment with different k values for large datasets/datasets with many features.
The formula for the Euclidean distance extends naturally for higher dimensions. All we need to do is extend the summation to account for the extra dimensions of the input data.
UK trains are crazy expensive and sometimes unreliable.
Training complete!
I really hope you enjoyed reading!
Do leave a comment if you’re unsure about anything, if you think I’ve made a mistake somewhere, or if you have a suggestion for what we should learn about next 😎
Until next Sunday,
Ameer
PS… like what you read? If so, feel free to subscribe so that you’re notified about future newsletter releases:
I’d be down to write about how encoding works. If this would be of interest, then do let me know 😎