Why data scientists should care about writing clean code
Trying to follow coding best practices as a data scientist is like training legs at the gym. Nobody enjoys leg day.
Hello fellow machine learners,
After a brief1 hiatus, we are back 😎
If you want to be a good machine learning practitioner, you should have some good grounding in maths and statistics. This is because ML models rely heavily on theoretical concepts, the likes of which we’ve explored plenty over the past year or so2.
As a fresh-faced maths undergraduate a few years back, I really tried to drill down into the theory of these algorithms, and I didn’t care much at all for software development best practice.
And why should I? My goal was to become a data scientist, not a software engineer. So who cares about all the boring software dev admin? Wouldn’t it just be a waste of time?
Well, now that I’ve worked as a data scientist in industry for a while, I can see now why those exact practices are so important for anyone who interacts with code, whether they’re the ones writing it or not.
With all that said, let’s get to unpacking!
Modular code
When working on machine learning projects at university, I wrote pretty much all my code in long Python notebooks.
This is certainly useful when conducting exploratory data analysis and when one wants to experiment in a scrappy way. However, if you want to build an end-to-end ML pipeline for a project, I would strongly recommend a more modular format for your code.
For example, you could design separate Python files for pre-processing your data and building models for the cleaned data. These files could then be run via the terminal, with the pre-processing outputs saved in new csv files and train models saved separately. I think that Python notebooks still have their part to play, but should be signposted separately to the core ML pipeline.
Building out this sort of file structure can make it easier to locate each part of the project workflow. For example, if someone wanted to see how you’ve tuned your models, they could just look for something like model_tuning.py
. Also, a solid file structure can also speed up the way in which you’re running your code. This is never something I had to worry much about when developing code for university projects. But in recent cases where I have needed to train models multiple times, it’s been far easier for me to execute code through the terminal than to run all the code blocks in a Python notebook, especially when not all the notebook’s code blocks are necessarily needed and just drag out the runtime.
I would recommend that you consider your file structure before even beginning any sort of project. A bit of early planning can save you a lot of headache down the line.
(So if any students are reading this article, do yourself a future favour and start improving your code structure ASAP!)
What about modularity within the code itself? Well, if you want to carry out a certain operation on a DataFrame, you could write a callable function to handle that particular task. The key to this is that a function should handle exactly one operation. You shouldn’t try to group together a bunch of operations under the same function.
There are a few reasons I can think of as to why functions are so useful:
✅ Debugging: if each desired operation is written in a separate function, then if/when errors arise in your code, you’ll be able to locate them more easily than if you were to write everything in a large, sequential code block. Trust me, this has helped more times than I can count when writing code for animations in previous articles!
✅ Reusability: if you contain a set of code within a function, then you can apply the same code multiple times using the function name rather than having to write all the code out again and again. This also brings into focus the importance of use cases for functions. Namely, functions can call other functions, so you could write a ‘foundational’ function that provides some basic utility that can be used in slightly different ways in a bunch of other functions.
✅ Readability: we’ll come onto this more in the next section. But the same way that a book is broken down into chapters, the code you write will just look more presentable with a bit of compartmentalisation. (More on this in the next section though.)
You can bolster the modularity of your code with the power of object-oriented programming. I’d highly recommend that you read into this if you don’t know much about it, but leave a comment down below if you’d like me to address object-oriented programming in a future article.
Documenting your code
Code is read more often that it is written. Where one person or group is responsible for writing the code for a particular package, there may be many others who want to read it to try and understand how it works. There are two main tools that one can use to make their written code easier to understand: docstrings and comments.
A docstring is basically a chunk of plain English text that helps explain the purpose of functions and code blocks. If docstrings are written well, then the corresponding code will be easier to understand. This is especially important for large projects where there are multiple programmers involved. Here is an example of a docstring I wrote for one of the methods in my KMeans class from a few weeks ago:
def update_centroids(self, data, labels):
"""
Updates the centroids' coordinates based on the current labels.
Parameters
----------
data : list
List of NumPy ndarrays. Each array represents the coordinates of a data point.
labels : list
List of integers. The i-th integer represents the cluster centroid that the i-th data point is closest to.
Returns
----------
new_centroids : list
List of NumPy ndarrays. Each array represents the updated coordinates of a centroid.
"""
In Python, the start and end point of the docstring is indicated by a set of triple quotation marks. I started with a brief description of the method’s purpose, then added sections for ‘Parameters’ and ‘Returns’. I follow the format of NumPy’s documentation, but there’s more than one way to write good docstrings.
Similarly, comments can be used to explain code. You can think of a comment as a super-short docstring (a few lines at most), used to explain smaller sections of code. Here is the code for the aforementioned update_centroids method:
new_centroids = []
for i in range(self.k):
# calculate the mean of the data points that belong to the current cluster
data_in_cluster = [data[j] for j in range(len(data)) if labels[j] == i]
data_stack = np.stack(data_in_cluster, axis=1)
new_centroid = np.mean(data_stack, axis=1)
new_centroids.append(new_centroid)
return new_centroids
In the above, I have used a one-line comment to explain the purpose of the code in the loop.
One thing to be careful about: more code documentation isn’t always better. There is no point commenting code line-by-line if it’s already obvious as to what the code is doing. For example, the following comment is pretty obsolete, because it’s obvious as to what the print line is doing:
# prints 'Hello world' to the terminal
print('Hello world')
It’s up to you as a programmer to determine what you need to explain and where. This might depend on who you expect to read your code. But consider what you might forget if you were to revisit your code repository down the line; if someone needs your help understanding what you wrote, it’d be pretty embarrasing if it eventually looks like gibberish to you too!
As well as this, code breaks. In my personal experience, it breaks a lot. In these instances, it’s always handy to have left yourself a bread crumb trail of comments and docstrings to help you better understand if your code is doing what you intended.
Packing it all up
Everything I’ve written this week hasn’t just been hot air- I do try to practise what I preach at times. While I admit that not all of my code for this newsletter has been amazingly documented thus far, I invite you to check out some examples of where I have implemented the tips above:
My KMeans Python class from scratch: https://github.com/AmeerAliSaleem/machine-learning-algorithms-unpacked/blob/main/k_means.py
My DBSCAN Python class from scratch: https://github.com/AmeerAliSaleem/machine-learning-algorithms-unpacked/blob/main/dbscan.py
My code for both the Multi-Armed Bandit class and the corresponding animations. The docstrings weren’t as detailed, but the code was still written in a modular manner: https://github.com/AmeerAliSaleem/machine-learning-algorithms-unpacked/blob/main/bandits.py
Training complete!
There is plenty more to learn about writing clean maintainable code. If you’ve been inspired by this article, I encourage you to do some of your own research and comment any tips you have down below! I am pretty early in my career and so could undoubtedly take inspiration from you all- lets help each become better programmers, regardless of our disciplines 😃
Do leave a comment if you’re unsure about anything, if you think I’ve made a mistake somewhere, or if you have a suggestion for what we should learn about next 😎
Until next Sunday,
Ameer
PS… like what you read? If so, feel free to subscribe so that you’re notified about future newsletter releases:
Sources
My GitHub repo where you can find the code for the entire newsletter series: https://github.com/AmeerAliSaleem/machine-learning-algorithms-unpacked
I’m not sure if a 5-week break can be classed as ‘brief’, but hey ho.
And will continue to!