Photo by Hannah Wei on Unsplash.

 

Me, a data scientist, and Jupyter notebooks. Well, our relationship started back then when I began to learn Python. Jupyter notebooks were my refuge when I wanted to make sure that my code works. Nowadays, I teach coding and do several data science projects and still, notebooks are the best tools for interactive coding and experimentation. Unfortunately, when trying to use notebooks in data science projects, things can get out of control quickly. As a result of experimentation, monolithic notebooks emerge, which are hard to maintain and modify. And yes, it’s very time-consuming to work twice: experiment and then transform your code to Python scripts. Not to mention, it’s painful to test such code, and version control is also a problem. This is the point when you must think, there has to be a better way! Lucky me, the answer is not in avoiding my beloved Jupyter notebooks.

 

Follow me and get to know some awesome ideas from Eduardo Blancas and his project, called Ploomber on how to do better data science projects and how to use and create Jupyter notebooks wisely, even in production.

Popular Jupyter notebooks

Jupyter is a free and open-source web tool, where one can write code in cells, which then is sent to the back-end ‘kernel’ and you immediately get the results. One of my colleagues says it’s like an old-school messenger application with code.   Jupyter notebook’s popularity exploded in the past few years, thanks to the ability to combine software code, computational output, explanatory text, and multimedia resources in a single document [1]. Among other things, notebooks could be used for scientific computing, data exploration, tutorials, and interactive manuals. What is more, notebooks can speak dozens of languages (it got its name from Julia, Python, and R). One analysis of the code-sharing site GitHub counted more than 7.5 million public Jupyter notebooks in January 2022.  As a data scientist, I mainly use Jupyter notebooks for data wrangling with Python and R, and I also teach students Python basics via Jupyter notebooks.

What’s the problem with notebooks?

Despite their popularity,  many data scientists (including me) face problems with Jupyter notebooks [2]. I could not summarize better, so I quote the words of Joel Grus, who explained some problems with notebooks [1].

 

“I have seen programmers get frustrated when notebooks don’t behave as expected, usually because they inadvertently run code cells out of order. Jupyter notebooks also encourage poor coding practice by making it difficult to organize code logically, break it into reusable modules and develop tests to ensure the code is working properly.”

 

Notebooks are hard to debug and test, and I also spent a lot of time in my career refactoring the code into some scripts, functions that can be used in production. There are also problems with version control, as notebooks are JSON files and git outputs an unreadable comparison between versions, making it hard to follow the changes made [2]. Here you can find a more detailed summary and explanation about the problems of Jupyter notebooks. 

 

The quest for modularization

The problems listed above could have been enough to lead me to find Ploomber, but I discovered this awesome project through my quest for modularization. What I needed was a tool, to easily create and run tasks or code snippets in the defined order without asking my data engineer colleagues for help. What I needed is called a pipeline. With a pipeline, one can split up tasks for smaller components and automate them. Pipelines can come in many shapes and sizes. One can create pipelines even in sklearn and pandas [3].

 

Ploomber is an open-source project initiated by Eduardo Blancas to create Python pipelines. I found it an easy-to-use tool, with which I could quickly define my tasks with execution order and break my analysis into modular parts. Ploomber comes with several sample projects where you can find great examples of the tool. I also share my experiments with Ploomber in this repo. What I especially like about Ploomber is the blog and the community on slack, where I could ask anything about this project.

 

Life-hacks from Eduardo Blancas

Okay, I found a great project to modularize my data science projects, but how did it help with my constant struggle with notebooks? 

 

Well, Ploomber comes with Jupytext, a package that allows us to save notebooks as py files, but interact with them as notebooks. The version-control problem was solved. 

 

Then comes the refactoring and modularization problem. One does not have to get rid of notebooks because Ploomber can handle notebooks as pipeline units. This way, I just have to clean my notebooks and spare time converting them to a completely different code structure and architecture. It is also possible to mix notebooks and scripts in pipeline tasks. There’s a blog post series about how to break down monolithic notebooks into smaller parts. What I always tell students and also Eduardo suggests, is to write your notebook so, to always be able to restart your kernel and run all of your code from the top to the bottom. Sometimes, it takes a notebook a long time to run with a lot of data, then just set a sample parameter to get a subset to test that your code runs. 

 

Besides modularization life-hacks,  another very important takeaway I read on Ploomber’s blog and apply myself at work is to lock the dependencies of the project and package it to be able to import code from other notebooks.  I have encountered package-version problems in a few projects so far, so I can assure you that it can spare you a few hours. 

 

A project of multiple shorter, cleaner notebooks instead of a few monolithic ones makes it easier to reproduce, understand and modify the code. Besides, it also makes it possible to design a testing strategy to test ML codes. Several posts about why machine learning projects fail, mention the difficulty of updating code and the time-consuming maintenance problems. With shorter, cleaner code, locked dependencies, and appropriate version control, maintenance and collaboration become easier and faster.

Summary

The ideas above are just some main thoughts I found useful on Ploomber’s blog. Since then, I have had a toolbox on how to split up notebooks into modular parts and how to use and convert them into a pipeline in smaller projects. I like to share and teach ideas on how to do better notebooks and code, and these coding practices are worth considering.

 

If you’re interested in further details of Ploomber and how to work more efficiently with notebooks, make sure to check out Eduardo Blancas talk about his project at the Reinforce AI Conference powered by Ericsson this March! Who could tell us more than the CEO and Co-founder of Ploomber himself?

 

References

 

[1] Jeffrey M. Perkel (2018). Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 145-146. 

 

[2] Eduardo Blancas (2021). Why (and how) to put notebooks in production. Ploomber.io blog.

 

[3] Anouk Dutrée (2021). Data pipelines: What, why and which ones. Towards Data Science blog.