Enhancing Data Science Workflows: Mastering Version Control for Jupyter Notebooks

A hands-on guide to facilitate collaboration and reproducibility with Jupytext, nbstripout, and nbconvert

Alessandro Tomassini

Towards Data Science

· ~8 min read · January 11, 2024 (Updated: January 11, 2024) · Free: No

In the work of a data scientist, effectively managing Jupyter notebooks with version control systems is crucial. This is not just for maintaining an organised workflow, but also for ensuring reproducibility and facilitating collaboration among team members. In this guide, we'll explore three key tools — Jupytext, nbstripout, and nbconvert — each with its unique features. I'll provide thorough descriptions, practical examples, and a (hopefully) balanced view of their advantages and disadvantages, helping you determine the best tool for your specific needs in notebook version control.

Understanding the challenge with Jupyter notebooks and version control

Jupyter notebooks, while excellent for explorative data analysis and visualisation, present challenges when it comes to version control. This is primarily because these notebooks are not just plain text files but are actually structured as JSON documents. This format, despite being adept at maintaining the complex interweaving of code, text, and output data, poses numerous challenges when it comes to managing versions in systems like Git. To exemplify this problem, let's create a notebook demo_notebook.py which outputs a simple plot:

import matplotlib.pyplot as plt
import numpy as np

time = np.linspace(0, 10, 1000)
plt.plot(time, np.sin(np.pi*time**2))
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.title("signal with time-varying frequency")

Image by author

We then initialise a git repo in the folder containing the notebook and use fold to peek into the notebook.

cd my_folder
git init
fold -s -w100 demo_notebook.ipynb

Below is the content of the notebook in JSON format (truncated for readability). It immediately stands out that the JSON structure encapsulates not only the input code but also metadata and output information, including binary data for images or graphs. This amalgamation results in bulky files that can bloat a repository and make the process of diffing (comparing different versions) cumbersome and counterintuitive.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": 
"iVBORw0KGgoAAAANSUhEUgAAAksAAAHFCAYAAADi7703AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMCw
gaHR0cHM6Ly9tYXRwbG90bGliLm9yZy81sbWrAAAACXBIWXMAAA9hAAAPYQGoP6dpAADfTElEQVR4nOx9d5hdVdn9OrdOn/QGIQG
EhCZVQuADhCACIlgQBAkgRRAboJ8aBVEs2IAoTREwgnwQfh+ifFJDE5BQAgGkdxKSCSlk+syt5/fHvfucd++z9ym3zGQy73qePJk
595x99jlzy7rrXXu9lm3bNhgMBoPBYDAYWsSGewIMBoPBYDAYmzKYLDEYDAaDwWD4gMkSg8FgMBgMhg+YLDEYDAaDwWD4gMkSg8F
gMBgMhg+YLDEYDAaDwWD4gMkSg8FgMBgMhg+YLDEYDAaDwWD4gMkSg8FgMBgMhg+YLDEYGixatAiWZWHZsmXDPRUHDz/8MCzLwsM
PP+xs+/GPfwzLsmp2jlNOOQUzZ86s2XjDgZF4DZZl4cc//nHgfuJ5+e6770Y+Ry2fKx//+Mfx8Y9/vCZj1QL1ns8DDzyAlpYWrFq
1qm7nYGzaSAz3BBgMBqOWuOCCC/Ctb31ruKcRCUuXLsWWW2453NMIjauuumq4pzCkmDdvHvbee2/84Ac/wF/+8pfhng5jGMDKEoN
RYwwMDIBbLg4ftt12W+y+++7DPY1I2GeffUYUWdpxxx2x4447Dvc0hhRf+9rXcNNNN2HlypXDPRXGMIDJEoNRBURZ5L777sOpp56
KiRMnoqmpCZlMBgCwePFizJ07F83NzWhpacEnP/lJLF++XBpj2bJl+OIXv4iZM2eisbERM2fOxPHHH4/33nsv8nxOO+00jBs3Dv3
9/Z7HDj74YOy0006RxxwcHMSCBQuw9dZbI5VKYYsttsDXvvY1dHZ2Ovt85jOfwYwZM1AsFj3Hz5kzB3vssYfzu23buOqqq7Dbbru
hsbERY8eOxTHHHIO33347cC7r1q3DV77yFUyfPh3pdBoTJ07Efvvth/vvv9/ZR1eGsywLX//613HjjTdihx12QFNTE3bddVf885/
/lPYzlfB0Jaz/9//+H+bMmYP29nY0NTVhm222wamnnirts2LFCpx44omYNGkS0uk0dthhB1xyySWe+6Qrwz3xxBPYb7/90NDQgGn

                            (truncated for readability)

AAAAASUVORK5CYII=",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "time = np.linspace(0, 10, 1000)\n",
    "plt.plot(time, np.sin(np.pi * time**2))\n",
    "plt.xlabel(\"Time\")\n",
    "plt.ylabel(\"Amplitude\")\n",
    "plt.title(\"signal with time-varying frequency\")\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

When it comes to peer-reviewing and collaboration, these aspects of Jupyter notebooks present even greater difficulties. The usual workflow in Git, which is well-suited for managing simple text files, struggles to navigate the complex and intricate format of Jupyter notebooks.

To illustrate this point, let's change the lines plt.plot(time, np.sin(np.pi*time**2) and plt.title("signal with time-varying frequency") in the notebook…

import matplotlib.pyplot as plt
import numpy as np

time = np.linspace(0, 10, 1000)
plt.plot(time, np.sin(np.pi*time**2)*np.exp(-time))
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.title("signal with time-varying frequency & amplitude")

Image by author

…and compare the two versions of the notebook using git diff --cached (this command shows the differences between the staged changes and the last commit, specifically for files that have been staged with git add but not yet committed):

git add demo_notebook.ipynb
git diff --cached

The problem becomes immediately evident as even minor alterations in the notebook can manifest as substantial changes when viewed through a Git diff (output truncated for readability):

Image by author

Due to these issues, the tasks of reviewing code changes or merging branches becomes extremely challenging. It's like trying to find a needle in a haystack when you're attempting to distinguish significant differences amidst a tangle of JSON formatting and output data. This complexity obscures the review process and increases the likelihood of merge conflicts, which can disrupt collaborative efforts and obscure the clear oversight that version control systems are designed to provide.

Another critical concern is when dealing with sensitive data. It's risky to commit a Jupyter notebook containing confidential information to GitHub or any public repository. A workaround could be manually clearing your notebook's output before each Git commit, but this method is far from foolproof. It's time-consuming and fraught with the risk of human error — just one oversight in clearing the output can lead to accidental exposure of sensitive data. This potential for errors adds another layer of complication to managing notebooks in version control, particularly in environments where data privacy and security are paramount.

Below, we'll explore a selection of tools that tackle these challenges, complete with examples. They are ranked in my personal order of preference. We'll discuss their strengths, weaknesses, and best use cases.

1. Jupytext

Photo by Patrick Tomasso on Unsplash

How it works: Jupytext converts Jupyter notebooks into plain text formats (such as .md, or .Rmd) or script files (like .py) by parsing the notebook's JSON structure and extracting the code and markdown cells. These formats are more amenable to version control, allowing for better tracking of changes and easier collaboration. It can create a dual representation where the notebook is synchronised with a text-based version, allowing changes in one format to reflect in the other.

Example:

Install Jupytext

pip install jupytext --upgrade

Pair a notebook with a Markdown format

# Pair a notebook with a Markdown file and keep them in sync
jupytext --set-formats ipynb,md demo_notebook.ipynb

the notebook is now converted into a markdown document

---
jupyter:
  jupytext:
    formats: ipynb,md
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.15.2
  kernelspec:
    display_name: Python 3
    language: python
    name: python3
---

```python
import matplotlib.pyplot as plt
import numpy as np

time = np.linspace(0, 10, 1000)
plt.plot(time, np.sin(np.pi*time**2))
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.title("signal with time-varying frequency")
plt.show()
```

Synchronise the paired markdown file with the modified notebook

jupytext --sync demo_notebook.md

And the modified parts appear clear and uncluttered.

diff --git a/demo_notebook.md b/demo_notebook.md
index 65d7d55..a9b5565 100644
--- a/demo_notebook.md
+++ b/demo_notebook.md
@@ -18,9 +18,9 @@ import matplotlib.pyplot as plt
 import numpy as np
 
 time = np.linspace(0, 10, 1000)
-plt.plot(time, np.sin(np.pi*time**2))
+plt.plot(time, np.sin(np.pi*time**2)*np.exp(-time))
 plt.xlabel("Time")
 plt.ylabel("Amplitude")
-plt.title("signal with time-varying frequency")
+plt.title("signal with time-varying frequency and time")
 plt.show()
 ```

Revert back to a notebook format

jupytext --to ipynb demo_notebook.md

Primary Purpose: Jupytext is designed to convert Jupyter notebooks into lighter, text-based formats like Markdown or Python scripts. It can sync these text files with the original notebooks, reflecting changes in both directions.

Advantages:

Facilitates better version control by converting notebooks to text formats.
Supports bidirectional synchronization, keeping the text file and notebook aligned.
Enhances collaboration and code review processes.

Limitations:

Loses the output data and interactive elements in the converted text format.
Managing two file formats (notebook and text) adds complexity to the workflow.

2. nbstripout

Photo by Enrico Mantegazza on Unsplash

How it works: nbstripout integrates with Git hooks to automatically strip output cells from notebooks when they are committed. It modifies the notebook's JSON content, removing the output fields, thus reducing the file size and simplifying diffs.

Example:

Install nbstripout

pip install nbstripout nbconvert
cd my_folder
nbstripout --install

The Git filters are now automatically applied

When we diff the modified notebook the changes stand out clearly.

diff --git a/demo_notebook.ipynb b/demo_notebook.ipynb
index 13ce09f..4be3364 100644
--- a/demo_notebook.ipynb
+++ b/demo_notebook.ipynb
@@ -10,10 +10,10 @@
     "import numpy as np\n",
     "\n",
     "time = np.linspace(0, 10, 1000)\n",
-    "plt.plot(time, np.sin(np.pi*time**2))\n",
+    "plt.plot(time, np.sin(np.pi*time**2)*np.exp(-time))\n",
     "plt.xlabel(\"Time\")\n",
     "plt.ylabel(\"Amplitude\")\n",
-    "plt.title(\"signal with time-varying frequency\")\n",
+    "plt.title(\"signal with time-varying frequency & amplitude\")\n",
     "plt.show()"
    ]
   }

Primary Purpose: nbstripout is a tool for cleaning output data from notebooks before they are committed to version control. It helps in maintaining a cleaner Git history by only tracking changes in the code and markdown cells.

Advantages:

Reduces notebook file size, making it easier to handle in version control systems.
Simplifies diffs and merge conflicts by excluding output data.
Can be automated with Git hooks for ease of use.

Limitations:

Output data, including graphs and visualizations, are lost in the version control process.
Does not offer format conversion or enhance diff readability like Jupytext does.

3. nbconvert

Photo by Martin Martz on Unsplash

How it works: nbconvert is a utility tool provided by the Jupyter ecosystem that allows you to convert Jupyter notebooks into various other formats, including HTML, PDF, Markdown, and script files. It processes the notebook's JSON structure and renders the cells (code, markdown, outputs) into the desired format. This is especially useful for generating a clean version of the notebook without execution outputs, which is more suitable for version control.

Example:

Install nbconvert

pip install nbconvert

Convert a notebook to a Python script

jupyter nbconvert --to script demo_notebook.ipynb

Convert a notebook to Markdown

jupyter nbconvert --to markdown demo_notebook.ipynb

Primary Purpose: nbconvert is used for converting notebooks to various static formats like HTML, Markdown, or scripts. It's part of the Jupyter ecosystem and is ideal for creating static reports or documents from notebooks.

Advantages:

Offers a wide range of format conversions, including to PDF and HTML.
Useful for generating static versions of notebooks for sharing or reporting.
Integrated into the Jupyter ecosystem, providing a seamless experience.

Limitations:

Converted documents are static and do not offer interactivity like notebooks.
Does not specifically address version control issues like diffing or merging.

An opinionated guide to choosing the right tool

Photo by JESHOOTS.COM on Unsplash

For enhanced version control: Jupytext is the most suitable when the primary need is version control, offering text-based formats and diffing advantages.
For clean Git history: nbstripout is ideal if the main goal is to maintain a clean Git history without the clutter of output data.
For document conversion: nbconvert is best when the need is to convert notebooks into different static formats for purposes like documentation or presentation.

In conclusion, each tool serves a specific purpose in the ecosystem of Jupyter notebook management. Your choice depends on whether your priority is enhancing version control, maintaining clean Git history, or converting notebooks for reporting and sharing.

I hope this guide has provided you with a clear and practical understanding of how to use different tools for effective version control of Jupyter notebooks. Happy coding!

References

Jupytext: https://jupytext.readthedocs.io/en/latest/index.html
nbstripout: https://nbconvert.readthedocs.io/en/latest/index.html
nbconvert: https://nbconvert.readthedocs.io/en/latest/index.html

Disclaimer

I am not affiliated with, nor do I have any connections to, any of the tools described in this guide. My insights and evaluations are based on independent research and personal experience.

#data-science #version-control #jupyter-notebook #reproducibility #machine-learning

Enhancing Data Science Workflows: Mastering Version Control for Jupyter Notebooks

A hands-on guide to facilitate collaboration and reproducibility with Jupytext, nbstripout, and nbconvert

Understanding the challenge with Jupyter notebooks and version control

1. Jupytext

Example:

2. nbstripout

Example:

3. nbconvert

Example:

An opinionated guide to choosing the right tool

References

Disclaimer

Reporting a Problem