Using IPython / Jupyter Notebooks Under Version Control

Here is my solution with git. It allows you to just add and commit (and diff) as usual: those operations will not alter your working tree, and at the same time (re)running a notebook will not alter your git history.

Although this can probably be adapted to other VCSs, I know it doesn’t satisfy your requirements (at least the VSC agnosticity). Still, it is perfect for me, and although it’s nothing particularly brilliant, and many people probably already use it, I didn’t find clear instructions about how to implement it by googling around. So it may be useful to other people.

  1. Save a file with this content somewhere (for the following, let us assume ~/bin/ipynb_output_filter.py)

  2. Make it executable (chmod +x ~/bin/ipynb_output_filter.py)

  3. Create the file ~/.gitattributes, with the following content

    *.ipynb filter=dropoutput_ipynb

  4. Run the following commands:

    git config –global core.attributesfile ~/.gitattributes
    git config –global filter.dropoutput_ipynb.clean ~/bin/ipynb_output_filter.py
    git config –global filter.dropoutput_ipynb.smudge cat

Done!

Limitations:

  • it works only with git
  • in git, if you are in branch somebranch and you do git checkout otherbranch; git checkout somebranch, you usually expect the working tree to be unchanged. Here instead you will have lost the output and cells numbering of notebooks whose source differs between the two branches.
  • more in general, the output is not versioned at all, as with Gregory’s solution. In order to not just throw it away every time you do anything involving a checkout, the approach could be changed by storing it in separate files (but notice that at the time the above code is run, the commit id is not known!), and possibly versioning them (but notice this would require something more than a git commit notebook_file.ipynb, although it would at least keep git diff notebook_file.ipynb free from base64 garbage).
  • that said, incidentally if you do pull code (i.e. committed by someone else not using this approach) which contains some output, the output is checked out normally. Only the locally produced output is lost.

My solution reflects the fact that I personally don’t like to keep generated stuff versioned – notice that doing merges involving the output is almost guaranteed to invalidate the output or your productivity or both.

EDIT:

  • if you do adopt the solution as I suggested it – that is, globally – you will have trouble in case for some git repo you want to version output. So if you want to disable the output filtering for a specific git repository, simply create inside it a file .git/info/attributes, with

    **.ipynb filter=

as content. Clearly, in the same way it is possible to do the opposite: enable the filtering only for a specific repository.

  • the code is now maintained in its own git repo

  • if the instructions above result in ImportErrors, try adding “ipython” before the path of the script:

      git config --global filter.dropoutput_ipynb.clean ipython ~/bin/ipynb_output_filter.py
    

EDIT: May 2016 (updated February 2017): there are several alternatives to my script – for completeness, here is a list of those I know: nbstripout (other variants), nbstrip, jq.

Leave a Comment

tech