Talk Supplements for PyCon India 2020

A meta-post on my talk at PyCon India 2020

Background

I am to present at PyCon IN 2020. Some of the motivating reasons for having a post are:

I would like to preserve questions
I would like to collect the video, slides and other miscellaneous stuff in one location ¹
It would be nice to have my own thoughts here afterwards

Details of this happy circumstance are reproduced below from the CFP here.

Details

Title: Reproducible Scalable Workflows with Nix, Papermill and Renku

Abstract

The provenance of Jupyter notebook interfaces can no longer be denied in the data-science and analysis community. In particular, fledgling and “fresh out of school” researchers and practitioners are used to using Jupyter notebooks for their initial analysis. As might be expected, these workflows are difficult to reproduce and also store. Caching efficiency and dependency re-use are almost always sub-optimal with virtual environments, compared to native installations, and the same issues (along with additional security concerns) plague docker setups as well. There are a set of Jupyter tools which have evolved to close this gap, like JupyText. However, the fundamental aspect of reproducing workflows on high performance computing clusters, of being able to compose programmatically, compilation rules which efficiently use underlying hardware with minimal user intervention is still not a solved problem. In this talk, I will discuss packaging Python applications and workflows in an end-to-end composable manner using the Nix ecosystem, which leverages a functional programming paradigm and then show how this allows for both user-friendly low-compute analysis, while being scalable on large clusters. To that end, the tools introduced will be:
The Nix programming language (emphasis on developer environments for python with mkShell) Jupyter Python kernels (the Xeus kernel for Python debugging) and Jupytext Papermill for parameterizing notebooks Renku for tracing provenance The goal is to have the audience familiarized with the best practices for reproducibility and analysis. The focus will be on scientific HPC applications, though any managed cluster can and will benefit from the practices described.

Slides

The slides are embedded below. The orgmode source is here on the site’s GH repo.

Video

One location I am going to be able to keep track of ↩︎