Helping Data Science teams easily move to a Docker-first development workflow to iterate and deliver projects faster and more reliably.
New to Docker? Check out this writeup on containers vs virtual machines and how Docker fits in:
Cookiecutter is a command-line utility that automatically scaffolds new projects for you based on a template (referred to as cookiecutters):
http://cookiecutter.readthedocs.io/en/latest/readme.html
This cookiecutter is used in conjunction with a base development image available in Docker Hub to provide an out-of-the-box ready environment for many Data Science and Machine Learning project use cases. After running this cookiecutter and the provided start script a developer will have a local development setup that looks like this:
By scaffolding your data science projects using this cookiecutter you will get:
- Project Docker image built with your own Dockerfile for project specific requirements
- Docker Compose configuration that dynamically binds to a free host port and forwards to the jupyter server listening port inside the container
- Shared volume configuration for accessing and executing all your project code inside of the controlled container environment
- Ability to edit code using your favorite IDE on your host machine and seeing real-time changes to the runtime environment
- Jupyter notebook fully configured with nb-extensions ready for development and feature engineering
- Common data science and plotting libraries pre-installed in the container environment to start working immediately
There are several downstream benefits for moving to a container-first workflow in terms of model and inference engine deployment/delivery. By using containers early in the development cycle you can remove a lot of the configuration management issues that waste developer time and ultimately lower quality of deliverables.
- Install Docker:
- For Mac: https://store.docker.com/editions/community/docker-ce-desktop-mac
- For Windows: https://store.docker.com/editions/community/docker-ce-desktop-windows
- For Linux: Go to this page and choose the appropriate install for your Linux distro: https://www.docker.com/community-edition
- Install Docker Compose (https://docs.docker.com/compose/install/#install-compose):
$ sudo curl -L https://github.com/docker/compose/releases/download/1.21.0/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
Test the installation:$ sudo chmod +x /usr/local/bin/docker-compose
$ docker-compose --version docker-compose version 1.21.0, build 1719ceb
- Install Docker Compose (https://docs.docker.com/compose/install/#install-compose):
- Install Python Cookiecutter package: http://cookiecutter.readthedocs.org/en/latest/installation.html >= 1.4.0
It is recommended to set up a central virtualenv or condaenv for cookiecutter and any other "system" wide Python packages you may need.
$ pip install cookiecutter
- Run the cookiecutter docker data science template to scaffold your new project:
$ cookiecutter https://github.com/Fabien-DS/docker-cookiecutter-data-science.git
- Answer all of the cookiecutter prompts for project name, description, license, etc.
- Run the start script from the level of your new project directory:
$ ./start.sh
- After the project image builds check which host port is being forwarded to the Jupyter notebook server inside the running container:
$ docker ps
- Using any browser access your notebook at localhost:{port}
- Start working!
For more details on what packages are available pre-installed in the base image see the manifoldai/orbyter-ml-dev repository page on Docker Hub.
The directory structure of your new project looks like this:
├── LICENSE
├── Dockerfile <- New project Dockerfile that sources from base ML dev image
├── docker-compose.yml <- Docker Compose configuration file
├── docker_clean_all.sh <- Helper script to remove all containers and images from your system
├── start.sh <- Script to run docker compose and any other project specific initialization steps
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
- Docker command cheatsheet: https://www.docker.com/sites/default/files/Docker_CheatSheet_08.09.2016_0.pdf
- Dockerfile reference: https://docs.docker.com/engine/reference/builder/
- Docker Compose reference: https://docs.docker.com/compose/compose-file/
- Kitematic (GUI interface to work with Docker. Highly recommended if you are new to Docker!): https://kitematic.com/
We are trying to bridge the gap that exists between data science and dev/operations teams today. We wrote about it here: https://medium.com/manifold-ai/torus-a-toolkit-for-docker-first-data-science-bddcb4c97b52
PRs and feature requests very welcome!