A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.
✨📚✨ Documentation: https://joserzapata.github.io/data-science-project-template/
Source Code: https://github.com/JoseRZapata/data-science-project-template
Table of Contents
- Data science project template
It is highly recommended to use a python version manager like Pyenv and this project is set to use Poetry >= 1.8 to manage the dependencies and the environment.
Note: Poetry >= 1.8 should always be installed in a dedicated virtual environment to isolate it from the rest of your system. why?, I recommend using UV to install poetry in an isolated environment.
🌟 Check how to setup your environment: https://joserzapata.github.io/data-science-project-template/local_setup/
🍪🥇 Via Cruft - (recommended)
# Install cruft in a isolated environment using uv
uv tool install cruft
# Or Install with pip
pip install --user cruft # Install `cruft` on your path for easy access
cruft create https://github.com/JoseRZapata/data-science-project-template
🍪 Via Cookiecutter
uv tool install cookiecutter # Install cruft in a isolated environment
# Or Install with pip
pip install --user cookiecutter # Install `cookiecutter` on your path for easy access
cookiecutter gh:JoseRZapata/data-science-project-template
Note: Cookiecutter uses gh:
as short-hand for https://github.com/
If the project was originally installed via Cookiecutter, you must first use Cruft to link the project with the original template:
cruft link https://github.com/JoseRZapata/data-science-project-template
Then/else:
cruft update
Folder structure for data science projects why?
.
├── .code_quality
│ ├── mypy.ini # mypy configuration
│ └── ruff.toml # ruff configuration
├── .github # github configuration
│ ├── actions
│ │ └── python-poetry-env
│ │ └── action.yml # github action to setup python environment
│ ├── dependabot.md # github action to update dependencies
│ ├── pull_request_template.md # template for pull requests
│ └── workflows # github actions workflows
│ ├── ci.yml # run continuous integration (tests, pre-commit, etc.)
│ ├── dependency_review.yml # review dependencies
│ ├── docs.yml # build documentation (mkdocs)
│ └── pre-commit_autoupdate.yml # update pre-commit hooks
├── .vscode # vscode configuration
| ├── extensions.json # list of recommended extensions
| ├── launch.json # vscode launch configuration
| └── settings.json # vscode settings
├── conf # folder configuration files
│ └── config.yaml # main configuration file
├── data
│ ├── 01_raw # raw immutable data
│ ├── 02_intermediate # typed data
│ ├── 03_primary # domain model data
│ ├── 04_feature # model features
│ ├── 05_model_input # often called 'master tables'
│ ├── 06_models # serialized models
│ ├── 07_model_output # data generated by model runs
│ ├── 08_reporting # reports, results, etc
│ └── README.md # description of the data structure
├── docs # documentation for your project
│ ├── index.md # documentation homepage
├── models # store final models
├── notebooks
│ ├── 1-data # data extraction and cleaning
│ ├── 2-exploration # exploratory data analysis (EDA)
│ ├── 3-analysis # Statistical analysis, hypothesis testing.
│ ├── 4-feat_eng # feature engineering (creation, selection, and transformation.)
│ ├── 5-models # model training, experimentation, and hyperparameter tuning.
│ ├── 6-evaluation # evaluation metrics, performance assessment
│ ├── 7-deploy # model packaging, deployment strategies.
│ ├── 8-reports # story telling, summaries and analysis conclusions.
│ ├── notebook_template.ipynb # template for notebooks
│ └── README.md # information about the notebooks
├── src # source code for use in this project
│ ├── libs # custom python scripts
│ │ ├── data_etl # data extraction, transformation, and loading
│ │ ├── data_validation # data validation
│ │ ├── feat_cleaning # feature engineering data cleaning
│ │ ├── feat_encoding # feature engineering encoding
│ │ ├── feat_imputation # feature engineering imputation
│ │ ├── feat_new_features # feature engineering new features
│ │ ├── feat_pipelines # feature engineering pipelines
│ │ ├── feat_preprocess_strings # feature engineering pre process strings
│ │ ├── feat_scaling # feature engineering scaling data
│ │ ├── feat_selection # feature engineering feature selection
│ │ ├── feat_strings # feature engineering strings
│ │ ├── metrics # evaluation metrics
│ │ ├── model # model training and prediction
│ │ ├── model_evaluation # model evaluation
│ │ ├── model_selection # model selection
│ │ ├── model_validation # model validation
│ │ └── reports # reports
│ ├── pipelines
│ │ ├── data_etl # data extraction, transformation, and loading
│ │ ├── feature_engineering # prepare data for modeling
│ │ ├── model_evaluation # evaluate model performance
│ │ ├── model_prediction # model predictions
│ │ └── model_train # train models
├── tests # test code for your project
│ └── test_mock.py # example test file
├── .editorconfig # editor configuration
├── .gitignore # files to ignore in git
├── .pre-commit-config.yaml # configuration for pre-commit hooks
├── codecov.yml # configuration for codecov
├── Makefile # useful commands to setup environment, run tests, etc.
├── mkdocs.yml # configuration for mkdocs documentation
├── poetry.toml # poetry virtual environment configuration
├── pyproject.toml # dependencies for poetry
└── README.md # description of your project
- Python packaging, dependency management and environment management
with Poetry -
why?
- Project workflow orchestration
with Make as an interface shim
- Self-documenting Makefile; just type
make
on the command line to display auto-generated documentation on available targets:
- Self-documenting Makefile; just type
- Automated Cookiecutter template synchronization with Cruft -
why?
- Code quality tooling automation and management with pre-commit
- Continuous integration and deployment with GitHub Actions
- Project configuration files with Hydra -
why?
- Optional: Jupyter support
- Static type-checking with Mypy
- Testing with Pytest
- Code coverage with Coverage.py
- Coverage reporting with Codecov
- Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
- ShellCheck
- Unsanitary commits:
- Secrets with
detect-secrets
- Large files with
check-added-large-files
- Files that contain merge conflict strings.check-merge-conflict
- Secrets with
-
Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
-
General file formatting:
-
Dependency updates with Dependabot, Automated Dependabot PR merging with the Dependabot Auto Merge GitHub Action
-
This is a replacement for pip-audit , In your local environment, If you want to check for vulnerabilities in your dependencies you can use [pip-audit].
- Dependency Review with dependency-review-action, This action scans your pull requests for dependency changes, and will raise an error if any vulnerabilities or invalid licenses are being introduced.
- Automatic updates with GitHub Actions workflow
.github/workflows/pre-commit_autoupdate.yml
Makefile to automate the setup of your environment, the installation of dependencies, the execution of tests, etc.
in terminal type make
to see the available commands
Target Description
------------------- ----------------------------------------------------
check Run code quality tools with pre-commit hooks.
docs_test Test if documentation can be built without warnings or errors
docs_view Build and serve the documentation
init_env Install dependencies with poetry and activate env
init_git Initialize git repository
install_data_libs Install pandas, scikit-learn, Jupyter, seaborn
install_mlops_libs Install dvc, mlflow
pre-commit_update Update pre-commit hooks
test Test the code with pytest and coverage
- Documentation building
with MkDocs - Tutorial
- Powered by mkdocs-material
- Rich automatic documentation from type annotations and docstrings (NumPy, Google, etc.) with mkdocstrings
- https://drivendata.github.io/cookiecutter-data-science/
- https://github.com/crmne/cookiecutter-modern-datascience
- https://github.com/fpgmaas/cookiecutter-poetry
- https://github.com/khuyentran1401/data-science-template
- https://github.com/woltapp/wolt-python-package-cookiecutter
- https://khuyentran1401.github.io/reproducible-data-science/structure_project/introduction.html
- https://github.com/TeoZosa/cookiecutter-cruft-poetry-tox-pre-commit-ci-cd
- https://github.com/cjolowicz/cookiecutter-hypermodern-python
- https://github.com/gotofritz/cookiecutter-gotofritz-poetry
- https://github.com/kedro-org/kedro-starters