Data science project template

A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.

✨📚✨ Documentation: https://joserzapata.github.io/data-science-project-template/

Source Code: https://github.com/JoseRZapata/data-science-project-template

Features

Dependency management with UV
Virtual environment management with UV
Linting with pre-commit and Ruff
Continuous integration with GitHub Actions
Documentation with mkdocs and mkdocstrings using the mkdocs-materialtheme
Automated dependency updates with Dependabot
Code formatting with Ruff
Import sorting with Ruff using isort rule.
Testing with pytest
Code coverage with Coverage.py
Coverage reporting with Codecov
Static type-checking with mypy
Security audit with Ruff using bandit rule.
Manage project labels with GitHub Labeler

Table of Contents

Data science project template

📁 Creating a New Project

👍 Recommendations

It is highly recommended to use managers for the python versions, dependencies and virtual environments.

This project uses UV, a extremely fast tool to replace pip, pip-tools, Pipx, Poetry, Pyenv, twine, virtualenv, and more.

🌟 Check how to setup your environment: https://joserzapata.github.io/data-science-project-template/local_setup/

🍪🥇 Via Cruft - (recommended)

# Install cruft in a isolated environment using uv

uv tool install cruft 

# Or Install with pip

pip install --user cruft # Install `cruft` on your path for easy access

cruft create https://github.com/JoseRZapata/data-science-project-template

🍪 Via Cookiecutter

uv tool install cookiecutter # Install cruft in a isolated environment

# Or Install with pip

pip install --user cookiecutter # Install `cookiecutter` on your path for easy access

cookiecutter gh:JoseRZapata/data-science-project-template

Note: Cookiecutter uses gh: as short-hand for https://github.com/

🔗 Linking an Existing Project

If the project was originally installed via Cookiecutter, you must first use Cruft to link the project with the original template:

cruft link https://github.com/JoseRZapata/data-science-project-template

Then/else:

cruft update

🗃️ Project structure

Folder structure for data science projects why?

Data structure

.
├── .code_quality
│   ├── mypy.ini                        # mypy configuration
│   └── ruff.toml                       # ruff configuration
├── .github                             # github configuration
│   ├── actions
│   │   └── python-poetry-env
│   │       └── action.yml              # github action to setup python environment
│   ├── dependabot.md                   # github action to update dependencies
│   ├── pull_request_template.md        # template for pull requests
│   └── workflows                       # github actions workflows
│       ├── ci.yml                      # run continuous integration (tests, pre-commit, etc.)
│       ├── dependency_review.yml       # review dependencies
│       ├── docs.yml                    # build documentation (mkdocs)
│       └── pre-commit_autoupdate.yml   # update pre-commit hooks
├── .vscode                             # vscode configuration
|   ├── extensions.json                 # list of recommended extensions
|   ├── launch.json                     # vscode launch configuration
|   └── settings.json                   # vscode settings
├── conf                                # folder configuration files
│   └── config.yaml                     # main configuration file
├── data
│   ├── 01_raw                          # raw immutable data
│   ├── 02_intermediate                 # typed data
│   ├── 03_primary                      # domain model data
│   ├── 04_feature                      # model features
│   ├── 05_model_input                  # often called 'master tables'
│   ├── 06_models                       # serialized models
│   ├── 07_model_output                 # data generated by model runs
│   ├── 08_reporting                    # reports, results, etc
│   └── README.md                       # description of the data structure
├── docs                                # documentation for your project
│   ├── index.md                        # documentation homepage
├── models                              # store final models
├── notebooks
│   ├── 1-data                          # data extraction and cleaning
│   ├── 2-exploration                   # exploratory data analysis (EDA)
│   ├── 3-analysis                      # Statistical analysis, hypothesis testing.
│   ├── 4-feat_eng                      # feature engineering (creation, selection, and transformation.)
│   ├── 5-models                        # model training, evaluation and hyperparameter tuning.
│   ├── 6-interpretation                # model interpretation
│   ├── 7-deploy                        # model packaging, deployment strategies.
│   ├── 8-reports                       # story telling, summaries and analysis conclusions.
│   ├── notebook_template.ipynb         # template for notebooks
│   └── README.md                       # information about the notebooks
├── src                                 # source code for use in this project
│   ├── libs                            # custom python scripts
│   │   ├── data_etl                    # data extraction, transformation, and loading  
│   │   ├── data_validation             # data validation  
│   │   ├── feat_cleaning               # feature engineering data cleaning
│   │   ├── feat_encoding               # feature engineering encoding
│   │   ├── feat_imputation             # feature engineering imputation    
│   │   ├── feat_new_features           # feature engineering new features
│   │   ├── feat_pipelines              # feature engineering pipelines
│   │   ├── feat_preprocess_strings     # feature engineering pre process strings
│   │   ├── feat_scaling                # feature engineering scaling data
│   │   ├── feat_selection              # feature engineering feature selection
│   │   ├── feat_strings                # feature engineering strings
│   │   ├── metrics                     # evaluation metrics
│   │   ├── model                       # model training and prediction    
│   │   ├── model_evaluation            # model evaluation
│   │   ├── model_selection             # model selection
│   │   ├── model_validation            # model validation
│   │   └── reports                     # reports
│   ├── pipelines
│   │   ├── data_etl                    # data extraction, transformation, and loading
│   │   ├── feature_engineering         # prepare data for modeling
│   │   ├── model_evaluation            # evaluate model performance
│   │   ├── model_prediction            # model predictions
│   │   └── model_train                 # train models    
├── tests                               # test code for your project
│   └── test_mock.py                    # example test file
├── .editorconfig                       # editor configuration
├── .gitignore                          # files to ignore in git
├── .pre-commit-config.yaml             # configuration for pre-commit hooks
├── codecov.yml                         # configuration for codecov
├── Makefile                            # useful commands to setup environment, run tests, etc.
├── mkdocs.yml                          # configuration for mkdocs documentation
├── pyproject.toml                      # dependencies and configuration project file
├── uv.lock                             # locked dependencies
└── README.md                           # description of your project

✨ Features and Tools

🚀 Project Standardization and Automation

🔨 Developer Workflow Automation

Python packaging, dependency management and environment management with UV - why use a management, (uv is a replacement for poetry)
Project workflow orchestration with Make as an interface shim
- Self-documenting Makefile; just type make on the command line to display auto-generated documentation on available targets:
Automated Cookiecutter template synchronization with Cruft - why?
Code quality tooling automation and management with pre-commit
Continuous integration and deployment with GitHub Actions
Project configuration files with Hydra - why?

🌱 Conditionally Rendered Python Package or Project Boilerplate

Optional: Jupyter support

🔧 Maintainability

🏷️ Type Checking and Data Validation

Static type-checking with Mypy

✅ 🧪 Testing/Coverage

Testing with Pytest
Code coverage with Coverage.py
Coverage reporting with Codecov

🚨 Linting

🔍 Code quality

Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
- Replacement for ~~Pylint~~, ~~Flake8~~ (including major plugins) and more linters under a single, common interface
ShellCheck
Unsanitary commits:
- Secrets with detect-secrets
- Large files with check-added-large-files
- Files that contain merge conflict strings.check-merge-conflict

🎨 Code formatting

Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
- Replacement for ~~Black~~, ~~isort~~, ~~pyupgrade~~ and more formatters under a single, common interface
General file formatting:
- end-of-file-fixer
- pretty-format-json
- (trim) trailing-whitespace
- check-yaml

👷 CI/CD

Automatic Dependency updates

Dependency updates with Dependabot, Automated Dependabot PR merging with the Dependabot Auto Merge GitHub Action
This is a replacement for pip-audit , In your local environment, If you want to check for vulnerabilities in your dependencies you can use [pip-audit].

Dependency Review in PR

Dependency Review with dependency-review-action, This action scans your pull requests for dependency changes, and will raise an error if any vulnerabilities or invalid licenses are being introduced.

Pre-commit automatic updates

Automatic updates with GitHub Actions workflow .github/workflows/pre-commit_autoupdate.yml

🔒 Security

🔏 Static Application Security Testing (SAST)

Code vulnerabilities with Bandit using Ruff

⌨️ Accessibility

🔨 Automation tool (Makefile)

Makefile to automate the setup of your environment, the installation of dependencies, the execution of tests, etc. in terminal type make to see the available commands

Target                Description
-------------------   ----------------------------------------------------
check                 Run code quality tools with pre-commit hooks.
docs_test             Test if documentation can be built without warnings or errors
docs_view             Build and serve the documentation
init_env              Install dependencies with uv and activate env
init_git              Initialize git repository
install_data_libs     Install pandas, scikit-learn, Jupyter, seaborn
pre-commit_update     Update pre-commit hooks
test                  Test the code with pytest and coverage

📝 Project Documentation

Documentation building with MkDocs - Tutorial
- Powered by mkdocs-material
- Rich automatic documentation from type annotations and docstrings (NumPy, Google, etc.) with mkdocstrings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data science project template

Features

📁 Creating a New Project

👍 Recommendations

🍪🥇 Via Cruft - (recommended)

🍪 Via Cookiecutter

🔗 Linking an Existing Project

🗃️ Project structure

✨ Features and Tools

🚀 Project Standardization and Automation

🔨 Developer Workflow Automation

🌱 Conditionally Rendered Python Package or Project Boilerplate

🔧 Maintainability

🏷️ Type Checking and Data Validation

✅ 🧪 Testing/Coverage

🚨 Linting

🔍 Code quality

🎨 Code formatting

👷 CI/CD

Automatic Dependency updates

Dependency Review in PR

Pre-commit automatic updates

🔒 Security

🔏 Static Application Security Testing (SAST)

⌨️ Accessibility

🔨 Automation tool (Makefile)

📝 Project Documentation

🗃️ Templates

Good practices

References

About

Releases 1

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.code_quality		.code_quality
.github		.github
.vscode		.vscode
docs		docs
hooks		hooks
tests		tests
{{cookiecutter.repo_name}}		{{cookiecutter.repo_name}}
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
cookiecutter.json		cookiecutter.json
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

JoseRZapata/data-science-project-template

Folders and files

Latest commit

History

Repository files navigation

Data science project template

Features

📁 Creating a New Project

👍 Recommendations

🍪🥇 Via Cruft - (recommended)

🍪 Via Cookiecutter

🔗 Linking an Existing Project

🗃️ Project structure

✨ Features and Tools

🚀 Project Standardization and Automation

🔨 Developer Workflow Automation

🌱 Conditionally Rendered Python Package or Project Boilerplate

🔧 Maintainability

🏷️ Type Checking and Data Validation

✅ 🧪 Testing/Coverage

🚨 Linting

🔍 Code quality

🎨 Code formatting

👷 CI/CD

Automatic Dependency updates

Dependency Review in PR

Pre-commit automatic updates

🔒 Security

🔏 Static Application Security Testing (SAST)

⌨️ Accessibility

🔨 Automation tool (Makefile)

📝 Project Documentation

🗃️ Templates

Good practices

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Contributors 4

Languages