A simple way to distribute and refer to data files that cannot be included directly in the DISPATCHES repository.
- Provide a reliable way to access the location of data files from DISPATCHES client code, regardless of the specifics of how DISPATCHES is installed (editable vs non-editable installation, local development vs CI, etc)
- Leverage as much as possible the built-in Python package distribution infrastructure to distribute collections of related non-code-files of small to moderate size (< 100 MB compressed)
- Allow using multiple repositories/package distributions to be used in a seamless way, so that the size limits only apply to each data package independently
- Manage the packaged data in any way beyond the file-system level
- i.e. the data package infrastructure only provide paths, which the client code uses to load the data in memory according to its specific needs
- Manage and/or expose metadata beyond the name of the package and the Python package distribution used to installed it
- Automatically enforce data distribution compliance requirements (LICENSE, COPYRIGHT, etc)
- This MUST still be done, but the process shall be manual rather than automatic
-
DISPATCHES data packages SHALL be available on GitHub as repositories owned by the
https://github.com/gmlc-dispatches
organization -
DISPATCHES data packages MAY be available on PyPI
-
The naming scheme SHOULD be consistent and follow this convention (using
my-example
as a placeholder):- Repository URL:
https://github.com/gmlc-dispatches/my-example-data
- Python package distribution name:
dispatches-my-example-data
- Repository URL:
-
The repository SHOULD register itself by adding the
dispatches-data-package
topic so that all data packages repositories can be browsed at the URL https://github.com/topics/dispatches-data-package -
The repository MUST follow this directory structure:
my-example-data/ `- .git/ `- pyproject.toml `- src/ `- dispatches_data/ `- packages/ `- my_example/ `- __init__.py `- README.md
-
Once installed, the data files SHALL be stored within the Python environment's
site-packages
directory as.../lib/python3.8/site-packages/dispatches_data/packages/my_example
, i.e. the data package directory -
The name of the data package directory (
my_example
) SHALL be used to refer to the data package -
Users should access the data package and its contents using the functions available in the
dispatches_data.api
module -
The Python package directory (i.e.
.../lib/python3.8/site-packages/dispatches_data/packages/my_example
) MUST contain ALL information required for distribution of the data- This includes, but is not limited to:
- License
- Copyright
- This includes, but is not limited to:
-
The same information MAY be repeated at the top level of the repository, but it MUST be in the package directory
- This is to ensure that all required information is always present when the data files are installed (which might not be the case if the information is stored at the top level of the repository)
-
More than one data packages MAY be distributed together (i.e. as part of the same repository and/or Python package distribution)
-
In this case, all of the above requirements apply to each data package individually (i.e. each separate data package directory MUST contain the appropriate required information)
Locate the data package(s) required by your application. In general, unless otherwise indicated, the naming conventions described above apply.
Using the same my_example
placeholder as above, the data package repository will be located at https://github.com/gmlc-dispatches/my-example-data.git
Install the data package(s) required by your application, using pip
.
pip install git+https://github.com/gmlc-dispatches/my-example-data.git
Verify that the data packages where installed correctly, e.g.:
pip show dispatches-my-example-data
It should now be possible to access the data package from the client code, i.e. the DISPATCHES code that will load and use the data files, using the functions exposed in the dispatches_data.api
module. These are simple functions that typically take the data package name (my_example
) as a str
argument.
Let's assume we want to create a dataframe from a file named mydata.csv
in the my_example
data package.
In a Python file or Jupyter notebook:
import pandas as pd
from dispatches_data.api import path
def load_data() -> pd.DataFrame:
path_to_csv_file = path("my_example") / "mydata.csv"
df = pd.read_csv(path_to_csv_file)
# process df as needed
return df
def main():
df = load_data()
... # rest of the code
See the documentation for the dispatches_data.api
module on ReadTheDocs.