Helping ecologist generalize models for predicting plant traits to better understand the health of ecosystems.
Code for this Kaggle competition: https://www.kaggle.com/competitions/planttraits2024/overview
For additional notes see this doc: https://docs.google.com/document/d/1YLDUVcI2sjkkCSk9zewKPOpY5vFpeMFmkDEsXBXCNCU/edit
To get up and running with the code, follow these steps. Make sure you have Docker installed.
-
bin/build.sh
-
Download
planttraits2024.zip
from the Kaggle comp linked above. Unzip it indata/raw
-
bin/preprocess_data.sh
to prepare the data and precompute embeddings -
For W&B logging, contact Nathan to be added to https://wandb.ai/nathan-mandi/PlantTraits2024. Then, create a
.env
file with yourWANDB_API_KEY
You are set up! You should now be able to run bin/train.sh
and other scripts to train models, interact with the code in the repo, etc.
Most of the modules have tests. Run them with bin/test.sh
and pass in any args you would pass to pytest
.
Ex: bin/test.sh -s tests/data/datasets/test_baseline_dataset.py
This repo uses Potluck, a machine learning repo template from Kung Fu AI. Here are instructions for use.
- [Docker][docker-url]
- [Docker Compose][docker-compose-url]
- [NVIDIA Docker Container Runtime][nvidia-url]
- Run
bin/build.sh
- Run
bin/preprocess_data.sh
To train the model, all you need to do is run this command:
bin/train.sh
(Note: Please include further instructions if GPU is required!)
Once the Docker image is built we can run the project's unit tests to verify everything is working. The below command will start a Docker container and execute all unit tests using the pytest framework.
bin/test.sh
If you want to run a test on a specific file or directory (rather than running all the tests in
the tests/ directory), you can use the -k
flag and list the file path afterwards.
For example, if we specifically wanted to run a test called "test_api", and its file path is as "tests/test_api.py", we can run:
bin/test.sh -k test_api.py
The bin/
directory contains basic shell bin that allow us access to common commands on most
environments. We're not guaranteed much functionality on any generic machine, so keeping these
basic is important.
The most commonly used bin are:
bin/build.sh
- build docker container(s) defined inDockerfile
andcompose.yaml
bin/test.sh
- run unit tests defined intests/
bin/notebook.sh
- instantiate a new jupyter notebook serverbin/shell.sh
- instantiate a new bash terminal inside the containerbin/train.sh
- train a model
Additional bin:
bin/lint.sh
- check code formatting for the projectbin/setup_environment.sh
- sets any build arguments or settings for all containers brought up with docker composebin/up.sh
- bring up all containers defined incompose.yaml
bin/down.sh
- stops all containers defined incompose.yaml
and removes associated volumes, networks and images
Data organization philosophy from cookiecutter data science
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.