Skip to content

Latest commit

 

History

History
218 lines (136 loc) · 18.5 KB

README.md

File metadata and controls

218 lines (136 loc) · 18.5 KB

Youth Mental Health: Automated Abstraction

Python 3.10 Youth Mental Health: Automated Abstraction

Welcome to the runtime repository for the Youth Mental Health: Automated Abstraction challenge on DrivenData! This repository contains a few things to help you create your code submission for this code execution competition:

  1. Example submission (example_submission/) — a simple demonstration solution, which runs successfully in the code execution runtime and outputs a valid submission. This provides the function signatures that you should implement in your solution.
  2. Runtime environment specification (runtime/) — the definition of the environment in which your code will run.

You can use this repository to:

🔧 Test your submission: Test your submission using a locally running version of the competition runtime to discover errors before submitting to the competition website.

📦 Request new packages in the official runtime: Since your submission will not have general access to the internet, all dependencies must be pre-installed. If you want to use a package that is not already in the runtime environment, make a pull request to this repository. Make sure to test out adding the new package to both official environments (CPU and GPU).

Changes to the repository are documented in CHANGELOG.md.



Quickstart

This quickstart guide will show you how to get started using this repository.

Prerequisites

When you make a submission on the DrivenData competition site, we run your submission inside a Docker container, a virtual operating system that allows for a consistent software environment across machines. The best way to make sure your submission will run sucessfully is to test it in a container on your local machine first. For that, you'll need:

  • A clone of this repository
  • Docker
  • At least 5 GB of free space for the CPU version of the Docker image or at least 15 GB of free space for the GPU version
  • GNU make (optional, but useful for running the commands in the Makefile)

Additional requirements to run with GPU:

Setting up the data directory

In the official code execution platform, code_execution/data will contain features for the test set. See the code submission page for details of the code_execution/data/test_features.csv file.

To test your submission in a local container, save a file under data/test_features.csv that matches the format of the actual test features file. For example, you could use a set of training examples. When you run your submission in a Docker container locally, the file you provide will be included in the container.

Evaluating your predictions

We also provide a script for you to evaluate your generated predictions using known training set labels. src/scoring.py takes the path to your predictions and the path to the corresponding labels, and calculates variable-averaged F1 score per the competition performance metric.

$ python src/scoring.py submission/submission.csv data/train_labels.csv
Variable-averaged F1 score: 0.0061

Testing your submission

As you develop your own submission, you'll need to know a little bit more about how your submission will be unpacked for running inference. This section contains more complete documentation for developing and testing your own submission.

Code submission format

Your final submission should be a zip archive named with the extension .zip (for example, submission.zip).

A template for main.py is included at example_submission/main.py. For more detail, see the "what to submit" section of the code submission page.

Running your submission locally

This section provides instructions on how to run your submission in the code execution container from your local machine. To simplify the steps, key processes have been defined in the Makefile. Commands from the Makefile are then run with make {command_name}. The basic steps are:

make pull
make pack-submission
make test-submission

Run make help for more information about the available commands as well as information on the official and built images that are available locally.

Here's the process in a bit more detail:

  1. First, make sure you have set up the prerequisites.

  2. Run make pull to download the official competition Docker image

Note

If you have built a local version of the runtime image with make build, that image will take precedence over the pulled image when using any make commands that run a container. You can explicitly use the pulled image by setting the SUBMISSION_IMAGE shell/environment variable to the pulled image or by deleting all locally built images.

  1. Save all of your submission files, including the required main.py script, in the submission_src folder of the runtime repository. Make sure any needed model weights and other assets are saved in submission_src as well.

  2. Run make pack-submission to create a submission/submission.zip file containing your code and model assets. This submission.zip file is what you will ultimately submit on the competition website.

    make pack-submission
    #> mkdir -p submission/
    #> cd submission_src; zip -r ../submission/submission.zip ./*
    #>   adding: main.py (deflated 73%)
  3. Run make test-submission to simulate what happens during code execution on your local machine. This command launches an instance of the competition Docker images and runs the container entrypoint script. First, it unzips submission/submission.zip into /code_execution/ in the container. Then, it runs your submitted main.py. In the local testing setting, the final submission is saved out to the submission/ folder on your local machine. This is the same inference process that will take place in the official runtime.

    make test-submission

Note

Remember that /code_execution/data is just a mounted version of what you have saved locally in data so you will just be using the training files for local testing. In the official code execution platform, /code_execution/data will contain the actual test data.

🎉 Congratulations! You've just completed your first test run for the Youth Mental Health: Automated Abstraction challenge. If everything worked as expected, you should see that a new file submission/submission.csv has been generated.

When you run make test-submission, the logs will be printed to the terminal and written out to submission/log.txt. If you run into errors, use the container logs written to log.txt to determine what changes you need to make for your code to execute successfully.

Running the example submission locally

Before you test your own submission, you can test the process above with the provided example submission first. This will follow the same process as running your submission, but will use the code in example_submission instead of the code in submission_src.

To run the example submission using make commands, make sure that Docker is running and then run the following in the terminal:

  1. make pull pulls the latest official Docker image from the container registry (Azure). You'll need an internet connection for this.
  2. make pack-example packages all files saved in the example_submission directory to submission/submission.zip
  3. make test-submission simulates a code execution submission with submission/submission.zip. This will run example_submission/main.py from within a Docker container to generation submission.csv.

Smoke tests

In order to prevent leakage of the test features, all logging is prohibited when running inference on the test features as part of an official submission. When submitting on the platform, you will have the ability to submit "smoke tests". Smoke tests run with logging enabled on a reduced version of the training set notes in order to run more quickly. They will not be considered for prize evaluation and are intended to let you test your code for correctness. In this competition, smoke tests will be the only place you can view logs or output from your code to debug. You should test your code locally as thorougly as possible before submitting your code for smoke tests or for full evaluation.

During a smoke test, you will still have access to data/submission_format.csv and data/test_features.csv. These files will be samples from the training set instead of test data. The data used in smoke tests is available on the data download page. To replicate the smoke test environment locally:

  1. Save smoke_test_features.csv from the data download page to data/test_features.csv.
  2. Save smoke_test_labels.csv from the data download page to data/smoke_test_labels.csv. If your code references a submission format file, copy the labels to data/submission_format.csv as well.

After you generate predictions on the smoke test data using make test-submission, you can score them by running:

python src/scoring.py submission/submission.csv data/smoke_test_labels.csv

If you've followed the above instructions, this score should match the one you receive from the smoke test environment on the platform.

Updating runtime packages

If you want to use a package that is not in the environment, you are welcome to make a pull request to this repository. Remember, your submission will only have access to packages in this runtime repository. If you're new to the GitHub contribution workflow, check out this guide by GitHub.

The runtime manages dependencies using Pixi. Here is a good tutorial to get started with Pixi. The official runtime uses Python 3.10.13.

  1. Fork this repository.

  2. Install pixi. See here for installation options.

  3. Edit the runtime/pixi.toml file to add your new packages in the dependencies section. You'll need to determine which environment(s) your new package is required for, and whether the package will be installed with conda (preferred) or pip. We recommend starting without a specific pinned version, and then pinning to the version in the resolved pixi.lock file that is generated.

    • CPU, GPU, or base: The pixi.toml file includes different sections for dependencies that apply to both the CPU and GPU environments (feature.base), the CPU environment only (feature.cpu), and the GPU environment only (feature.gpu).

    • Conda or pip: Packages installed using conda are specified by the header dependencies. These install from the conda-forge channel using conda install. Packages installed with pip are specified by the header pypi-dependencies. These install from PyPI using pip. Installing packages with conda is strongly preferred. Packages should only be installed using pip if they are not available in a conda channel. Conda dependencies are much faster to resolve than PyPI dependencies.

    • For example, to add version 0.0.1 of package1 to both the CPU and GPU environments using conda, you would add the line package1 = "0.0.1" under [feature.base.dependencies]. To add version 0.2 of package2 to the CPU environment only using pip, you would add the line package2 = { version = "0.2.*" } under the header [feature.cpu.pypi-dependencies].

      [feature.base.dependencies]
      package1 = "0.0.1"
      
      [feature.cpu.pypi-dependencies]
      package2 = { version = "0.2.*" }
      
  4. With Docker open and running, run make update-lockfile. This will generate an updated runtime/pixi.lock from runtime/pixi.toml within a Docker container.

  5. Locally test that the Docker image builds successfully for both the CPU and GPU environment:

    CPU_OR_GPU=cpu make build
    CPU_OR_GPU=gpu make build
  6. Commit the changes to your forked repository. Ensure that your branch includes updated versions of both runtime/pixi.toml and runtime/pixi.lock.

  7. Open a pull request from your branch to the main branch of this repository. Navigate to the Pull requests tab in this repository, and click the "New pull request" button. For more detailed instructions, check out GitHub's help page.

  8. Once you open the pull request, we will use Github Actions to build the Docker images with your changes and run the tests in runtime/tests. For security reasons, administrators may need to approve the workflow run before it happens. Once it starts, the process can take up to 30 minutes, and may take longer if your build is queued behind others. You will see a section on the pull request page that shows the status of the tests and links to the logs ("Details"):

    Example appearance of Github Actions

  9. You may be asked to submit revisions to your pull request if the tests fail or if a DrivenData staff member has feedback. Pull requests won't be merged until all tests pass and the team has reviewed and approved the changes.

Makefile commands

A Makefile with several helpful shell recipes is included in the repository. The runtime documentation above uses it extensively. Running make by itself in your shell will list relevant Docker images and provide you the following list of available commands:

Available commands:

build               Builds the container locally 
clean               Delete temporary Python cache and bytecode files 
interact-container  Open an interactive bash shell within the running container
pack-example        Creates a submission/submission.zip file from the source code in 
                    example_submission 
pack-submission     Creates a submission/submission.zip file from the source code in 
                    submission_src 
pull                Pulls the official container from Azure Container Registry 
test-container      Ensures that your locally built image can import all the Python packages 
                    successfully when it runs 
test-submission     Runs container using code from `submission/submission.zip` and data from 
                    `/code_execution/data/` 
update-lockfile     Updates runtime environment lockfile using Docker