Skip to content

Commit

Permalink
Expose infra as a package, publish dev builds (#696)
Browse files Browse the repository at this point in the history
    Publish levanter dev build automatically
    move stuff from infra/**/*.py into src/levanter/infra/{cli_helpers,tpu,docker}.py
    make the extra context stuff be optional (I actually have another way of dealing with it in Marin now
  • Loading branch information
dlwh authored Aug 26, 2024
1 parent 7ec7bb5 commit 20faff3
Show file tree
Hide file tree
Showing 14 changed files with 770 additions and 595 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/docker-base-image.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Build and Push Docker TPU Base Image
name: Build and Push Docker TPU Images

on:
push:
Expand Down
67 changes: 67 additions & 0 deletions .github/workflows/publish_dev.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: Publish Dev Build

on:
workflow_run:
workflows: ["Run Tests"]
types:
- completed
branches: [main]
workflow_dispatch:

jobs:
build-package:
runs-on: ubuntu-latest
if: ${{ github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success'}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'

- name: Calculate Version and Build Number
run: |
PROJECT_VERSION=$(sed -n 's/__version__ = "\(.*\)"/\1/p' src/levanter/__init__.py)
BUILD_NUMBER=$(git rev-list --count HEAD)
FULL_VERSION="${PROJECT_VERSION}.dev${BUILD_NUMBER}"
echo "FULL_VERSION=${FULL_VERSION}" >> $GITHUB_ENV
echo "Calculated version with build number: $FULL_VERSION"
- name: Update pyproject.toml version
run: |
# replace the version in pyproject.toml
sed -i "s/version = \".*\"/version = \"$FULL_VERSION\"/g" pyproject.toml
- name: Build package
run: |
python -m pip install --upgrade pip
pip install build
python -m build
- name: Upload package
uses: actions/upload-artifact@v4
with:
name: package
path: dist/


# cf https://test.pypi.org/manage/project/levanter/settings/publishing/
publish-dev:
runs-on: ubuntu-latest
needs:
- build-package
permissions:
id-token: write
steps:
- name: Retrieve release distributions
uses: actions/download-artifact@v4
with:
name: package
path: dist/

- name: Publish release distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1


10 changes: 4 additions & 6 deletions docker/tpu/Dockerfile.base
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,12 @@ RUN pip install virtualenv
# venv binaries encode their directory, so we need to setup the venv in the final location
RUN virtualenv -p python3.10 /opt/levanter/.venv
ENV PATH /opt/levanter/.venv/bin:$PATH
RUN /opt/levanter/.venv/bin/pip install -U "jax[tpu]==0.4.30" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
RUN /opt/levanter/.venv/bin/pip install -U uv "jax[tpu]==0.4.30" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Install package dependencies to make incremental builds faster.
WORKDIR /tmp/
ADD pyproject.toml README.md /tmp/
# work around setuptools bug
RUN mkdir -p /tmp/src
RUN pip install .[test]
WORKDIR /opt/levanter
ADD pyproject.toml README.md /opt/levanter/
RUN uv sync --no-install-project

FROM python:3.10

Expand Down
4 changes: 0 additions & 4 deletions docker/tpu/Dockerfile.incremental
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,10 @@ ENV TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60\

WORKDIR /opt/levanter

# We have to mkdir src/ to avoid setuptools error
RUN mkdir -p /opt/levanter/src
ADD pyproject.toml README.md /opt/levanter/
RUN pip install -e '.[test]'
ADD . /opt/levanter

# Add $EXTRA_CTX to the same location as in local machine.
# so that the same (config) path(s) specified in train_lm.py argument still works
#COPY .mnt $EXTRA_CTX
# it's already in the image, so we don't need to copy it. just move it if we set EXTRA_CTX
RUN if [ -f ".mnt" ]; then mkdir -p $(dirname $EXTRA_CTX) && mv .mnt $EXTRA_CTX; fi
9 changes: 6 additions & 3 deletions docs/Getting-Started-TPU-VM.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ To run in the foreground, use `--foreground` with the `launch.py` script. You sh
python infra/launch.py -- python src/levanter/main/train_lm.py --config_path config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
```
### Using external directory/file
### Using an external directory or file
In case that you want to reference some external directory/file outside of the levanter repo, you can do it by adding the external directory/file to the docker image so that it becomes accessible in TPU instances. You can specify the path you want to add as extra buildl context by `--extra_context` with the `launch.py` script. Then, you should be able to use the external files in arguments in `train_lm.py` etc.
```bash
Expand All @@ -147,8 +147,10 @@ python infra/launch.py --extra_context <external path> -- python src/levanter/ma

### Babysitting Script

If you are using a preemptible TPU VM, you probably want to use the "babysitting" script that automatically re-creates
the VM. This is because preemptible instances can be preempted and will always be killed every 24 hours. You can run `launch.py` with the `--retries` and `--foreground` parameter to accomplish this. If `--retries` is greater than 1, `launch.py` will automatically attempt to re-create the VM and re-run the command if it fails. (`--foreground` is necessary to keep the script from returning immediately.)
If you are using a preemptible TPU VM, you probably want to use the "babysitting" version of the script to keep an eye on
the VM. This is because preemptible instances can be preempted and will always be killed every 24 hours.
You can run `launch.py` with the `--retries` and `--foreground` parameter to accomplish this.
If `--retries` is greater than 1, `launch.py` will automatically attempt to re-create the VM and re-run the command if it fails. (`--foreground` is necessary to keep the script from returning immediately.)

```bash
python infra/launch.py --retries=100 --foreground --tpu_name=my_tpu -- python src/levanter/main/train_lm.py --config_path config/my_config.yaml \
Expand Down Expand Up @@ -185,6 +187,7 @@ Tokenizers and configuration files are loaded via `fsspec` which supports remote
filesystems , so you can also copy your tokenizer or config file to GCS and use
a `gs://` path to access it.
## Common Issues
### (CRFM) Permission denied on `/files`
Expand Down
112 changes: 0 additions & 112 deletions infra/helpers/cli.py

This file was deleted.

Loading

0 comments on commit 20faff3

Please sign in to comment.