Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create pkgci.yml and pkgci_build_packages.yml. #589

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ScottTodd
Copy link
Member

Progress on #584.

Summary

  • Added .github/workflows/pkgci_build_packages.yml that builds sharktank, shortfin, and shark-ai dev packages and upload them to GitHub artifacts.
    • With ccache configured and tracing disabled, this job takes around 2 minutes to run.
  • Added .github/workflows/pkgci.yml that just runs pkgci_build_packages.yml for now. Other jobs can be migrated to depend on that job and use/test the packages.

Other details

  • Experimented a bit with ccache. I'm seeing a ~30% cache hit rate but that still gets the shortfin build from 2m30s (test logs here) down to 1m30s on standard GitHub-hosted ubuntu-24.04 runners (test logs here).
  • Plumbed the SHORTFIN_ENABLE_TRACING setting through scripts / Docker. For dev packages we can keep tracing disabled (unless there is a clear reason to add it). If the cache hit rate improves then we might be able to enable tracing for low cost.
  • Dropped cache: "pip" from build_packages.yml since it is counterproductive for a job that only installs packaging. Multiple workflows seem to be writing to the same cache and I see no way to customize the cache key. That, or the cache is unnecessarily large and we just need to prune it manually.

@ScottTodd ScottTodd requested review from marbre and renxida November 22, 2024 00:56
Copy link
Collaborator

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just two minor things.

.github/workflows/pkgci_build_packages.yml Outdated Show resolved Hide resolved
Comment on lines +144 to +150
elif [[ "${ARCH}" == "aarch64" ]]; then
# Latest version of ccache is not released for arm64, built it
git clone --depth 1 --branch "v${CCACHE_VERSION}" https://github.com/ccache/ccache.git
mkdir -p ccache/build && cd "$_"
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release ..
ninja
cp ccache /usr/bin/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need aarch64 support? We probably can just install ccache for x86_64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might at some point. I've written this code a few times before, most recently at https://github.com/iree-org/base-docker-images/blob/main/build_tools/install_ccache.sh, and supporting both architectures isn't much extra code. Better to write cross-platform/architecture code whenever possible instead of artificially limiting ourselves.

I plan on updating https://github.com/nod-ai/base-docker-images/blob/main/dockerfiles/manylinux_x86_64.Dockerfile to more closely match https://github.com/iree-org/base-docker-images/blob/main/dockerfiles/manylinux_x86_64.Dockerfile, as part of upgrading from manylinux2014 to manylinux_2_28:

# TODO(#130): Update to manylinux_2_28, upstream or a fork
# * upstream uses a version of gcc that has build warnings/errors
# * https://github.com/nod-ai/base-docker-images is a bit out of date but can include a recent clang
# MANYLINUX_DOCKER_IMAGE="${MANYLINUX_DOCKER_IMAGE:-quay.io/pypa/manylinux_2_28_${ARCH}:latest}"
MANYLINUX_DOCKER_IMAGE="${MANYLINUX_DOCKER_IMAGE:-quay.io/pypa/manylinux2014_${ARCH}:latest}"

@ScottTodd
Copy link
Member Author

Next steps with this:

  • Refactor at least one workflow to use the package artifacts
  • Put some more thought into how workflows can get different versions of the IREE / iree-turbine / etc. dependencies. Should they keep doing this?
    # Install latest iree-tubrine.
    pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \
    -e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"
    # Try with the latest IREE nightly releases, not what iree-turbine pins.
    # We could also pin to a known working or stable version.
    # This should eventually stabilize. Do the best we can for now.
    pip install -f https://iree.dev/pip-release-links.html --upgrade --pre \
    iree-base-compiler \
    iree-base-runtime

@ScottTodd
Copy link
Member Author

Picking this back up now.

Copy link
Contributor

@renxida renxida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely excited for every bit of CI time reduction.

@@ -75,6 +89,23 @@ function run_in_docker() {
echo "Using python versions: ${PYTHON_VERSIONS}"
local orig_path="${PATH}"

# Configure caching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES!

ScottTodd added a commit that referenced this pull request Dec 5, 2024
#646)

Splitting this off from #589 to
make progress on #584.

Tested with
```
CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
Cacheable calls:   626 / 636 (98.43%)
  Hits:              2 / 626 ( 0.32%)
    Direct:          2 /   2 (100.0%)
    Preprocessed:    0 /   2 ( 0.00%)
  Misses:          624 / 626 (99.68%)
Uncacheable calls:  10 / 636 ( 1.57%)
Local storage:
  Cache size (GB): 0.1 / 2.0 ( 3.10%)
  Hits:              2 / 626 ( 0.32%)
  Misses:          624 / 626 (99.68%)

+ ccache --show-stats
ccache stats:
Cacheable calls:   1252 / 1272 (98.43%)
  Hits:             550 / 1252 (43.93%)
    Direct:         550 /  550 (100.0%)
    Preprocessed:     0 /  550 ( 0.00%)
  Misses:           702 / 1252 (56.07%)
Uncacheable calls:   20 / 1272 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 4.11%)
  Hits:             550 / 1252 (43.93%)
  Misses:           702 / 1252 (56.07%)

+ ccache --show-stats
Cacheable calls:   1878 / 1908 (98.43%)
  Hits:            1098 / 1878 (58.47%)
    Direct:        1098 / 1098 (100.0%)
    Preprocessed:     0 / 1098 ( 0.00%)
  Misses:           780 / 1878 (41.53%)
Uncacheable calls:   30 / 1908 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.12%)
  Hits:            1098 / 1878 (58.47%)
  Misses:           780 / 1878 (41.53%)

CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
ccache stats:
Cacheable calls:   3756 / 3816 (98.43%)
  Hits:            2820 / 3756 (75.08%)
    Direct:        2820 / 2820 (100.0%)
    Preprocessed:     0 / 2820 ( 0.00%)
  Misses:           936 / 3756 (24.92%)
Uncacheable calls:   60 / 3816 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.19%)
  Hits:            2820 / 3756 (75.08%)
  Misses:           936 / 3756 (24.92%)
```

So we have multiple configurations getting built (Python versions,
tracing enable/disabled), but we still get a reasonable number of cache
hits. Definitely room to improve there, but better than nothing.
ScottTodd added a commit that referenced this pull request Dec 10, 2024
Progress on #584. ~~Depends on
#666 (the first commit).~~

This is refactors the `build_packages.yml` workflow so it can be used
via `workflow_call` as part of a "pkgci" setup, as an alternative to
creating a new `pkgci_build_packages.yml` workflow as originally
proposed in #589. This lets us
reuse the same workflow for building stable, nightly, and dev packages,
all across the same matrix of Python versions and operating systems.
Package builds take about 2 minutes (wall time) across the full matrix,
so we might as well build them all, instead of artificially constraining
ourselves to a subset like only Linux on Python 3.11.

Triggers for the workflow are now this:

Trigger | Scenario | Build type(s)
-- | -- | --
`schedule` | Nightly pre-release build | `rc`
`workflow_dispatch` | Workflow testing, manual releasing | `rc` default,
`stable` and `dev` possible
`workflow_call` | Pull request or push "pkgci" dev builds | `dev`
default, `stable` and `rc` possible

With this workflow behavior:

Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes
to release?
-- | -- | -- | -- | --
`stable` | None | No | Yes | No
`rc` | `rcYYYYMMDD` | No | Yes | Yes
`dev` | `.dev0+${{ github.sha }}` | Yes | No | No

Tested over at
https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml.
Example run:
https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm
cache)
monorimet pushed a commit that referenced this pull request Dec 13, 2024
#646)

Splitting this off from #589 to
make progress on #584.

Tested with
```
CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
Cacheable calls:   626 / 636 (98.43%)
  Hits:              2 / 626 ( 0.32%)
    Direct:          2 /   2 (100.0%)
    Preprocessed:    0 /   2 ( 0.00%)
  Misses:          624 / 626 (99.68%)
Uncacheable calls:  10 / 636 ( 1.57%)
Local storage:
  Cache size (GB): 0.1 / 2.0 ( 3.10%)
  Hits:              2 / 626 ( 0.32%)
  Misses:          624 / 626 (99.68%)

+ ccache --show-stats
ccache stats:
Cacheable calls:   1252 / 1272 (98.43%)
  Hits:             550 / 1252 (43.93%)
    Direct:         550 /  550 (100.0%)
    Preprocessed:     0 /  550 ( 0.00%)
  Misses:           702 / 1252 (56.07%)
Uncacheable calls:   20 / 1272 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 4.11%)
  Hits:             550 / 1252 (43.93%)
  Misses:           702 / 1252 (56.07%)

+ ccache --show-stats
Cacheable calls:   1878 / 1908 (98.43%)
  Hits:            1098 / 1878 (58.47%)
    Direct:        1098 / 1098 (100.0%)
    Preprocessed:     0 / 1098 ( 0.00%)
  Misses:           780 / 1878 (41.53%)
Uncacheable calls:   30 / 1908 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.12%)
  Hits:            1098 / 1878 (58.47%)
  Misses:           780 / 1878 (41.53%)

CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
ccache stats:
Cacheable calls:   3756 / 3816 (98.43%)
  Hits:            2820 / 3756 (75.08%)
    Direct:        2820 / 2820 (100.0%)
    Preprocessed:     0 / 2820 ( 0.00%)
  Misses:           936 / 3756 (24.92%)
Uncacheable calls:   60 / 3816 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.19%)
  Hits:            2820 / 3756 (75.08%)
  Misses:           936 / 3756 (24.92%)
```

So we have multiple configurations getting built (Python versions,
tracing enable/disabled), but we still get a reasonable number of cache
hits. Definitely room to improve there, but better than nothing.
IanNod pushed a commit to IanNod/SHARK-Platform that referenced this pull request Dec 17, 2024
Progress on nod-ai#584. ~~Depends on
nod-ai#666 (the first commit).~~

This is refactors the `build_packages.yml` workflow so it can be used
via `workflow_call` as part of a "pkgci" setup, as an alternative to
creating a new `pkgci_build_packages.yml` workflow as originally
proposed in nod-ai#589. This lets us
reuse the same workflow for building stable, nightly, and dev packages,
all across the same matrix of Python versions and operating systems.
Package builds take about 2 minutes (wall time) across the full matrix,
so we might as well build them all, instead of artificially constraining
ourselves to a subset like only Linux on Python 3.11.

Triggers for the workflow are now this:

Trigger | Scenario | Build type(s)
-- | -- | --
`schedule` | Nightly pre-release build | `rc`
`workflow_dispatch` | Workflow testing, manual releasing | `rc` default,
`stable` and `dev` possible
`workflow_call` | Pull request or push "pkgci" dev builds | `dev`
default, `stable` and `rc` possible

With this workflow behavior:

Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes
to release?
-- | -- | -- | -- | --
`stable` | None | No | Yes | No
`rc` | `rcYYYYMMDD` | No | Yes | Yes
`dev` | `.dev0+${{ github.sha }}` | Yes | No | No

Tested over at
https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml.
Example run:
https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm
cache)
monorimet pushed a commit that referenced this pull request Jan 8, 2025
Progress on #584. ~~Depends on
#666 (the first commit).~~

This is refactors the `build_packages.yml` workflow so it can be used
via `workflow_call` as part of a "pkgci" setup, as an alternative to
creating a new `pkgci_build_packages.yml` workflow as originally
proposed in #589. This lets us
reuse the same workflow for building stable, nightly, and dev packages,
all across the same matrix of Python versions and operating systems.
Package builds take about 2 minutes (wall time) across the full matrix,
so we might as well build them all, instead of artificially constraining
ourselves to a subset like only Linux on Python 3.11.

Triggers for the workflow are now this:

Trigger | Scenario | Build type(s)
-- | -- | --
`schedule` | Nightly pre-release build | `rc`
`workflow_dispatch` | Workflow testing, manual releasing | `rc` default,
`stable` and `dev` possible
`workflow_call` | Pull request or push "pkgci" dev builds | `dev`
default, `stable` and `rc` possible

With this workflow behavior:

Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes
to release?
-- | -- | -- | -- | --
`stable` | None | No | Yes | No
`rc` | `rcYYYYMMDD` | No | Yes | Yes
`dev` | `.dev0+${{ github.sha }}` | Yes | No | No

Tested over at
https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml.
Example run:
https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm
cache)
renxida added a commit that referenced this pull request Jan 10, 2025
… build packages once (#780)

This builds on #625, #589 to make progress on issue #584.

This adds a pkgci.yml to run multiple package-based CI tasks after
building package using Scott's changes in #667. This gives us the
following benefits:

* Integration test workflows are faster because they now use dev
packages, without needing to build them from source or use editable
installs. Also, if more integration tests are added, they can reuse the
built packages.
* Users and developers can access the same dev packages to reproduce CI
results
* Only one runner needs the build requirements (potentially including
clang, ninja, CMake, Rust, etc.), other runners only need Python.

This also switches to using uv to create venvs, which is faster.

This PR brings shortfin CPU LLM CI time to roughly half an hour on the
mi250 runner to a few seconds of package build (fast due to caching) and
around 5 minutes of testing.

---------

Co-authored-by: Scott Todd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants