Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multinode-HA Vespa Setup for Local Testing #1071

Open
wants to merge 41 commits into
base: mainline
Choose a base branch
from

Conversation

vicilliar
Copy link
Contributor

@vicilliar vicilliar commented Dec 16, 2024

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    Testing improvement

  • What is the current behavior? (You can also link to an open issue here)
    Current vespa setup only uses a single node.

  • What is the new behavior (if this is a feature change)?
    We implement a multinode setup for local vespa, so we can simulate cloud shards and replicas.
    vespa_local.py start function now accepts --Shards and --Replicas as parameters. If Shards > 1 or Replicas > 0, multinode vespa setup is used. Multinode vespa setup has 3 config server nodes, max(2, total_content_nodes / 4) API nodes, and shards * (1 + replicas) content nodes.
    Unit test github workflow now accepts shards and replicas as parameters.
    Orchestrator workflow was created, which runs 4 unit tests setups:
    (1) 0 replicas, 1 shard
    (2) 1 replica, 1 shard
    (3) 0 replicas, 2 shards
    (4) 1 replica, 2 shards

Unit tests on multinode vespa will ignore the following directories: tests/core/inference, tests/processing, tests/s2_inference

Multinode vespa tests will use m6i.2xlarge instead of m6i.xlarge due to the higher memory usage from many vespa nodes. Config and API nodes are ~1gb and content nodes are ~500mb. A 9 node system (3 config, 2 API, 4 content) needs roughly 7gb for vespa alone.

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
    No

  • Have unit tests been run against this PR? (Has there also been any additional testing?)
    In progress

  • Related Python client changes (link commit/PR here)

  • Related documentation changes (link commit/PR here)

  • Other information:

  • Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

papa99do
papa99do previously approved these changes Jan 30, 2025
Copy link
Collaborator

@papa99do papa99do left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great. Thanks for adding the unit test to make sure the compose file and configs are generated correctly.

cancel-in-progress: true

permissions:
contents: read

jobs:
Determine-Vespa-Setup:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step should be run after Check-Changes, and should be run only if check-changes returns true:
if: ${{ needs.Check-Changes.outputs.doc_only == 'false' }} # Run only if there are non-documentation changes

@@ -224,7 +282,7 @@ jobs:
cd marqo
export PYTHONPATH="./tests:./src:."
set -o pipefail
pytest --ignore=tests/test_documentation.py --ignore=tests/compatibility_tests \
pytest ${{ env.MULTINODE_TEST_ARGS }} --ignore=tests/test_documentation.py --ignore=tests/compatibility_tests \
Copy link
Collaborator

@papa99do papa99do Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems MULTINODE_TEST_ARGS is not passed in correctly (or maybe is not populated correctly in the first place?)

Also, in the next line, we fail the build if --cov-fail-under=69, which does not make sense for these tests since they skip a lot of test cases. we should skip the coverage check in multi-shard/replica tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is passed in correctly for multinode runs. Please check this 2 shard 1 replica run: https://github.com/marqo-ai/marqo/actions/runs/13106217962/job/36561470973#step:9:15

MULTINODE_TEST_ARGS will be empty string for 1 shard and 0 replicas. Maybe that's the one you saw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants