Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feeding does not respect resource limits, crashes node #32288

Open
buinauskas opened this issue Aug 28, 2024 · 3 comments
Open

Feeding does not respect resource limits, crashes node #32288

buinauskas opened this issue Aug 28, 2024 · 3 comments
Assignees
Milestone

Comments

@buinauskas
Copy link
Contributor

buinauskas commented Aug 28, 2024

Describe the bug

  • resource limits were not respected
  • single-node test deployment uses 100% disk space and bricks the whole deployment

we have a dedicated single-node vespa deployment to test features and optimizations, it helps us predict how changes will scale to larger deployments.

this test deployment has more usable memory than usable disk, can this be an issue for resource limiter?

To Reproduce
Steps to reproduce the behavior:

  1. Seed ~80M documents to vespa
  2. Update these documents using Vespa's partial update feature to attach embeddings, there are ~3 photo embeddings per document
  3. After some time, disk usage starts spiking, reaches 100%
  4. The machine reports that too many inodes are used
  5. Node goes down

That's the relevant embedding schema, we attach multi-dimensional photo clip embeddings where each dimension is a unique photo ID associated with that document.

field photo_embeddings type tensor<bfloat16>(photo_id{}, embedding[512]) {
    indexing: attribute | index
    attribute {
        fast-rank
        distance-metric: angular
    }
    index {
        hnsw {
            max-links-per-node: 16
            neighbors-to-explore-at-insert: 96
        }
    }
}

Expected behavior

  • vespa's resource limits to kick in
  • get 429 status codes and make sure that feed requests are rejected
  • vespa content node is reachable, responds to search requests

Screenshots

  • content_proton_resource_usage_disk_usage_total_max metric was used
  • 2024-08-26 08:00 we start seeding
  • 2024-08-26 19:00 seeding is over, embeddings are starting to be attached
image

Environment (please complete the following information):

  • OS: Docker
  • Infrastructure: self-hosted
  • Memory: 512G
  • Disk 893G RAID1, 446G usable

Vespa version
8.363.17

Additional context
These are interesting logs and they go in such a sequence:

Aug 27, 2024 @ 17:16:48.000 what():  Fatal: Writing 2097152 bytes to '/opt/vespa/var/db/vespa/search/cluster.vinted/n2/documents/items/0.ready/attribute/photo_embeddings/snapshot-230031437/photo_embeddings.dat' failed (wrote -1): No space left on device

Aug 27, 2024 @ 17:16:48.000 PC: @     0x7faea85ef52f  (unknown)  raise

Aug 27, 2024 @ 17:16:48.000 terminate called after throwing an instance of 'std::runtime_error'

Aug 27, 2024 @ 17:16:48.000 *** SIGABRT received at time=1724768208 on cpu 64 ***

Aug 27, 2024 @ 17:17:04.000 Write operations are now blocked: 'diskLimitReached: { action: "add more content nodes", reason: "disk used (0.999999) > disk limit (0.9)", stats: { capacity: 475877605376, used: 475877257216, diskUsed: 0.999999, diskLimit: 0.9}}'

Aug 27, 2024 @ 17:17:21.000 Unable to get response from service 'searchnode:2193:RUNNING:vinted/search/cluster.vinted/2': Connect to http://localhost:19107 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused

That's our services.xml file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<services version="1.0">
  <admin version="2.0">
    <slobroks>
      <slobrok hostalias="vespa-scale-readiness-cfg1.infra"/>
    </slobroks>
    <configservers>
      <configserver hostalias="vespa-scale-readiness-cfg1.infra"/>
    </configservers>
    <cluster-controllers>
      <cluster-controller hostalias="vespa-scale-readiness-cfg1.infra"/>
    </cluster-controllers>
    <adminserver hostalias="vespa-scale-readiness-cfg1.infra"/>
    <metrics>
      <consumer id="custom-metrics">
        <metric-set id="default"/>
        <metric id="update_with_create.count"/>
      </consumer>
    </metrics>
  </admin>
  <container id="default" version="1.0">
    <nodes>
      <jvm options="-Xms24g -Xmx24g -XX:+PrintCommandLineFlags -Xlog:disable"/>
      <node hostalias="vespa-scale-readiness-container1.infra"/>
    </nodes>
    <components>
      <include dir="ext/linguistics"/>
      <include dir="ext/clip"/>
    </components>
    <search>
      <include dir="searchers"/>
    </search>
    <document-processing>
      <chain id="default">
        <documentprocessor id="com.search.items.ItemsRankingProcessor" bundle="vespa"/>
        <documentprocessor id="com.search.items.CreatingUpdateTrackingProcessor" bundle="vespa"/>
      </chain>
    </document-processing>
    <model-evaluation/>
    <document-api/>
    <accesslog type="disabled"/>
  </container>
  <content id="vinted" version="1.0">
    <search>
      <coverage>
        <minimum>0.8</minimum>
        <min-wait-after-coverage-factor>0.2</min-wait-after-coverage-factor>
        <max-wait-after-coverage-factor>0.3</max-wait-after-coverage-factor>
      </coverage>
    </search>
    <redundancy>1</redundancy>
    <documents garbage-collection="true">
      <document type="items" mode="index"/>
      <document type="items_7d" mode="index" selection="items_7d.created_at &gt; now() - 604800"/>
    </documents>
    <engine>
      <proton>
        <searchable-copies>1</searchable-copies>
        <tuning>
          <searchnode>
            <requestthreads>
              <persearch>8</persearch>
              <search>256</search>
              <summary>64</summary>
            </requestthreads>
            <removed-db>
              <prune>
                <age>86400</age>
              </prune>
            </removed-db>
          </searchnode>
        </tuning>
      </proton>
    </engine>
    <group>
      <distribution partitions="1|*"/>
      <group distribution-key="1" name="group1">
        <node distribution-key="2" hostalias="vespa-scale-readiness-data1.infra"/>
      </group>
    </group>
  </content>
</services>
@hmusum hmusum changed the title Feeding does not respect resource limits, crashes no Feeding does not respect resource limits, crashes node Aug 28, 2024
@geirst geirst added this to the soon milestone Aug 28, 2024
@vekterli
Copy link
Member

The spikes you are observing are almost certainly caused by flushing of in-memory data structures to disk, which requires temporary disk usage that is proportional to the memory used by that data structure (in this case presumably a large tensor attribute).

As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.

The automatic feed blocking mechanisms are not currently clever enough to anticipate the impact that future flushes will have based on the already fed data. We should ideally look at the ratio of host memory to disk and automatically derive a reasonable default block threshold based on this—it is clear that the default limits are not appropriate for high memory + low disk setups.

@buinauskas
Copy link
Contributor Author

I have to admit that our test hardware is quite unusual, but we have to deal with what we got. It's good that we discovered it in such circumstances.

As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.

We'll keep that in mind.

We have now reduced our test dataset size and are happy to know what caused the problem. Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.

@geirst geirst modified the milestones: soon, later Sep 4, 2024
@vekterli
Copy link
Member

vekterli commented Sep 4, 2024

Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.

I'm leaving the issue open for now, as it'd be a good thing to detect and at least warn about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants