Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_neon_cli_basics is flaky due to unexpected warning #10381

Open
alexanderlaw opened this issue Jan 14, 2025 · 3 comments
Open

test_neon_cli_basics is flaky due to unexpected warning #10381

alexanderlaw opened this issue Jan 14, 2025 · 3 comments
Labels
t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link

When trying to run tests with sanitizers enabled on my laptop, I see test_neon_cli_basics failed on each run (on a more performant workstation it fails from time to time) with:

2025-01-14 14:18:58.959 INFO [utils.py:523] not allowed pageserver_1 error: 2025-01-14T12:18:43.768048Z  WARN init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 3): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)

More from pageserver.log:

...
2025-01-14T12:18:41.450655Z  INFO init_tenant_mgr: Calling http://127.0.0.1:15002/upcall/v1/ API to re-attach tenants
2025-01-14T12:18:41.451258Z  INFO init_tenant_mgr: Loaded node metadata: postgres localhost:15004, http localhost:15005, other fields: {"availability_zone_id": String("us-east-2a")}
2025-01-14T12:18:41.451922Z  INFO init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 0): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T12:18:41.452221Z  INFO init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 1): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T12:18:41.452263Z  INFO init_tenant_mgr: Backoff: waiting 1.1 seconds before processing with the task
2025-01-14T12:18:42.554922Z  INFO init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 2): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T12:18:42.555060Z  INFO init_tenant_mgr: Backoff: waiting 1.2100000000000002 seconds before processing with the task
2025-01-14T12:18:43.768048Z  WARN init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 3): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T12:18:43.768179Z  INFO init_tenant_mgr: Backoff: waiting 1.3310000000000004 seconds before processing with the task
2025-01-14T12:18:45.101318Z  WARN init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 4): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T12:18:45.101420Z  INFO init_tenant_mgr: Backoff: waiting 1.4641000000000004 seconds before processing with the task
2025-01-14T12:18:46.578343Z  INFO init_tenant_mgr: calling control plane generation validation API succeeded after 5 retries
...

The same can be seen at CI:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10104/12286480456/index.html#/testresult/387bce70aea2aae3
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9950/12089114170/index.html#/testresult/a79150a64b0cfba
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9878/12011001635/index.html#suites/48b4046d39093f7675bf477e070db277/efbdabcef0dc3dee/history
...
(the latter run was considered failed)

I can see a similar message added to DEFAULT_PAGESERVER_ALLOWED_ERRORS:

    # During teardown, we stop the storage controller before the pageservers, so pageservers
    # can experience connection errors doing background deletion queue work.
    ".*WARN deletion backend: calling control plane generation validation API failed.*error sending request.*",

The message from test_neon_cli_basics differs in only one component: "init_tenant_mgr" vs "deletion backend".

@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Jan 14, 2025
@jcsp
Copy link
Collaborator

jcsp commented Jan 14, 2025

I'd suggest checking the timing of storage controller startup vs. pageserver startup -- I think neon_local is meant to do the controller first and wait for it to be ready, maybe that isn't happening.

Or, maybe the test is directly controlling individual services and restarting things in the wrong order.

@alexanderlaw
Copy link
Author

Thank you for looking at this!

I've reduced the test to the following:

def test_neon_cli_basics(neon_env_builder: NeonEnvBuilder, port_distributor: PortDistributor):
    env = neon_env_builder.init_configs()
    # Skipping the init step that creates a local tenant in Pytest tests
    try:
        env.neon_cli.start()
        env.create_tenant(tenant_id=env.initial_tenant, set_default=True)

    finally:
        env.neon_cli.stop()

and still getting those warnings. It really looks like pageserver starting first.

A failed test log contains:

...
2025-01-14 15:11:15.761 INFO [neon_cli.py:68] Running command "/src/neon-t/target/debug/neon_local init --config=/tmp/tmpgfklu1xv"
2025-01-14 15:11:15.785 INFO [neon_cli.py:68] Running command "/src/neon-t/target/debug/neon_local start"
2025-01-14 15:11:21.726 INFO [neon_cli.py:68] Running command "/src/neon-t/target/debug/neon_local tenant create --tenant-id b73c8c1132b7303f7990b1085f431a24 --timeline-id 5d5c7148b2c0dc307fb3f4c60fa122e1 --pg-version 17 --set-default"
2025-01-14 15:11:29.629 INFO [neon_cli.py:68] Running command "/src/neon-t/target/debug/neon_local stop"
...

pageserver.log:

...
2025-01-14T13:11:15.891037Z  INFO init_tenant_mgr: Initialized ephemeral layer size limit to 0, for 33070247936 bytes of memory
2025-01-14T13:11:15.896493Z  INFO init_tenant_mgr: Calling http://127.0.0.1:15002/upcall/v1/ API to re-attach tenants
2025-01-14T13:11:15.896877Z  INFO init_tenant_mgr: Loaded node metadata: postgres localhost:15004, http localhost:15005, other fields: {"availability_zone_id": String("us-east-2a")}
2025-01-14T13:11:15.897452Z  INFO init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 0): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T13:11:15.897628Z  INFO init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 1): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T13:11:15.897654Z  INFO init_tenant_mgr: Backoff: waiting 1.1 seconds before processing with the task
2025-01-14T13:11:16.999259Z  INFO init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 2): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T13:11:16.999351Z  INFO init_tenant_mgr: Backoff: waiting 1.2100000000000002 seconds before processing with the task
2025-01-14T13:11:18.211383Z  WARN init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 3): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T13:11:18.211460Z  INFO init_tenant_mgr: Backoff: waiting 1.3310000000000004 seconds before processing with the task
2025-01-14T13:11:19.544769Z  WARN init_tenant_mgr: calling control plane generation validation API failed, will retry (attempt 4): error sending request for url (http://127.0.0.1:15002/upcall/v1/re-attach)
2025-01-14T13:11:19.544922Z  INFO init_tenant_mgr: Backoff: waiting 1.4641000000000004 seconds before processing with the task
2025-01-14T13:11:21.021452Z  INFO init_tenant_mgr: calling control plane generation validation API succeeded after 5 retries
2025-01-14T13:11:21.021531Z  INFO init_tenant_mgr: Received re-attach response with 0 tenants ...
2025-01-14T13:11:21.021697Z  INFO init_tenant_mgr: Attaching 0 tenants at startup, warming up 8 at a time
...

storage_controller.log:

2025-01-14T13:11:19.490107Z  INFO version: git:b7d110c9ffd65a558b917379f4936a88761ff3b4-modified, launch_timestamp: 2025-01-14 13:11:19.488214043 UTC, build_tag build_tag:, listening on 127.0.0.1:15002
2025-01-14T13:11:19.490196Z  WARN Starting in dev mode: this may be an insecure configuration.
2025-01-14T13:11:19.504055Z  INFO Connected to database.
...

@alexanderlaw
Copy link
Author

In fact, the warning can be easily produced just by neon_local start (with slow-starting storage controller). I suppose this is explained by JoinSet used in handle_start_all/handle_start_all_impl (starting from 035a49a):

while let Some(result) = js.join_next().await {

As JoinSet doc says: The set is not ordered, and the tasks will be returned in the order they complete.

alexanderlaw added a commit that referenced this issue Jan 15, 2025
alexanderlaw added a commit that referenced this issue Jan 15, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 16, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 17, 2025
alexanderlaw added a commit that referenced this issue Jan 18, 2025
alexanderlaw added a commit that referenced this issue Jan 18, 2025
alexanderlaw added a commit that referenced this issue Jan 19, 2025
alexanderlaw added a commit that referenced this issue Jan 20, 2025
alexanderlaw added a commit that referenced this issue Jan 20, 2025
alexanderlaw added a commit that referenced this issue Jan 20, 2025
alexanderlaw added a commit that referenced this issue Jan 20, 2025
alexanderlaw added a commit that referenced this issue Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants