Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on Grouped Tab in Kubeflow Pipelines UI - Failed to Get Executions #11319

Open
afrozsh19 opened this issue Oct 21, 2024 · 1 comment
Open
Labels

Comments

@afrozsh19
Copy link

afrozsh19 commented Oct 21, 2024

Environment

Steps to reproduce

Description:
When accessing the Execution tab in Kubeflow Pipelines, the default Main tab loads fine. However, when switching to the Grouped tab, the UI takes a while to load and then results in the following error:

Steps

  1. Navigate to the Executions tab in the Kubeflow Pipelines UI.
  2. Switch from the default tab to the Grouped tab.
  3. The page attempts to load and eventually fails with the above error.

Expected result

  • The Grouped tab should load executions specific to the selected profile and not fail with the Gateway Time-out error.
  • The system should only fetch executions for the current profile, rather than fetching executions for all profiles (namespaces).

Actual Result

The page results in failure with below error message
Error: Failed getting executions: Unknown Content-type received. Code: 2

image

Materials and reference

Debugging Findings:

  1. Network Call Failure:

    • One of the network calls in the browser fails:
      • Resource Path: /ml_metadata.MetadataStoreService/GetExecutions
      • Response: Gateway Time-out
  2. Pod Logs (metadata-grpc-deployment):

    • Logs from the metadata-grpc-deployment pod show the following error:
      W1021 10:05:33.342247 210 metadata_store_service_impl.cc:417] PutExecution failed: mysql_query aborted: errno: Lock wait timeout exceeded; try restarting transaction, error: Lock wait timeout exceeded; try restarting transaction
  3. Executions Fetched Across All Profiles:

    • The system appears to fetch executions from all Kubeflow profiles (i.e., namespaces) regardless of the currently selected profile in the UI. This results in fetching executions across multiple namespaces, which might be contributing to the slowness.

Additional Context:

  • It seems that the large number of pipeline runs (~100k) may be contributing to the slow query times or query timeout in MySQL, resulting in the error.
  • The Lock wait timeout error from the MySQL database in the metadata-grpc-deployment pod could indicate a need for query optimisation or database tuning to handle the load more efficiently.
  • Pipeline Runs: Approximately 100k pipeline runs in the system

Impacted by this bug? Give it a 👍.

@saijalgupta2
Copy link

could you please help us when this issue will be resolved?
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants