fix(api): optimise archived list query. Fixes #13295 #13566

bom-d-van · 2024-09-06T06:36:39Z

Motivation

The current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.

this new change is backward compatible and it is tested on our production env (using postgresql).

Modifications

rewriting the generated query to use a subquery without unmarshalling json payload first (effectively reducing workload required from postgres, although arguably postgres should be smart enough to only perform the operation after querying ;)):

SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE (("clustername" = 'default'
        AND "instanceid" = '')
       AND "namespace" = 'argo-map'
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/phase'
            AND value = 'Succeeded')
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/workflow-template'
            AND value = 'mapping1-pipeline-template-with-nfs'))
ORDER BY "startedat" DESC
LIMIT 1;

now:

SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE "clustername" = 'default'
  AND UID IN
    (SELECT UID
     FROM "argo_archived_workflows"
     WHERE (("clustername" = 'default'
             AND "instanceid" = '')
            AND "namespace" = 'argo-map'
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/phase'
                 AND value = 'Succeeded')
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/workflow-template'
                 AND value = 'mapping1-pipeline-template-with-nfs'))
     ORDER BY "startedat" DESC
     LIMIT 1);

Verification

the query runtime on our production env is reduced from 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.

the current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels. this new change is backward compatible and it is tested on our prodction env (using postgresql). realated issue: argoproj#13295 query change example: previous: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1; ``` now: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE "clustername" = 'default' AND UID IN (SELECT UID FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1); ``` Signed-off-by: Xiaofan Hu <[email protected]>

agilgur5 · 2024-09-06T15:28:04Z

The current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table.

interesting analysis, a bit different than in the issue.

although arguably postgres should be smart enough to only perform the operation after querying ;)

also agreed, I would think the DB would know to avoid this 😕

It looks like some E2E tests are failing, and one of them is getting a 500 on this API call -- can you check them?

bom-d-van · 2024-09-08T09:30:06Z

hi @agilgur5 thank for the feedback. I have pushed another fix for mysql. it requires a little bit more handling.

there is a test failure about (artgc-artifact-not-written-failed-zn9dv), but it seems not related to my changes. do you have any suggestions?

bom-d-van · 2024-09-08T12:34:49Z

hmm, it seems that retry helps. :P

agilgur5 · 2024-09-08T15:35:52Z

there is a test failure about (artgc-artifact-not-written-failed-zn9dv), but it seems not related to my changes. do you have any suggestions?

hmm, it seems that retry helps. :P

Sounds like a test flake then. Although it looks like it was with TestStoppedWorkflow, and not the test you mentioned above (which was added recently, so would have been easier to investigate):

artifacts_test.go:218: timeout after 1m30s waiting for condition
    artifacts_test.go:241: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/artifacts_test.go:241
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:251
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/artifacts_test.go:240
        	Error:      	Received unexpected error:
        	            	The specified key does not exist.
        	Test:       	TestArtifactsSuite/TestStoppedWorkflow
=== FAIL: ArtifactsSuite/TestStoppedWorkflow

This one I believe has flaked before IIRC

agilgur5

This mostly looks good to me but would like some folks to double-check cc @jiachengxu @terrytangyuan @Joibel @jessesuen

Also might be worthwhile to push up an image for users on the issue to test and confirm results

persist/sqldb/workflow_archive.go

agilgur5 · 2024-09-08T15:51:48Z

the query runtime on our production env is reduced from 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.

While this requires a good bit of rows to test, I'm also thinking that we could potentially write a simplified test that checks if the query returns under a certain amount of time on a test seed. For example, under 100ms.
That way it could act as a regression test for anything severely off, but not a full performance test to catch everything. We could potentially apply that to all API calls that make queries actually 🤔

ryancurrah · 2024-09-08T18:32:35Z

I can test it out. We have a large non-production deployment we can test it in.

bom-d-van · 2024-09-09T13:31:55Z

While this requires a good bit of rows to test, I'm also thinking that we could potentially write a simplified test that checks if the query returns under a certain amount of time on a test seed. For example, under 100ms.

@agilgur5 yeah, it sounds quite reasonable and would be helpful in the long run. but unfortunately, I didn't spot a simple way for me to add a new regression test fast enough and my recent work schedule is quite tight.

could we add a new issue and linking it to this pr for now? I might be able to make the contribution a few weeks later.

the optimized query is introduced in 9856df7 Signed-off-by: Xiaofan Hu <[email protected]>

persist/sqldb/workflow_archive.go

agilgur5 · 2024-09-10T00:27:58Z

could we add a new issue and linking it to this pr for now? I might be able to make the contribution a few weeks later.

Yea that'd be fine so long as we get some more manual tests of this. I'm loading up my dev env right now to push up an image

ryancurrah · 2024-09-10T00:43:07Z

Provide me an image to use and I will test it.

agilgur5 · 2024-09-10T00:57:49Z

Pushed up agilgur5/argo-cli:fix-archive-list-query-optimization for testing

terrytangyuan

The change looks good.

persist/sqldb/workflow_archive.go

related to 9856df7 Signed-off-by: Xiaofan Hu <[email protected]>

ryancurrah · 2024-09-12T16:44:05Z

I can confirm this change reduces query times. Loading time in the UI on filtering labels was reduced by 94.44%.

Query time Before: 18 seconds
Query time after : 1 second
DB type: Postgresql
Version: 14
Number of rows (SELECT COUNT(*) FROM public.argo_archived_workflows;): 93135
CPUs: 24
Memory: 16GB

agilgur5

Thanks for the testing @ryancurrah !

With that, some of the individual query tests folks made across both MySQL and Postgres, and 2 other approvals, I think this is good to go 🚀

Note that this query seems to be further optimizable per #13295 (comment), so we may want to open a new issue to add further optimizations, but this at least fixes the performance issues and gets us to a much more reasonable baseline

ryancurrah · 2024-09-12T18:44:55Z

Note that this query seems to be further optimizable per #13295 (comment), so we may want to open a new issue to add further optimizations, but this at least fixes the performance issues and gets us to a much more reasonable baseline

If you need someone to test any more query optimizations feel free to ping me.

agilgur5 · 2024-09-14T17:36:11Z

Added a follow-up issue for further optimizations to this query in #13601

…j#13566) Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

agilgur5 changed the title ~~fix: optimise list query in persist/sqldb/workflow_archive.go Fixes #13295~~ fix(api): optimise archived list query. Fixes #13295 Sep 6, 2024

agilgur5 added area/api Argo Server API area/workflow-archive labels Sep 6, 2024

agilgur5 mentioned this pull request Sep 7, 2024

3.5: Improve Archived API/DB read performance #13295

Closed

4 tasks

bom-d-van force-pushed the fix/archive-list-query-optimization branch 2 times, most recently from 95c4116 to b766093 Compare September 8, 2024 08:45

bom-d-van force-pushed the fix/archive-list-query-optimization branch from b766093 to 057aca6 Compare September 8, 2024 09:35

agilgur5 reviewed Sep 8, 2024

View reviewed changes

persist/sqldb/workflow_archive.go Show resolved Hide resolved

agilgur5 added the prioritized-review For members of the Sustainability Effort label Sep 8, 2024

jiachengxu approved these changes Sep 9, 2024

View reviewed changes

fix: workaround for mysql 42000 error (Unsupported subquery syntax)

7aa6dc7

the optimized query is introduced in 9856df7 Signed-off-by: Xiaofan Hu <[email protected]>

bom-d-van force-pushed the fix/archive-list-query-optimization branch 3 times, most recently from 14d07ed to 91bf3a6 Compare September 9, 2024 14:07

agilgur5 reviewed Sep 10, 2024

View reviewed changes

persist/sqldb/workflow_archive.go Outdated Show resolved Hide resolved

terrytangyuan approved these changes Sep 10, 2024

View reviewed changes

terrytangyuan reviewed Sep 10, 2024

View reviewed changes

persist/sqldb/workflow_archive.go Outdated Show resolved Hide resolved

Danny5487401 mentioned this pull request Sep 10, 2024

v3.5: slow sql query when using JSON_EXTRACT in 2cpu 4 GB mysql 8.0 #13563

Closed

4 tasks

bom-d-van force-pushed the fix/archive-list-query-optimization branch from b715b42 to 5ec0599 Compare September 10, 2024 07:24

bom-d-van force-pushed the fix/archive-list-query-optimization branch 2 times, most recently from 66717c8 to 367f21c Compare September 10, 2024 13:01

agilgur5 reviewed Sep 10, 2024

View reviewed changes

persist/sqldb/workflow_archive.go Outdated Show resolved Hide resolved

bom-d-van force-pushed the fix/archive-list-query-optimization branch from 367f21c to f6db948 Compare September 11, 2024 01:32

fix: add a subquery comment in *workflowArchive.ListWorkflows

9f39a95

related to 9856df7 Signed-off-by: Xiaofan Hu <[email protected]>

bom-d-van force-pushed the fix/archive-list-query-optimization branch from f6db948 to 9f39a95 Compare September 11, 2024 03:00

agilgur5 approved these changes Sep 12, 2024

View reviewed changes

agilgur5 merged commit 729ac17 into argoproj:main Sep 12, 2024
27 checks passed

agilgur5 added this to the v3.5.x patches milestone Sep 12, 2024

agilgur5 mentioned this pull request Sep 14, 2024

3.5: Further optimize Archive List API call / DB query #13601

Open

4 tasks

Joibel pushed a commit to pipekit/argo-workflows that referenced this pull request Sep 19, 2024

fix(api): optimise archived list query. Fixes argoproj#13295 (argopro…

444d67b

…j#13566) Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

Joibel mentioned this pull request Sep 20, 2024

Release v3.5 patch releases discussion #11997

Open

Joibel pushed a commit that referenced this pull request Sep 20, 2024

fix(api): optimise archived list query. Fixes #13295 (#13566)

f4f7dbd

Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

agilgur5 mentioned this pull request Oct 21, 2024

test: basic benchmarks for workflow archive DB operations #13767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): optimise archived list query. Fixes #13295 #13566

fix(api): optimise archived list query. Fixes #13295 #13566

bom-d-van commented Sep 6, 2024

agilgur5 commented Sep 6, 2024 •

edited

Loading

bom-d-van commented Sep 8, 2024 •

edited

Loading

bom-d-van commented Sep 8, 2024

agilgur5 commented Sep 8, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading

agilgur5 commented Sep 8, 2024 •

edited

Loading

ryancurrah commented Sep 8, 2024

bom-d-van commented Sep 9, 2024

agilgur5 commented Sep 10, 2024

ryancurrah commented Sep 10, 2024

agilgur5 commented Sep 10, 2024

terrytangyuan left a comment

ryancurrah commented Sep 12, 2024 •

edited by agilgur5

Loading

agilgur5 left a comment •

edited

Loading

ryancurrah commented Sep 12, 2024 •

edited by agilgur5

Loading

agilgur5 commented Sep 14, 2024 •

edited

Loading

fix(api): optimise archived list query. Fixes #13295 #13566

fix(api): optimise archived list query. Fixes #13295 #13566

Conversation

bom-d-van commented Sep 6, 2024

Motivation

Modifications

Verification

agilgur5 commented Sep 6, 2024 • edited Loading

bom-d-van commented Sep 8, 2024 • edited Loading

bom-d-van commented Sep 8, 2024

agilgur5 commented Sep 8, 2024 • edited Loading

agilgur5 left a comment • edited Loading

Choose a reason for hiding this comment

agilgur5 commented Sep 8, 2024 • edited Loading

ryancurrah commented Sep 8, 2024

bom-d-van commented Sep 9, 2024

agilgur5 commented Sep 10, 2024

ryancurrah commented Sep 10, 2024

agilgur5 commented Sep 10, 2024

terrytangyuan left a comment

Choose a reason for hiding this comment

ryancurrah commented Sep 12, 2024 • edited by agilgur5 Loading

agilgur5 left a comment • edited Loading

Choose a reason for hiding this comment

ryancurrah commented Sep 12, 2024 • edited by agilgur5 Loading

agilgur5 commented Sep 14, 2024 • edited Loading

agilgur5 commented Sep 6, 2024 •

edited

Loading

bom-d-van commented Sep 8, 2024 •

edited

Loading

agilgur5 commented Sep 8, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading

agilgur5 commented Sep 8, 2024 •

edited

Loading

ryancurrah commented Sep 12, 2024 •

edited by agilgur5

Loading

agilgur5 left a comment •

edited

Loading

ryancurrah commented Sep 12, 2024 •

edited by agilgur5

Loading

agilgur5 commented Sep 14, 2024 •

edited

Loading