Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(api): optimise archived list query. Fixes #13295 #13566

Merged
merged 3 commits into from
Sep 12, 2024

Conversation

bom-d-van
Copy link
Contributor

Fixes #13295

Motivation

The current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.

this new change is backward compatible and it is tested on our production env (using postgresql).

Modifications

rewriting the generated query to use a subquery without unmarshalling json payload first (effectively reducing workload required from postgres, although arguably postgres should be smart enough to only perform the operation after querying ;)):

previous:

SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE (("clustername" = 'default'
        AND "instanceid" = '')
       AND "namespace" = 'argo-map'
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/phase'
            AND value = 'Succeeded')
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/workflow-template'
            AND value = 'mapping1-pipeline-template-with-nfs'))
ORDER BY "startedat" DESC
LIMIT 1;

now:

SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE "clustername" = 'default'
  AND UID IN
    (SELECT UID
     FROM "argo_archived_workflows"
     WHERE (("clustername" = 'default'
             AND "instanceid" = '')
            AND "namespace" = 'argo-map'
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/phase'
                 AND value = 'Succeeded')
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/workflow-template'
                 AND value = 'mapping1-pipeline-template-with-nfs'))
     ORDER BY "startedat" DESC
     LIMIT 1);

Verification

the query runtime on our production env is reduced from 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.

Screenshot 2024-08-29 at 18 51 43 Screenshot 2024-08-29 at 18 50 05

the current implementation appears to be causing postgres always unmarshalling workflow json
payload for all the records in the table. by adopting a subquery approach, we are able to optimise
the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows
and 1794624 argo_archived_workflows_labels.

this new change is backward compatible and it is tested on our prodction env (using postgresql).

realated issue: argoproj#13295

query change example:

previous:

```sql
SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE (("clustername" = 'default'
        AND "instanceid" = '')
       AND "namespace" = 'argo-map'
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/phase'
            AND value = 'Succeeded')
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/workflow-template'
            AND value = 'mapping1-pipeline-template-with-nfs'))
ORDER BY "startedat" DESC
LIMIT 1;
```

now:

```sql
SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE "clustername" = 'default'
  AND UID IN
    (SELECT UID
     FROM "argo_archived_workflows"
     WHERE (("clustername" = 'default'
             AND "instanceid" = '')
            AND "namespace" = 'argo-map'
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/phase'
                 AND value = 'Succeeded')
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/workflow-template'
                 AND value = 'mapping1-pipeline-template-with-nfs'))
     ORDER BY "startedat" DESC
     LIMIT 1);
```

Signed-off-by: Xiaofan Hu <[email protected]>
@agilgur5 agilgur5 changed the title fix: optimise list query in persist/sqldb/workflow_archive.go Fixes #13295 fix(api): optimise archived list query. Fixes #13295 Sep 6, 2024
@agilgur5
Copy link
Member

agilgur5 commented Sep 6, 2024

The current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table.

interesting analysis, a bit different than in the issue.

although arguably postgres should be smart enough to only perform the operation after querying ;)

also agreed, I would think the DB would know to avoid this 😕

It looks like some E2E tests are failing, and one of them is getting a 500 on this API call -- can you check them?

@bom-d-van bom-d-van force-pushed the fix/archive-list-query-optimization branch 2 times, most recently from 95c4116 to b766093 Compare September 8, 2024 08:45
@bom-d-van
Copy link
Contributor Author

bom-d-van commented Sep 8, 2024

hi @agilgur5 thank for the feedback. I have pushed another fix for mysql. it requires a little bit more handling.

there is a test failure about (artgc-artifact-not-written-failed-zn9dv), but it seems not related to my changes. do you have any suggestions?

@bom-d-van bom-d-van force-pushed the fix/archive-list-query-optimization branch from b766093 to 057aca6 Compare September 8, 2024 09:35
@bom-d-van
Copy link
Contributor Author

hmm, it seems that retry helps. :P

@agilgur5
Copy link
Member

agilgur5 commented Sep 8, 2024

there is a test failure about (artgc-artifact-not-written-failed-zn9dv), but it seems not related to my changes. do you have any suggestions?

hmm, it seems that retry helps. :P

Sounds like a test flake then. Although it looks like it was with TestStoppedWorkflow, and not the test you mentioned above (which was added recently, so would have been easier to investigate):

artifacts_test.go:218: timeout after 1m30s waiting for condition
    artifacts_test.go:241: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/artifacts_test.go:241
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:251
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/artifacts_test.go:240
        	Error:      	Received unexpected error:
        	            	The specified key does not exist.
        	Test:       	TestArtifactsSuite/TestStoppedWorkflow
=== FAIL: ArtifactsSuite/TestStoppedWorkflow

This one I believe has flaked before IIRC

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good to me but would like some folks to double-check cc @jiachengxu @terrytangyuan @Joibel @jessesuen

Also might be worthwhile to push up an image for users on the issue to test and confirm results

persist/sqldb/workflow_archive.go Show resolved Hide resolved
@agilgur5 agilgur5 added the prioritized-review For members of the Sustainability Effort label Sep 8, 2024
@agilgur5
Copy link
Member

agilgur5 commented Sep 8, 2024

the query runtime on our production env is reduced from 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels.

While this requires a good bit of rows to test, I'm also thinking that we could potentially write a simplified test that checks if the query returns under a certain amount of time on a test seed. For example, under 100ms.
That way it could act as a regression test for anything severely off, but not a full performance test to catch everything. We could potentially apply that to all API calls that make queries actually 🤔

@ryancurrah
Copy link
Contributor

I can test it out. We have a large non-production deployment we can test it in.

@bom-d-van
Copy link
Contributor Author

While this requires a good bit of rows to test, I'm also thinking that we could potentially write a simplified test that checks if the query returns under a certain amount of time on a test seed. For example, under 100ms.

@agilgur5 yeah, it sounds quite reasonable and would be helpful in the long run. but unfortunately, I didn't spot a simple way for me to add a new regression test fast enough and my recent work schedule is quite tight.

could we add a new issue and linking it to this pr for now? I might be able to make the contribution a few weeks later.

the optimized query is introduced in 9856df7

Signed-off-by: Xiaofan Hu <[email protected]>
@bom-d-van bom-d-van force-pushed the fix/archive-list-query-optimization branch 3 times, most recently from 14d07ed to 91bf3a6 Compare September 9, 2024 14:07
@agilgur5
Copy link
Member

could we add a new issue and linking it to this pr for now? I might be able to make the contribution a few weeks later.

Yea that'd be fine so long as we get some more manual tests of this. I'm loading up my dev env right now to push up an image

@ryancurrah
Copy link
Contributor

Provide me an image to use and I will test it.

@agilgur5
Copy link
Member

Pushed up agilgur5/argo-cli:fix-archive-list-query-optimization for testing

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good.

@bom-d-van bom-d-van force-pushed the fix/archive-list-query-optimization branch 2 times, most recently from 66717c8 to 367f21c Compare September 10, 2024 13:01
@bom-d-van bom-d-van force-pushed the fix/archive-list-query-optimization branch from 367f21c to f6db948 Compare September 11, 2024 01:32
@bom-d-van bom-d-van force-pushed the fix/archive-list-query-optimization branch from f6db948 to 9f39a95 Compare September 11, 2024 03:00
@ryancurrah
Copy link
Contributor

ryancurrah commented Sep 12, 2024

I can confirm this change reduces query times. Loading time in the UI on filtering labels was reduced by 94.44%.

Query time Before: 18 seconds
Query time after : 1 second
DB type: Postgresql
Version: 14
Number of rows (SELECT COUNT(*) FROM public.argo_archived_workflows;): 93135
CPUs: 24
Memory: 16GB

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the testing @ryancurrah !

With that, some of the individual query tests folks made across both MySQL and Postgres, and 2 other approvals, I think this is good to go 🚀

Note that this query seems to be further optimizable per #13295 (comment), so we may want to open a new issue to add further optimizations, but this at least fixes the performance issues and gets us to a much more reasonable baseline

@agilgur5 agilgur5 merged commit 729ac17 into argoproj:main Sep 12, 2024
27 checks passed
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Sep 12, 2024
@ryancurrah
Copy link
Contributor

ryancurrah commented Sep 12, 2024

Note that this query seems to be further optimizable per #13295 (comment), so we may want to open a new issue to add further optimizations, but this at least fixes the performance issues and gets us to a much more reasonable baseline

If you need someone to test any more query optimizations feel free to ping me.

@agilgur5
Copy link
Member

agilgur5 commented Sep 14, 2024

Added a follow-up issue for further optimizations to this query in #13601

Joibel pushed a commit to pipekit/argo-workflows that referenced this pull request Sep 19, 2024
Joibel pushed a commit that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Argo Server API area/workflow-archive prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3.5: Improve Archived API/DB read performance
6 participants