Agentless on serverless FTR agent deployment fix #201783

jeniawhite · 2024-11-26T13:30:57Z

Summary

Noticed that FTR tests are failing:
https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8

Following PR:
#199567

Looking at the code, I saw that we look at the status and expect a Pending status, yet we get a Healthy status.
It looks like this should be an async flow.

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
Flaky Test Runner was used on any tests changed
The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

See some risk examples
...

acorretti

Agreed to go for 5s and see if the async process is stable enough. This may flake out, but we'll see.

seanrathier

Thanks for taking this on for us.

seanrathier

Sorry for the reverse approval, I just realized this could be flaky with just healthy.

seanrathier · 2024-11-26T14:10:57Z

x-pack/test/cloud_security_posture_functional/agentless/create_agent.ts

+
+      // wait for eventually healthy status
+      await retry.tryForTime(5000, async () => {
+        expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy');


Suggested change

expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy');

expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy' || 'Pending' );

Maybe we should wait for a Healthy state? Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?

Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?

The state doesn't matter for this test, but we should be concerned about the state. We want to test that the UX of creating an agentless agent is working.

It could take longer than 5 seconds for the healthy state to reach ES from Fleet and the agentless agent, but it all depends on the performance of the test environment, so we could experience instances of this test failing when in fact it is not.

Perhaps we need a separate test in the Quality Gates here for the Agentless agents? A synthetic test that all the agents are eventually healthy for these QG.

Changed code to verify Pending or Healthy.
Added comment with explanation.

x-pack/test/cloud_security_posture_functional/agentless/create_agent.ts

…-fix'

…e/kibana into evgb-AgentlessServerlessFTRFix

…-fix'

elasticmachine · 2024-11-26T16:13:27Z

💔 Build Failed

Buildkite Build
Commit: 33d2361

Failed CI Steps

Linting

History

…e/kibana into evgb-AgentlessServerlessFTRFix

opauloh · 2024-11-26T16:30:05Z

x-pack/test/cloud_security_posture_functional/agentless/create_agent.ts

-      expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Pending');
+
+      // wait for eventually Pending or Healthy status
+      // purpose of this retry is to wait for the agent to be created and the status to be updated
+      // not to wait for the agent to be healthy
+      await retry.tryForTime(agentCreationTimeout, async () => {
+        const resStatus = await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus();
+        expect(resStatus == 'Healthy' || resStatus == 'Pending').to.be(true);
+      });


I honestly don't think an assertion that can have two possible states is going to be reliable in the long-term, having resolve as successful with either 'Pending' or 'Healthy' will lead to a false sensation of coverage when some bad configuration makes it stuck on Pending and we end up not catching it.

Initially implemented with a Healthy assertion, and I agree with what you've said.
But @seanrathier raised that we do not care about the actual agent state, and we just want to make sure that we have a value that makes sense, which indicates that the agent was initiated (which means I assume that the request trigger worked).
I think that it would be great if you both could sync and decide on something, and I will make the relevant changes (I've been implementing back and forth).

One thing that I would like us to do is to verify that the agent got deployed successfully and it works, I won't go into if this test should verify that or some other test (will let you both decide @opauloh @seanrathier).

…-fix'

jen-huang

The prior version of these tests before #199567 simply asserted on the successful creation of an agent policy that contains these agentless integrations.

But after this work, we hide this complexity of "agent policy" for agentless integrations from the user, so I changed it to assert on the existence of an agent status instead (I did not realized that in some environments, a healthy agent could spin up in time?).

I suggest we simply assert on the existence of the status badge (see getFirstCspmIntegrationPageAgentlessStatus) rather than what the actual status reports as, and leave E2E testing elsewhere.

seanrathier · 2024-11-26T17:38:04Z

@jeniawhite, thanks for taking on this initial work. I can take over this task right now. More changes for this are needed to fix the test and remove some confusion about what is running in Serverless Quality Gates.

@jen-huang I will change some of the test so that it is clear what is running in MKI by duplicating the test with a directory called something like mki_tests

https://github.com/elastic/security-team/issues/11274

jen-huang · 2024-11-26T20:09:34Z

FYI once #199567 (comment) is backported, this will likely need to be backported as well.

## Summary Noticed that FTR tests are failing: https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8 Following PR: elastic#199567 Looking at the code, I saw that we look at the status and expect a `Pending` status, yet we get a `Healthy` status. It looks like this should be an async flow. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ... --------- Co-authored-by: kibanamachine <[email protected]>

Agentless on serverless FTR fix

aa06874

jeniawhite requested a review from a team as a code owner November 26, 2024 13:30

jeniawhite requested a review from jen-huang November 26, 2024 13:31

jeniawhite marked this pull request as draft November 26, 2024 13:40

Eventually wait for Healthy

29703f7

acorretti approved these changes Nov 26, 2024

View reviewed changes

seanrathier approved these changes Nov 26, 2024

View reviewed changes

jeniawhite marked this pull request as ready for review November 26, 2024 14:05

jeniawhite added backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes labels Nov 26, 2024

jeniawhite enabled auto-merge (squash) November 26, 2024 14:07

seanrathier reviewed Nov 26, 2024

View reviewed changes

jeniawhite disabled auto-merge November 26, 2024 14:13

pheyos reviewed Nov 26, 2024

View reviewed changes

x-pack/test/cloud_security_posture_functional/agentless/create_agent.ts Outdated Show resolved Hide resolved

jeniawhite and others added 2 commits November 26, 2024 16:26

Increase timeout

b88c424

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

e37f85c

…-fix'

wayneseymour mentioned this pull request Nov 26, 2024

[SKIP ON MKI] x-pack/test_serverless/functional/test_suites/security/ftr/cloud_security_posture/agentless_api/create_agent.ts #201777

Closed

jeniawhite added 2 commits November 26, 2024 17:21

Wait for any status and lower timeout

c2e5736

Merge branch 'evgb-AgentlessServerlessFTRFix' of github.com:jeniawhit…

bf35a6b

…e/kibana into evgb-AgentlessServerlessFTRFix

jeniawhite requested a review from pheyos November 26, 2024 15:26

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

33d2361

…-fix'

jeniawhite enabled auto-merge (squash) November 26, 2024 16:07

jeniawhite disabled auto-merge November 26, 2024 16:07

jeniawhite added 2 commits November 26, 2024 18:27

Linter fix

c6ca426

Merge branch 'evgb-AgentlessServerlessFTRFix' of github.com:jeniawhit…

797a885

…e/kibana into evgb-AgentlessServerlessFTRFix

opauloh reviewed Nov 26, 2024

View reviewed changes

jeniawhite requested a review from opauloh November 26, 2024 16:38

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

be589b5

…-fix'

jen-huang reviewed Nov 26, 2024

View reviewed changes

jeniawhite enabled auto-merge (squash) November 26, 2024 18:38

jeniawhite merged commit 9addb53 into elastic:main Nov 26, 2024
22 checks passed

kibanamachine added the v9.0.0 label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentless on serverless FTR agent deployment fix #201783

Agentless on serverless FTR agent deployment fix #201783

jeniawhite commented Nov 26, 2024 •

edited

Loading

acorretti left a comment

seanrathier left a comment

seanrathier left a comment

seanrathier Nov 26, 2024

jeniawhite Nov 26, 2024

seanrathier Nov 26, 2024 •

edited

Loading

jeniawhite Nov 26, 2024

elasticmachine commented Nov 26, 2024 •

edited

Loading

opauloh Nov 26, 2024

jeniawhite Nov 26, 2024 •

edited

Loading

jen-huang left a comment

seanrathier commented Nov 26, 2024

jen-huang commented Nov 26, 2024

	expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy');
	expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy' \|\| 'Pending' );

Agentless on serverless FTR agent deployment fix #201783

Agentless on serverless FTR agent deployment fix #201783

Conversation

jeniawhite commented Nov 26, 2024 • edited Loading

Summary

Checklist

Identify risks

acorretti left a comment

Choose a reason for hiding this comment

seanrathier left a comment

Choose a reason for hiding this comment

seanrathier left a comment

Choose a reason for hiding this comment

seanrathier Nov 26, 2024

Choose a reason for hiding this comment

jeniawhite Nov 26, 2024

Choose a reason for hiding this comment

seanrathier Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

jeniawhite Nov 26, 2024

Choose a reason for hiding this comment

elasticmachine commented Nov 26, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

History

opauloh Nov 26, 2024

Choose a reason for hiding this comment

jeniawhite Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

jen-huang left a comment

Choose a reason for hiding this comment

seanrathier commented Nov 26, 2024

jen-huang commented Nov 26, 2024

jeniawhite commented Nov 26, 2024 •

edited

Loading

seanrathier Nov 26, 2024 •

edited

Loading

elasticmachine commented Nov 26, 2024 •

edited

Loading

jeniawhite Nov 26, 2024 •

edited

Loading