-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agentless on serverless FTR agent deployment fix #201783
Agentless on serverless FTR agent deployment fix #201783
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed to go for 5s and see if the async process is stable enough. This may flake out, but we'll see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the reverse approval, I just realized this could be flaky with just healthy.
|
||
// wait for eventually healthy status | ||
await retry.tryForTime(5000, async () => { | ||
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy'); | |
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy' || 'Pending' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should wait for a Healthy
state? Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?
The state doesn't matter for this test, but we should be concerned about the state. We want to test that the UX of creating an agentless agent is working.
It could take longer than 5 seconds for the healthy state to reach ES from Fleet and the agentless agent, but it all depends on the performance of the test environment, so we could experience instances of this test failing when in fact it is not.
Perhaps we need a separate test in the Quality Gates here for the Agentless agents? A synthetic test that all the agents are eventually healthy for these QG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed code to verify Pending
or Healthy
.
Added comment with explanation.
x-pack/test/cloud_security_posture_functional/agentless/create_agent.ts
Outdated
Show resolved
Hide resolved
…e/kibana into evgb-AgentlessServerlessFTRFix
💔 Build Failed
Failed CI StepsHistory |
…e/kibana into evgb-AgentlessServerlessFTRFix
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Pending'); | ||
|
||
// wait for eventually Pending or Healthy status | ||
// purpose of this retry is to wait for the agent to be created and the status to be updated | ||
// not to wait for the agent to be healthy | ||
await retry.tryForTime(agentCreationTimeout, async () => { | ||
const resStatus = await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus(); | ||
expect(resStatus == 'Healthy' || resStatus == 'Pending').to.be(true); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't think an assertion that can have two possible states is going to be reliable in the long-term, having resolve as successful with either 'Pending' or 'Healthy' will lead to a false sensation of coverage when some bad configuration makes it stuck on Pending and we end up not catching it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially implemented with a Healthy
assertion, and I agree with what you've said.
But @seanrathier raised that we do not care about the actual agent state, and we just want to make sure that we have a value that makes sense, which indicates that the agent was initiated (which means I assume that the request trigger worked).
I think that it would be great if you both could sync and decide on something, and I will make the relevant changes (I've been implementing back and forth).
One thing that I would like us to do is to verify that the agent got deployed successfully and it works, I won't go into if this test should verify that or some other test (will let you both decide @opauloh @seanrathier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prior version of these tests before #199567 simply asserted on the successful creation of an agent policy that contains these agentless integrations.
But after this work, we hide this complexity of "agent policy" for agentless integrations from the user, so I changed it to assert on the existence of an agent status instead (I did not realized that in some environments, a healthy agent could spin up in time?).
I suggest we simply assert on the existence of the status badge (see getFirstCspmIntegrationPageAgentlessStatus
) rather than what the actual status reports as, and leave E2E testing elsewhere.
@jeniawhite, thanks for taking on this initial work. I can take over this task right now. More changes for this are needed to fix the test and remove some confusion about what is running in Serverless Quality Gates. @jen-huang I will change some of the test so that it is clear what is running in MKI by duplicating the test with a directory called something like |
FYI once #199567 (comment) is backported, this will likely need to be backported as well. |
## Summary Noticed that FTR tests are failing: https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8 Following PR: elastic#199567 Looking at the code, I saw that we look at the status and expect a `Pending` status, yet we get a `Healthy` status. It looks like this should be an async flow. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ... --------- Co-authored-by: kibanamachine <[email protected]>
## Summary Noticed that FTR tests are failing: https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8 Following PR: elastic#199567 Looking at the code, I saw that we look at the status and expect a `Pending` status, yet we get a `Healthy` status. It looks like this should be an async flow. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ... --------- Co-authored-by: kibanamachine <[email protected]>
Summary
Noticed that FTR tests are failing:
https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8
Following PR:
#199567
Looking at the code, I saw that we look at the status and expect a
Pending
status, yet we get aHealthy
status.It looks like this should be an async flow.
Checklist
Check the PR satisfies following conditions.
Reviewers should verify this PR satisfies this list as well.
release_note:breaking
label should be applied in these situations.release_note:*
label is applied per the guidelinesIdentify risks
Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.
Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.