Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agentless on serverless FTR agent deployment fix #201783

Merged
merged 10 commits into from
Nov 26, 2024

Conversation

jeniawhite
Copy link
Contributor

@jeniawhite jeniawhite commented Nov 26, 2024

Summary

Noticed that FTR tests are failing:
https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8

Following PR:
#199567

Looking at the code, I saw that we look at the status and expect a Pending status, yet we get a Healthy status.
It looks like this should be an async flow.

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
  • This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
  • Flaky Test Runner was used on any tests changed
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

@jeniawhite jeniawhite requested a review from a team as a code owner November 26, 2024 13:30
@jeniawhite jeniawhite requested a review from jen-huang November 26, 2024 13:31
@jeniawhite jeniawhite marked this pull request as draft November 26, 2024 13:40
Copy link

@acorretti acorretti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to go for 5s and see if the async process is stable enough. This may flake out, but we'll see.

Copy link
Contributor

@seanrathier seanrathier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on for us.

@jeniawhite jeniawhite marked this pull request as ready for review November 26, 2024 14:05
@jeniawhite jeniawhite added backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes labels Nov 26, 2024
@jeniawhite jeniawhite enabled auto-merge (squash) November 26, 2024 14:07
Copy link
Contributor

@seanrathier seanrathier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the reverse approval, I just realized this could be flaky with just healthy.


// wait for eventually healthy status
await retry.tryForTime(5000, async () => {
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy');
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy' || 'Pending' );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should wait for a Healthy state? Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?

Copy link
Contributor

@seanrathier seanrathier Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?

The state doesn't matter for this test, but we should be concerned about the state. We want to test that the UX of creating an agentless agent is working.

It could take longer than 5 seconds for the healthy state to reach ES from Fleet and the agentless agent, but it all depends on the performance of the test environment, so we could experience instances of this test failing when in fact it is not.

Perhaps we need a separate test in the Quality Gates here for the Agentless agents? A synthetic test that all the agents are eventually healthy for these QG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed code to verify Pending or Healthy.
Added comment with explanation.

@jeniawhite jeniawhite disabled auto-merge November 26, 2024 14:13
@jeniawhite jeniawhite requested a review from pheyos November 26, 2024 15:26
@jeniawhite jeniawhite enabled auto-merge (squash) November 26, 2024 16:07
@jeniawhite jeniawhite disabled auto-merge November 26, 2024 16:07
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 26, 2024

💔 Build Failed

Failed CI Steps

History

Comment on lines 71 to 80
expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Pending');

// wait for eventually Pending or Healthy status
// purpose of this retry is to wait for the agent to be created and the status to be updated
// not to wait for the agent to be healthy
await retry.tryForTime(agentCreationTimeout, async () => {
const resStatus = await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus();
expect(resStatus == 'Healthy' || resStatus == 'Pending').to.be(true);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't think an assertion that can have two possible states is going to be reliable in the long-term, having resolve as successful with either 'Pending' or 'Healthy' will lead to a false sensation of coverage when some bad configuration makes it stuck on Pending and we end up not catching it.

Copy link
Contributor Author

@jeniawhite jeniawhite Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially implemented with a Healthy assertion, and I agree with what you've said.
But @seanrathier raised that we do not care about the actual agent state, and we just want to make sure that we have a value that makes sense, which indicates that the agent was initiated (which means I assume that the request trigger worked).
I think that it would be great if you both could sync and decide on something, and I will make the relevant changes (I've been implementing back and forth).

One thing that I would like us to do is to verify that the agent got deployed successfully and it works, I won't go into if this test should verify that or some other test (will let you both decide @opauloh @seanrathier).

@jeniawhite jeniawhite requested a review from opauloh November 26, 2024 16:38
Copy link
Contributor

@jen-huang jen-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior version of these tests before #199567 simply asserted on the successful creation of an agent policy that contains these agentless integrations.

But after this work, we hide this complexity of "agent policy" for agentless integrations from the user, so I changed it to assert on the existence of an agent status instead (I did not realized that in some environments, a healthy agent could spin up in time?).

I suggest we simply assert on the existence of the status badge (see getFirstCspmIntegrationPageAgentlessStatus) rather than what the actual status reports as, and leave E2E testing elsewhere.

@seanrathier
Copy link
Contributor

@jeniawhite, thanks for taking on this initial work. I can take over this task right now. More changes for this are needed to fix the test and remove some confusion about what is running in Serverless Quality Gates.

@jen-huang I will change some of the test so that it is clear what is running in MKI by duplicating the test with a directory called something like mki_tests

@jeniawhite jeniawhite enabled auto-merge (squash) November 26, 2024 18:38
@jeniawhite jeniawhite merged commit 9addb53 into elastic:main Nov 26, 2024
22 checks passed
@jen-huang
Copy link
Contributor

FYI once #199567 (comment) is backported, this will likely need to be backported as well.

paulinashakirova pushed a commit to paulinashakirova/kibana that referenced this pull request Nov 26, 2024
## Summary
Noticed that FTR tests are failing:

https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8

Following PR:
elastic#199567

Looking at the code, I saw that we look at the status and expect a
`Pending` status, yet we get a `Healthy` status.
It looks like this should be an async flow.

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

### Identify risks

Does this PR introduce any risks? For example, consider risks like hard
to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified
risk. Invite stakeholders and evaluate how to proceed before merging.

- [ ] [See some risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
- [ ] ...

---------

Co-authored-by: kibanamachine <[email protected]>
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Dec 12, 2024
## Summary
Noticed that FTR tests are failing:

https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8

Following PR:
elastic#199567

Looking at the code, I saw that we look at the status and expect a
`Pending` status, yet we get a `Healthy` status.
It looks like this should be an async flow.

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

### Identify risks

Does this PR introduce any risks? For example, consider risks like hard
to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified
risk. Invite stakeholders and evaluate how to proceed before merging.

- [ ] [See some risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
- [ ] ...

---------

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants