Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create DiscoverEC2 User Tasks when Auto Discover fails on EC2 instances #47064

Merged
merged 4 commits into from
Oct 16, 2024

Conversation

marcoandredinis
Copy link
Contributor

@marcoandredinis marcoandredinis commented Oct 1, 2024

This PR changes the DiscoveryService to start creating and updating
Discover EC2 User Tasks.

So, what are Discover EC2 User Tasks?
When users set up Auto Discover for EC2 Instances, they don't have a
good way of checking for issues on their configured matchers.

We created User Tasks as a way to warn Users that something's wrong.
Each User Task should describe an issue that happened and a way to fix
it.
This has potential to be used to report unexpected events trough the
whole system, which are not errors per se, but something the user should
take action in order to improve the situation.
In this case, we are creating a sub type of those tasks: DiscoverEC2.

From now on, when the DiscoveryService fails to auto-enroll an instance,
it will create a DiscoverEC2 User Task grouping all the failed instances
by the following props:

  • integration
  • issue type
  • account id
  • region

A follow up PR will also create notifications so that the user can
actually be notified on those User Tasks and take action.

Demo:

$ tctl get user_tasks
kind: user_task
metadata:
  expires:
    nanos: 542005000
    seconds: 1727976384
  name: 52c33845-a413-5b2e-bfa3-1afc5d79236a
  revision: 663ed34b-6c65-4976-9052-af5898b295e1
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-123:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-123
        invocation_url: https://eu-west-2.console.aws.amazon.com/systems-manager/run-command/5041ad35-deab-4be7-8bfb-cd64f1cc2a24/i-123
        sync_time:
          nanos: 525000000
          seconds: 1727976024
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-script-failure
  state: OPEN
  task_type: discover-ec2
version: v1
---
kind: user_task
metadata:
  expires:
    nanos: 540832000
    seconds: 1727976378
  name: 7da201b0-57d1-503b-856e-de8ca5acd1e1
  revision: 9f803bb5-7ef6-4bd6-a2e5-fbbf8119248a
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-1234:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-1234
        sync_time:
          nanos: 379000000
          seconds: 1727976018
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-unsupported-os
  state: OPEN
  task_type: discover-ec2
version: v1
---
kind: user_task
metadata:
  expires:
    nanos: 782998000
    seconds: 1727976378
  name: b64397ac-764d-5d9d-af09-30d4775f7a78
  revision: 30ac5a71-a314-43de-a190-ee7db6cfa36a
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-12345:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-12345
        sync_time:
          nanos: 631000000
          seconds: 1727976018
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-agent-connection-lost
  state: OPEN
  task_type: discover-ec2
version: v1
---
kind: user_task
metadata:
  expires:
    nanos: 291505000
    seconds: 1727976378
  name: dcdca1c5-d489-58bd-8f01-6477ad9756d1
  revision: 176e6527-5cf7-43a2-b1b5-39cdff720f96
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-123456:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-123456
        sync_time:
          nanos: 887000000
          seconds: 1727976017
      i-1234567:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-1234567
        sync_time:
          nanos: 134000000
          seconds: 1727976018
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-agent-not-registered
  state: OPEN
  task_type: discover-ec2
version: v1

@marcoandredinis marcoandredinis added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v16 labels Oct 1, 2024
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from ccff419 to 7567361 Compare October 2, 2024 08:08
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from 7567361 to 0c450ad Compare October 2, 2024 13:42
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from 17cf648 to c10e937 Compare October 3, 2024 17:22
@marcoandredinis marcoandredinis marked this pull request as ready for review October 3, 2024 17:24
@marcoandredinis
Copy link
Contributor Author

(i'm working on the flaky test but this should be reviewable already 👍 )

@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from bdb1723 to 3eb2db8 Compare October 4, 2024 10:30
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from 3eb2db8 to a5b14d9 Compare October 4, 2024 10:45
Base automatically changed from marco/discover_ec2_tasks to master October 4, 2024 11:22
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from c1e1f45 to 6c641a2 Compare October 4, 2024 14:48
@marcoandredinis
Copy link
Contributor Author

@fheinecke @timothyb89 Can you please take a look when you get a chance?

Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions and suggestions.

lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
lib/srv/discovery/status.go Outdated Show resolved Hide resolved
lib/srv/discovery/status.go Outdated Show resolved Hide resolved
Comment on lines 439 to 457
for existingIntanceID, existingInstance := range currentUserTask.Spec.DiscoverEc2.Instances {
// Each DiscoveryService works on all the DiscoveryConfigs assigned to a given DiscoveryGroup.
// So, it's safe to say that current DiscoveryService has the last state for a given DiscoveryGroup.
// If other instances exist for this DiscoveryGroup, they can be discarded because, as said before, the current DiscoveryService has the last state for a given DiscoveryGroup.
if existingInstance.DiscoveryGroup == s.DiscoveryGroup {
continue
}

// For existing instances whose sync time is too far in the past, just drop them.
// This ensures that if an instance is removed from AWS, it will eventually disappear from the User Tasks' instance list.
// It might also be the case that the DiscoveryConfig was changed and the instance is no longer matched (because of labels/regions or other matchers).
instanceIssueExpiration := s.clock.Now().Add(-2 * s.PollInterval)
if existingInstance.SyncTime.AsTime().Before(instanceIssueExpiration) {
continue
}

// Merge existing backend state into in-memory object.
failedInstances[existingIntanceID] = existingInstance
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of a hard time parsing this logic. Are we trying to merge instances from this discovery service into existing tasks object? Can it be simplified or refactored somehow to make this more clear? Right now it's a big method that does several things so it's a bit hard to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.
It brings the existing task from the cluster's backend and merges it into the existing in-memory state.
And then updates it.

I've moved the merge part into its own method.
Let me know what you think.

lib/srv/discovery/status.go Outdated Show resolved Hide resolved
// groupPending is used to register which groups were changed in memory but were not yet sent upstream.
// When upserting User Tasks, if the group is not present in groupPending,
// then the cluster already has the latest version of this particular group.
groupPending map[awsEC2FailedEnrollmentGroup]struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: groupPending is a pretty confusing name tbh. What is the "group" and why is it "pending"? Can we pick a more descriptive name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 4 fields that we use to group instances into a task: integration, issue type, region and account id.
So, the group is the unique key for this resource.
What do you think about groupsToSync or taskKeysToSync?
Changing to the latter, but please let me know if that's clearer?

lib/srv/discovery/status.go Outdated Show resolved Hide resolved
lib/srv/discovery/status.go Show resolved Hide resolved
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from d2d1849 to bdc10e0 Compare October 9, 2024 10:28
lib/srv/discovery/discovery.go Outdated Show resolved Hide resolved
@@ -293,5 +300,232 @@ func (s *Server) ReportEC2SSMInstallationResult(ctx context.Context, result *ser

s.updateDiscoveryConfigStatus(result.DiscoveryConfig)

s.awsEC2Tasks.addFailedEnrollment(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this invocation and the one in discovery.go above? Just trying to understand what errors will be captured by this function vs the other one cause they seem to be both processing results of SSM runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 3 places where we generate tasks:
A) we describe the Instance's SSM Agent state and if there's an error (eg, ssm not installed, ssm not online, os not supported)
B) sending the command (ssm:SendCommand) fails (wrong permissions, document params issues, ...)
C) after starting the execution, we check the execution result for each instance

A and C are reported using the ReportEC2SSMInstallationResult
B is reported in discovery.go

if err := s.ec2Installer.Run(s.ctx, req); err != nil {

lib/srv/discovery/status.go Outdated Show resolved Hide resolved
lib/srv/discovery/status.go Outdated Show resolved Hide resolved
lib/srv/discovery/status.go Outdated Show resolved Hide resolved
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 5 times, most recently from 7868348 to 1ee8889 Compare October 15, 2024 16:13
This PR changes the DiscoveryService to start creating and updating
Discover EC2 User Tasks.

So, what are Discover EC2 User Tasks?
When users set up Auto Discover for EC2 Instances, they don't have a
good way of checking for issues on their configured matchers.

We created User Tasks as a way to warn Users that something's wrong.
Each User Task should describe an issue that happened and a way to fix
it.
This has potential to be used to report unexpected events trough the
whole system, which are not errors per se, but something the user should
take action in order to improve the situation.
In this case, we are creating a sub type of those tasks: DiscoverEC2.

From now on, when the DiscoveryService fails to auto-enroll an instance,
it will create a DiscoverEC2 User Task grouping all the failed instances
by the following props:
- integration
- issue type
- account id
- region

A follow up PR will also create notifications so that the user can
actually be notified on those User Tasks and take action.
@marcoandredinis marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from 1ee8889 to c2bae80 Compare October 15, 2024 17:29
@marcoandredinis marcoandredinis added this pull request to the merge queue Oct 16, 2024
Merged via the queue into master with commit 34f9bb7 Oct 16, 2024
40 checks passed
@marcoandredinis marcoandredinis deleted the marco/discovery_emit_discoverec2_tasks branch October 16, 2024 10:33
@public-teleport-github-review-bot

@marcoandredinis See the table below for backport results.

Branch Result
branch/v16 Create PR

mvbrock pushed a commit that referenced this pull request Oct 16, 2024
…es (#47064)

* Create DiscoverEC2 User Tasks for failed EC2 enrollments

This PR changes the DiscoveryService to start creating and updating
Discover EC2 User Tasks.

So, what are Discover EC2 User Tasks?
When users set up Auto Discover for EC2 Instances, they don't have a
good way of checking for issues on their configured matchers.

We created User Tasks as a way to warn Users that something's wrong.
Each User Task should describe an issue that happened and a way to fix
it.
This has potential to be used to report unexpected events trough the
whole system, which are not errors per se, but something the user should
take action in order to improve the situation.
In this case, we are creating a sub type of those tasks: DiscoverEC2.

From now on, when the DiscoveryService fails to auto-enroll an instance,
it will create a DiscoverEC2 User Task grouping all the failed instances
by the following props:
- integration
- issue type
- account id
- region

A follow up PR will also create notifications so that the user can
actually be notified on those User Tasks and take action.

* improve naming and split function into multiple methods

* rename task issues variable name

* fix context early cancelation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/branch/v16 discovery no-changelog Indicates that a PR does not require a changelog entry size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants