Create DiscoverEC2 User Tasks when Auto Discover fails on EC2 instances #47064

marcoandredinis · 2024-10-01T20:52:25Z

This PR changes the DiscoveryService to start creating and updating
Discover EC2 User Tasks.

So, what are Discover EC2 User Tasks?
When users set up Auto Discover for EC2 Instances, they don't have a
good way of checking for issues on their configured matchers.

We created User Tasks as a way to warn Users that something's wrong.
Each User Task should describe an issue that happened and a way to fix
it.
This has potential to be used to report unexpected events trough the
whole system, which are not errors per se, but something the user should
take action in order to improve the situation.
In this case, we are creating a sub type of those tasks: DiscoverEC2.

From now on, when the DiscoveryService fails to auto-enroll an instance,
it will create a DiscoverEC2 User Task grouping all the failed instances
by the following props:

integration
issue type
account id
region

A follow up PR will also create notifications so that the user can
actually be notified on those User Tasks and take action.

Demo:

$ tctl get user_tasks
kind: user_task
metadata:
  expires:
    nanos: 542005000
    seconds: 1727976384
  name: 52c33845-a413-5b2e-bfa3-1afc5d79236a
  revision: 663ed34b-6c65-4976-9052-af5898b295e1
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-123:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-123
        invocation_url: https://eu-west-2.console.aws.amazon.com/systems-manager/run-command/5041ad35-deab-4be7-8bfb-cd64f1cc2a24/i-123
        sync_time:
          nanos: 525000000
          seconds: 1727976024
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-script-failure
  state: OPEN
  task_type: discover-ec2
version: v1
---
kind: user_task
metadata:
  expires:
    nanos: 540832000
    seconds: 1727976378
  name: 7da201b0-57d1-503b-856e-de8ca5acd1e1
  revision: 9f803bb5-7ef6-4bd6-a2e5-fbbf8119248a
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-1234:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-1234
        sync_time:
          nanos: 379000000
          seconds: 1727976018
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-unsupported-os
  state: OPEN
  task_type: discover-ec2
version: v1
---
kind: user_task
metadata:
  expires:
    nanos: 782998000
    seconds: 1727976378
  name: b64397ac-764d-5d9d-af09-30d4775f7a78
  revision: 30ac5a71-a314-43de-a190-ee7db6cfa36a
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-12345:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-12345
        sync_time:
          nanos: 631000000
          seconds: 1727976018
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-agent-connection-lost
  state: OPEN
  task_type: discover-ec2
version: v1
---
kind: user_task
metadata:
  expires:
    nanos: 291505000
    seconds: 1727976378
  name: dcdca1c5-d489-58bd-8f01-6477ad9756d1
  revision: 176e6527-5cf7-43a2-b1b5-39cdff720f96
spec:
  discover_ec2:
    account_id: "123456789012"
    instances:
      i-123456:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-123456
        sync_time:
          nanos: 887000000
          seconds: 1727976017
      i-1234567:
        discovery_config: dc001
        discovery_group: aws-prod
        instance_id: i-1234567
        sync_time:
          nanos: 134000000
          seconds: 1727976018
    region: eu-west-2
  integration: teleportdev
  issue_type: ec2-ssm-agent-not-registered
  state: OPEN
  task_type: discover-ec2
version: v1

marcoandredinis · 2024-10-03T17:40:04Z

(i'm working on the flaky test but this should be reviewable already 👍 )

marcoandredinis · 2024-10-08T13:42:44Z

@fheinecke @timothyb89 Can you please take a look when you get a chance?

r0mant

Left some questions and suggestions.

lib/srv/discovery/discovery.go

lib/srv/discovery/status.go

r0mant · 2024-10-08T18:31:23Z

lib/srv/discovery/status.go

+		for existingIntanceID, existingInstance := range currentUserTask.Spec.DiscoverEc2.Instances {
+			// Each DiscoveryService works on all the DiscoveryConfigs assigned to a given DiscoveryGroup.
+			// So, it's safe to say that current DiscoveryService has the last state for a given DiscoveryGroup.
+			// If other instances exist for this DiscoveryGroup, they can be discarded because, as said before, the current DiscoveryService has the last state for a given DiscoveryGroup.
+			if existingInstance.DiscoveryGroup == s.DiscoveryGroup {
+				continue
+			}
+
+			// For existing instances whose sync time is too far in the past, just drop them.
+			// This ensures that if an instance is removed from AWS, it will eventually disappear from the User Tasks' instance list.
+			// It might also be the case that the DiscoveryConfig was changed and the instance is no longer matched (because of labels/regions or other matchers).
+			instanceIssueExpiration := s.clock.Now().Add(-2 * s.PollInterval)
+			if existingInstance.SyncTime.AsTime().Before(instanceIssueExpiration) {
+				continue
+			}
+
+			// Merge existing backend state into in-memory object.
+			failedInstances[existingIntanceID] = existingInstance
+		}


I'm having a bit of a hard time parsing this logic. Are we trying to merge instances from this discovery service into existing tasks object? Can it be simplified or refactored somehow to make this more clear? Right now it's a big method that does several things so it's a bit hard to follow.

Yes, that's correct.
It brings the existing task from the cluster's backend and merges it into the existing in-memory state.
And then updates it.

I've moved the merge part into its own method.
Let me know what you think.

lib/srv/discovery/status.go

r0mant · 2024-10-08T18:46:26Z

lib/srv/discovery/status.go

+	// groupPending is used to register which groups were changed in memory but were not yet sent upstream.
+	// When upserting User Tasks, if the group is not present in groupPending,
+	// then the cluster already has the latest version of this particular group.
+	groupPending map[awsEC2FailedEnrollmentGroup]struct{}


Nit: groupPending is a pretty confusing name tbh. What is the "group" and why is it "pending"? Can we pick a more descriptive name?

We have 4 fields that we use to group instances into a task: integration, issue type, region and account id.
So, the group is the unique key for this resource.
What do you think about groupsToSync or taskKeysToSync?
Changing to the latter, but please let me know if that's clearer?

lib/srv/discovery/status.go

lib/srv/discovery/discovery.go

r0mant · 2024-10-09T18:07:30Z

lib/srv/discovery/status.go

@@ -293,5 +300,232 @@ func (s *Server) ReportEC2SSMInstallationResult(ctx context.Context, result *ser

 	s.updateDiscoveryConfigStatus(result.DiscoveryConfig)

+	s.awsEC2Tasks.addFailedEnrollment(


What's the difference between this invocation and the one in discovery.go above? Just trying to understand what errors will be captured by this function vs the other one cause they seem to be both processing results of SSM runs?

We have 3 places where we generate tasks:
A) we describe the Instance's SSM Agent state and if there's an error (eg, ssm not installed, ssm not online, os not supported)
B) sending the command (ssm:SendCommand) fails (wrong permissions, document params issues, ...)
C) after starting the execution, we check the execution result for each instance

A and C are reported using the ReportEC2SSMInstallationResult
B is reported in discovery.go

teleport/lib/srv/discovery/discovery.go

Line 963 in aeefe42

if err := s.ec2Installer.Run(s.ctx, req); err != nil {

lib/srv/discovery/status.go

This PR changes the DiscoveryService to start creating and updating Discover EC2 User Tasks. So, what are Discover EC2 User Tasks? When users set up Auto Discover for EC2 Instances, they don't have a good way of checking for issues on their configured matchers. We created User Tasks as a way to warn Users that something's wrong. Each User Task should describe an issue that happened and a way to fix it. This has potential to be used to report unexpected events trough the whole system, which are not errors per se, but something the user should take action in order to improve the situation. In this case, we are creating a sub type of those tasks: DiscoverEC2. From now on, when the DiscoveryService fails to auto-enroll an instance, it will create a DiscoverEC2 User Task grouping all the failed instances by the following props: - integration - issue type - account id - region A follow up PR will also create notifications so that the user can actually be notified on those User Tasks and take action.

public-teleport-github-review-bot · 2024-10-16T10:35:00Z

@marcoandredinis See the table below for backport results.

Branch	Result
branch/v16	Create PR

…es (#47064) * Create DiscoverEC2 User Tasks for failed EC2 enrollments This PR changes the DiscoveryService to start creating and updating Discover EC2 User Tasks. So, what are Discover EC2 User Tasks? When users set up Auto Discover for EC2 Instances, they don't have a good way of checking for issues on their configured matchers. We created User Tasks as a way to warn Users that something's wrong. Each User Task should describe an issue that happened and a way to fix it. This has potential to be used to report unexpected events trough the whole system, which are not errors per se, but something the user should take action in order to improve the situation. In this case, we are creating a sub type of those tasks: DiscoverEC2. From now on, when the DiscoveryService fails to auto-enroll an instance, it will create a DiscoverEC2 User Task grouping all the failed instances by the following props: - integration - issue type - account id - region A follow up PR will also create notifications so that the user can actually be notified on those User Tasks and take action. * improve naming and split function into multiple methods * rename task issues variable name * fix context early cancelation

marcoandredinis added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v16 labels Oct 1, 2024

marcoandredinis mentioned this pull request Oct 1, 2024

AWS OIDC Integration Dashboard #41909

Open

marcoandredinis force-pushed the marco/discover_ec2_tasks branch from 4e32211 to 873c321 Compare October 2, 2024 08:08

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from ccff419 to 7567361 Compare October 2, 2024 08:08

marcoandredinis force-pushed the marco/discover_ec2_tasks branch from 873c321 to 5e20588 Compare October 2, 2024 13:23

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from 7567361 to 0c450ad Compare October 2, 2024 13:42

marcoandredinis force-pushed the marco/discover_ec2_tasks branch from 5e20588 to 497f0aa Compare October 3, 2024 16:37

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from 17cf648 to c10e937 Compare October 3, 2024 17:22

marcoandredinis marked this pull request as ready for review October 3, 2024 17:24

github-actions bot requested a review from fheinecke October 3, 2024 17:25

github-actions bot added discovery size/md labels Oct 3, 2024

github-actions bot requested a review from timothyb89 October 3, 2024 17:25

marcoandredinis force-pushed the marco/discover_ec2_tasks branch from 497f0aa to 649c8be Compare October 4, 2024 10:11

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from bdb1723 to 3eb2db8 Compare October 4, 2024 10:30

marcoandredinis force-pushed the marco/discover_ec2_tasks branch from 649c8be to 43cbfc1 Compare October 4, 2024 10:31

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from 3eb2db8 to a5b14d9 Compare October 4, 2024 10:45

Base automatically changed from marco/discover_ec2_tasks to master October 4, 2024 11:22

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from c1e1f45 to 6c641a2 Compare October 4, 2024 14:48

r0mant reviewed Oct 8, 2024

View reviewed changes

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 2 times, most recently from d2d1849 to bdc10e0 Compare October 9, 2024 10:28

marcoandredinis requested a review from r0mant October 9, 2024 10:45

r0mant approved these changes Oct 9, 2024

View reviewed changes

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch 5 times, most recently from 7868348 to 1ee8889 Compare October 15, 2024 16:13

marcoandredinis added 4 commits October 15, 2024 18:29

improve naming and split function into multiple methods

5ae1833

rename task issues variable name

77ed708

fix context early cancelation

c2bae80

marcoandredinis force-pushed the marco/discovery_emit_discoverec2_tasks branch from 1ee8889 to c2bae80 Compare October 15, 2024 17:29

strideynet approved these changes Oct 16, 2024

View reviewed changes

public-teleport-github-review-bot bot removed request for timothyb89 and fheinecke October 16, 2024 10:12

marcoandredinis added this pull request to the merge queue Oct 16, 2024

Merged via the queue into master with commit 34f9bb7 Oct 16, 2024
40 checks passed

marcoandredinis deleted the marco/discovery_emit_discoverec2_tasks branch October 16, 2024 10:33

marcoandredinis mentioned this pull request Oct 16, 2024

[v16] Create DiscoverEC2 User Tasks when Auto Discover fails on EC2 instances #47618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create DiscoverEC2 User Tasks when Auto Discover fails on EC2 instances #47064

Create DiscoverEC2 User Tasks when Auto Discover fails on EC2 instances #47064

marcoandredinis commented Oct 1, 2024 •

edited

Loading

marcoandredinis commented Oct 3, 2024

marcoandredinis commented Oct 8, 2024

r0mant left a comment

r0mant Oct 8, 2024

marcoandredinis Oct 9, 2024

r0mant Oct 8, 2024

marcoandredinis Oct 9, 2024

r0mant Oct 9, 2024

marcoandredinis Oct 10, 2024

public-teleport-github-review-bot bot commented Oct 16, 2024

		@@ -293,5 +300,232 @@ func (s Server) ReportEC2SSMInstallationResult(ctx context.Context, result ser

		s.updateDiscoveryConfigStatus(result.DiscoveryConfig)

		s.awsEC2Tasks.addFailedEnrollment(

Create DiscoverEC2 User Tasks when Auto Discover fails on EC2 instances #47064

Create DiscoverEC2 User Tasks when Auto Discover fails on EC2 instances #47064

Conversation

marcoandredinis commented Oct 1, 2024 • edited Loading

marcoandredinis commented Oct 3, 2024

marcoandredinis commented Oct 8, 2024

r0mant left a comment

Choose a reason for hiding this comment

r0mant Oct 8, 2024

Choose a reason for hiding this comment

marcoandredinis Oct 9, 2024

Choose a reason for hiding this comment

r0mant Oct 8, 2024

Choose a reason for hiding this comment

marcoandredinis Oct 9, 2024

Choose a reason for hiding this comment

r0mant Oct 9, 2024

Choose a reason for hiding this comment

marcoandredinis Oct 10, 2024

Choose a reason for hiding this comment

public-teleport-github-review-bot bot commented Oct 16, 2024

marcoandredinis commented Oct 1, 2024 •

edited

Loading