Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure VM MSI support #584

Merged
merged 7 commits into from
Mar 7, 2025
Merged

Azure VM MSI support #584

merged 7 commits into from
Mar 7, 2025

Conversation

jepio
Copy link
Member

@jepio jepio commented Feb 28, 2025

Support for assigning a managed identity to the VM. The MSI is looked up by name. A scenario this unlocks is for example running kola tests from within a vm spawned with kola spawn. Or interacting with a storage account.

Switch to managed boot diagnostics (does not require creating a storage acount).

Fix --keys support: keys need to be deduplicated in ignition v3 and some tests don't tolerate ssh keys in userdata (which we must follow).

@jepio jepio requested review from t-lo, tormath1 and Copilot February 28, 2025 10:14

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@t-lo
Copy link
Member

t-lo commented Feb 28, 2025

@jepio I tested this with a VM set up with user managed identity with Contributor level access to the whole subscription. kola run picks up the identity just fine and can create resources and launch VMs, but kola is unable to ssh into the test VMs and to access their boot diagnostics.

In my test I used a pre-defined resource group, availability set, and virtual network. kola creates its own resource group for public IPs and storage accounts, which are then used for the VMs. The issue I am encountering appears to be caused by combining custom availability sets / virtual networks with using managed identities.

Are there recommended settings for user managed identities we should document?

@t-lo
Copy link
Member

t-lo commented Mar 4, 2025

I'm making progress with testing, might have a suitable config soon to document the Azure environment this feature can be used in.

Meanwhile I ran into this; might be caused by the Azure bindings update?

2025-03-04T04:01:14Z platform/api/azure: Creating ResourceGroup kola-cluster-751da5660a
2025-03-04T04:01:16Z platform/api/azure: Creating StorageAccount kolasa25f997b58b
panic: unimplemented case in CopyKeys (conf.Empty not supported on purpose

goroutine 16304 [running]:
github.com/flatcar/mantle/platform/conf.(*Conf).CopyKeys(0x47207d?, {0xc000597330?, 0x1?, 0xc000b470b0?})
        /usr/src/mantle/platform/conf/conf.go:1393 +0x107
github.com/flatcar/mantle/platform/conf.(*UserData).Render(0xc0014af440, {0x2fd9d26, 0x5})
        /usr/src/mantle/platform/conf/conf.go:402 +0x39a
github.com/flatcar/mantle/platform.(*BaseCluster).RenderUserData(0xc00075d140, 0xc000071b40?, 0xc00078ef20?)
        /usr/src/mantle/platform/cluster.go:169 +0x213
github.com/flatcar/mantle/platform/machine/azure.(*cluster).NewMachine(0xc000634690, 0xc000071b40)
        /usr/src/mantle/platform/machine/azure/cluster.go:44 +0xf7
github.com/flatcar/mantle/platform.NewMachines.func1()
        /usr/src/mantle/platform/platform.go:308 +0x72
created by github.com/flatcar/mantle/platform.NewMachines in goroutine 338
        /usr/src/mantle/platform/platform.go:306 +0xb5

I was running with --keys.

@t-lo
Copy link
Member

t-lo commented Mar 5, 2025

Rebased this branch to latest flatcar-master to include #585.

Copy link
Member

@t-lo t-lo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can be merged. Using the machine identity works, and I think the earlier issues I encountered when testing this are related to Azure instead of this PR.

LGTM, thank you Jeremi!

Copy link
Member

@t-lo t-lo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to revert my approval, a new run of this PR again triggered a stack trace:

panic: unimplemented case in CopyKeys (conf.Empty not supported on purpose

goroutine 10631 [running]:
github.com/flatcar/mantle/platform/conf.(*Conf).CopyKeys(0x7faf5ac3a9a8?, {0xc00008def8?, 0x47207d?, 0x0?})
        /usr/src/mantle/platform/conf/conf.go:1393 +0x107
github.com/flatcar/mantle/platform/conf.(*UserData).Render(0xc000943c80, {0x2efa182, 0x5})
        /usr/src/mantle/platform/conf/conf.go:402 +0x39a
github.com/flatcar/mantle/platform.(*BaseCluster).RenderUserData(0xc0001c7f00, 0xc0003f9980?, 0xc000a82f20?)
        /usr/src/mantle/platform/cluster.go:169 +0x213
github.com/flatcar/mantle/platform/machine/azure.(*cluster).NewMachine(0xc000749500, 0xc0003f9980)
        /usr/src/mantle/platform/machine/azure/cluster.go:45 +0xf7
github.com/flatcar/mantle/platform.NewMachines.func1()
        /usr/src/mantle/platform/platform.go:308 +0x72
created by github.com/flatcar/mantle/platform.NewMachines in goroutine 337
        /usr/src/mantle/platform/platform.go:306 +0xb5

@jepio
Copy link
Member Author

jepio commented Mar 5, 2025

Need to revert my approval, a new run of this PR again triggered a stack trace:

panic: unimplemented case in CopyKeys (conf.Empty not supported on purpose

goroutine 10631 [running]:
github.com/flatcar/mantle/platform/conf.(*Conf).CopyKeys(0x7faf5ac3a9a8?, {0xc00008def8?, 0x47207d?, 0x0?})
        /usr/src/mantle/platform/conf/conf.go:1393 +0x107
github.com/flatcar/mantle/platform/conf.(*UserData).Render(0xc000943c80, {0x2efa182, 0x5})
        /usr/src/mantle/platform/conf/conf.go:402 +0x39a
github.com/flatcar/mantle/platform.(*BaseCluster).RenderUserData(0xc0001c7f00, 0xc0003f9980?, 0xc000a82f20?)
        /usr/src/mantle/platform/cluster.go:169 +0x213
github.com/flatcar/mantle/platform/machine/azure.(*cluster).NewMachine(0xc000749500, 0xc0003f9980)
        /usr/src/mantle/platform/machine/azure/cluster.go:45 +0xf7
github.com/flatcar/mantle/platform.NewMachines.func1()
        /usr/src/mantle/platform/platform.go:308 +0x72
created by github.com/flatcar/mantle/platform.NewMachines in goroutine 337
        /usr/src/mantle/platform/platform.go:306 +0xb5

What arguments do you pass to trigger that?

@jepio
Copy link
Member Author

jepio commented Mar 5, 2025

and can you confirm that your ignition config is v3.4?

@t-lo
Copy link
Member

t-lo commented Mar 5, 2025

Need to revert my approval, a new run of this PR again triggered a stack trace:

[...]

What arguments do you pass to trigger that?

I'm running the full set of kola tests on x86-64 targets (test VMs) in a pre-defined availability set / subnet / vnet. Kola is run from an Azure VM and uses a user managed identity with subscription-level Contributor and Storage Blob Data Contributor roles assigned. It eventually runs into the above issue, it doesn't reproduce right away.

The kola command (sensitive parts redacted):

kola run --platform azure \
         --board amd64-usr \
         --basename XXXXX  \
         --azure-availability-set XXXXX \
         --azure-resource-group XXXXX \
         --azure-vnet-subnet-name XXXXX\
         --azure-use-private-ips \
         --azure-location XXXXX \
         --azure-hyper-v-generation "V2" \
         --azure-publisher kinvolk \
         --azure-offer flatcar-container-linux-corevm-amd64 \
         --azure-sku alpha-gen2 \
         --azure-version latest \
         --azure-size Standard_D2s_v6 \
         --parallel 1 \
         --log-level DEBUG \
         --keys \
         --tapfile run.tap \
         '*'

@jepio jepio requested a review from Copilot March 5, 2025 17:40
@jepio
Copy link
Member Author

jepio commented Mar 5, 2025

Ok @t-lo, try again. The --keys option should now work and I switched to managed boot diagnostics so you shouldn't need that "Storage Blob Data Contributor" when testing marketplace images (though it will be needed when uploading images).

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR adds support for assigning a managed identity to an Azure VM and switches the boot diagnostics to use managed storage, along with deduplication for SSH keys in ignition v3. Key changes include:

  • Integrating support for managed identities by adding a new flag and API lookup for a user-assigned identity.
  • Adjusting VM creation parameters to include the managed identity.
  • Updating boot diagnostics and SSH key deduplication logic.

Reviewed Changes

File Description
platform/api/azure/api.go Added msiClient setup and a new FindManagedIdentityID method.
platform/machine/azure/flight.go Integrated managed identity lookup into flight initialization.
cmd/kola/options.go Introduced new CLI flag for VM identity and SSH key deduplication.
util/retry.go Moved the sleep call in the retry loop to alter retry timings.
platform/api/azure/instance.go Updated getVMParameters, CreateInstance, and GetConsoleOutput to pass and use the managed identity and simplify console log fetching.
platform/cluster.go Updated SSH key handling in RenderUserData to account for new configuration.
platform/machine/azure/cluster.go Updated cluster struct and instance creation to include managed identity.
platform/machine/azure/machine.go Adjusted GetConsoleOutput usage based on updated API.

Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.

@t-lo
Copy link
Member

t-lo commented Mar 5, 2025

Ok @t-lo, try again. The --keys option should now work and I switched to managed boot diagnostics so you shouldn't need that "Storage Blob Data Contributor" when testing marketplace images (though it will be needed when uploading images).

Thanks for the quick turnaround! I'll run a new bunch of tests overnight and will check back tomorrow.

Copy link
Member

@t-lo t-lo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look good with the last change; I don't see the ssh key error stacktrace anymore.

@jepio
Copy link
Member Author

jepio commented Mar 6, 2025

@tormath1 would you give this a review?

Copy link
Contributor

@tormath1 tormath1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that we can remove the storage account dependency. LGTM only small nits.

@t-lo
Copy link
Member

t-lo commented Mar 6, 2025

Good that we can remove the storage account dependency. LGTM only small nits.

Unfortunately I need it on the orchestrator VM anyway, for other reasons. It would be nice to only operate on resource group scope, i.e. make kola create and delete NIC, storage account, and public IP resources in the existing RG. But I guess that's a larger change.

I'm fine with the general state of things for now, this is super useful!

jepio added 6 commits March 7, 2025 11:59
For access to Azure resources.

Signed-off-by: Jeremi Piotrowski <[email protected]>
Signed-off-by: Jeremi Piotrowski <[email protected]>
This does not require that the user have RBAC permissions to a storage account
to fetch, because it uses SAS keys behind the scenes. The previous approach
used a kola created storage account has Shared Key Access disabled for security
reasons.

Signed-off-by: Jeremi Piotrowski <[email protected]>
We switched to managed boot diagnostics, so we no longer require a separate
storage account for that purpose.  We still need a storage account when
uploading images.

Signed-off-by: Jeremi Piotrowski <[email protected]>
Ignition v3 does not tolerate duplicate ssh keys, which we will often get if we
fetch them from ssh-agent and from the filesystem. Implement key deduplication
in GetSSHKeys(), which is called when `-t` is passed.

Signed-off-by: Jeremi Piotrowski <[email protected]>
This way if the condition is true in the first iteration we exit immediately.

Signed-off-by: Jeremi Piotrowski <[email protected]>
@jepio jepio requested a review from Copilot March 7, 2025 11:03

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR adds support for assigning a managed identity to Azure VMs along with related improvements in instance creation and boot diagnostics. Key changes include:

  • Introducing a new VM identity option and updating API clients to use managed identity lookups.
  • Propagating the managed identity through the flight, cluster, and instance creation workflows.
  • Adjusting SSH key deduplication logic and revising the boot diagnostics retrieval via HTTP.

Reviewed Changes

File Description
platform/api/azure/api.go Added managed identity client setup and a lookup function to find identity by name.
cmd/kola/options.go Introduced new CLI option for VM Identity and updated SSH key deduplication logic.
platform/machine/azure/flight.go Updated flight setup to resolve and pass the managed identity to the instance.
util/retry.go Revised the placement of the sleep call in WaitUntilReady.
platform/api/azure/instance.go Updated VM parameter and instance creation functions to handle the managed identity.
platform/cluster.go Modified user-data SSH key addition to respect the NoSSHKeyInUserData flag.
platform/machine/azure/cluster.go Added a ManagedIdentityID field to the cluster struct and passed it during instance creation.
platform/machine/azure/machine.go Updated the GetConsoleOutput invocation to remove the storage account parameter.

Copilot reviewed 29 out of 29 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

cmd/kola/options.go:126

  • There is a typo in the option description: "availibity" should be corrected to "availability".
sv(&kola.AzureOptions.AvailabilitySet, "azure-availability-set", "", "Deploy resources in an existing availibity set")

util/retry.go:69

  • The repositioning of the time.Sleep call within the WaitUntilReady loop changes the timing behavior (sleeping after checkFunction instead of before). Verify that this revised delay logic meets the intended retry semantics.
time.Sleep(delay)
@jepio jepio merged commit b0583ce into flatcar-master Mar 7, 2025
2 checks passed
@jepio
Copy link
Member Author

jepio commented Mar 7, 2025

Thanks @tormath1 :)

@jepio jepio deleted the azure-vm-msi branch March 7, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants