Skip to content

Commit

Permalink
Merge branch 'cameissner/gc' of https://github.com/Azure/AgentBaker i…
Browse files Browse the repository at this point in the history
…nto cameissner/gc
  • Loading branch information
Cameron Meissner committed Dec 3, 2024
2 parents 2b346e6 + e68d55d commit fb13274
Show file tree
Hide file tree
Showing 50 changed files with 2,251 additions and 1,543 deletions.
7 changes: 4 additions & 3 deletions CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
* @juan-lee @cameronmeissner @UtheMan @ganeshkumarashok @anujmaheshwari1 @AlisonB319 @Devinwong @lilypan26 @ShiqianTao @AbelHu @junjiezhang1997 @jason1028kr @djsly @phealy @r2k1 @timmy-wright
* @juan-lee @cameronmeissner @UtheMan @ganeshkumarashok @anujmaheshwari1 @AlisonB319 @Devinwong @lilypan26 @AbelHu @junjiezhang1997 @jason1028kr @djsly @phealy @r2k1 @timmy-wright

# Code owners for for cse_cmd.sh. This is to ensure that the scriptless v-team is aware of the changes in order to sync with AKSNodeConfig.
cse_cmd.sh @Devinwong @lilypan26 @r2k1 @timmy-wright
# Code owners for for cse_cmd.sh and nodecustomdata.yml. This is to ensure that the scriptless v-team is aware of the changes in order to sync with AKSNodeConfig.
cse_cmd.sh @Devinwong @lilypan26 @r2k1 @timmy-wright
nodecustomdata.yml @Devinwong @lilypan26 @r2k1 @timmy-wright
7 changes: 5 additions & 2 deletions aks-node-controller/proto/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,11 @@ Nevertheless, it’s not a big harm to use `optional` even though it’s not nee
1. Update corresponding .proto files to the data contract. Usually we start with `config.proto`.
1. At `AgentBaker` root level, run command `make compile-proto-files` to compile the .proto files to .go files. At this point, we updated the data contract.
1. Tell how VHD should react to this new variable by updaing the bootscripts as you do before. Basically you will be modifying shell scripts like `install-dependencies.sh`, `cse_install.sh`, `cse_helpers.sh`, etc. You may also want to add some unit tests to spec files like `cse_install_spec.sh`, `cse_helpers.sh` to find bugs earlier.
1. On the VHD side, we are still invoking the bootstrap scripts under the hood. So we will need to update cse_cmd.sh.gtpl to set the environment variables of the CSE trigger command. Note: Node SIG is working on migrating all scripts to managable Go binary. Before it's done, the bootstrap scripts will still be used.
1. Set default values for your variables, if the existing defaulting provided by `proto3` doesn't fit your purpose. For example, if a bool variable is not set, `proto` will default it to `false`. However, if you want to default it to `true`, then you can set your own default function. `getDisableSSH` in `cse_cmd.sh.gtpl` is 1 example.
1. On the VHD side, we are still invoking the bootstrap scripts under the hood. To set the environment variables of the CSE trigger command, add the desired variable to `getCSEEnv()` in [parser.go](https://github.com/Azure/AgentBaker/blob/dev/aks-node-controller/parser/parser.go). If you need to add a corresponding file to the VHD, please generate the file in the bootstrap scripts rather than adding to [`nodecustomdata.yml`](https://github.com/Azure/AgentBaker/blob/dev/parts/linux/cloud-init/nodecustomdata.yml) as this file will eventually be deprecated. Here is an [example](https://github.com/Azure/AgentBaker/commit/81ce18fb7f53acab3c7fe8f3a70b635baf1f2f52) for generating the kube CA cert.

Note: Node SIG is working on migrating all scripts to managable Go binary. Before it's done, the bootstrap scripts will still be used.

1. Set default values for your variables, if the existing defaulting provided by `proto3` doesn't fit your purpose. For example, if a bool variable is not set, `proto` will default it to `false`. However, if you want to default it to `true`, then you can set your own default function. `getDisableSSH` in [helper.go](https://github.com/Azure/AgentBaker/blob/dev/aks-node-controller/parser/helper.go) is 1 example.

## Detailed steps with example
Example: IMDSRestrictionConfig [Example PR](https://github.com/Azure/AgentBaker/pull/5154)
Expand Down
8 changes: 3 additions & 5 deletions e2e/.env.sample
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
# To use this file for e2e tests, copy it and rename it to .env:
# cp .env.sample .env
# The .env file is included in .gitignore to prevent it from being committed to version control.
TAGS_TO_RUN="os=ubuntu,arch=amd64,wasm=false,gpu=false,imagename=1804gen2containerd"
# check config/config.go for all available environment variables
TAGS_TO_SKIP="gpu=true,os=windows"
SIG_VERSION_TAG_NAME=buildId
SIG_VERSION_TAG_VALUE=123456
BUILD_ID=local
KEEP_VMSS=false
LOCATION=westus3
SIG_VERSION_TAG_VALUE=123456
2 changes: 1 addition & 1 deletion e2e/aks_model.go
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,7 @@ func addCacheRuelsToPrivateAzureContainerRegistry(ctx context.Context, t *testin
cacheParams := armcontainerregistry.CacheRule{
Properties: &armcontainerregistry.CacheRuleProperties{
SourceRepository: to.Ptr("mcr.microsoft.com/*"),
TargetRepository: to.Ptr("aks/*"),
TargetRepository: to.Ptr("*"),
},
}
cacheCreateResp, err := config.Azure.CacheRulesClient.BeginCreate(
Expand Down
94 changes: 35 additions & 59 deletions e2e/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,7 @@ import (
"testing"
"time"

aksnodeconfigv1 "github.com/Azure/agentbaker/aks-node-controller/pkg/gen/aksnodeconfig/v1"
"github.com/Azure/agentbaker/e2e/config"
"github.com/Azure/agentbaker/pkg/agent/datamodel"
"github.com/Azure/azure-sdk-for-go/sdk/azcore"
"github.com/Azure/azure-sdk-for-go/sdk/azcore/to"
"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v6"
Expand All @@ -36,12 +34,11 @@ var (
)

type Cluster struct {
Model *armcontainerservice.ManagedCluster
Kube *Kubeclient
SubnetID string
NodeBootstrappingConfiguration *datamodel.NodeBootstrappingConfiguration
AKSNodeConfig *aksnodeconfigv1.Configuration
Maintenance *armcontainerservice.MaintenanceConfiguration
Model *armcontainerservice.ManagedCluster
Kube *Kubeclient
SubnetID string
ClusterParams *ClusterParams
Maintenance *armcontainerservice.MaintenanceConfiguration
}

// Returns true if the cluster is configured with Azure CNI
Expand Down Expand Up @@ -129,15 +126,12 @@ func prepareCluster(ctx context.Context, t *testing.T, cluster *armcontainerserv
return nil, fmt.Errorf("ensure debug daemonsets for %q: %w", *cluster.Name, err)
}

nbc := getBaseNodeBootstrappingConfiguration(ctx, t, kube)

return &Cluster{
Model: cluster,
Kube: kube,
SubnetID: subnetID,
NodeBootstrappingConfiguration: nbc,
Maintenance: maintenance,
AKSNodeConfig: nbcToAKSNodeConfigV1(nbc), // TODO: replace with base template
Model: cluster,
Kube: kube,
SubnetID: subnetID,
Maintenance: maintenance,
ClusterParams: extractClusterParameters(ctx, t, kube),
}, nil
}

Expand All @@ -159,7 +153,7 @@ func getOrCreateCluster(ctx context.Context, t *testing.T, cluster *armcontainer
existingCluster, err := config.Azure.AKS.Get(ctx, config.ResourceGroupName, *cluster.Name, nil)
var azErr *azcore.ResponseError
if errors.As(err, &azErr) && azErr.StatusCode == 404 {
return createNewAKSClusterWithRetry(ctx, t, cluster, config.ResourceGroupName)
return createNewAKSClusterWithRetry(ctx, t, cluster)
}
if err != nil {
return nil, fmt.Errorf("failed to get cluster %q: %w", *cluster.Name, err)
Expand All @@ -175,17 +169,26 @@ func getOrCreateCluster(ctx context.Context, t *testing.T, cluster *armcontainer
// we need to recreate in the case where the cluster is in the "Succeeded" provisioning state,
// though it's corresponding node resource group has been garbage collected
t.Logf("node resource group of cluster %s does not exist, will attempt to recreate", *cluster.Name)
return createNewAKSClusterWithRetry(ctx, t, cluster, config.ResourceGroupName)
return createNewAKSClusterWithRetry(ctx, t, cluster)
}
return &existingCluster.ManagedCluster, nil
case "Creating", "Updating":
return waitUntilClusterReady(ctx, config.ResourceGroupName, *cluster.Name)
return waitUntilClusterReady(ctx, *cluster.Name)
default:
// this operation will try to update the cluster if it's in a failed state
return createNewAKSClusterWithRetry(ctx, t, cluster, config.ResourceGroupName)
return createNewAKSClusterWithRetry(ctx, t, cluster)
}
}

func isExistingResourceGroup(ctx context.Context, resourceGroupName string) (bool, error) {
rgExistence, err := config.Azure.ResourceGroup.CheckExistence(ctx, resourceGroupName, nil)
if err != nil {
return false, fmt.Errorf("failed to get RG %q: %w", resourceGroupName, err)
}

return rgExistence.Success, nil
}

func createNewAKSCluster(ctx context.Context, t *testing.T, cluster *armcontainerservice.ManagedCluster) (*armcontainerservice.ManagedCluster, error) {
t.Logf("creating or updating cluster %s in rg %s\n", *cluster.Name, *cluster.Location)
// Note, it seems like the operation still can start a trigger a new operation even if nothing has changes
Expand All @@ -212,12 +215,12 @@ func createNewAKSCluster(ctx context.Context, t *testing.T, cluster *armcontaine
// that retries creating a cluster if it fails with a 409 Conflict error
// clusters are reused, and sometimes a cluster can be in UPDATING or DELETING state
// simple retry should be sufficient to avoid such conflicts
func createNewAKSClusterWithRetry(ctx context.Context, t *testing.T, cluster *armcontainerservice.ManagedCluster, resourceGroup string) (*armcontainerservice.ManagedCluster, error) {
func createNewAKSClusterWithRetry(ctx context.Context, t *testing.T, cluster *armcontainerservice.ManagedCluster) (*armcontainerservice.ManagedCluster, error) {
maxRetries := 10
retryInterval := 30 * time.Second
var lastErr error
for attempt := 0; attempt < maxRetries; attempt++ {
t.Logf("Attempt %d: creating or updating cluster %s in region %s and rg %s\n", attempt+1, *cluster.Name, *cluster.Location, resourceGroup)
t.Logf("Attempt %d: creating or updating cluster %s in region %s and rg %s\n", attempt+1, *cluster.Name, *cluster.Location, config.ResourceGroupName)

createdCluster, err := createNewAKSCluster(ctx, t, cluster)
if err == nil {
Expand Down Expand Up @@ -348,45 +351,18 @@ func collectGarbageVMSS(ctx context.Context, t *testing.T, cluster *armcontainer
return nil
}

func isExistingResourceGroup(ctx context.Context, resourceGroupName string) (bool, error) {
rgExistence, err := config.Azure.ResourceGroup.CheckExistence(ctx, resourceGroupName, nil)
if err != nil {
return false, fmt.Errorf("failed to get RG %q: %w", resourceGroupName, err)
}

return rgExistence.Success, nil
}

var rgOnce sync.Once

func ensureResourceGroupOnce(ctx context.Context) {
rgOnce.Do(func() {
err := ensureResourceGroup(ctx)
if err != nil {
panic(err)
}
})
}
func ensureResourceGroup(ctx context.Context) error {
rgExists, err := isExistingResourceGroup(ctx, config.ResourceGroupName)
if err != nil {
return err
}

if !rgExists {
_, err = config.Azure.ResourceGroup.CreateOrUpdate(
ctx,
config.ResourceGroupName,
armresources.ResourceGroup{
Location: to.Ptr(config.Config.Location),
Name: to.Ptr(config.ResourceGroupName),
},
nil)
_, err := config.Azure.ResourceGroup.CreateOrUpdate(
ctx,
config.ResourceGroupName,
armresources.ResourceGroup{
Location: to.Ptr(config.Config.Location),
Name: to.Ptr(config.ResourceGroupName),
},
nil)

if err != nil {
return fmt.Errorf("failed to create RG %q: %w", config.ResourceGroupName, err)
}
if err != nil {
return fmt.Errorf("failed to create RG %q: %w", config.ResourceGroupName, err)
}

return nil
}
Loading

0 comments on commit fb13274

Please sign in to comment.