Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update cloud-init customization #11

Merged
merged 2 commits into from
Oct 18, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 68 additions & 61 deletions controllers/cluster_scripts/cloud_init.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -3,80 +3,87 @@ users:
- name: root
lock_passwd: false
write_files:
- path: /root/ {{- if .ControlPlane -}} control_plane {{- else -}} node {{- end -}} .sh
# On first boot, cloud-init writes all files defined in userdata. At the same time,
# VMware Guest Customization configures networking, and reboots the machine when it is done.
# Any files in /run are not preserved. We need cloud-init to fetch userdata and write the
# files again. We clear the cloud-init cache, and reboot. Cloud-init thinks it is the
# first boot, and fetches the userdata, and writes the files.
- path: /root/replace-userdata-files.sh
owner: root
content: |
#!/usr/bin/env bash
catch() {
vmtoolsd --cmd "info-set guestinfo.post_customization_script_execution_status $?"
ERROR_MESSAGE="$(date) $(caller): $BASH_COMMAND"
echo "$ERROR_MESSAGE" &>> /var/log/capvcd/customization/error.log
if [[ -s /root/kubeadm.err ]]
then
KUBEADM_FAILURE=$(cat /root/kubeadm.err)
ERROR_MESSAGE="$ERROR_MESSAGE $KUBEADM_FAILURE"
fi
vmtoolsd --cmd "info-set guestinfo.post_customization_script_execution_failure_reason $ERROR_MESSAGE"
function _log() {
echo "$(date -u +"%Y-%m-%d %H:%M:%S") $@" >> /var/log/capvcd/replace-userdata-files.log
}
mkdir -p /var/log/capvcd/customization
trap 'catch $? $LINENO' ERR EXIT
set -eEx

echo "$(date) Post Customization script execution in progress" &>> /var/log/capvcd/customization/status.log {{- if .ControlPlane }}
mkdir -p /var/log/capvcd

VCLOUD_BASIC_AUTH_PATH=/root/vcloud-basic-auth.yaml
VCLOUD_CONFIGMAP_PATH=/root/vcloud-configmap.yaml
VCLOUD_CCM_PATH=/root/cloud-director-ccm.yaml
VCLOUD_CSI_CONFIGMAP_PATH=/root/vcloud-csi-configmap.yaml
CSI_DRIVER_PATH=/root/csi-driver.yaml
CSI_CONTROLLER_PATH=/root/csi-controller.yaml
CSI_NODE_PATH=/root/csi-node.yaml {{- end }}
_log "Checking for kubeadm configuration file"
if [ -f /run/kubeadm/kubeadm.yaml ] || [ -f /run/kubeadm/kubeadm-join-config.yaml ]; then
_log "kubeadm configuration file found, exiting"
exit 0
fi
_log "kubeadm configuration file not found, cleaning cloud-init cache and rebooting"
cloud-init clean
reboot
- path: /root/bootstrap.sh
owner: root
content: |
#!/usr/bin/env bash

vmtoolsd --cmd "info-set guestinfo.postcustomization.networkconfiguration.status in_progress"
hostname "{{ .MachineName }}"
echo "::1 ipv6-localhost ipv6-loopback" >/etc/hosts
echo "127.0.0.1 localhost" >>/etc/hosts
echo "{{ .MachineName }}" >/etc/hostname
supershal marked this conversation as resolved.
Show resolved Hide resolved
echo "127.0.0.1" `hostname` >>/etc/hosts
vmtoolsd --cmd "info-set guestinfo.postcustomization.networkconfiguration.status successful"
mkdir -p /var/log/capvcd
(
# Prefix timestamp to commands in trace output.
PS4='$(date -u +"%Y-%m-%d %H:%M:%S")\011'
set -o xtrace

vmtoolsd --cmd "info-set guestinfo.metering.status in_progress"
vmtoolsd --cmd "info-set guestinfo.metering.status successful"
# Exit on the first error. Does not apply to commad lists, or pipelines.
set -o errexit

vmtoolsd --cmd "info-set guestinfo.postcustomization.proxy.setting.status in_progress"
vmtoolsd --cmd "info-set guestinfo.postcustomization.proxy.setting.status successful"
# Our images do not require any network customization,
# but CAPVCD requires a successful status to finish bootstrapping.
vmtoolsd --cmd "info-set guestinfo.postcustomization.networkconfiguration.status successful"

vmtoolsd --cmd "info-set {{ if .ControlPlane -}} guestinfo.postcustomization.kubeinit.status {{- else -}} guestinfo.postcustomization.kubeadm.node.join.status {{- end }} in_progress"
{{ .BootstrapRunCmd }}
if [[ ! -f /run/cluster-api/bootstrap-success.complete ]]
then
echo "file /run/cluster-api/bootstrap-success.complete not found" &>> /var/log/capvcd/customization/error.log
exit 1
fi
vmtoolsd --cmd "info-set {{ if .ControlPlane -}} guestinfo.postcustomization.kubeinit.status {{- else -}} guestinfo.postcustomization.kubeadm.node.join.status {{- end }} successful"
# Our images do not ship the VCD metering service,
# but CAPVCD requires a successful status to finish bootstrapping.
vmtoolsd --cmd "info-set guestinfo.metering.status successful"

vmtoolsd --cmd "info-set {{ if .ControlPlane -}} guestinfo.postcustomization.kubeinit.status {{- else -}} guestinfo.postcustomization.kubeadm.node.join.status {{- end }} in_progress"

# Run the preKubeadmCommands, and then kubeadm itself.
{{ .BootstrapRunCmd }}

# Kubeadm is the first command in a bash "list of commands," and its failure
# does not cause this subshell to exit. Therefore, we check the "sentinel" also created
# in the "list of commands," and exit if it is missing.
if [[ ! -f /run/cluster-api/bootstrap-success.complete ]]; then
echo "file /run/cluster-api/bootstrap-success.complete not found"
exit 1
fi

vmtoolsd --cmd "info-set {{ if .ControlPlane -}} guestinfo.postcustomization.kubeinit.status {{- else -}} guestinfo.postcustomization.kubeadm.node.join.status {{- end }} successful"

exit 0
) &>> /var/log/capvcd/bootstrap.log
bootstrap_exit_code=$?

# Write the exit code to the VM metadata.
vmtoolsd --cmd "info-set guestinfo.post_customization_script_execution_status $bootstrap_exit_code"

# Use the last lines of the bootstrap log to give context about any failure.
TAIL_LOG="$(tail --lines=10 /var/log/capvcd/bootstrap.log)"
vmtoolsd --cmd "info-set guestinfo.post_customization_script_execution_failure_reason $TAIL_LOG"

echo "$(date) post customization script execution completed" &>> /var/log/capvcd/customization/status.log
exit 0
# Write cloud-init output for additional context.
vmtoolsd --cmd "info-set guestinfo.post_customization_cloud_init_output $(</var/log/cloud-init-output.log)"
runcmd:
- 'sudo cloud-init clean --seed --logs'
- 'sudo cat /dev/null > /var/log/cloud-init-output.log'
{{ if .ControlPlane }}
- '[ ! -f /run/kubeadm/konvoy-set-kube-proxy-configuration.sh] && sudo reboot'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were you able to boot the VM and create cluster after removing this file? I remember that pre kubeadm commands were failing If I removed them. I will test it out later to confirm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!

You're right that any preKubeadmCommand that requires an ordinary file in /run will fail after a reboot, because ordinary files in /run do not persist across reboots. We recently (in https://github.com/mesosphere/konvoy2/pull/2337) moved all patch scripts from /run to /etc for this reason.

Your question made me wonder about the two reboot calls left in this template:

{{ if .ControlPlane }}
- '[ ! -f /root/control_plane.sh ] && sudo reboot'
- '[ ! -f /run/kubeadm/kubeadm.yaml ] && sudo reboot'
- bash /root/control_plane.sh
{{ else }}
- '[ ! -f /root/node.sh ] && sudo reboot'
- '[ ! -f /run/kubeadm/kubeadm-join-config.yaml ] && sudo reboot'
- bash /root/node.sh
{{ end }}

It seems harmless to reboot if the kubeadm config (/run/kubeadm/kubeadm.yaml or /run/kubeadm/kubeadm-join-config.yaml) is not not there (yet?).

But if we reboot because the bootstrap script (/root/control_plane.sh or /root/node.sh) is missing, and the kubeadm config happens to already be present, we will lose the kubeadm config after the reboot, leading to further reboots, without end.

At this time (66ecf82), I can successfully reboot either a control plane, or worker machine.

I think it may be better to remove the remaining reboot calls. I will experiment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment that explains why the reboot call is necessary. I've also moved these checks out to their own script, and use a separate log file to keep track.

This is what the log looks like:

# cat /var/log/capvcd/replace-userdata-files.log
2023-10-17 22:07:37 Checking for kubeadm configuration file
2023-10-17 22:07:37 kubeadm configuration file not found, cleaning cloud-init cache and rebooting
2023-10-17 22:08:12 Checking for kubeadm configuration file
2023-10-17 22:08:12 kubeadm configuration file found, exiting

Copy link
Collaborator

@supershal supershal Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty clever. just iterating the logic: until the user-data is available, the vm will reboot and cloud-init will run it as if its first boot. once it is user-data are available, the bootstrap.sh will run kubeadm init/join.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until the user-data is available, the vm will reboot and cloud-init will run it as if its first boot. once it is user-data are available, the bootstrap.sh will run kubeadm init/join.

Correct. The upstream cloud-init also did this, but used in-line commands, instead of a script, and the reason behind everything wasn't given.

- '[ ! -f /run/konvoy/containerd-apply-patches.sh] && sudo reboot'
- '[ ! -f /run/konvoy/restart-containerd-and-wait.sh] && sudo reboot'
- '[ ! -f /root/control_plane.sh ] && sudo reboot'
- '[ ! -f /run/kubeadm/kubeadm.yaml ] && sudo reboot'
- bash /root/control_plane.sh
{{ else }}
- '[ ! -f /root/node.sh ] && sudo reboot'
- '[ ! -f /run/kubeadm/kubeadm-join-config.yaml ] && sudo reboot'
- bash /root/node.sh
{{ end }}
- bash /root/replace-userdata-files.sh
- bash /root/bootstrap.sh
timezone: UTC
disable_root: false
disable_vmware_customization: true
network:
config: disabled
# Ensure we have an IPv4 address for localhost
manage_etc_hosts: localhost
# Ensure that cloud-init can override the hostname.
preserve_hostname: false
hostname: "{{ .MachineName }}"
final_message: "The system is ready after $UPTIME seconds"
final_message: "The system is ready after $UPTIME seconds"