Skip to content

COS-3013: overlay node image before bootstrapping if necessary #8742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash
set -euo pipefail

UNIT_DIR="${1:-/tmp}"

if ! rpm -q openshift-clients &>/dev/null; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eranco74 @carbonin this is the proxy test for whether the current filesystem needs all of the "node layer" stuff to be layered on. Any concerns for the assisted-installer live ISO? We don't install this, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't install anything additional on the rootfs

ln -sf "/etc/systemd/system/node-image-overlay.target" \
"${UNIT_DIR}/default.target"
fi
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# This is a separate unit because in the assisted-installer flow, we only want
# `node-image-overlay.service`, not the isolating back to `multi-user.target`.

[Unit]
Description=Node Image Finish
Requires=node-image-overlay.service
After=node-image-overlay.service

[Service]
Type=oneshot
# and now, back to our regularly scheduled programming...
ExecStart=/usr/bin/echo "Node image overlay complete; switching back to multi-user.target"
ExecStart=/usr/bin/systemctl --no-block isolate multi-user.target
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[Unit]
Description=Node Image Overlay
Requires=node-image-pull.service
After=node-image-pull.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/node-image-overlay.sh
RemainAfterExit=yes
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[Unit]
Description=Node Image Overlay Target
Requires=basic.target

# for easier debugging
Requires=sshd.service getty.target systemd-user-sessions.service

Requires=node-image-overlay.service
Requires=node-image-finish.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[Unit]
Description=Node Image Pull
Requires=network.target NetworkManager.service
After=network.target

[Service]
Type=oneshot
# we need to call ostree container (i.e. rpm-ostree), which has install_exec_t,
# but by default, we'll run as unconfined_service_t, which is not allowed that
# transition. Relabel the script itself.
ExecStartPre=chcon --reference=/usr/bin/ostree /usr/local/bin/node-image-pull.sh
ExecStart=/usr/local/bin/node-image-pull.sh
MountFlags=slave
RemainAfterExit=yes
119 changes: 0 additions & 119 deletions data/data/bootstrap/files/usr/local/bin/bootstrap-pivot.sh.template

This file was deleted.

18 changes: 18 additions & 0 deletions data/data/bootstrap/files/usr/local/bin/node-image-overlay.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
set -euo pipefail

ostree_checkout=/ostree/repo/tmp/node-image
if [ ! -d "${ostree_checkout}" ]; then
ostree_checkout=/var/ostree-container/checkout
fi

echo "Overlaying node image content"

# keep /usr/lib/modules from the booted deployment for kernel modules
mount -o bind,ro "/usr/lib/modules" "${ostree_checkout}/usr/lib/modules"
mount -o rbind,ro "${ostree_checkout}/usr" /usr
rsync -a "${ostree_checkout}/usr/etc/" /etc

# reload the new policy
echo "Reloading SELinux policy"
semodule -R
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we systemctl daemon-reexec here to be safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I'd rather like to wait and see what kind of issues we hit before pre-emptively doing that. This may inform other things in this approach.

Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/bin/bash
set -euo pipefail

# shellcheck source=release-image.sh.template
. /usr/local/bin/release-image.sh

# yuck... this is a good argument for renaming the node image to just `node` in both OCP and OKD
coreos_img=rhel-coreos
{{ if .IsOKD }}
coreos_img=stream-coreos
{{ end }}
until COREOS_IMAGE=$(image_for ${coreos_img}); do
echo 'Failed to query release image; retrying...'
sleep 10
done

# need to use rpm-ostree here since `bootc status` doesn't work in the live ISO currently
# https://github.com/containers/bootc/issues/1043
booted_version=$(rpm-ostree status --json | jq -r .deployments[0].version)

echo "Currently on CoreOS version $booted_version"
echo "Target node image is $COREOS_IMAGE"

# try to do this in the system repo so we get hardlinks and the checkout is
# read-only, but fallback to using /var if we're in the live environment since
# that's truly read-only
ostree_repo=/ostree/repo
ostree_checkout="${ostree_repo}/tmp/node-image"
hardlink='-H'
if grep -q coreos.liveiso= /proc/cmdline; then
ostree_repo=/var/ostree-container/repo
ostree_checkout=/var/ostree-container/checkout
mkdir -p "${ostree_repo}"
echo "In live ISO; creating temporary repo to pull node image"
ostree init --mode=bare --repo="${ostree_repo}"
# if there are layers, import all the content in the system repo for
# layer-level deduping
if [ -d /ostree/repo/refs/heads/ostree/container ]; then
echo "Importing base content from system repo for deduplication"
ostree pull-local --repo="${ostree_repo}" /ostree/repo
fi
# but we won't be able to force hardlinks cross-device
hardlink=''
else
# (remember, we're MountFlags=slave)
mount -o rw,remount /sysroot
fi

# Use ostree stack to pull the container here. This gives us efficient
# downloading with layers we already have, and also handles SELinux.
echo "Pulling ${COREOS_IMAGE}"
while ! ostree container image pull --authfile "/root/.docker/config.json" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumptions / understanding just to confirm:

  • the $COREOS_IMAGE is an entire "bootc image" with a node layer, a coreos layer, and a rhel layer, as described in the enhancement proposal.
  • at this point we're running on a basic RHEL image mode system, and we're using this command to retrieve $COREOS_IMAGE because we want to overlay its coreos layer and node layer on top of whatever RHEL we happen to be running right now.

Question: even though we only want the "node" and "coreos" layers, is it correct to assume that this command will pull the entire image including its rhel base image, unless that rhel base image happens to be the same as the one we're running at the moment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the $COREOS_IMAGE is an entire "bootc image" with a node layer, a coreos layer, and a rhel layer, as described in the enhancement proposal.

Yes, exactly!

at this point we're running on a basic RHEL image mode system, and we're using this command to retrieve $COREOS_IMAGE because we want to overlay its coreos layer and node layer on top of whatever RHEL we happen to be running right now.

Nomenclature here gets messy, but basically I've been using:

  • RHEL CoreOS image: RHEL layer + CoreOS layer
  • node image: RHEL layer + CoreOS layer + node layer

The really confusing thing is that the node image in the payload is still called rhel-coreos, and I didn't have the stamina to also bundle as part of this enhancement renaming it to node, but ideally we eventually should rename so it's clearer.

Here, we're running on a RHEL CoreOS system (e.g. we still need Ignition for example). And the COREOS_IMAGE is the rhel-coreos component of the release payload (i.e. the node image).

Question: even though we only want the "node" and "coreos" layers, is it correct to assume that this command will pull the entire image including its rhel base image, unless that rhel base image happens to be the same as the one we're running at the moment?

Think of the "layers" described in the enhancement as more conceptual than literal. In practice, the base RHCOS OCI images are chunked into many layers, grouping together files based on heuristics to try to maximize likelihood of sharing content. The node layer I guess is both conceptual and literal, since it all happens in one RUN invocation. So in the ideal case, we only pull that one node layer, which is currently around ~200M. The rest of the RHEL+CoreOS layers depend on how stale the bootimage is vs what we're installing (which I think more or less correlates with "how many z-stream releases have happened between the one containing this bootimage to the one we're installing").

Though I will note it'll be more inefficient than that at the start because the node image will still be built the old way while the bootimages have moved to the new model.

And of course, one can't not mention zstd:chunked here which should make it much easier to get efficiency than when playing with layers where we need to make sure everything is lined up just right (and notably that layer digests haven't been recalculated somewhere along the way in the pipelines).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, the base RHCOS OCI images are chunked into many layers, grouping together files based on heuristics to try to maximize likelihood of sharing content.

If you want an example of this BTW, you can look at the layers for any of the images in https://quay.io/repository/fedora/fedora-coreos?tab=tags.

"${ostree_repo}" ostree-unverified-image:docker://"${COREOS_IMAGE}"; do
echo 'Failed to fetch release image; retrying...'
sleep 10
done

# ideally, `ostree container image pull` would support `--write-ref` or a
# command to escape a pullspec, but for now it's pretty easy to tell which ref
# it is since it's the only docker one
ref=$(ostree refs --repo "${ostree_repo}" | grep ^ostree/container/image/docker)
if [ $(echo "$ref" | wc -l) != 1 ]; then
echo "Expected single docker ref, found:"
echo "$ref"
exit 1
fi
ostree refs --repo "${ostree_repo}" "$ref" --create coreos/node-image

# massive hack to make ostree admin config-diff work in live ISO where /etc
# is actually on a separate mount and not the deployment root proper... should
# enhance libostree for this (remember, we're MountFlags=slave)
if grep -q coreos.liveiso= /proc/cmdline; then
mount -o bind,ro /etc /ostree/deploy/*/deploy/*/etc
fi

# get all state files in /etc; this is a cheap way to get "3-way /etc merge" semantics
etc_keep=$(ostree admin config-diff | cut -f5 -d' ' | sed -e 's,^,/usr/etc/,')

# check out the commit
echo "Checking out node image content"
ostree checkout --repo "${ostree_repo}" ${hardlink} coreos/node-image "${ostree_checkout}" --skip-list=<(cat <<< "$etc_keep")

# in the assisted-installer case, nuke the temporary repo to save RAM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much RAM do we need to budget for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is special about the assisted installer here? Does the assisted installer treat the bootstrap node separately? Perhaps this is mixed up with agent installer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's because assisted-installer runs the "bootstrap in place" process from a live ISO, so the filesystem is in RAM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much RAM do we need to budget for this?

Pretty much the size of the node image itself. I said ~1G when we discussed, but looking now, that's the compressed size. Uncompressed it's ~2.2G. I would add a healthy percent margin on top.

Worth mentioning I did test a SNO install with this in a 16G VM which is the minimum requirement. I'm not sure though how much headroom was left. I was actually pleasantly surprised when working on this to learn that bootstrapping needs less than 16G of container images. :)

Do you need more precise information?

I assume it's because assisted-installer runs the "bootstrap in place" process from a live ISO, so the filesystem is in RAM.

Right, it's because we're booted from the ISO. We can't import into the system OSTree repo since it's read-only as part of the squashfs, so we have to create a separate OSTree repo and import there. That separate repo is in /var, which is in RAM.

if grep -q coreos.liveiso= /proc/cmdline; then
echo "Deleting temporary repo"
rm -rf "${ostree_repo}"
fi
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@
Description=Configure CRI-O to use the pause image
After=release-image.service
Requires=release-image.service
{{if .IsOKD -}}
Requires=release-image-pivot.service
{{end -}}
Before=crio.service

[Service]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,7 @@
[Unit]
Description=Kubernetes Kubelet
Wants=rpc-statd.service crio.service release-image.service
{{if .IsOKD -}}
Wants=release-image-pivot.service
{{end -}}
After=crio.service release-image.service
{{if .IsOKD -}}
After=release-image-pivot.service
{{end -}}

[Service]
Type=notify
Expand Down

This file was deleted.

9 changes: 5 additions & 4 deletions pkg/asset/ignition/bootstrap/common.go
Original file line number Diff line number Diff line change
Expand Up @@ -438,16 +438,17 @@ func AddStorageFiles(config *igntypes.Config, base string, uri string, templateD

var mode int
appendToFile := false
if parentDir == "bin" || parentDir == "dispatcher.d" {
switch {
case parentDir == "bin", parentDir == "dispatcher.d", parentDir == "system-generators":
mode = 0555
} else if filename == "motd" || filename == "containers.conf" {
case filename == "motd", filename == "containers.conf":
mode = 0644
appendToFile = true
} else if filename == "registries.conf" {
case filename == "registries.conf":
// Having the mode be private breaks rpm-ostree, xref
// https://github.com/openshift/installer/pull/6789
mode = 0644
} else {
default:
mode = 0600
}
ign := ignition.FileFromBytes(strings.TrimSuffix(base, ".template"), "root", mode, data)
Expand Down