Skip to content
This repository has been archived by the owner on Oct 8, 2024. It is now read-only.

remove stale opentelemetry_sdk packages #151

Merged
merged 8 commits into from
Jul 30, 2024
Merged

remove stale opentelemetry_sdk packages #151

merged 8 commits into from
Jul 30, 2024

Conversation

ca-scribner
Copy link
Contributor

@ca-scribner ca-scribner commented Jul 24, 2024

Issue

This hack mitigate an issue where the otel library's mechanism to discover resource detectors raised an Exception after a charm refresh, as described in https://github.com/canonical/grafana-agent-operator/issues/146. That was caused by a juju/charmcraft issue (https://bugs.launchpad.net/juju/+bug/2058335) where, when upgrading packages, Juju leaves behind package metadata from the old versions. This caused otel's discovery mechanism to incorrectly use the old resource detector, leading to an exception.

Solution

The fix here is to, before the otel discovery mechanism fires, detect if we have multiple "opentelemetry_sdk" distributions (packages) and if we do, delete any that appear to be stale (where "stale" is determined by whether they present entry_points - this assumption worked in testing, but not sure if it is universally valid)

Context

https://github.com/canonical/grafana-agent-operator/issues/146

Testing Instructions

(in a machine model)

to repro the original issue

charmcraft pack
juju deploy grafana-agent --revision=134 --channel=latest/stable
juju deploy ubuntu
juju relate grafana-agent ubuntu
# wait for deploy to finish and charm to go to blocked (because it has no target for its metrics)

juju refresh grafana-agent --channel latest/edge
# charm should refresh and go to error

to demo the fix:

juju deploy grafana-agent --revision=134 --channel=latest/stable
juju deploy ubuntu
juju relate grafana-agent ubuntu
# wait for deploy to finish and charm to go to blocked (because it has no target for its metrics)

# rebuild grafana-agent latest/edge with this PR's version of the library
juju refresh grafana-agent --path ./grafana-agent_ubuntu-22.04-amd64.charm
# charm should refresh and go back to blocked.  It should not go to error

Release Notes

Adds a workaround for juju bug 2058335, which results in parts of old python packages being left in the virtual environment after upgrades during charm refresh. This patch detects and removes any duplicate opentelemetry-exporter-otlp-proto-http package found during upgrade.

This hack mitigate an issue where the otel library's mechanism to discover resource detectors raised an Exception after a charm refresh, as described in https://github.com/canonical/grafana-agent-operator/issues/146.  That was caused by a juju/charmcraft issue (https://bugs.launchpad.net/juju/+bug/2058335) where, when upgrading packages, Juju leaves behind package metadata from the old versions.  This caused otel's discovery mechanism to incorrectly use the old resource detector, leading to an exception.

The fix here is to, before the otel discovery mechanism fires, detect if we have multiple "opentelemetry_sdk" distributions (packages) and if we do, delete any that appear to be stale (where "stale" is determined by whether they present `entry_points` - this assumption worked in testing, but not sure if it is universally valid)
@ca-scribner ca-scribner requested a review from a team as a code owner July 24, 2024 20:56
@PietroPasotti
Copy link
Contributor

I'm not really happy to do this on each trace_charm call (i.e. on every hook).

How about we only run that conditionally on:

if os.getenv("JUJU_HOOK_PATH") == "hooks/upgrade-charm":
    ...

Also please let's add a test for it.

A better solution IMHO would be do this for each library from ops as a temporary workaround until juju fixes theirs. I can pursue that while we go on with this as a hotfix.

@ca-scribner
Copy link
Contributor Author

+1 to putting behind a guard to only run on upgrade-charm.

I'm +1 to adding tests too, but wasn't sure the right balance of correctness and destructiveness :D We could do a full-fidelity unit test, but the test would have to insert stuff into its venv which, if it goes wrong, could be messy.

Maybe an integration test is possible, but the test would have to do something like:

  • deploy a broken charm from charmhub (must be from charmhub, otherwise the bug doesn't occur)
  • build a new version of that charm using the current lib
  • refresh to the new version (this might have to be deployed from charmhub. I'd need to confirm that)

@ca-scribner
Copy link
Contributor Author

Added a guard. Also ignored the type errors as I think the issue is a lack of typing in the source objects? But I'm a typing noob so might be wrong

@ca-scribner
Copy link
Contributor Author

If we implement this, I'll need to test the guard some more. Did a quick test on the grafana-agent charm and I don't think this guard worked (or rather, it was too aggressive - the charm still went into error so I dont think the directory was purged)

@PietroPasotti
Copy link
Contributor

envvar name: JUJU_DISPATCH_PATH

@ca-scribner
Copy link
Contributor Author

I've updated the PR to use the correct environment variable, and tested it locally for grafana-agent. Everything works using the reproduction procedure given in this PR description. Automated tests have not been added here because of how awkward the testing would be.

Given the odd nature of this bug and lack of automatic testing, it would be great if someone else can independently try the grafana-agent fix in case I've missed something

lib/charms/tempo_k8s/v1/charm_tracing.py Outdated Show resolved Hide resolved
lib/charms/tempo_k8s/v1/charm_tracing.py Outdated Show resolved Hide resolved
lib/charms/tempo_k8s/v1/charm_tracing.py Outdated Show resolved Hide resolved
@ca-scribner
Copy link
Contributor Author

ty for the edits - lgtm!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants