-
Notifications
You must be signed in to change notification settings - Fork 3
remove stale opentelemetry_sdk packages #151
Conversation
This hack mitigate an issue where the otel library's mechanism to discover resource detectors raised an Exception after a charm refresh, as described in https://github.com/canonical/grafana-agent-operator/issues/146. That was caused by a juju/charmcraft issue (https://bugs.launchpad.net/juju/+bug/2058335) where, when upgrading packages, Juju leaves behind package metadata from the old versions. This caused otel's discovery mechanism to incorrectly use the old resource detector, leading to an exception. The fix here is to, before the otel discovery mechanism fires, detect if we have multiple "opentelemetry_sdk" distributions (packages) and if we do, delete any that appear to be stale (where "stale" is determined by whether they present `entry_points` - this assumption worked in testing, but not sure if it is universally valid)
I'm not really happy to do this on each How about we only run that conditionally on:
Also please let's add a test for it. A better solution IMHO would be do this for each library from ops as a temporary workaround until juju fixes theirs. I can pursue that while we go on with this as a hotfix. |
+1 to putting behind a guard to only run on upgrade-charm. I'm +1 to adding tests too, but wasn't sure the right balance of correctness and destructiveness :D We could do a full-fidelity unit test, but the test would have to insert stuff into its venv which, if it goes wrong, could be messy. Maybe an integration test is possible, but the test would have to do something like:
|
Added a guard. Also ignored the type errors as I think the issue is a lack of typing in the source objects? But I'm a typing noob so might be wrong |
If we implement this, I'll need to test the guard some more. Did a quick test on the grafana-agent charm and I don't think this guard worked (or rather, it was too aggressive - the charm still went into error so I dont think the directory was purged) |
envvar name: JUJU_DISPATCH_PATH |
I've updated the PR to use the correct environment variable, and tested it locally for grafana-agent. Everything works using the reproduction procedure given in this PR description. Automated tests have not been added here because of how awkward the testing would be. Given the odd nature of this bug and lack of automatic testing, it would be great if someone else can independently try the grafana-agent fix in case I've missed something |
ty for the edits - lgtm! |
Issue
This hack mitigate an issue where the otel library's mechanism to discover resource detectors raised an Exception after a charm refresh, as described in https://github.com/canonical/grafana-agent-operator/issues/146. That was caused by a juju/charmcraft issue (https://bugs.launchpad.net/juju/+bug/2058335) where, when upgrading packages, Juju leaves behind package metadata from the old versions. This caused otel's discovery mechanism to incorrectly use the old resource detector, leading to an exception.
Solution
The fix here is to, before the otel discovery mechanism fires, detect if we have multiple "opentelemetry_sdk" distributions (packages) and if we do, delete any that appear to be stale (where "stale" is determined by whether they present
entry_points
- this assumption worked in testing, but not sure if it is universally valid)Context
https://github.com/canonical/grafana-agent-operator/issues/146
Testing Instructions
(in a machine model)
to repro the original issue
to demo the fix:
Release Notes
Adds a workaround for juju bug 2058335, which results in parts of old python packages being left in the virtual environment after upgrades during charm refresh. This patch detects and removes any duplicate
opentelemetry-exporter-otlp-proto-http
package found during upgrade.