-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable Storage Metrics emmiter. #5761
Conversation
56d73b9
to
a644253
Compare
a644253
to
c9d0177
Compare
pulpcore/app/models/content.py
Outdated
from pulpcore.plugin.tasking import dispatch | ||
from pulpcore.app.tasks import telemetry | ||
|
||
dispatch(telemetry.emmit_disk_space_usage_telemetry, args=(self.pulp_domain.pk,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this again, would not it make much more sense to (1) store the value of used disk space in a dedicated table and (2) schedule a task that will periodically update this value once in a while? All the burden of running the complex query will be deferred to the task and API workers will just emit the value stored in the table as it is.
gauge.set(DomainSize.objects.get(domain=domain["pk"]), {"domain_name": domain["pulp_domain__name"]})
Does it make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so @lubosmj. I'm dispatching a task from here to avoid any sort of lock or slow down response when creating an Artifact.
Also, I'm just sending the metered value when a new Artifact hits the base. I can't see a reason to store it since I'm not using that value all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your plan is to dispatch a task each time an artifact is saved or deleted, correct? During syncing or recycling (orphan cleanup), this might result into hundreds of tasks being dispatched within the very small time frame, potentially blocking other useful tasks from being executed. I doubt this has some performance benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. It'll only schedule a new task if there's no other scheduled for the next 5 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change of plans @lubosmj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current idea is to run this metrics just like we run the unblocked tasks metrics. Within the heartbeat of the worker.
af5f201
to
cd834a5
Compare
96f1912
to
61d99a6
Compare
pulpcore/tasking/worker.py
Outdated
if os.getenv("PULP_OTEL_ENABLED", "").lower() != "true" and not settings.DOMAIN_ENABLED: | ||
return | ||
|
||
with contextlib.suppress(AdvisoryLockError), PGAdvisoryLock(STORAGE_METRICS_LOCK): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the way this reads to me (maybe I'm wrong here), every worker could acquire the lock with every beat causing this code to be run a lot more than expected. I imagined this working where the advisory lock would attempt to be acquired here (called from beat()) but then held until the process exits and the fact the lock is acquired held in memory as a variable and then checked (no need to re-acquire if I already got it last time, I still have it).
To acquire the lock in a long-lived way I think you'll need to not use the context manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holding a lock forever is probably a bad idea. (Holding a lock for the time of a several hours sync task has not yet demonstrated to be a real problem, but it gives me a lot to think about.) Also we agreed to not do any version of leader election in the workers.
So, as we can now establish, this whole telemetry collecting and posting meters (I'm not talking about recording events, just scraping measures from the database here) business turns out to be more than a one off. Also it is sufficiently separate from all the other pulp operations to be moved into it's own process. This should reduce the risk of this whole endeavour by an order of magnitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmbouter Nope man. Only one worker can acquire the lock and it will only run the query if the current time is late than the last_heartbeat + heartbeat_period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having a different process dealing with it @mdellweg. Yet, I believe we need to discuss two points:
1- We still gonna need to have telemetry on some Pulp parts. Are we ok on telemetry code living between those two spaces? A "telemetry worker" and the Pulp code itself?
2- Currently we're running out of resources on services team and I can't see this new pulp-process happening in the short term.
How can we figure out a solution for this?
61d99a6
to
7aecca7
Compare
a7e7121
to
906c17e
Compare
provider = MeterProvider(metric_readers=[metric_reader], resource=resource) | ||
|
||
|
||
def emit_domain_space_usage_metric(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only need for more than one such task would be the typical "Hourly, Daily, Weekly" schedules.
def emit_domain_space_usage_metric(): | |
def otel_metrics(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. The name make it clear what the function does. The telemetry
module could even receive the code of the task metrics we have in worker.py
.
I see no reason to change the name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name appears in the database. There is no way we can change it later. The current name is specific to it's first batch of work, yes. But it's not future proof. Also it suffers from homeopathic naming. "emit" and "usage" don't actually add reasonable information. And still this is the one recurring task to collect and send all the measures that need updates frequently (currently five minutes, but i'm not sold on that interval, so don't put it in the name).
We may think about adding a "otel_metrics_daily" later. You get the idea?
pulpcore/app/tasks/telemetry.py
Outdated
metric_reader.collect() | ||
metric_reader.force_flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling force_flush
appears redundant to me.
def force_flush(self, timeout_millis: float = 10_000) -> bool:
super().force_flush(timeout_millis=timeout_millis) # ABC class calls self.collect()
self._exporter.force_flush(timeout_millis=timeout_millis) # noop for PeriodicExportingMetricReader
return True
https://opentelemetry-python.readthedocs.io/en/latest/sdk/metrics.export.html#opentelemetry.sdk.metrics.export.MetricReader.force_flush
https://opentelemetry-python.readthedocs.io/en/latest/sdk/metrics.export.html#opentelemetry.sdk.metrics.export.PeriodicExportingMetricReader.force_flush
https://opentelemetry-python.readthedocs.io/en/latest/exporter/otlp/otlp.html#opentelemetry.exporter.otlp.proto.http.metric_exporter.OTLPMetricExporter.force_flush
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks man. I've removed it.
906c17e
to
687e7ee
Compare
pulpcore/app/tasks/telemetry.py
Outdated
# This configuration is needed since the worker thread is not using the opentelemetry | ||
# instrumentation agent to run the task code. | ||
|
||
OTLP_EXPORTER_ENDPOINT = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed. It is not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I removed the usage of this on the Exporter but forgot to remove it here. THanks for the catch.
pulpcore/app/tasks/telemetry.py
Outdated
metric_reader = PeriodicExportingMetricReader( | ||
exporter, export_interval_millis=3000, export_timeout_millis=3000 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout and interval values are retrieved from environment variables by default. Is it worth setting them if we enforce the exporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. need to remove it from here also
CHANGES/5762.feature
Outdated
Re-enable the Domain Storage metric emmiter and adds a feature flag to it. | ||
This is an experimental feature and can change without prior notice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changelog does not reflect the reality. We do not have a feature flag for the storage metrics, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. My first implementation had it
|
||
# We're using the same gauge with different attributes for each domain space usage | ||
for domain in space_utilization_per_domain: | ||
space_usage_gauge.set(domain["total_size"], {"domain_name": domain["pulp_domain__name"]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth incorporating the pulp-href/PRN here? It was part of the metric's attributes in the past:
Lines 643 to 649 in f5c439c
metrics.Observation( | |
total_size, | |
{ | |
"pulp_href": get_url(self.domain), | |
"domain_name": self.domain.name, | |
}, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know @lubosmj. I can't see the usefulness of it on the metric itself. Also, to use the get_prn
function we need a Domain instance or the Domain pulp_href. At that point of our code we don't have that.
d481f04
to
de0cfdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the functional's perspective, this is working great!
otel-collector | InstrumentationScope pulpcore.app.tasks.telemetry
otel-collector | Metric #0
otel-collector | Descriptor:
otel-collector | -> Name: space_usage
otel-collector | -> Description: The total space usage per domain.
otel-collector | -> Unit: bytes
otel-collector | -> DataType: Gauge
otel-collector | NumberDataPoints #0
otel-collector | Data point attributes:
otel-collector | -> domain_name: Str(default)
otel-collector | StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
otel-collector | Timestamp: 2024-10-08 15:56:56.717505952 +0000 UTC
otel-collector | Value: 368237
The feature is resilient to worker's restarts and the value is correctly reported. Please, address this comment from @mdellweg: https://github.com/pulp/pulpcore/pull/5761/files#r1790281943. I am going to approve this PR then.
CHANGES/5762.feature
Outdated
Re-enable and refactor the Domain Storage metric emiter. | ||
This is an experimental feature and can change without prior notice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-enable and refactor the Domain Storage metric emiter. | |
This is an experimental feature and can change without prior notice. | |
Re-enabled and refactord the Domain Storage metric emiter. |
de0cfdd
to
45a405c
Compare
Revert "Disable the Storage Metrics emmiter for now."
This reverts commit 7029ee4.
Closes #5762