Re-enable Storage Metrics emmiter. #5761

decko · 2024-09-02T13:17:10Z

Revert "Disable the Storage Metrics emmiter for now."
This reverts commit 7029ee4.

lubosmj · 2024-09-10T08:05:59Z

pulpcore/app/models/content.py

+        from pulpcore.plugin.tasking import dispatch
+        from pulpcore.app.tasks import telemetry
+
+        dispatch(telemetry.emmit_disk_space_usage_telemetry, args=(self.pulp_domain.pk,))


Seeing this again, would not it make much more sense to (1) store the value of used disk space in a dedicated table and (2) schedule a task that will periodically update this value once in a while? All the burden of running the complex query will be deferred to the task and API workers will just emit the value stored in the table as it is.

gauge.set(DomainSize.objects.get(domain=domain["pk"]), {"domain_name": domain["pulp_domain__name"]})

Does it make sense to you?

I don't think so @lubosmj. I'm dispatching a task from here to avoid any sort of lock or slow down response when creating an Artifact.

Also, I'm just sending the metered value when a new Artifact hits the base. I can't see a reason to store it since I'm not using that value all the time.

Your plan is to dispatch a task each time an artifact is saved or deleted, correct? During syncing or recycling (orphan cleanup), this might result into hundreds of tasks being dispatched within the very small time frame, potentially blocking other useful tasks from being executed. I doubt this has some performance benefits.

Nope. It'll only schedule a new task if there's no other scheduled for the next 5 minutes.

Change of plans @lubosmj

The current idea is to run this metrics just like we run the unblocked tasks metrics. Within the heartbeat of the worker.

bmbouter · 2024-09-17T00:48:25Z

pulpcore/tasking/worker.py

+        if os.getenv("PULP_OTEL_ENABLED", "").lower() != "true" and not settings.DOMAIN_ENABLED:
+            return
+
+        with contextlib.suppress(AdvisoryLockError), PGAdvisoryLock(STORAGE_METRICS_LOCK):


With the way this reads to me (maybe I'm wrong here), every worker could acquire the lock with every beat causing this code to be run a lot more than expected. I imagined this working where the advisory lock would attempt to be acquired here (called from beat()) but then held until the process exits and the fact the lock is acquired held in memory as a variable and then checked (no need to re-acquire if I already got it last time, I still have it).

To acquire the lock in a long-lived way I think you'll need to not use the context manager.

Holding a lock forever is probably a bad idea. (Holding a lock for the time of a several hours sync task has not yet demonstrated to be a real problem, but it gives me a lot to think about.) Also we agreed to not do any version of leader election in the workers.
So, as we can now establish, this whole telemetry collecting and posting meters (I'm not talking about recording events, just scraping measures from the database here) business turns out to be more than a one off. Also it is sufficiently separate from all the other pulp operations to be moved into it's own process. This should reduce the risk of this whole endeavour by an order of magnitude.

@bmbouter Nope man. Only one worker can acquire the lock and it will only run the query if the current time is late than the last_heartbeat + heartbeat_period.

I like the idea of having a different process dealing with it @mdellweg. Yet, I believe we need to discuss two points:
1- We still gonna need to have telemetry on some Pulp parts. Are we ok on telemetry code living between those two spaces? A "telemetry worker" and the Pulp code itself?
2- Currently we're running out of resources on services team and I can't see this new pulp-process happening in the short term.

How can we figure out a solution for this?

mdellweg · 2024-10-07T11:42:03Z

pulpcore/app/tasks/telemetry.py

+provider = MeterProvider(metric_readers=[metric_reader], resource=resource)
+
+
+def emit_domain_space_usage_metric():


The only need for more than one such task would be the typical "Hourly, Daily, Weekly" schedules.

Suggested change

def emit_domain_space_usage_metric():

def otel_metrics():

I don't get it. The name make it clear what the function does. The telemetry module could even receive the code of the task metrics we have in worker.py.
I see no reason to change the name here.

This name appears in the database. There is no way we can change it later. The current name is specific to it's first batch of work, yes. But it's not future proof. Also it suffers from homeopathic naming. "emit" and "usage" don't actually add reasonable information. And still this is the one recurring task to collect and send all the measures that need updates frequently (currently five minutes, but i'm not sold on that interval, so don't put it in the name).
We may think about adding a "otel_metrics_daily" later. You get the idea?

lubosmj · 2024-10-07T12:50:31Z

pulpcore/app/tasks/telemetry.py

+    metric_reader.collect()
+    metric_reader.force_flush()


Calling force_flush appears redundant to me.

def force_flush(self, timeout_millis: float = 10_000) -> bool: super().force_flush(timeout_millis=timeout_millis) # ABC class calls self.collect() self._exporter.force_flush(timeout_millis=timeout_millis) # noop for PeriodicExportingMetricReader return True

https://opentelemetry-python.readthedocs.io/en/latest/sdk/metrics.export.html#opentelemetry.sdk.metrics.export.MetricReader.force_flush
https://opentelemetry-python.readthedocs.io/en/latest/sdk/metrics.export.html#opentelemetry.sdk.metrics.export.PeriodicExportingMetricReader.force_flush
https://opentelemetry-python.readthedocs.io/en/latest/exporter/otlp/otlp.html#opentelemetry.exporter.otlp.proto.http.metric_exporter.OTLPMetricExporter.force_flush

Thanks man. I've removed it.

lubosmj · 2024-10-07T15:22:45Z

pulpcore/app/tasks/telemetry.py

+# This configuration is needed since the worker thread is not using the opentelemetry
+# instrumentation agent to run the task code.
+
+OTLP_EXPORTER_ENDPOINT = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318/")


This can be removed. It is not used anywhere.

Yeah. I removed the usage of this on the Exporter but forgot to remove it here. THanks for the catch.

lubosmj · 2024-10-07T15:23:58Z

pulpcore/app/tasks/telemetry.py

+metric_reader = PeriodicExportingMetricReader(
+    exporter, export_interval_millis=3000, export_timeout_millis=3000
+)


The timeout and interval values are retrieved from environment variables by default. Is it worth setting them if we enforce the exporting?

Nope. need to remove it from here also

lubosmj · 2024-10-07T15:24:48Z

CHANGES/5762.feature

+Re-enable the Domain Storage metric emmiter and adds a feature flag to it.
+This is an experimental feature and can change without prior notice.


This changelog does not reflect the reality. We do not have a feature flag for the storage metrics, do we?

Sorry. My first implementation had it

lubosmj · 2024-10-07T15:27:34Z

pulpcore/app/tasks/telemetry.py

+
+    # We're using the same gauge with different attributes for each domain space usage
+    for domain in space_utilization_per_domain:
+        space_usage_gauge.set(domain["total_size"], {"domain_name": domain["pulp_domain__name"]})


Is it worth incorporating the pulp-href/PRN here? It was part of the metric's attributes in the past:

pulpcore/pulpcore/app/util.py

Lines 643 to 649 in f5c439c

metrics.Observation(

total_size,

{

"pulp_href": get_url(self.domain),

"domain_name": self.domain.name,

},

)

.

I don't know @lubosmj. I can't see the usefulness of it on the metric itself. Also, to use the get_prn function we need a Domain instance or the Domain pulp_href. At that point of our code we don't have that.

lubosmj

From the functional's perspective, this is working great!

otel-collector    | InstrumentationScope pulpcore.app.tasks.telemetry 
otel-collector    | Metric #0
otel-collector    | Descriptor:
otel-collector    |      -> Name: space_usage
otel-collector    |      -> Description: The total space usage per domain.
otel-collector    |      -> Unit: bytes
otel-collector    |      -> DataType: Gauge
otel-collector    | NumberDataPoints #0
otel-collector    | Data point attributes:
otel-collector    |      -> domain_name: Str(default)
otel-collector    | StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
otel-collector    | Timestamp: 2024-10-08 15:56:56.717505952 +0000 UTC
otel-collector    | Value: 368237

The feature is resilient to worker's restarts and the value is correctly reported. Please, address this comment from @mdellweg: https://github.com/pulp/pulpcore/pull/5761/files#r1790281943. I am going to approve this PR then.

lubosmj · 2024-10-08T16:09:28Z

CHANGES/5762.feature

+Re-enable and refactor the Domain Storage metric emiter.
+This is an experimental feature and can change without prior notice.


Suggested change

Re-enable and refactor the Domain Storage metric emiter.

This is an experimental feature and can change without prior notice.

Re-enabled and refactord the Domain Storage metric emiter.

Closes #5762

decko force-pushed the reactivate_space_usage_metric branch from 56d73b9 to a644253 Compare September 2, 2024 13:49

decko requested review from lubosmj and dkliban September 2, 2024 13:50

decko marked this pull request as draft September 5, 2024 12:02

decko force-pushed the reactivate_space_usage_metric branch from a644253 to c9d0177 Compare September 6, 2024 11:58

lubosmj reviewed Sep 10, 2024

View reviewed changes

github-actions bot added the multi-commit label Sep 12, 2024

decko force-pushed the reactivate_space_usage_metric branch from af5f201 to cd834a5 Compare September 16, 2024 16:38

github-actions bot added no-changelog no-issue labels Sep 16, 2024

decko force-pushed the reactivate_space_usage_metric branch 7 times, most recently from 96f1912 to 61d99a6 Compare September 16, 2024 20:36

decko marked this pull request as ready for review September 16, 2024 21:03

decko requested a review from lubosmj September 16, 2024 21:04

bmbouter reviewed Sep 17, 2024

View reviewed changes

dralley marked this pull request as draft September 24, 2024 13:13

decko force-pushed the reactivate_space_usage_metric branch from 61d99a6 to 7aecca7 Compare October 3, 2024 15:11

github-actions bot removed multi-commit no-changelog labels Oct 3, 2024

decko closed this Oct 3, 2024

decko reopened this Oct 3, 2024

decko force-pushed the reactivate_space_usage_metric branch 2 times, most recently from a7e7121 to 906c17e Compare October 3, 2024 18:19

decko marked this pull request as ready for review October 4, 2024 12:16

decko requested a review from mdellweg October 4, 2024 12:16

mdellweg reviewed Oct 7, 2024

View reviewed changes

lubosmj reviewed Oct 7, 2024

View reviewed changes

decko force-pushed the reactivate_space_usage_metric branch from 906c17e to 687e7ee Compare October 7, 2024 13:25

lubosmj reviewed Oct 7, 2024

View reviewed changes

decko force-pushed the reactivate_space_usage_metric branch 2 times, most recently from d481f04 to de0cfdd Compare October 8, 2024 13:41

decko requested a review from mdellweg October 8, 2024 14:21

lubosmj reviewed Oct 8, 2024

View reviewed changes

Re-enable Storage Metrics emmiter and refactor it.

45a405c

Closes #5762

decko force-pushed the reactivate_space_usage_metric branch from de0cfdd to 45a405c Compare October 9, 2024 12:31

decko closed this by deleting the head repository Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable Storage Metrics emmiter. #5761

Re-enable Storage Metrics emmiter. #5761

decko commented Sep 2, 2024 •

edited

Loading

lubosmj Sep 10, 2024

decko Sep 10, 2024

lubosmj Sep 10, 2024

decko Sep 12, 2024

decko Sep 16, 2024

decko Sep 16, 2024

bmbouter Sep 17, 2024

mdellweg Sep 17, 2024

decko Sep 17, 2024

decko Sep 17, 2024

mdellweg Oct 7, 2024

decko Oct 7, 2024

mdellweg Oct 7, 2024

lubosmj Oct 7, 2024

decko Oct 8, 2024

lubosmj Oct 7, 2024

decko Oct 8, 2024

lubosmj Oct 7, 2024

decko Oct 8, 2024

lubosmj Oct 7, 2024

decko Oct 8, 2024

lubosmj Oct 7, 2024 •

edited

Loading

decko Oct 8, 2024

lubosmj left a comment

lubosmj Oct 8, 2024

		provider = MeterProvider(metric_readers=[metric_reader], resource=resource)


		def emit_domain_space_usage_metric():

		Re-enable the Domain Storage metric emmiter and adds a feature flag to it.
		This is an experimental feature and can change without prior notice.

	metrics.Observation(
	total_size,
	{
	"pulp_href": get_url(self.domain),
	"domain_name": self.domain.name,
	},
	)

		Re-enable and refactor the Domain Storage metric emiter.
		This is an experimental feature and can change without prior notice.

	Re-enable and refactor the Domain Storage metric emiter.
	This is an experimental feature and can change without prior notice.
	Re-enabled and refactord the Domain Storage metric emiter.

Re-enable Storage Metrics emmiter. #5761

Re-enable Storage Metrics emmiter. #5761

Conversation

decko commented Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lubosmj Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lubosmj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

decko commented Sep 2, 2024 •

edited

Loading

lubosmj Oct 7, 2024 •

edited

Loading