Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable Storage Metrics emmiter. #5761

Closed
wants to merge 1 commit into from

Conversation

decko
Copy link
Member

@decko decko commented Sep 2, 2024

Revert "Disable the Storage Metrics emmiter for now."
This reverts commit 7029ee4.

Closes #5762

@decko decko force-pushed the reactivate_space_usage_metric branch from 56d73b9 to a644253 Compare September 2, 2024 13:49
@decko decko marked this pull request as draft September 5, 2024 12:02
@decko decko force-pushed the reactivate_space_usage_metric branch from a644253 to c9d0177 Compare September 6, 2024 11:58
from pulpcore.plugin.tasking import dispatch
from pulpcore.app.tasks import telemetry

dispatch(telemetry.emmit_disk_space_usage_telemetry, args=(self.pulp_domain.pk,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this again, would not it make much more sense to (1) store the value of used disk space in a dedicated table and (2) schedule a task that will periodically update this value once in a while? All the burden of running the complex query will be deferred to the task and API workers will just emit the value stored in the table as it is.

gauge.set(DomainSize.objects.get(domain=domain["pk"]), {"domain_name": domain["pulp_domain__name"]})

Does it make sense to you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so @lubosmj. I'm dispatching a task from here to avoid any sort of lock or slow down response when creating an Artifact.

Also, I'm just sending the metered value when a new Artifact hits the base. I can't see a reason to store it since I'm not using that value all the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your plan is to dispatch a task each time an artifact is saved or deleted, correct? During syncing or recycling (orphan cleanup), this might result into hundreds of tasks being dispatched within the very small time frame, potentially blocking other useful tasks from being executed. I doubt this has some performance benefits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. It'll only schedule a new task if there's no other scheduled for the next 5 minutes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change of plans @lubosmj

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current idea is to run this metrics just like we run the unblocked tasks metrics. Within the heartbeat of the worker.

@decko decko force-pushed the reactivate_space_usage_metric branch from af5f201 to cd834a5 Compare September 16, 2024 16:38
@decko decko force-pushed the reactivate_space_usage_metric branch 7 times, most recently from 96f1912 to 61d99a6 Compare September 16, 2024 20:36
@decko decko marked this pull request as ready for review September 16, 2024 21:03
@decko decko requested a review from lubosmj September 16, 2024 21:04
if os.getenv("PULP_OTEL_ENABLED", "").lower() != "true" and not settings.DOMAIN_ENABLED:
return

with contextlib.suppress(AdvisoryLockError), PGAdvisoryLock(STORAGE_METRICS_LOCK):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the way this reads to me (maybe I'm wrong here), every worker could acquire the lock with every beat causing this code to be run a lot more than expected. I imagined this working where the advisory lock would attempt to be acquired here (called from beat()) but then held until the process exits and the fact the lock is acquired held in memory as a variable and then checked (no need to re-acquire if I already got it last time, I still have it).

To acquire the lock in a long-lived way I think you'll need to not use the context manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding a lock forever is probably a bad idea. (Holding a lock for the time of a several hours sync task has not yet demonstrated to be a real problem, but it gives me a lot to think about.) Also we agreed to not do any version of leader election in the workers.
So, as we can now establish, this whole telemetry collecting and posting meters (I'm not talking about recording events, just scraping measures from the database here) business turns out to be more than a one off. Also it is sufficiently separate from all the other pulp operations to be moved into it's own process. This should reduce the risk of this whole endeavour by an order of magnitude.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmbouter Nope man. Only one worker can acquire the lock and it will only run the query if the current time is late than the last_heartbeat + heartbeat_period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having a different process dealing with it @mdellweg. Yet, I believe we need to discuss two points:
1- We still gonna need to have telemetry on some Pulp parts. Are we ok on telemetry code living between those two spaces? A "telemetry worker" and the Pulp code itself?
2- Currently we're running out of resources on services team and I can't see this new pulp-process happening in the short term.

How can we figure out a solution for this?

@dralley dralley marked this pull request as draft September 24, 2024 13:13
@decko decko force-pushed the reactivate_space_usage_metric branch from 61d99a6 to 7aecca7 Compare October 3, 2024 15:11
@decko decko closed this Oct 3, 2024
@decko decko reopened this Oct 3, 2024
@decko decko force-pushed the reactivate_space_usage_metric branch 2 times, most recently from a7e7121 to 906c17e Compare October 3, 2024 18:19
@decko decko marked this pull request as ready for review October 4, 2024 12:16
@decko decko requested a review from mdellweg October 4, 2024 12:16
provider = MeterProvider(metric_readers=[metric_reader], resource=resource)


def emit_domain_space_usage_metric():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only need for more than one such task would be the typical "Hourly, Daily, Weekly" schedules.

Suggested change
def emit_domain_space_usage_metric():
def otel_metrics():

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. The name make it clear what the function does. The telemetry module could even receive the code of the task metrics we have in worker.py.
I see no reason to change the name here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name appears in the database. There is no way we can change it later. The current name is specific to it's first batch of work, yes. But it's not future proof. Also it suffers from homeopathic naming. "emit" and "usage" don't actually add reasonable information. And still this is the one recurring task to collect and send all the measures that need updates frequently (currently five minutes, but i'm not sold on that interval, so don't put it in the name).
We may think about adding a "otel_metrics_daily" later. You get the idea?

Comment on lines 44 to 45
metric_reader.collect()
metric_reader.force_flush()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling force_flush appears redundant to me.

    def force_flush(self, timeout_millis: float = 10_000) -> bool:
        super().force_flush(timeout_millis=timeout_millis)  # ABC class calls self.collect()
        self._exporter.force_flush(timeout_millis=timeout_millis)  # noop for PeriodicExportingMetricReader
        return True

https://opentelemetry-python.readthedocs.io/en/latest/sdk/metrics.export.html#opentelemetry.sdk.metrics.export.MetricReader.force_flush
https://opentelemetry-python.readthedocs.io/en/latest/sdk/metrics.export.html#opentelemetry.sdk.metrics.export.PeriodicExportingMetricReader.force_flush
https://opentelemetry-python.readthedocs.io/en/latest/exporter/otlp/otlp.html#opentelemetry.exporter.otlp.proto.http.metric_exporter.OTLPMetricExporter.force_flush

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks man. I've removed it.

@decko decko force-pushed the reactivate_space_usage_metric branch from 906c17e to 687e7ee Compare October 7, 2024 13:25
# This configuration is needed since the worker thread is not using the opentelemetry
# instrumentation agent to run the task code.

OTLP_EXPORTER_ENDPOINT = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed. It is not used anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I removed the usage of this on the Exporter but forgot to remove it here. THanks for the catch.

Comment on lines 21 to 23
metric_reader = PeriodicExportingMetricReader(
exporter, export_interval_millis=3000, export_timeout_millis=3000
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout and interval values are retrieved from environment variables by default. Is it worth setting them if we enforce the exporting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. need to remove it from here also

Comment on lines 1 to 3
Re-enable the Domain Storage metric emmiter and adds a feature flag to it.
This is an experimental feature and can change without prior notice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changelog does not reflect the reality. We do not have a feature flag for the storage metrics, do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. My first implementation had it


# We're using the same gauge with different attributes for each domain space usage
for domain in space_utilization_per_domain:
space_usage_gauge.set(domain["total_size"], {"domain_name": domain["pulp_domain__name"]})
Copy link
Member

@lubosmj lubosmj Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth incorporating the pulp-href/PRN here? It was part of the metric's attributes in the past:

metrics.Observation(
total_size,
{
"pulp_href": get_url(self.domain),
"domain_name": self.domain.name,
},
)
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know @lubosmj. I can't see the usefulness of it on the metric itself. Also, to use the get_prn function we need a Domain instance or the Domain pulp_href. At that point of our code we don't have that.

@decko decko force-pushed the reactivate_space_usage_metric branch 2 times, most recently from d481f04 to de0cfdd Compare October 8, 2024 13:41
@decko decko requested a review from mdellweg October 8, 2024 14:21
Copy link
Member

@lubosmj lubosmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the functional's perspective, this is working great!

image

otel-collector    | InstrumentationScope pulpcore.app.tasks.telemetry 
otel-collector    | Metric #0
otel-collector    | Descriptor:
otel-collector    |      -> Name: space_usage
otel-collector    |      -> Description: The total space usage per domain.
otel-collector    |      -> Unit: bytes
otel-collector    |      -> DataType: Gauge
otel-collector    | NumberDataPoints #0
otel-collector    | Data point attributes:
otel-collector    |      -> domain_name: Str(default)
otel-collector    | StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
otel-collector    | Timestamp: 2024-10-08 15:56:56.717505952 +0000 UTC
otel-collector    | Value: 368237

The feature is resilient to worker's restarts and the value is correctly reported. Please, address this comment from @mdellweg: https://github.com/pulp/pulpcore/pull/5761/files#r1790281943. I am going to approve this PR then.

Comment on lines 1 to 3
Re-enable and refactor the Domain Storage metric emiter.
This is an experimental feature and can change without prior notice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Re-enable and refactor the Domain Storage metric emiter.
This is an experimental feature and can change without prior notice.
Re-enabled and refactord the Domain Storage metric emiter.

@decko decko force-pushed the reactivate_space_usage_metric branch from de0cfdd to 45a405c Compare October 9, 2024 12:31
@decko decko closed this by deleting the head repository Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-enable Storage Metrics
4 participants