Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amend -off-cpu-threshold value #316

Merged
merged 9 commits into from
Jan 22, 2025

Conversation

rockdaboot
Copy link
Contributor

This amends the -off-cpu-threshold values to

  • 0 means off-cpu profiling is disabled (default)
  • [1..1000] is the %% probability for an off-cpu event to be reported: 1 -> one out of thousand; 1000 -> all.

Before, a value of 1000 disabled off-cpu profiling, thus enabling reporting of all events wasn't possible, which isn't what users might expect. The value of 0 was "valid" but caused a division-by-zero panic.

Additionally, when introducing new configuration variables, a default of non-zero (or empty string) may cause issues in other code bases. For example, our experimental OTEL collector/receiver stopped working, because the newly introduced off-cpu threshold value wasn't initialized with support.OffCPUThresholdMax (not easy to know when not closely following the PRs in this repo). A default of 0 for the off-cpu threshold naturally mitigates this.

@rockdaboot rockdaboot self-assigned this Jan 21, 2025
@rockdaboot rockdaboot requested review from a team as code owners January 21, 2025 08:43
@@ -554,23 +554,22 @@ func loadAllMaps(coll *cebpf.CollectionSpec, cfg *Config,
// On modern systems /proc/sys/kernel/pid_max defaults to 4194304.
// Try to fit this PID space scaled down with cfg.OffCPUThreshold into
// this map.
adaption["sched_times"] = (4194304 / support.OffCPUThresholdMax) * cfg.OffCPUThreshold
adaption["sched_times"] = (4194304 * cfg.OffCPUThreshold) / support.OffCPUThresholdMax
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiply first to prevent rounding inaccuracies.

@dmathieu
Copy link
Member

We should have better unit testing to avoid this kind of regressions

cli_flags.go Outdated Show resolved Hide resolved
cli_flags.go Outdated Show resolved Hide resolved
@rockdaboot
Copy link
Contributor Author

We should have better unit testing to avoid this kind of regressions

We should definitely have better unit testing. The current test code coverage for this project is not impressive.

But in the case mentioned in the PR description, the error came from a Cilium ebpf function, hard to trigger/catch in a unit tests. Btw, this PR adds a check for the off-cpu threshold value in controller/config/Validate() - which simplifies creating a unit test for corner cases of configuration values.

cli_flags.go Outdated Show resolved Hide resolved
cli_flags.go Outdated Show resolved Hide resolved
cli_flags.go Outdated Show resolved Hide resolved
cli_flags.go Show resolved Hide resolved
cli_flags.go Show resolved Hide resolved
@rockdaboot rockdaboot merged commit a7b433f into open-telemetry:main Jan 22, 2025
24 checks passed
@rockdaboot rockdaboot deleted the off-cpu-threshold branch January 22, 2025 13:57
@felixge
Copy link
Member

felixge commented Feb 6, 2025

Sorry for the late comment on this. Wouldn't it be nicer to define this chance as a rate X where the probability is 1/X?

See https://pkg.go.dev/runtime#SetMutexProfileFraction for prior art that uses this approach.

@rockdaboot
Copy link
Contributor Author

Sorry for the late comment on this. Wouldn't it be nicer to define this chance as a rate X where the probability is 1/X?

See https://pkg.go.dev/runtime#SetMutexProfileFraction for prior art that uses this approach.

I agree that the current accepted values limit the user.

Why not directly using a probability 1/X? (For example 0.004)
That avoids explaining the user that X refers to a probability of 1/X.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants