Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[profiling] Fix 3 flaky tests #3554

Merged
merged 1 commit into from
Mar 26, 2024
Merged

Conversation

AlexJF
Copy link
Contributor

@AlexJF AlexJF commented Mar 22, 2024

What does this PR do?
Fix 3 flaky tests that have surfaced this week:

  • Datadog::Profiling::Collectors::CpuAndWallTimeWorker ._native_allocation_count when CpuAndWallTimeWorker has been started when allocation profiling is enabled returns the number of allocations between two calls of the method

    • https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/14119/workflows/cbefa69c-c16e-41fd-a601-2e951a5264ab/jobs/525211

    • expect(after_allocations - before_allocations).to be 100 | expected #<Integer:201> => 100 | got #<Integer:215> => 107

    • TL;DR: Rare GC occurrence in the middle of the test would count extra array allocations due to execution of finalizer functions with complex arguments (*rest) which are expected to come from usage of Tempfile in previous tests in the run.

      C backtrace of Array-typed allocation samples during allocation of 100 Objects
      -- C level backtrace information -------------------------------------------
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_vm_bugreport+0xb4c) [0x105013440]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_bug_without_die+0xfc) [0x104e561b8]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_bug+0x1c) [0x105108b90]
      /Users/alexandre.fonseca/go/src/github.com/DataDog/dd-trace-rb/lib/datadog_profiling_native_extension.3.3.0_arm64-darwin23.bundle(_sample_thread.cold.1+0x14) [0x12d3178a8]
      /Users/alexandre.fonseca/go/src/github.com/DataDog/dd-trace-rb/lib/datadog_profiling_native_extension.3.3.0_arm64-darwin23.bundle(sample_thread+0x604) [0x12d30f65c]
      /Users/alexandre.fonseca/go/src/github.com/DataDog/dd-trace-rb/lib/datadog_profiling_native_extension.3.3.0_arm64-darwin23.bundle(trigger_sample_for_thread+0x574) [0x12d310bbc]
      /Users/alexandre.fonseca/go/src/github.com/DataDog/dd-trace-rb/lib/datadog_profiling_native_extension.3.3.0_arm64-darwin23.bundle(thread_context_collector_sample_allocation+0x298) [0x12d3113d4]
      /Users/alexandre.fonseca/go/src/github.com/DataDog/dd-trace-rb/lib/datadog_profiling_native_extension.3.3.0_arm64-darwin23.bundle(rescued_sample_allocation+0x58) [0x12d30c81c]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_vrescue2+0xc8) [0x104e62910]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_rescue2+0x2c) [0x104e62820]
      /Users/alexandre.fonseca/go/src/github.com/DataDog/dd-trace-rb/lib/datadog_profiling_native_extension.3.3.0_arm64-darwin23.bundle(on_newobj_event+0xb8) [0x12d30c6ac]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(exec_hooks_unprotected+0x54) [0x105014814]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_exec_event_hooks+0xc8) [0x105014764]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(gc_event_hook_body+0x94) [0x104e83954]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(newobj_slowpath_wb_protected+0x164) [0x104e804a0]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_wb_protected_newobj_of+0x90) [0x104e7243c]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(ary_new+0x6c) [0x104dbb7c4]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_ary_new_from_values+0x28) [0x104dbba4c]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(setup_parameters_complex+0xad8) [0x105003df8]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(vm_call_iseq_setup+0x60) [0x1050014a0]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(vm_call0_body+0x260) [0x10500b438]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_vm_call0+0xd8) [0x104ff5390]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(run_finalizer+0x120) [0x104e74cf4]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(finalize_deferred_heap_pages+0x88) [0x104e74e78]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(gc_finalize_deferred+0x50) [0x104e71e6c]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_postponed_job_flush+0x170) [0x105018358]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_threadptr_execute_interrupts+0x118) [0x104fb5090]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(vm_call0_body+0x9bc) [0x10500bb94]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_call0+0x3c8) [0x10500c7f4]
      /Users/alexandre.fonseca/.asdf/installs/ruby/3.3.0/lib/libruby.3.3.dylib(rb_class_new_instance_pass_kw+0x3c) [0x104ee9188]
      
  • Datadog::Profiling::Collectors::CpuAndWallTimeWorker #stats_reset_not_thread_safe returns accumulated stats and resets them back to 0

  • Datadog::Profiling::Collectors::CpuAndWallTimeWorker #start when heap profiling is enabled records live heap objects

    • Not hit in CI but surfaced during the thousands of runs we had to run to reproduce the above 2.
    • expect(relevant_sample.values[:'heap-live-samples']).to eq test_num_allocated_object | expected: 123 | got: 120
    • TL;DR: An unlucky GC during the test execution could lead to some of the recorded objects having been allocated before the GC and others after, leading to different values of gc gen age and thus being aggregated in multiple samples rather than the expected individual one.

Motivation:
Less flakes!

Additional Notes:

How to test the change?
These bugs were reproduced and fixed through thousands of local executions of the cpu_and_wall_time_worker_spec.rb. We haven't hit failures after fixing them but you're welcome to give it a try!

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@AlexJF AlexJF requested a review from a team as a code owner March 22, 2024 18:42
@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Mar 22, 2024
@AlexJF AlexJF force-pushed the alexjf/handle-allocation-flaky-test branch from 9445ae8 to 5d14f23 Compare March 22, 2024 18:43
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM Amazing work!

I guess I'll sum up this PR as

Note to self: ask 'what happens if GC runs in the middle of this spec?'

when writing and reviewing allocation/heap profiling code.

@ivoanjo ivoanjo added this to the 1.22.0 milestone Mar 25, 2024
@AlexJF AlexJF merged commit c097f90 into master Mar 26, 2024
219 checks passed
@AlexJF AlexJF deleted the alexjf/handle-allocation-flaky-test branch March 26, 2024 15:53
ivoanjo added a commit that referenced this pull request Feb 10, 2025
**What does this PR do?**

This PR fixes the following flaky profiler spec
(seen in
https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/19126/workflows/00361a70-d216-4fa9-9ce9-0f87e5f033f1/jobs/669239):

```
Failures:

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker._native_allocation_count when CpuAndWallTimeWorker has been started when allocation profiling and allocation counting is enabled returns different numbers of allocations for different threads
     Failure/Error: expect(after_t1 - before_t1).to be 100

       expected #<Integer:201> => 100
            got #<Integer:217> => 108

       Compared using equal?, which compares object identity,
       but expected and actual are not the same object. Use
       `expect(actual).to eq(expected)` if you don't care about
       object identity in this example.
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:1377:in `block (5 levels) in <top (required)>'
     # ./spec/spec_helper.rb:238:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:123:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.23.0/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.9/lib/rspec/wait.rb:46:in `block (2 levels) in <top (required)>'
```

We had previously debugged the same issue on another spec in
#3554 (and left a nice
comment explaining what the issue is).

I just propagated the same fix (disabling GC for the duration
of the test) to this spec as well.

**Motivation:**

Zero known flaky profiler tests!

**Additional Notes:**

In practice what happens is that we're counting how many
Ruby objects get created during the spec, but if GC gets
triggered at the exact wrong time, it may trigger other
Ruby code running, and thus more objects than expected
being allocated.

By disabling GC for this test, hopefully we're able
to keep the Ruby VM from not allocating any incidental
things in the background.

This GC counting feature is also deprecated, so in the
future we'll remove it which is even nicer :)

**How to test the change?**

Validate that CI is still green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/testing Involves testing processes (e.g. RSpec)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants