Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Relax profiler expectation to reduce flakiness #3548

Merged
merged 1 commit into from
Mar 21, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 21, 2024

What does this PR do?

We ran into a flaky spec in CI
(https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/14137/workflows/a609e08c-87bf-48a7-8eb4-4e9a45e01478/jobs/526030):

$ bundle exec appraisal ruby-3.1-opentelemetry rake spec:profiling:main
...

Failures:

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker #start when all threads are sleeping (no thread holds the Global VM Lock) is able to sample even when all threads are sleeping
     Failure/Error: expect(trigger_sample_attempts).to be >= sample_count

       expected: >= 21
            got:    20
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:397:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.9/lib/rspec/wait.rb:46:in `block (2 levels) in <top (required)>'

Finished in 14.61 seconds (files took 1.39 seconds to load)
662 examples, 1 failure, 19 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:369 # Datadog::Profiling::Collectors::CpuAndWallTimeWorker #start when all threads are sleeping (no thread holds the Global VM Lock) is able to sample even when all threads are sleeping

Randomized with seed 32941

This is unexpected since (at time of writing) we always increment the counter before triggering a sample, so this should not be possible.

After some head-scratching, I believe this is related to the rb_postponed_job_register_one clobbering bug we've had reports before. (See comments for my full reasoning).

Thus... I tried to add a small margin to hopefully make sure this flakiness becomes even rarer on the affected Rubies (< 3.3).

Motivation:

Reduce as much as possible flakiness in the profiler test suite.

Additional Notes:

For reference, here's the code for queuing:

https://github.com/ruby/ruby/blob/v3_2_3/vm_trace.c#L1655-L1664

    if (ATOMIC_CAS(vm->postponed_job_index, expected_index, expected_index+1) == expected_index) {
        pjob = &vm->postponed_job_buffer[expected_index];
    }
    else {
        return PJRR_INTERRUPTED;
    }

    /* unused: pjob->flags = flags; */
    pjob->func = func;
    pjob->data = data;

and this is dequeing:

https://github.com/ruby/ruby/blob/v3_2_3/vm_trace.c#L1772-L1780

            rb_atomic_t index;
            struct rb_workqueue_job *wq_job;

            while ((index = vm->postponed_job_index) > 0) {
                if (ATOMIC_CAS(vm->postponed_job_index, index, index-1) == index) {
                    // < -- here Ruby has claimed an index to execute AND
                    // at the same time marked it as available
                    // so if our interrupt and queue happens here, we'll
                    // clobber the previous job
                    rb_postponed_job_t *pjob = &vm->postponed_job_buffer[index-1];
                    (*pjob->func)(pjob->data);
                }
            }

How to test the change?

It's really hard to trigger this without actually changing Ruby to e.g. trigger the enqueue at the exact time. I didn't spend time on reproducing this exact issue as I'm pretty convinced from looking at the code.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

**What does this PR do?**

We ran into a flaky spec in CI
(https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/14137/workflows/a609e08c-87bf-48a7-8eb4-4e9a45e01478/jobs/526030):

```
$ bundle exec appraisal ruby-3.1-opentelemetry rake spec:profiling:main
...

Failures:

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker #start when all threads are sleeping (no thread holds the Global VM Lock) is able to sample even when all threads are sleeping
     Failure/Error: expect(trigger_sample_attempts).to be >= sample_count

       expected: >= 21
            got:    20
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:397:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.9/lib/rspec/wait.rb:46:in `block (2 levels) in <top (required)>'

Finished in 14.61 seconds (files took 1.39 seconds to load)
662 examples, 1 failure, 19 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:369 # Datadog::Profiling::Collectors::CpuAndWallTimeWorker #start when all threads are sleeping (no thread holds the Global VM Lock) is able to sample even when all threads are sleeping

Randomized with seed 32941
```

This is unexpected since (at time of writing) we always increment the
counter before triggering a sample, so this should not be possible.

After some head-scratching, I believe this is related to the
`rb_postponed_job_register_one` clobbering bug we've had reports before.
(See comments for my full reasoning).

Thus... I tried to add a small margin to hopefully make sure this
flakiness becomes even rarer on the affected Rubies (< 3.3).

**Motivation:**

Reduce as much as possible flakiness in the profiler test suite.

**Additional Notes:**

For reference, here's the code for queuing:

https://github.com/ruby/ruby/blob/v3_2_3/vm_trace.c#L1655-L1664

```c
    if (ATOMIC_CAS(vm->postponed_job_index, expected_index, expected_index+1) == expected_index) {
        pjob = &vm->postponed_job_buffer[expected_index];
    }
    else {
        return PJRR_INTERRUPTED;
    }

    /* unused: pjob->flags = flags; */
    pjob->func = func;
    pjob->data = data;
```

and this is dequeing:

https://github.com/ruby/ruby/blob/v3_2_3/vm_trace.c#L1772-L1780

```c
            rb_atomic_t index;
            struct rb_workqueue_job *wq_job;

            while ((index = vm->postponed_job_index) > 0) {
                if (ATOMIC_CAS(vm->postponed_job_index, index, index-1) == index) {
                    // < -- here Ruby has claimed an index to execute AND
                    // at the same time marked it as available
                    // so if our interrupt and queue happens here, we'll
                    // clobber the previous job
                    rb_postponed_job_t *pjob = &vm->postponed_job_buffer[index-1];
                    (*pjob->func)(pjob->data);
                }
            }
```

**How to test the change?**

It's really hard to trigger this without actually changing Ruby to
e.g. trigger the enqueue at the exact time. I didn't spend time on
reproducing this exact issue as I'm pretty convinced from looking at
the code.
@ivoanjo ivoanjo requested a review from a team as a code owner March 21, 2024 11:17
@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Mar 21, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.26%. Comparing base (210ed96) to head (ddd175f).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3548      +/-   ##
==========================================
- Coverage   98.27%   98.26%   -0.01%     
==========================================
  Files        1274     1274              
  Lines       75114    75117       +3     
  Branches     3535     3537       +2     
==========================================
+ Hits        73815    73817       +2     
- Misses       1299     1300       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivoanjo ivoanjo merged commit 0ccdf78 into master Mar 21, 2024
220 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/relax-expectation-profiler-flakiness branch March 21, 2024 12:18
@github-actions github-actions bot added this to the 1.22.0 milestone Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/testing Involves testing processes (e.g. RSpec)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants