8338534: GenShen: Handle alloc failure differently when immediate garbage is pending #479

kdnilsen · 2024-08-21T14:48:55Z

Several changes are implemented here:

Re-order the phases that execute immediately after final-mark so that we do concurrent-cleanup quicker (but still after concurrent weak references)
After immediate garbage has been reclaimed by concurrent cleanup, notify waiting allocators
If an allocation failure occurs while immediate garbage recycling is pending, stall the allocation but do not cancel the concurrent gc.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

JDK-8338534: GenShen: Handle alloc failure differently when immediate garbage is pending (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/shenandoah.git pull/479/head:pull/479
$ git checkout pull/479

Update a local copy of the PR:
$ git checkout pull/479
$ git pull https://git.openjdk.org/shenandoah.git pull/479/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 479

View PR using the GUI difftool:
$ git pr show -t 479

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/shenandoah/pull/479.diff

Webrev

Link to Webrev Comment

This reverts commit 7bb1d38.

When we round up, we introduce the risk that the new size exceeds the maximum LAB size, resulting in an assertion error.

This reverts commit 99cce53.

This reverts commit d881300.

This may allow us to reclaim immediate garbage more quickly.

We just reported freeset status after rebuilding freeset in final mark. There will be heavy contention on the heaplock at the start of evacuation as many mutator and worker threads prep their GCLAB and PLAB buffers for evacuation. We avoid some lock contention by not reporting freset status here.

If we notify waiting mutators after immediate garbage is reclaimed, do not clear the alloc-failure flag. Otherwise, this hides the fact that alloc failure occurred during this GC.

bridgekeeper · 2024-08-21T14:49:44Z

👋 Welcome back kdnilsen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-08-21T14:50:57Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

kdnilsen · 2024-08-21T17:07:53Z

BTW, I can't figure out what jcheck whitespace is complaining about. I think it is reporting the wrong line number. jcheck on my local copy does not report a whitespace problem.

ysramakrishna · 2024-08-29T22:14:40Z

BTW, I can't figure out what jcheck whitespace is complaining about. I think it is reporting the wrong line number. jcheck on my local copy does not report a whitespace problem.

Yes, for some reason git jcheck -s doesn't find it.

However, there is indeed whitespace (in fact, 2 as in the error message) in that file on an otherwise blank line. However, the line number is indeed incorrect as you stated. It should be line 2467.

diff --git a/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp b/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp
index cbcd22c053f..342c023d180 100644
--- a/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp
+++ b/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp
@@ -2464,7 +2464,7 @@ void ShenandoahHeap::rebuild_free_set(bool concurrent) {
   _free_set->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old_region, last_old_region, old_region_count);
   size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();
   control_thread()->anticipate_immediate_garbage(anticipated_immediate_garbage);
-  
+
   // If there are no old regions, first_old_region will be greater than last_old_region
   assert((first_old_region > last_old_region) ||
          ((last_old_region + 1 - first_old_region >= old_region_count) &&

ysramakrishna · 2024-08-29T22:26:13Z

% find . -type f -name "*pp" | xargs grep -n " $" 
./src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:2467:

mlbridge · 2024-08-30T15:03:19Z

Webrevs

00: Full (f11a3c81)

ysramakrishna

Several changes are implemented here:

Re-order the phases that execute immediately after final-mark so that we do concurrent-cleanup quicker (but still after concurrent weak references)

After immediate garbage has been reclaimed by concurrent cleanup, notify waiting allocators

If an allocation failure occurs while immediate garbage recycling is pending, stall the allocation but do not cancel the concurrent gc.

As you suggested offline, I agree that it might make sense to do (1) separately, and then do (2+3).

Left a few comments mainly on (2+3), but I'm still missing the solution to the problem described in the ticket. I'll chat with you offline to get clarity.

ysramakrishna · 2024-08-30T00:36:13Z

src/hotspot/share/gc/shenandoah/shenandoahController.hpp

@@ -71,6 +73,8 @@ class ShenandoahController: public ConcurrentGCThread {
  // until another cycle runs and clears the alloc failure gc flag.
  void handle_alloc_failure(ShenandoahAllocRequest& req, bool block);

+  void anticipate_immediate_garbage(size_t anticipated_immediate_garbage_words);


a 1-line documentation comment on the role of the field (and that the method sets it -- why not simply call it set_foo(value) for field _foo ? I realize you want readers to get the most recently written value, hence the atomic store & load.

ysramakrishna · 2024-08-30T01:03:32Z

src/hotspot/share/gc/shenandoah/shenandoahController.cpp

+  if (clear_alloc_failure) {
+    _alloc_failure_gc.unset();
+    _humongous_alloc_failure_gc.unset();
+  }


For good hygiene, I'd move the variable value changes into the monitor which is held when waiting or notifying. I realize this doesn't matter for correctness, but makes debugging easier.

Further, if you protect the updates and reads of the variables with the lock, you don't need to do the extra atomic ops.

You'd need to examine all sets/gets and waits/notifys to make sure this works, but I am guessing it will, and it'll also improve performance.

However, that can be done in a separate effort, if you prefer, for which I'm happy to file a separate ticket for that investigation/change.

I realize now that this idiom is quite pervasive in Shenandoah code, so just fixing this instance of it doesn't accomplish much at this time. I am not convinced it's a good idiom. I'll investigate this separately. I vaguely recall a discussion along these lines in an older PR.

I'll file a separate ticket for this; you can ignore this remark for the purposes of this PR.

ysramakrishna · 2024-08-30T01:11:41Z

src/hotspot/share/gc/shenandoah/shenandoahConcurrentGC.cpp

-  ShenandoahHeap::heap()->free_set()->recycle_trash();
+  ShenandoahHeap* heap = ShenandoahHeap::heap();
+  if (heap->free_set()->recycle_trash()) {
+    heap->control_thread()->notify_alloc_failure_waiters(false);


Can you motivate this notification? As far as I can tell, all waiters will react to the notification by waking up, finding that the variables are still set, and consequently go back to wait.

I am sure I am missing something here, or you didn't make an intended change to allow waiters to retry allocation after waking up and go back to sleep if they didn't succeed?

A documentation comment would definitely help cross the t's and dot the i's for the reader.

ysramakrishna · 2024-08-30T01:23:24Z

src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp

@@ -2462,6 +2462,9 @@ void ShenandoahHeap::rebuild_free_set(bool concurrent) {
  size_t young_cset_regions, old_cset_regions;
  size_t first_old_region, last_old_region, old_region_count;
  _free_set->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old_region, last_old_region, old_region_count);
+  size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();
+  control_thread()->anticipate_immediate_garbage(anticipated_immediate_garbage);
+


This is the line that has two whitespaces, vide the jcheck whitespace error above.

ysramakrishna · 2024-08-30T01:24:21Z

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp

 }

-void ShenandoahFreeSet::recycle_trash() {
+bool ShenandoahFreeSet::recycle_trash() {
+  bool result = false;


I'd take the opportunity to do some counting verification here.

int n_trash_regions = 0;

Read on ...

ysramakrishna · 2024-08-30T01:32:25Z

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp

+  _heap->control_thread()->anticipate_immediate_garbage((size_t) 0);
+  return result;


...

clear_anticipated_immediate_garage(n_trash_regions*HeapRegionSize); return n_trash_regions > 0;

with

void ...::clear_anticipated_immediate_garbage(size_t aig) { assert(_anticipated_immediate_garbage == aig, "Mismatch?"); _anticipated_immediate_garbage = 0; }

Is this an intended invariant? I think it is, but don't understand enough of the details to be certain.

ysramakrishna · 2024-08-30T01:36:00Z

src/hotspot/share/gc/shenandoah/shenandoahController.cpp

@@ -53,6 +53,10 @@ size_t ShenandoahController::get_gc_id() {
  return Atomic::load(&_gc_id);
 }

+void ShenandoahController::anticipate_immediate_garbage(size_t anticipated_immediate_garbage) {


Suggested rename see further above. set_<field>.

ysramakrishna · 2024-08-30T01:42:11Z

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp

+bool ShenandoahFreeSet::try_recycle_trashed(ShenandoahHeapRegion* r) {
+  bool result = false;
  if (r->is_trash()) {
    r->recycle();
+    result = true;
  }
+  return true;
 }


If I understood your intent, I think this has a bug because it always returns true here. I believe you just wanted:

if (r->is_trash()) { r->recycle(); return true; } return false;

ysramakrishna · 2024-08-30T23:03:10Z

src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp

@@ -753,6 +753,9 @@ void ShenandoahGeneration::prepare_regions_and_collection_set(bool concurrent) {
    // We are preparing for evacuation.  At this time, we ignore cset region tallies.
    size_t first_old, last_old, num_old;
    heap->free_set()->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old, last_old, num_old);
+    size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();


This makes it sound like old_cset_regions & young_cset_regions hold all regions that will be part of immediate garbage -- which indeed is the case. The name prepare_to_rebuild() does not give one much clue as to the fact that that's happening when we return from the call, and the API spec of the method does not explicitly specify it. One needs to read into the code of the method find_regions_with_alloc_capacity() to realize that this is what is happening.

In summary, firstly, I feel some of these methods are in need of tighter header file documentation of post-conditions that callers are relying on and, secondly, I feel given the extremely fat APIs (lots of reference variables that are modified by these methods) that some amount of refactoring is needed in the longer term. The refactoring should be a separate effort, but in the shorter term I think the API/spec documentation of prepare_to_rebuild and find_regions_with_alloc_capacity could be improved.

The names old_cset_regions and young_cset_regions is very confusing as well. These regions are already trash before the collection set is chosen during final mark (and so, will not themselves be part of the collection set). Suggest calling them old_trashed_regions and young_trashed_regions here.

ysramakrishna · 2024-08-30T23:19:57Z

src/hotspot/share/gc/shenandoah/shenandoahOldGeneration.cpp

@@ -474,6 +474,9 @@ void ShenandoahOldGeneration::prepare_regions_and_collection_set(bool concurrent
    size_t cset_young_regions, cset_old_regions;
    size_t first_old, last_old, num_old;
    heap->free_set()->prepare_to_rebuild(cset_young_regions, cset_old_regions, first_old, last_old, num_old);
+    size_t anticipated_immediate_garbage = (cset_young_regions + cset_old_regions) * ShenandoahHeapRegion::region_size_words();
+    heap->control_thread()->anticipate_immediate_garbage(anticipated_immediate_garbage);


suggest set_<foo> for changing field value <foo>.

earthling-amzn · 2024-08-30T23:57:22Z

src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp

@@ -753,6 +753,9 @@ void ShenandoahGeneration::prepare_regions_and_collection_set(bool concurrent) {
    // We are preparing for evacuation.  At this time, we ignore cset region tallies.
    size_t first_old, last_old, num_old;
    heap->free_set()->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old, last_old, num_old);
+    size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();


The names old_cset_regions and young_cset_regions is very confusing as well. These regions are already trash before the collection set is chosen during final mark (and so, will not themselves be part of the collection set). Suggest calling them old_trashed_regions and young_trashed_regions here.

earthling-amzn · 2024-08-31T00:02:31Z

src/hotspot/share/gc/shenandoah/shenandoahController.cpp

@@ -65,11 +69,12 @@ void ShenandoahController::handle_alloc_failure(ShenandoahAllocRequest& req, boo
                 req.type_string(),
                 byte_size_in_proper_unit(req.size() * HeapWordSize), proper_unit_for_byte_size(req.size() * HeapWordSize));

-    // Now that alloc failure GC is scheduled, we can abort everything else
-    heap->cancel_gc(GCCause::_allocation_failure);
+    if (Atomic::load(&_anticipated_immediate_garbage) < req.size()) {


To make sure I understand... here we are saying that if final mark anticipates this much immediate garbage (computed when it rebuilt the freeset after choosing the collection set), then we aren't going to cancel the GC if this particular request could be satisfied. Instead we will block as though the gc has already been cancelled. This thread will be notified when concurrent cleanup completes.

earthling-amzn · 2024-08-31T00:07:59Z

src/hotspot/share/gc/shenandoah/shenandoahController.cpp

-  _alloc_failure_gc.unset();
-  _humongous_alloc_failure_gc.unset();
+void ShenandoahController::notify_alloc_failure_waiters(bool clear_alloc_failure) {
+  if (clear_alloc_failure) {


Why would we not clear the alloc failure? This seems like it would confuse the control thread. Isn't this going to have the control thread attempt to notify alloc failure waiters again when the cycle is finished?

bridgekeeper · 2024-09-28T00:24:52Z

@kdnilsen This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ysramakrishna · 2024-10-16T20:06:41Z

@kdnilsen : Should this become a draft for now?

bridgekeeper · 2024-11-14T00:42:50Z

@kdnilsen This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

kdnilsen added 26 commits February 26, 2024 00:13

Remove dead code for inelastic plabs

7bb1d38

Revert "Remove dead code for inelastic plabs"

8bc4367

This reverts commit 7bb1d38.

Round LAB sizes down rather than up to force alignment

99cce53

When we round up, we introduce the risk that the new size exceeds the maximum LAB size, resulting in an assertion error.

Revert "Round LAB sizes down rather than up to force alignment"

11b26bb

This reverts commit 99cce53.

Merge branch 'openjdk:master' into master

941d8aa

Merge branch 'openjdk:master' into master

39c5885

Make satb-mode Info logging less verbose

28a382b

Merge branch 'openjdk:master' into master

a43675a

Change behavior of max_old and min_old

d881300

Revert "Change behavior of max_old and min_old"

c2cb1b7

This reverts commit d881300.

Merge branch 'openjdk:master' into master

141fec1

Merge branch 'openjdk:master' into master

bac08f0

Merge branch 'openjdk:master' into master

84f27d7

Merge branch 'openjdk:master' into master

118f5b1

Merge branch 'openjdk:master' into master

5312029

Merge branch 'openjdk:master' into master

56567b0

Merge branch 'openjdk:master' into master

25ee3f5

Merge branch 'openjdk:master' into master

c076aa3

Merge branch 'openjdk:master' into master

ff99de7

Merge branch 'openjdk:master' into master

b8b4e42

Reorder concurrent cleanup following final mark

1c26ae0

This may allow us to reclaim immediate garbage more quickly.

Notify waiting allocators after cleanup of immediate garbage

4198b75

Do not clear alloc-failure flag after cleanup early

408f789

If we notify waiting mutators after immediate garbage is reclaimed, do not clear the alloc-failure flag. Otherwise, this hides the fact that alloc failure occurred during this GC.

Improve comments regarding when we can recycle trashed regions

9e9c54c

Do not cancel GC on allocation failure if immediate garbage is adequate

8461939

kdnilsen added 2 commits August 21, 2024 15:05

Whitespace

02fd5c1

Experiment with white space jcheck

5c0bb7d

Fix whitespace

f11a3c8

openjdk bot added the rfr Pull request is ready for review label Aug 30, 2024

ysramakrishna reviewed Aug 30, 2024

View reviewed changes

earthling-amzn suggested changes Aug 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8338534: GenShen: Handle alloc failure differently when immediate garbage is pending #479

8338534: GenShen: Handle alloc failure differently when immediate garbage is pending #479

kdnilsen commented Aug 21, 2024 •

edited by openjdk bot

Loading

bridgekeeper bot commented Aug 21, 2024

openjdk bot commented Aug 21, 2024

kdnilsen commented Aug 21, 2024 •

edited

Loading

ysramakrishna commented Aug 29, 2024

ysramakrishna commented Aug 29, 2024

mlbridge bot commented Aug 30, 2024

ysramakrishna left a comment •

edited

Loading

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

ysramakrishna Aug 30, 2024

earthling-amzn Aug 30, 2024

ysramakrishna Aug 30, 2024

earthling-amzn Aug 30, 2024

earthling-amzn Aug 31, 2024

earthling-amzn Aug 31, 2024

bridgekeeper bot commented Sep 28, 2024

ysramakrishna commented Oct 16, 2024

bridgekeeper bot commented Nov 14, 2024

		_heap->control_thread()->anticipate_immediate_garbage((size_t) 0);
		return result;

8338534: GenShen: Handle alloc failure differently when immediate garbage is pending #479

Are you sure you want to change the base?

8338534: GenShen: Handle alloc failure differently when immediate garbage is pending #479

Conversation

kdnilsen commented Aug 21, 2024 • edited by openjdk bot Loading

Progress

Issue

Reviewing

Webrev

bridgekeeper bot commented Aug 21, 2024

openjdk bot commented Aug 21, 2024

kdnilsen commented Aug 21, 2024 • edited Loading

ysramakrishna commented Aug 29, 2024

ysramakrishna commented Aug 29, 2024

mlbridge bot commented Aug 30, 2024

Webrevs

ysramakrishna left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bridgekeeper bot commented Sep 28, 2024

ysramakrishna commented Oct 16, 2024

bridgekeeper bot commented Nov 14, 2024

kdnilsen commented Aug 21, 2024 •

edited by openjdk bot

Loading

kdnilsen commented Aug 21, 2024 •

edited

Loading

ysramakrishna left a comment •

edited

Loading