Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The env var NODE_V8_COVERAGE intermittently causes the process to hang #49344

Closed
jaydenseric opened this issue Aug 27, 2023 · 25 comments · Fixed by #51290
Closed

The env var NODE_V8_COVERAGE intermittently causes the process to hang #49344

jaydenseric opened this issue Aug 27, 2023 · 25 comments · Fixed by #51290
Labels
confirmed-bug Issues with confirmed bugs. v8 engine Issues and PRs related to the V8 dependency.

Comments

@jaydenseric
Copy link
Contributor

Version

20.5.1

Platform

Darwin [redacted] 22.5.0 Darwin Kernel Version 22.5.0: Mon Apr 24 20:52:24 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T6000 arm64

Subsystem

No response

What steps will reproduce the bug?

Using the environment variable NODE_V8_COVERAGE when running a Node.js script with the Node.js CLI option --test can intermittently cause the process to never exit. On such occasions, (when using --test-reporter=spec) the final report of modules that were run as tests, but didn't contain any tests, as well as the final summary of the test/suite/etc. counts, never prints in the console. Running the same script without setting NODE_V8_COVERAGE always succeeds as expected.

I have tried for several months to figure out a minimal reproduction for a bug report, but was only encountering the issue in large private codebases at work. You can see discussion about it here:

https://node-js.slack.com/archives/C3910A78T/p1686119658586369

Finally I have encountered the issue in one of my public open source repos and can share it here. Clone this branch:

https://github.com/jaydenseric/graphql-upload/tree/jaydenseric/node-test-runner

Specifically this commit introduces the issue:

jaydenseric/graphql-upload@3b7f034

After cloning, run npm install and then run the script npm run tests.

How often does it reproduce? Is there a required condition?

It reproduces perhaps 80% of the time, but for reasons I can't determine this can vary wildly to as much as 100% of the time depending when you run it.

You can prevent the environment variable NODE_V8_COVERAGE being set by changing the tests script:

https://github.com/jaydenseric/graphql-upload/blob/3b7f034e1c2d084f314734377c4351c171b87a42/package.json#L89

-    "tests": "coverage-node --test-reporter=spec --test",
+    "tests": "node --test-reporter=spec --test",

After doing this the script should always succeed, although without a final code coverage report printing.

What is the expected behavior? Why is that the expected behavior?

When the script intermittently succeeds, it is supposed to look like this:

Screenshot 2023-08-27 at 4 48 17 pm

What do you see instead?

Notice that most of the time, the script permanently hangs at this point:

Screenshot 2023-08-27 at 4 45 09 pm

At this time the node processes in macOS Activity Monitor look like this:

Screenshot 2023-08-27 at 4 49 38 pm

If you manually quit the node process with a higher ID and high CPU utilisation, the Node.js script exits like this:

Screenshot 2023-08-27 at 4 49 54 pm

Additional information

Past discussion about this issue:

https://node-js.slack.com/archives/C3910A78T/p1686119658586369

@MoLow MoLow added the test_runner Issues and PRs related to the test runner subsystem. label Aug 30, 2023
@MoLow
Copy link
Member

MoLow commented Aug 30, 2023

unfortunately, I am not able to reproduce this

@jaydenseric
Copy link
Contributor Author

jaydenseric commented Aug 30, 2023

You can see that it's not just my local machine; the GitHub Actions CI also timed out:

https://github.com/jaydenseric/graphql-upload/actions/runs/5989368434/job/16245479185#step:4:105

So I've experienced this issue on 3 seperate machines now; work laptop, personal laptop, and in CI.

@cjihrig
Copy link
Contributor

cjihrig commented Sep 30, 2023

I am able to reproduce this with jaydenseric/graphql-upload@3b7f034 on macOS. It does not reproduce 100% of the time (possibly a race condition?) so I used a small bash script:

while :
do
  NODE_V8_COVERAGE=./coverage ~/iojs/node/node processRequest.test.mjs
done

I believe this is only related to NODE_V8_COVERAGE and not the test runner. Also note that the bash script does not use --test. I've also heard from other folks that this can be reproduced without the test runner (ie, using node-tap on a different codebase).

When NODE_V8_COVERAGE is present, the inspector is enabled at startup. When the process is getting ready to exit, we call EndStartedProfilers().

I'm not yet sure if we are missing some cleanup step, but when the process hangs, we enter an infinite loop of flushing and adding new tasks:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000018886a1e0 libsystem_malloc.dylib`nanov2_pointer_size + 312
    frame #1: 0x000000018886a06c libsystem_malloc.dylib`nanov2_size + 32
    frame #2: 0x0000000188869e48 libsystem_malloc.dylib`free + 148
    frame #3: 0x000000010295f52c node`node::PerIsolatePlatformData::FlushForegroundTasksInternal() [inlined] std::__1::default_delete<v8::Task>::operator()(this=<unavailable>, __ptr=0x00006000013c8720) const at memory:1428:5 [opt]
    frame #4: 0x000000010295f51c node`node::PerIsolatePlatformData::FlushForegroundTasksInternal() [inlined] std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>::reset(this=<unavailable>, __p=0x0000000000000000) at memory:1689:7 [opt]
    frame #5: 0x000000010295f514 node`node::PerIsolatePlatformData::FlushForegroundTasksInternal() [inlined] std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>::~unique_ptr(this=<unavailable>) at memory:1643:19 [opt]
    frame #6: 0x000000010295f514 node`node::PerIsolatePlatformData::FlushForegroundTasksInternal() [inlined] std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>::~unique_ptr(this=<unavailable>) at memory:1643:17 [opt]
    frame #7: 0x000000010295f514 node`node::PerIsolatePlatformData::FlushForegroundTasksInternal(this=0x000000014ae073f8) at node_platform.cc:494:5 [opt]
    frame #8: 0x0000000102960b34 node`node::NodePlatform::DrainTasks(this=0x0000600002ce4000, isolate=<unavailable>) at node_platform.cc:457:25 [opt]
    frame #9: 0x00000001028311b0 node`node::FreeEnvironment(env=<unavailable>) at environment.cc:513:15 [opt]
    frame #10: 0x000000010293ead8 node`node::NodeMainInstance::Run() [inlined] node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>::operator()(this=<unavailable>, pointer=0x000000014b033800) const at util.h:675:39 [opt]
    frame #11: 0x000000010293ead0 node`node::NodeMainInstance::Run() [inlined] std::__1::unique_ptr<node::Environment, node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>>::reset(this=<unavailable>, __p=0x0000000000000000) at memory:1689:7 [opt]
    frame #12: 0x000000010293ead0 node`node::NodeMainInstance::Run() [inlined] std::__1::unique_ptr<node::Environment, node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>>::~unique_ptr(this=<unavailable>) at memory:1643:19 [opt]
    frame #13: 0x000000010293ead0 node`node::NodeMainInstance::Run() [inlined] std::__1::unique_ptr<node::Environment, node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>>::~unique_ptr(this=<unavailable>) at memory:1643:17 [opt]
    frame #14: 0x000000010293ead0 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:90:1 [opt]
    frame #15: 0x00000001028c76c8 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1374:24 [opt]
    frame #16: 0x00000001028c762c node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1381:27 [opt]
    frame #17: 0x00000001083f50f4 dyld`start + 520

At least some of the new tasks are coming from v8::internal::Heap::PostFinalizationRegistryCleanupTaskIfNeeded() (note that I added a printf() if you're curious about the top couple frames):

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000188a1d8d4 libsystem_kernel.dylib`__write_nocancel + 8
    frame #1: 0x000000018894ce14 libsystem_c.dylib`__swrite + 24
    frame #2: 0x000000018892c1ac libsystem_c.dylib`_swrite + 108
    frame #3: 0x000000018891e564 libsystem_c.dylib`__sfvwrite + 492
    frame #4: 0x0000000188948df8 libsystem_c.dylib`puts + 152
    frame #5: 0x000000010256389c node`non-virtual thunk to node::PerIsolatePlatformData::PostNonNestableTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>) [inlined] node::PerIsolatePlatformData::PostNonNestableTask(this=<unavailable>, task=v8::Task @ 0x00006000010abb90) at node_platform.cc:281:3 [opt]
    frame #6: 0x0000000102563890 node`non-virtual thunk to node::PerIsolatePlatformData::PostNonNestableTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>) at node_platform.cc:0 [opt]
    frame #7: 0x0000000102884418 node`v8::internal::Heap::PostFinalizationRegistryCleanupTaskIfNeeded(this=0x00000001280152b8) at heap.cc:6559:15 [opt]
    frame #8: 0x0000000102863874 node`v8::internal::FinalizationRegistryCleanupTask::RunInternal(this=0x00006000010abbd0) at finalization-registry-cleanup-task.cc:98:10 [opt]
    frame #9: 0x0000000102564880 node`node::PerIsolatePlatformData::RunForegroundTask(this=0x00000001216051e8, task=v8::Task @ 0x00006000010abbf0) at node_platform.cc:435:11 [opt]
    frame #10: 0x0000000102563504 node`node::PerIsolatePlatformData::FlushForegroundTasksInternal(this=0x00000001216051e8) at node_platform.cc:504:5 [opt]
    frame #11: 0x0000000102564b88 node`node::NodePlatform::DrainTasks(this=0x0000600002f98000, isolate=<unavailable>) at node_platform.cc:462:25 [opt]
    frame #12: 0x00000001024351b0 node`node::FreeEnvironment(env=<unavailable>) at environment.cc:513:15 [opt]
    frame #13: 0x0000000102542ad8 node`node::NodeMainInstance::Run() [inlined] node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>::operator()(this=<unavailable>, pointer=0x0000000121822c00) const at util.h:675:39 [opt]
    frame #14: 0x0000000102542ad0 node`node::NodeMainInstance::Run() [inlined] std::__1::unique_ptr<node::Environment, node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>>::reset(this=<unavailable>, __p=0x0000000000000000) at memory:1689:7 [opt]
    frame #15: 0x0000000102542ad0 node`node::NodeMainInstance::Run() [inlined] std::__1::unique_ptr<node::Environment, node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>>::~unique_ptr(this=<unavailable>) at memory:1643:19 [opt]
    frame #16: 0x0000000102542ad0 node`node::NodeMainInstance::Run() [inlined] std::__1::unique_ptr<node::Environment, node::FunctionDeleter<node::Environment, &node::FreeEnvironment(node::Environment*)>>::~unique_ptr(this=<unavailable>) at memory:1643:17 [opt]
    frame #17: 0x0000000102542ad0 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:90:1 [opt]
    frame #18: 0x00000001024cb6c8 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1374:24 [opt]
    frame #19: 0x00000001024cb62c node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1381:27 [opt]
    frame #20: 0x0000000107e090f4 dyld`start + 520

@nodejs/v8 does anything immediately stand out to you?

@cjihrig
Copy link
Contributor

cjihrig commented Sep 30, 2023

Both codebases that I've seen this observed on are using FinalizationRegistry. Adding the following code to my reproduction seems to have fixed the hang (although it's obviously not a proper solution):

globalThis.FinalizationRegistry = function(){};
globalThis.FinalizationRegistry.prototype.register = function(){};

There seems to be an issue with Node's use of the inspector interacting with FinalizationRegistry.

@cjihrig
Copy link
Contributor

cjihrig commented Oct 3, 2023

Possibly related issue with D8: https://bugs.chromium.org/p/v8/issues/detail?id=14281.

@tniessen tniessen added the v8 engine Issues and PRs related to the V8 dependency. label Oct 4, 2023
@joyeecheung joyeecheung added the confirmed-bug Issues with confirmed bugs. label Oct 4, 2023
@joyeecheung
Copy link
Member

Possibly related: #47452 maybe we can try to see if the bug goes away with that PR?

@cjihrig
Copy link
Contributor

cjihrig commented Oct 4, 2023

I just tried that patch. Unfortunately, I still observed the process hang.

@joyeecheung
Copy link
Member

Can this be bisected to a particular commit? (I would expect it to be one of those V8 upgrades, perhaps we can just test the ones around V8 upgrades with the repro)

@targos
Copy link
Member

targos commented Oct 4, 2023

Was it introduced in 20.0.0 or a later update?

@targos
Copy link
Member

targos commented Oct 4, 2023

There was (sadly) no V8 upgrade since v20.0.0

@cjihrig
Copy link
Contributor

cjihrig commented Oct 4, 2023

It has been tricky to bisect because it doesn't reproduce all the time. For the reproduction in this issue, I'm only seeing it on Node 20 (which would be convenient because of the V8 update in 20.0.0). However, in an unrelated codebase, I first witnessed what I believe to be the same bug on Node 18 back in February - at the time I thought it was just a bug in c8 as it went away with Node 19.

I believe @mcollina has observed the bug on versions other than 20. @isaacs has seen an issue in node-tap related to source maps that seems very similar to this one as well. Unfortunately, we don't know if we are all seeing the same bug, or if there are multiple bugs that all result in a hang.

@mhdawson
Copy link
Member

mhdawson commented Oct 4, 2023

From the discussions at the TSC meeting it seemed like there was consensus to document in release notes and the release announcement as a known limitation versus delaying the LTS promotion. The exception might be if we had a confirmed fix, but needed a few more days to get it landed. As @richardlau mentioned, the decision is up to the Release team to make one way or the other.

@joyeecheung
Copy link
Member

joyeecheung commented Oct 5, 2023

If we are not even sure this is a regression in v20.x, is this still relevant for the v20.x LTS promotion? It seems there are reports that this could go back to v18.x #49344 (comment)

@mhofman
Copy link

mhofman commented Oct 6, 2023

We have been experiencing this issue with ava as far back as Node 16 I believe: Agoric/agoric-sdk#7619

Others have been experiencing the same issue with other environments: pact-foundation/pact-js#1074

At some point we had a half consistent repro, at least on some environments. Since then I've successfully worked around the issue by having ava call takeCoverage() before exiting. We have not experienced a hang since.

@mcollina
Copy link
Member

@bcoe do you think you can take a look at this?

@bcoe
Copy link
Contributor

bcoe commented Nov 20, 2023

@bcoe do you think you can take a look at this?

@mcollina I might potentially have some time over the holidays to look, but I do not have any time prior.

@cjihrig how convinced are you that this is related to https://bugs.chromium.org/p/v8/issues/detail?id=14281? I could try to find someone from Chromium to take a look.

@isaacs
Copy link
Contributor

isaacs commented Nov 21, 2023

@bcoe https://bugs.chromium.org/p/v8/issues/detail?id=14281 does seem extremely close to what's going on in node. The generation of the source-map-cache (which node does when NODE_V8_COVERAGE is set) makes use of the FinalizationRegistry. If you look up a lot of source maps (150-200 or so) right as the isolate is about to quit, it'll trigger a hang pretty reliably, especially if some of those source map files can't be found. I was able to avoid it in node-tap by looking them up proactively, so that I only ever have 0 or 1 at process exit.

@mcollina mcollina removed the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Nov 29, 2023
@richardlau
Copy link
Member

Linking #47748 here as they look similar.

@mcollina
Copy link
Member

@bcoe did you have any luck with the V8 team?

@bcoe
Copy link
Contributor

bcoe commented Dec 13, 2023

@cjihrig out of curiosity, what happens if you add fg.cleanupSome(); after the call to unregister in the iterable weakmap?

My guess is this helper cleans up resources immediately, rather than waiting until exit? I wonder if it happens to break the deadlock behavior. I haven't dug any real digging, I was just reading that the v8 bug, which says all the test262 test cases use cleanupSome.

@bcoe
Copy link
Contributor

bcoe commented Dec 18, 2023

@bcoe did you have any luck with the V8 team?

@mcollina yes, and I have an offer of help in the new year...

However, looking at @paulrutter's analysis with gdb in #47748, I think there's a chance this isn't a bug with v8, but an issue with JavaScript callbacks being registered with FinalizationRegistry, and then causing an exception when they execute late.

Thoughts on this line of thinking?

@mcollina
Copy link
Member

mcollina commented Dec 18, 2023

I think the analysis in #47748 (comment) might actually be correct. It's something we should fix on our end before digging deep on the V8 side.

@legendecas
Copy link
Member

A minimal script #47748 (comment) to reliably reproduce the problem and generate the same call stacks mentioned above.

@mcollina mcollina added the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Dec 20, 2023
@legendecas legendecas removed the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Jan 10, 2024
RafaelGSS pushed a commit that referenced this issue Jan 11, 2024
At the point of `FreeEnvironment` and onwards, no JavaScript execution
associated with the Environment should be triggered.

Avoid draining platform tasks that can trigger JavaScript execution in
`FreeEnvironment`. The holder of `node::Environment` should immediately
call `node::MultiIsolatePlatform::UnregisterIsolate` and
`v8::Isolate::Dispose` to cancel pending foreground tasks and join
concurrent tasks after the environment was freed.

`NodePlatform` can properly handle the case in `RunForegroundTask` when
an Isolate out-lives its associated `node::Environment`.

PR-URL: #51290
Fixes: #47748
Fixes: #49344
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
marco-ippolito pushed a commit to marco-ippolito/node that referenced this issue Jan 12, 2024
At the point of `FreeEnvironment` and onwards, no JavaScript execution
associated with the Environment should be triggered.

Avoid draining platform tasks that can trigger JavaScript execution in
`FreeEnvironment`. The holder of `node::Environment` should immediately
call `node::MultiIsolatePlatform::UnregisterIsolate` and
`v8::Isolate::Dispose` to cancel pending foreground tasks and join
concurrent tasks after the environment was freed.

`NodePlatform` can properly handle the case in `RunForegroundTask` when
an Isolate out-lives its associated `node::Environment`.

PR-URL: nodejs#51290
Fixes: nodejs#47748
Fixes: nodejs#49344
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
Medhansh404 pushed a commit to Medhansh404/node that referenced this issue Jan 19, 2024
At the point of `FreeEnvironment` and onwards, no JavaScript execution
associated with the Environment should be triggered.

Avoid draining platform tasks that can trigger JavaScript execution in
`FreeEnvironment`. The holder of `node::Environment` should immediately
call `node::MultiIsolatePlatform::UnregisterIsolate` and
`v8::Isolate::Dispose` to cancel pending foreground tasks and join
concurrent tasks after the environment was freed.

`NodePlatform` can properly handle the case in `RunForegroundTask` when
an Isolate out-lives its associated `node::Environment`.

PR-URL: nodejs#51290
Fixes: nodejs#47748
Fixes: nodejs#49344
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
richardlau pushed a commit that referenced this issue Mar 25, 2024
At the point of `FreeEnvironment` and onwards, no JavaScript execution
associated with the Environment should be triggered.

Avoid draining platform tasks that can trigger JavaScript execution in
`FreeEnvironment`. The holder of `node::Environment` should immediately
call `node::MultiIsolatePlatform::UnregisterIsolate` and
`v8::Isolate::Dispose` to cancel pending foreground tasks and join
concurrent tasks after the environment was freed.

`NodePlatform` can properly handle the case in `RunForegroundTask` when
an Isolate out-lives its associated `node::Environment`.

PR-URL: #51290
Fixes: #47748
Fixes: #49344
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
richardlau pushed a commit that referenced this issue Mar 25, 2024
At the point of `FreeEnvironment` and onwards, no JavaScript execution
associated with the Environment should be triggered.

Avoid draining platform tasks that can trigger JavaScript execution in
`FreeEnvironment`. The holder of `node::Environment` should immediately
call `node::MultiIsolatePlatform::UnregisterIsolate` and
`v8::Isolate::Dispose` to cancel pending foreground tasks and join
concurrent tasks after the environment was freed.

`NodePlatform` can properly handle the case in `RunForegroundTask` when
an Isolate out-lives its associated `node::Environment`.

PR-URL: #51290
Fixes: #47748
Fixes: #49344
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
richardlau pushed a commit that referenced this issue May 16, 2024
At the point of `FreeEnvironment` and onwards, no JavaScript execution
associated with the Environment should be triggered.

Avoid draining platform tasks that can trigger JavaScript execution in
`FreeEnvironment`. The holder of `node::Environment` should immediately
call `node::MultiIsolatePlatform::UnregisterIsolate` and
`v8::Isolate::Dispose` to cancel pending foreground tasks and join
concurrent tasks after the environment was freed.

`NodePlatform` can properly handle the case in `RunForegroundTask` when
an Isolate out-lives its associated `node::Environment`.

PR-URL: #51290
Fixes: #47748
Fixes: #49344
Reviewed-By: Vinícius Lourenço Claro Cardoso <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

Successfully merging a pull request may close this issue.