Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential startup regression since on Node.js >= 22 #180

Closed
RafaelGSS opened this issue Oct 26, 2024 · 57 comments · Fixed by nodejs/node#56275
Closed

Potential startup regression since on Node.js >= 22 #180

RafaelGSS opened this issue Oct 26, 2024 · 57 comments · Fixed by nodejs/node#56275

Comments

@RafaelGSS
Copy link
Member

Hi folks,

I've been investigating some areas of regression on Node.js 22 and apparently, the Node.js startup has degraded. I've run a pretty simple benchmark

➜  nodejs-internal-benchmark git:(main) ✗ hyperfine --warmup 20 -- './node18 semicolon.js'
Benchmark 1: ./node18 semicolon.js
  Time (mean ± σ):      19.5 ms ±   0.3 ms    [User: 16.1 ms, System: 2.2 ms]
  Range (min … max):    18.6 ms …  21.0 ms    142 runs

➜  nodejs-internal-benchmark git:(main) ✗ hyperfine --warmup 20 -- './node20 semicolon.js'
Benchmark 1: ./node20 semicolon.js
  Time (mean ± σ):      19.6 ms ±   0.3 ms    [User: 16.3 ms, System: 2.0 ms]
  Range (min … max):    18.9 ms …  20.6 ms    142 runs

➜  nodejs-internal-benchmark git:(main) ✗ hyperfine --warmup 20 -- './node22 semicolon.js'
Benchmark 1: ./node22 semicolon.js
  Time (mean ± σ):      27.5 ms ±   0.4 ms    [User: 23.9 ms, System: 2.1 ms]
  Range (min … max):    26.7 ms …  28.5 ms    102 runs

➜  nodejs-internal-benchmark git:(main) ✗ hyperfine --warmup 20 -- './node23 semicolon.js'
Benchmark 1: ./node23 semicolon.js
  Time (mean ± σ):      30.5 ms ±   0.4 ms    [User: 26.8 ms, System: 2.4 ms]
  Range (min … max):    29.8 ms …  31.6 ms    92 runs

Where the semicolon is just a js file with ;

Although this regression isn't confirmed by our benchmarks:

misc/startup-core.js n=30 mode='process' script='test/fixtures/semicolon'                    ***      5.82 %       ±1.01% ±1.34% ±1.74%

Please note, the hyperfine benchmarks were run on a local macOS machine -- I couldn't run it in a dedicated one as I'm travelling and I don't have access via SSH to my machines and the nodejs benchmark (misc/startup-core.js) was run in a dedicated aws machine

Does someone has an idea? cc @joyeecheung

@joyeecheung
Copy link
Member

I cannot reproduce on a Linux server though can reproduce on an arm64 macbook. There has been a regression in the range of 0522ac086c...ed49722a8a, most likely due to the v8 11.8 upgrade nodejs/node#49639

@joyeecheung
Copy link
Member

joyeecheung commented Oct 28, 2024

Some regressions can also be observed from d8.

❯ hyperfine --warmup 3 "v8-mac-arm64-rel-11.3.244/d8 semicolon.js" "v8-mac-arm64-rel-11.7.174/d8 semicolon.js"
Benchmark 1: v8-mac-arm64-rel-11.3.244/d8 semicolon.js
  Time (mean ± σ):       6.9 ms ±   0.4 ms    [User: 2.2 ms, System: 4.5 ms]
  Range (min … max):     6.0 ms …   8.0 ms    313 runs

Benchmark 2: v8-mac-arm64-rel-11.7.174/d8 semicolon.js
  Time (mean ± σ):       8.2 ms ±   0.5 ms    [User: 2.9 ms, System: 5.6 ms]
  Range (min … max):     7.4 ms …   9.6 ms    297 runs

Summary
  v8-mac-arm64-rel-11.3.244/d8 semicolon.js ran
    1.20 ± 0.11 times faster than v8-mac-arm64-rel-11.7.174/d8 semicolon.js

@joyeecheung
Copy link
Member

joyeecheung commented Oct 28, 2024

Whatever it is it seems to be happening at the global level, because --version regressed, even after I simply added

  for (int i = 1; i < argc; i++) {  // Start from 1 to skip program name
    if (strcmp(argv[i], "--version") == 0) {
      printf("%s\n", NODE_VERSION);
      return 0;
    }
  }

to the main function to skip everything of substance later.

❯ hyperfine --warmup 5  "~/.nvs/node/20.18.0/arm64/bin/node --version" "~/.nvs/node/23.1.0/arm64/bin/node --version" "out/Release/node --version"
Benchmark 1: ~/.nvs/node/20.18.0/arm64/bin/node --version
  Time (mean ± σ):      13.1 ms ±   0.2 ms    [User: 11.2 ms, System: 1.3 ms]
  Range (min … max):    12.5 ms …  13.7 ms    198 runs

Benchmark 2: ~/.nvs/node/23.1.0/arm64/bin/node --version
  Time (mean ± σ):      27.7 ms ±   0.3 ms    [User: 25.3 ms, System: 1.7 ms]
  Range (min … max):    26.7 ms …  28.4 ms    100 runs

Benchmark 3: out/Release/node --version
  Time (mean ± σ):      27.8 ms ±   0.5 ms    [User: 24.6 ms, System: 1.3 ms]
  Range (min … max):    26.6 ms …  29.1 ms    101 runs

Summary
  ~/.nvs/node/20.18.0/arm64/bin/node --version ran
    2.11 ± 0.04 times faster than ~/.nvs/node/23.1.0/arm64/bin/node --version
    2.12 ± 0.05 times faster than out/Release/node --version

@lemire
Copy link
Member

lemire commented Oct 28, 2024

I checked on that on a Linux server with x64, there is a measurable but tiny regression.

node 20

$ hyperfine --warmup 5 "node --version"
Benchmark 1: node --version
  Time (mean ± σ):       4.1 ms ±   2.1 ms    [User: 2.2 ms, System: 1.9 ms]
  Range (min … max):     1.8 ms …   8.8 ms    304 runs

node 23

$ hyperfine --warmup 5 "node --version"
Benchmark 1: node --version
  Time (mean ± σ):       4.6 ms ±   2.2 ms    [User: 2.5 ms, System: 2.1 ms]
  Range (min … max):     2.6 ms …   9.3 ms    277 runs

@lemire
Copy link
Member

lemire commented Oct 28, 2024

In my previous comment, regarding linux/x64, the regression is small and possibly irrelevant.

However, the regression under ARM-based macOS seems significant.

Let us create a file with the following content console.log("hello").

Node 20 (linux x64)

$ hyperfine --warmup 5 "node hello.js"
Benchmark 1: node hello.js
  Time (mean ± σ):      36.6 ms ±   2.0 ms    [User: 27.3 ms, System: 9.8 ms]
  Range (min … max):    31.0 ms …  44.9 ms    72 runs

Node 23 (linux x64)

$ hyperfine --warmup 5 "node hello.js"
Benchmark 1: node hello.js
  Time (mean ± σ):      33.7 ms ±   1.8 ms    [User: 22.6 ms, System: 11.5 ms]
  Range (min … max):    27.9 ms …  41.4 ms    67 runs

Node 20 (macos ARM)

$  hyperfine --warmup 5 "node hello.js"
Benchmark 1: node hello.js
  Time (mean ± σ):      20.6 ms ±   1.7 ms    [User: 16.7 ms, System: 2.2 ms]
  Range (min … max):    19.4 ms …  34.0 ms    134 runs

Node 23 (macos ARM)

$  hyperfine --warmup 5 "node hello.js"
Benchmark 1: node hello.js
  Time (mean ± σ):      29.8 ms ±   1.2 ms    [User: 25.0 ms, System: 2.5 ms]
  Range (min … max):    28.6 ms …  37.0 ms    88 runs

For reference, bun is under 5 ms in this test on my mac.

@joyeecheung
Copy link
Member

joyeecheung commented Oct 28, 2024

With this diff applied to skip most things in the main function:

diff --git a/src/node_main.cc b/src/node_main.cc
index 3295121b87..8dfa54b61f 100644
--- a/src/node_main.cc
+++ b/src/node_main.cc
@@ -19,13 +19,14 @@
 // OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
 // USE OR OTHER DEALINGS IN THE SOFTWARE.

-#include "node.h"
 #include <cstdio>
+#include <cstring>
+#include "node.h"

 #ifdef _WIN32
-#include <windows.h>
 #include <VersionHelpers.h>
 #include <WinError.h>
+#include <windows.h>

 #define SKIP_CHECK_VAR "NODE_SKIP_PLATFORM_CHECK"
 #define SKIP_CHECK_VALUE "1"
@@ -94,6 +95,12 @@ int wmain(int argc, wchar_t* wargv[]) {
 // UNIX

 int main(int argc, char* argv[]) {
+  for (int i = 1; i < argc; i++) {  // Start from 1 to skip program name
+    if (strcmp(argv[i], "--version") == 0) {
+      printf("%s\n", NODE_VERSION);
+      return 0;
+    }
+  }
   return node::Start(argc, argv);
 }
 #endif

The regression does seem to primarily come from the V8 11.3 -> 11.8 upgrade, and specifically from global initializations as the main function isn't doing much with the patch applied:

❯  hyperfine --warmup 5 "./node-0522ac08 --version" "./node-6b76b778 --version"
Benchmark 1: ./node-0522ac08 --version
  Time (mean ± σ):      14.6 ms ±   0.5 ms    [User: 11.7 ms, System: 1.2 ms]
  Range (min … max):    13.6 ms …  16.8 ms    178 runs

Benchmark 2: ./node-6b76b778 --version
  Time (mean ± σ):      23.3 ms ±   0.5 ms    [User: 20.1 ms, System: 1.4 ms]
  Range (min … max):    22.0 ms …  24.4 ms    117 runs

Summary
  ./node-0522ac08 --version ran
    1.60 ± 0.07 times faster than ./node-6b76b778 --version

@lemire
Copy link
Member

lemire commented Oct 28, 2024

I started a Neoverse N1 node on AWS to check if the problem might be with ARM.

I think it is not. It looks like a macOS issue OR an LLVM issue OR an LLVM on macOS issue.

Node 20

Benchmark 1: node hello.js
  Time (mean ± σ):      27.3 ms ±   0.4 ms    [User: 20.4 ms, System: 6.9 ms]
  Range (min … max):    26.5 ms …  28.4 ms    103 runs

Node 23

Benchmark 1: node hello.js
  Time (mean ± σ):      23.6 ms ±   0.4 ms    [User: 15.7 ms, System: 7.9 ms]
  Range (min … max):    22.8 ms …  24.8 ms    118 runs

@joyeecheung
Copy link
Member

joyeecheung commented Oct 28, 2024

By the way testing console.log() would throw in quite a few other factors than startup alone - console.log is implemented in JS with some complex streams and a custom formatter, so you will also be checking how fast V8 execute the JS code behind console.log (there's also the difference from enabling maglev between 20 and 23 that would play a part in JS execution) and whether the JS code gets modified to regress.

@lemire
Copy link
Member

lemire commented Oct 28, 2024

@joyeecheung Sure. But the effect is large (~ 10 ms).

10 ms is quite a large effect. Especially if you account for the fact that bun can print "hello" in 5 ms.

@joyeecheung
Copy link
Member

joyeecheung commented Oct 28, 2024

That should be a separate issue about the performance of console.log(), which is kind of a beast of its own due to the compatibility burdens of the underlying (monkey-patchable) stream implementation. This issue is specifically about startup regression, and it probably comes down to figuring out what happened during the V8 11.3 -> 11.8 upgrade to make the program load significantly slower even with a no-op main function. Looking at console.log won't really help us chase down what happened during the V8 upgrade that caused the regression, I am afraid.

@joyeecheung
Copy link
Member

joyeecheung commented Oct 28, 2024

Since we are checking global initializations here...I changed the patch to make it actually an no-op

diff --git a/src/node_main.cc b/src/node_main.cc
index f66099a557..86cc30600c 100644
--- a/src/node_main.cc
+++ b/src/node_main.cc
@@ -94,6 +94,7 @@ int wmain(int argc, wchar_t* wargv[]) {
 // UNIX

 int main(int argc, char* argv[]) {
-  return node::Start(argc, argv);
+  return 0;
+  //return node::Start(argc, argv);
 }
 #endif

With the patch applied to make the main function a no-op, the regression continues (86cb23d0 is the last commit updating the gypfiles for 11.8 upgrade, 5633c622 is where main is today).

❯ hyperfine --warmup 5 "./node-0522ac08" "./node-86cb23d0" "./node-5633c622"
Benchmark 1: ./node-0522ac08
  Time (mean ± σ):      13.5 ms ±   0.4 ms    [User: 11.8 ms, System: 1.2 ms]
  Range (min … max):    12.7 ms …  15.4 ms    187 runs

Benchmark 2: ./node-86cb23d0
  Time (mean ± σ):      22.1 ms ±   0.4 ms    [User: 20.2 ms, System: 1.3 ms]
  Range (min … max):    21.3 ms …  23.4 ms    120 runs

Benchmark 3: ./node-5633c622
  Time (mean ± σ):      27.1 ms ±   0.4 ms    [User: 25.0 ms, System: 1.4 ms]
  Range (min … max):    26.1 ms …  28.1 ms    106 runs

Summary
  ./node-0522ac08 ran
    1.64 ± 0.06 times faster than ./node-86cb23d0
    2.00 ± 0.07 times faster than ./node-5633c622

@lemire
Copy link
Member

lemire commented Oct 28, 2024

Let us use node --version as a reference and let us stretch out to a wide range of versions.

Node 16

  Time (mean ± σ):       7.7 ms ±   1.2 ms    [User: 5.8 ms, System: 0.8 ms]
  Range (min … max):     7.0 ms …  25.6 ms    303 runs

Node 17

  Time (mean ± σ):       8.8 ms ±   1.6 ms    [User: 6.9 ms, System: 0.8 ms]
  Range (min … max):     8.1 ms …  28.7 ms    280 runs

Node 18

  Time (mean ± σ):       8.7 ms ±   2.5 ms    [User: 7.0 ms, System: 0.9 ms]
  Range (min … max):     7.7 ms …  39.0 ms    274 runs

Node 19

  Time (mean ± σ):       9.3 ms ±   1.7 ms    [User: 7.6 ms, System: 1.1 ms]
  Range (min … max):     8.5 ms …  32.4 ms    250 runs

Node 20

  Time (mean ± σ):       9.8 ms ±   1.1 ms    [User: 8.1 ms, System: 1.1 ms]
  Range (min … max):     9.1 ms …  21.5 ms    253 runs

Node 21

  Time (mean ± σ):      13.9 ms ±   1.9 ms    [User: 12.8 ms, System: 0.9 ms]
  Range (min … max):    13.2 ms …  38.9 ms    181 runs

Node 22

  Time (mean ± σ):      16.4 ms ±   1.5 ms    [User: 14.5 ms, System: 1.0 ms]
  Range (min … max):    15.6 ms …  33.2 ms    156 runs

Node 23

  Time (mean ± σ):      20.2 ms ±   2.1 ms    [User: 18.3 ms, System: 1.1 ms]
  Range (min … max):    18.9 ms …  42.1 ms    128 runs

Analysis

So it seems that the last few versions of Node.js have added well over 12 ms to the startup time on some macOS systems.

The regression does not appear to have happened all at once.

@joyeecheung
Copy link
Member

@just-js mentioned that it might be a binary size threshold, just to comment that I previously tried to build with --without-intl to shrink the size by ~30MB and it didn't seem to make a difference (though of course others can try to see if that reproduce to make sure that's not a false elimination too)

@billywhizz
Copy link

i've tried running on my 8th generation core i5 laptop on Ubuntu 22.04 (kernel 6.8.0.48-generic) and i can see a regression on linux using both hyperfine and andrew kelley's poop in a couple of node 22 releases before the latest one, which seems to be working normally again.

we can see using poop for those releases the number of cpu instructions executed was far higher than before or after.

i will see if i can run the same benchmarks on macos over the weekend and will report back. i'll also try to capture more detail on the instructions executed if i can repro the issue there.

the script i am running is basically this

./node --version
ls -la ./node
hyperfine --warmup 20 "./node semicolon.js" 2>/dev/null
sudo poop "./node semicolon.js"
v20.11.0
-rwxr-xr-x 1 andrew andrew 96331384 Nov  7 16:04 ./node-20
Benchmark 1: ./node-20 semicolon.js
  Time (mean ± σ):      24.3 ms ±   0.2 ms    [User: 16.2 ms, System: 8.3 ms]
  Range (min … max):    24.0 ms …  25.5 ms    120 runs
 
Benchmark 1 (198 runs): ./node-20 semicolon.js
  measurement          mean ± σ            min … max           outliers
  wall_time          25.2ms ±  959us    24.1ms … 35.4ms          1 ( 1%)        
  peak_rss           39.1MB ±  143KB    38.3MB … 39.2MB         38 (19%)        
  cpu_cycles         55.3M  ± 2.75M     52.7M  … 84.7M           4 ( 2%)        
  instructions       80.4M  ±  130K     80.1M  … 80.8M           4 ( 2%)        
  cache_references   2.13M  ± 43.7K     2.06M  … 2.38M          14 ( 7%)        
  cache_misses        380K  ± 54.9K      304K  …  604K           1 ( 1%)        
  branch_misses       388K  ± 24.8K      366K  …  582K          11 ( 6%)        

v21.7.3
-rwxr-xr-x 1 andrew andrew 102202728 Nov  7 16:05 ./node-21
Benchmark 1: ./node-21 semicolon.js
  Time (mean ± σ):      22.7 ms ±   0.1 ms    [User: 15.4 ms, System: 7.0 ms]
  Range (min … max):    22.3 ms …  23.0 ms    128 runs
 
Benchmark 1 (213 runs): ./node-21 semicolon.js
  measurement          mean ± σ            min … max           outliers
  wall_time          23.4ms ±  700us    22.5ms … 27.3ms         13 ( 6%)        
  peak_rss           40.2MB ± 41.8KB    40.0MB … 40.4MB         13 ( 6%)        
  cpu_cycles         51.4M  ± 1.89M     49.6M  … 63.6M          11 ( 5%)        
  instructions       77.4M  ±  130K     77.0M  … 77.8M           2 ( 1%)        
  cache_references   1.83M  ± 44.2K     1.70M  … 2.03M          23 (11%)        
  cache_misses        322K  ± 56.5K      257K  …  448K           0 ( 0%)        
  branch_misses       357K  ± 21.1K      342K  …  539K          23 (11%)        

v22.6.0
-rwxr-xr-x 1 andrew andrew 119322416 Nov  7 16:05 ./node-22
Benchmark 1: ./node-22 semicolon.js
  Time (mean ± σ):      32.7 ms ±   0.5 ms    [User: 25.2 ms, System: 7.1 ms]
  Range (min … max):    32.3 ms …  36.7 ms    89 runs
 
Benchmark 1 (147 runs): ./node-22 semicolon.js
  measurement          mean ± σ            min … max           outliers
  wall_time          34.0ms ± 1.58ms    32.6ms … 49.1ms          2 ( 1%)        
  peak_rss           46.0MB ± 52.9KB    45.7MB … 46.0MB         14 (10%)        
  cpu_cycles         85.0M  ± 4.77M     81.8M  …  133M           6 ( 4%)        
  instructions        142M  ±  146K      142M  …  143M           0 ( 0%)        
  cache_references   2.02M  ± 53.2K     1.89M  … 2.32M           7 ( 5%)        
  cache_misses        396K  ± 71.8K      300K  …  587K           0 ( 0%)        
  branch_misses       559K  ± 58.8K      527K  … 1.15M          13 ( 9%)        

v22.7.0
-rwxr-xr-x 1 andrew andrew 116859992 Nov  7 16:05 ./node-22-7
Benchmark 1: ./node-22-7 semicolon.js
  Time (mean ± σ):      32.5 ms ±   0.2 ms    [User: 24.9 ms, System: 7.3 ms]
  Range (min … max):    32.1 ms …  32.9 ms    89 runs
 
Benchmark 1 (148 runs): ./node-22-7 semicolon.js
  measurement          mean ± σ            min … max           outliers
  wall_time          33.7ms ±  980us    32.5ms … 37.1ms          6 ( 4%)        
  peak_rss           45.7MB ± 47.2KB    45.5MB … 45.7MB         15 (10%)        
  cpu_cycles         84.2M  ± 2.66M     81.4M  … 94.8M           5 ( 3%)        
  instructions        143M  ±  187K      142M  …  143M           4 ( 3%)        
  cache_references   2.06M  ± 38.9K     1.94M  … 2.21M           4 ( 3%)        
  cache_misses        404K  ± 77.8K      303K  …  539K           0 ( 0%)        
  branch_misses       551K  ± 30.3K      524K  …  701K          10 ( 7%)        

v22.11.0
-rwxr-xr-x 1 andrew andrew 117383456 Nov  7 16:05 ./node-22-11
Benchmark 1: ./node-22-11 semicolon.js
  Time (mean ± σ):      21.2 ms ±   0.1 ms    [User: 13.9 ms, System: 6.9 ms]
  Range (min … max):    20.7 ms …  21.7 ms    137 runs
 
Benchmark 1 (225 runs): ./node-22-11 semicolon.js
  measurement          mean ± σ            min … max           outliers
  wall_time          22.2ms ± 1.22ms    21.1ms … 33.5ms         22 (10%)        
  peak_rss           42.6MB ± 63.1KB    42.1MB … 42.7MB         32 (14%)        
  cpu_cycles         45.8M  ± 2.91M     43.5M  … 69.6M          17 ( 8%)        
  instructions       75.2M  ±  142K     74.8M  … 75.7M           6 ( 3%)        
  cache_references   1.16M  ± 38.2K     1.06M  … 1.35M          23 (10%)        
  cache_misses        355K  ± 65.7K      277K  …  499K           0 ( 0%)        
  branch_misses       273K  ± 25.8K      256K  …  446K          30 (13%)        

v23.0.0
-rwxr-xr-x 1 andrew andrew 120543848 Nov  7 16:05 ./node-23
Benchmark 1: ./node-23 semicolon.js
  Time (mean ± σ):      20.7 ms ±   0.1 ms    [User: 12.9 ms, System: 7.4 ms]
  Range (min … max):    20.5 ms …  21.2 ms    138 runs
 
Benchmark 1 (232 runs): ./node-23 semicolon.js
  measurement          mean ± σ            min … max           outliers
  wall_time          21.5ms ±  777us    20.6ms … 24.6ms         23 (10%)        
  peak_rss           43.5MB ±  135KB    42.7MB … 43.5MB         36 (16%)        
  cpu_cycles         43.6M  ± 2.10M     41.6M  … 51.6M          21 ( 9%)        
  instructions       69.2M  ± 100.0K    69.0M  … 69.6M           4 ( 2%)        
  cache_references   1.15M  ± 34.7K     1.05M  … 1.34M          22 ( 9%)        
  cache_misses        355K  ± 57.5K      281K  …  491K           0 ( 0%)        
  branch_misses       269K  ± 24.1K      251K  …  433K          24 (10%)        

v23.1.0
-rwxr-xr-x 1 andrew andrew 121761992 Oct 24 13:22 ./node-23-1
Benchmark 1: ./node-23-1 semicolon.js
  Time (mean ± σ):      21.7 ms ±   1.0 ms    [User: 13.3 ms, System: 8.1 ms]
  Range (min … max):    21.1 ms …  29.0 ms    134 runs
 
Benchmark 1 (223 runs): ./node-23-1 semicolon.js
  measurement          mean ± σ            min … max           outliers
  wall_time          22.3ms ± 1.15ms    21.2ms … 33.5ms         22 (10%)        
  peak_rss           43.9MB ±  123KB    43.5MB … 43.9MB         42 (19%)        
  cpu_cycles         44.2M  ± 2.84M     41.9M  … 69.7M          21 ( 9%)        
  instructions       69.3M  ± 96.9K     69.1M  … 69.8M           4 ( 2%)        
  cache_references   1.15M  ± 37.0K     1.07M  … 1.39M          21 ( 9%)        
  cache_misses        357K  ± 58.6K      285K  …  528K           0 ( 0%)        
  branch_misses       272K  ± 26.3K      252K  …  471K          20 ( 9%)     

@billywhizz
Copy link

billywhizz commented Nov 9, 2024

this is what i am seeing on Apple M3 Max across node releases from v16 to v23 with a selection of v22 releases for good measure. it's a bit of a mixed picture but overall we can see almost 2x difference between the best (node 18.20) and the worst (node 23.1). i would imagine the binary size will have some effect here given it's 1.5x bigger between v23 and v18 🥴 . will have a look tomorrow to see if i can capture some more detail on what the differences are using other tools.

andrew@Andrews-MacBook-Pro node-test % ls -la ./node-*
-rwxr-xr-x  1 andrew  wheel   78614896  8 Aug  2023 ./node-16-20
-rwxr-xr-x  1 andrew  wheel   79202736  1 Jun  2022 ./node-17-9
-rwxr-xr-x  1 andrew  wheel   87121776  8 Jul 13:26 ./node-18-20
-rwxr-xr-x  1 andrew  wheel   88683840 10 Apr  2023 ./node-19-9
-rwxr-xr-x  1 andrew  wheel   94556304  3 Oct 07:09 ./node-20-18
-rwxr-xr-x  1 andrew  wheel  100725664 10 Apr  2024 ./node-21-7
-rwxr-xr-x  1 andrew  wheel  111818288  8 Nov 23:54 ./node-22-1
-rwxr-xr-x  1 andrew  wheel  119121584  8 Nov 23:50 ./node-22-11
-rwxr-xr-x  1 andrew  wheel  113470352  8 Nov 23:50 ./node-22-4
-rwxr-xr-x  1 andrew  wheel  125356400  8 Nov 23:52 ./node-23-1
andrew@Andrews-MacBook-Pro node-test % hyperfine --warmup 100 --runs 200 "./node-16-20 semicolon.js" "./node-17-9 semicolon.js" "./node-18-20 semicolon.js" "./node-19-9 semicolon.js" "./node-20-18 semicolon.js" "./node-21-7 semicolon.js" "./node-22-1 semicolon.js" "./node-22-4 semicolon.js" "./node-22-11 semicolon.js" "./node-23-1 semicolon.js" 2>/dev/null
Benchmark 1: ./node-16-20 semicolon.js
  Time (mean ± σ):      24.1 ms ±   0.4 ms    [User: 20.0 ms, System: 2.7 ms]
  Range (min … max):    23.8 ms …  28.2 ms    200 runs
 
Benchmark 2: ./node-17-9 semicolon.js
  Time (mean ± σ):      28.9 ms ±   0.4 ms    [User: 20.4 ms, System: 2.9 ms]
  Range (min … max):    28.0 ms …  32.1 ms    200 runs
 
Benchmark 3: ./node-18-20 semicolon.js
  Time (mean ± σ):      20.4 ms ±   0.4 ms    [User: 16.0 ms, System: 2.9 ms]
  Range (min … max):    20.0 ms …  23.8 ms    200 runs
 
Benchmark 4: ./node-19-9 semicolon.js
  Time (mean ± σ):      26.1 ms ±   0.6 ms    [User: 17.0 ms, System: 3.2 ms]
  Range (min … max):    24.6 ms …  29.9 ms    200 runs
 
Benchmark 5: ./node-20-18 semicolon.js
  Time (mean ± σ):      20.7 ms ±   0.3 ms    [User: 16.4 ms, System: 2.7 ms]
  Range (min … max):    20.3 ms …  23.5 ms    200 runs
 
Benchmark 6: ./node-21-7 semicolon.js
  Time (mean ± σ):      32.3 ms ±   0.6 ms    [User: 23.1 ms, System: 3.0 ms]
  Range (min … max):    30.6 ms …  36.5 ms    200 runs
 
Benchmark 7: ./node-22-1 semicolon.js
  Time (mean ± σ):      34.3 ms ±   0.6 ms    [User: 24.6 ms, System: 3.0 ms]
  Range (min … max):    32.5 ms …  39.8 ms    200 runs
 
Benchmark 8: ./node-22-4 semicolon.js
  Time (mean ± σ):      34.5 ms ±   0.6 ms    [User: 24.8 ms, System: 3.1 ms]
  Range (min … max):    32.8 ms …  37.9 ms    200 runs
 
Benchmark 9: ./node-22-11 semicolon.js
  Time (mean ± σ):      36.5 ms ±   0.7 ms    [User: 26.5 ms, System: 3.2 ms]
  Range (min … max):    34.7 ms …  41.0 ms    200 runs
 
Benchmark 10: ./node-23-1 semicolon.js
  Time (mean ± σ):      38.1 ms ±   0.7 ms    [User: 27.8 ms, System: 3.3 ms]
  Range (min … max):    36.1 ms …  42.8 ms    200 runs
 
Summary
  ./node-18-20 semicolon.js ran
    1.01 ± 0.02 times faster than ./node-20-18 semicolon.js
    1.18 ± 0.03 times faster than ./node-16-20 semicolon.js
    1.28 ± 0.04 times faster than ./node-19-9 semicolon.js
    1.41 ± 0.03 times faster than ./node-17-9 semicolon.js
    1.58 ± 0.04 times faster than ./node-21-7 semicolon.js
    1.68 ± 0.04 times faster than ./node-22-1 semicolon.js
    1.69 ± 0.04 times faster than ./node-22-4 semicolon.js
    1.78 ± 0.05 times faster than ./node-22-11 semicolon.js
    1.86 ± 0.05 times faster than ./node-23-1 semicolon.js

@billywhizz
Copy link

also, afaik there is no way to pin tasks to specific cores on macos so i ran multiple times to ensure this was nothing to do with running on efficiency or performance cores and i see consistent results across all runs, so we can rule that out i think.

Screenshot from 2024-11-09 23-38-00

@lemire
Copy link
Member

lemire commented Nov 10, 2024

@billywhizz

Intriguing.

@Uzlopak Uzlopak changed the title Potential statup regression since on Node.js >= 22 Potential startup regression since on Node.js >= 22 Nov 10, 2024
@billywhizz
Copy link

billywhizz commented Nov 10, 2024

another couple of things:

  • if i run the same benchmark on the same machine using my lo runtime, which does very little work at startup and doesn't use v8 snapshots or code caching, we can see 10% overhead between v8 version 12.0 and 12.3 / 12.4 and then another bump of ~6% overhead between v8 version 12.4 and 12.9. would need to investigate this further and do a more rigorous test across v8 versions but this maybe indicates there is a slowly increasing overhead to v8 engine startup over time. we could do the same test using d8 across different v8 releases as @joyeecheung did above.

Screenshot from 2024-11-10 12-55-33

  • and... one thing i notice in the node.js benchmarks i posted above is, if you look at the User and System cpu time compared to the wall clock time, we can see the node versions exhibiting the serious slowness have ~6-7ms more wall clock time compared to user + system time, which would seem to indicate they are spending significant time blocking/waiting for async events and not actually doing cpu bound work? 🤔

i.e.
Screenshot from 2024-11-10 13-13-55

@billywhizz
Copy link

i am no expert in tracing processes on macos but i have managed to capture flamegraphs for node 20.18 (which is fast) and node 22.11 (which is slow). am having a look now to see if there is anything useful here that might shed some light on the issue.

these are the commands i ran in order to capture these flamegraphs, assuming the two node processes are named node-20-18 and node-22-11 in the current directory:

curl -L -o stackcollapse.pl https://raw.githubusercontent.com/brendangregg/FlameGraph/refs/heads/master/stackcollapse.pl
chmod +x stackcollapse.pl
curl -L -o flamegraph.pl https://raw.githubusercontent.com/brendangregg/FlameGraph/refs/heads/master/flamegraph.pl
chmod +x flamegraph.pl
sudo dtrace -c "./node-20-18 semicolon.js" -o node-20-18.stacks -n 'profile-5000 /execname == "node-20-18"/ { @[ustack(100)] = count(); }'
./stackcollapse.pl node-20-18.stacks | ./flamegraph.pl > node-20-18.svg
sudo dtrace -c "./node-22-11 semicolon.js" -o node-22-11.stacks -n 'profile-5000 /execname == "node-22-11"/ { @[ustack(100)] = count(); }'
./stackcollapse.pl node-22-11.stacks | ./flamegraph.pl > node-22-11.svg

node-22-11 (slow)

node-22-11

node-20-18 (fast)

node-20-18

@billywhizz
Copy link

billywhizz commented Nov 10, 2024

A few things that jump out immediately:

  • node::InitializeOncePerProcessInternal takes 22% of time in the fast instance and 30% of the time in the slow instance
  • node::NodeMainInstance::NodeMainInstance constructor takes ~26% of time in both instances
  • node::NodeMainInstance::Run takes ~28% of the time in the slow instance, and ~40% of time in the fast instance. this seems to be where the useful work is being done so am assuming we want this to be as "wide" as possible when running optimally.
  • the small traces on left and right of the graph are system code being run when tearing down the process and take ~6% of time in the fast instance and ~12% of time in the slow instance

i tried a few different things in order to capture these flamegraphs, most of which were super painful and involved using xcode gui and Instruments tool etc. but then i came across this article which outlined the approach i ended up using.

i'll keep digging into this over next few days when i can find the time, but if anyone has any insight or suggestions for better ways of capturing the required detail on macos, please let me know. 🙏

@lemire
Copy link
Member

lemire commented Nov 10, 2024

Look at InitializeOncePerProcess:

Screenshot 2024-11-10 at 6 29 54 PM Screenshot 2024-11-10 at 6 29 44 PM

@billywhizz
Copy link

billywhizz commented Nov 11, 2024

Look at InitializeOncePerProcess:

interesting. i'm not sure i understand what you are suggesting though - can you explain further? if we click on node::InitializeOncePerProcessInternal in the flamegraph we can see:

  • in the slow version (22.11) it takes 30% of the total time, in the fast version (20.18) it is 22%, so a ~8% difference
  • libsystem_c.dylib`atexit accounts for 2.7% (10.87% v 8.16%) of this 8% difference. this is, afaik, a function that is called when the process exits. would have to dig into node source to see exactly what is being done here but from the flamegraph the cost is all coming from findClosestSymbol
  • OPENSSL_initcrypto acounts for 0.4% (6.52% v 6.12%)
  • ncrypto::CSPRNG accounts for another 2.5% (6.5% v 4%)
  • cppgc::InitializeProcess accounts for another 2.5% (this only exists on slow instance - version 22.11 as i think cppgc was not in use in 20.18?)

those differences add up to 2.7 + 0.4 + 2.5 + 2.5 = ~8% which is the total difference between the two versions for node::InitializeOncePerProcessInternal. obv these percentages are not super accurate as dtrace is just sampling at a 5KHz frequency. i will see if i can run some more times and if results are consistent.

so, maybe we can assume, without digging into code that cppgc being added and some differences between openssl versions in setup and teardown are the source of the differences in this particular function?

as an aside, hyperfine is probably not the best tool to investigate this problem as it will also account for the time taken tearing the process down, which we are not particularly interested in, unless we are actually concerned about the time to cycle a very small program.

and, possibly related, i have read about openssl v3 introducing some pretty serious overhead compared to 1.1. maybe this is something to do with it? but then, why would node major versions 17 and 19 also be affected? or does node use system libssl on macos? again, i would have to have a dig into the node.js source to understand exactly what is going on here. if anyone else knows already, please feel free to comment.

also, these are obv. percentages of different overall wall clock times - ~36ms versus ~20ms, so for this function that would be 30% of 36ms or 12ms versus 22% of 20ms or 4.4ms - a ~7.6ms difference in total - roughly half of the total difference we are seeing.

@billywhizz
Copy link

it seems all the node versions in question here - 17, 18, 19, 20, 22 are on openssl v3. node seems to use a fork of openssl even on macos?

@billywhizz
Copy link

billywhizz commented Nov 11, 2024

if we break down the overhead between versions 22.11 and 20.18 and multiply by the percentage of total time shown in the flamegraphs above we can see the following:

total overhead: 16 ms (36 ms for 22.11 versus 20 ms for 20.18)

Function Overhead
InitializeOncePerProcessInternal 6.5 ms
NodeMainInstance::NodeMainInstance 4 ms
NodeMainInstance::Run 1.7 ms
__exit 1.9ms
Isolate::Delete, TearDownOncePerProcess, dyld4::start, libsystem_c.dylib::exit 1.3ms

so, we are not so interested in the teardown overhead and should probably focus on overhead in InitializeOncePerProcessInternal and the NodeMainInstance::NodeMainInstance constructor.

afaict, most of the overhead in InitializeOncePerProcessInternal seems to be coming from OpenSSL initialization. Most of the overhead in NodeMainInstance::NodeMainInstance constructor seems to be coming from v8 snapshot initialization. It might be useful to find out how big those snapshot blobs are. 🤔

@lemire
Copy link
Member

lemire commented Nov 11, 2024

@billywhizz

Is there a quick way to build node with OpenSSL disabled? (Evidently, such a node version might be dysfunctional.)

i'm not sure i understand what you are suggesting though - can you explain further?

I think that your analysis leads you to the conclusion that there is significant overhead in InitializeOncePerProcessInternal. Your numbers indicate it is of significant interest.

most of the overhead in InitializeOncePerProcessInternal seems to be coming from OpenSSL initialization. Most of the overhead in NodeMainInstance::NodeMainInstance constructor seems to be coming from v8 snapshot initialization.

Though there is contradictory evidence in this thread, on a Linux box (as opposed to macOS), using various node versions, I do not see much of a regression as we advance through the versions, see #180 (comment)

You get somewhat complicated results under Linux: #180 (comment)

If the problem is OpenSSL initialization, can we link the performance regressions to changes in OpenSSL?

@richardlau
Copy link
Member

Is there a quick way to build node with OpenSSL disabled? (Evidently, such a node version might be dysfunctional.)

configure --without-ssl

@billywhizz
Copy link

@lemire

Though there is contradictory evidence in this thread, on a Linux box (as opposed to macOS), using various node versions, I do not see much of a regression as we advance through the versions

yes, i do not see any regressions on linux/x64 on my pretty old i5 machine when i use the final releases for each major node version. as i said in comment above i only saw the regressions in intermediate releases, so agree it is confusing exactly what is going on there.

but if we stick to the final release for each version then i think we can say clearly on linux/x64 on my i5 laptop performance is steadily improving, while on macos/arm64, as shown above, it seems to be getting worse. 🤷

hyperfine --warmup 100 --runs 200 "node-v17.9.1-linux-x64/bin/node semicolon.js" "node-v18.20.4-linux-x64/bin/node semicolon.js" "node-v19.9.0-linux-x64/bin/node semicolon.js" "node-v20.18.0-linux-x64/bin/node semicolon.js" "node-v21.7.3-linux-x64/bin/node semicolon.js" "node-v22.11.0-linux-x64/bin/node semicolon.js" "node-v23.2.0-linux-x64/bin/node semicolon.js" 2>/dev/null
Benchmark 1: node-v17.9.1-linux-x64/bin/node semicolon.js
  Time (mean ± σ):      36.2 ms ±   0.8 ms    [User: 27.5 ms, System: 8.4 ms]
  Range (min … max):    35.6 ms …  45.1 ms    200 runs
 
Benchmark 2: node-v18.20.4-linux-x64/bin/node semicolon.js
  Time (mean ± σ):      27.6 ms ±   1.3 ms    [User: 18.3 ms, System: 9.0 ms]
  Range (min … max):    27.0 ms …  43.0 ms    200 runs
 
Benchmark 3: node-v19.9.0-linux-x64/bin/node semicolon.js
  Time (mean ± σ):      28.1 ms ±   0.2 ms    [User: 18.9 ms, System: 8.9 ms]
  Range (min … max):    27.6 ms …  28.9 ms    200 runs
 
Benchmark 4: node-v20.18.0-linux-x64/bin/node semicolon.js
  Time (mean ± σ):      23.7 ms ±   0.6 ms    [User: 16.1 ms, System: 7.3 ms]
  Range (min … max):    23.2 ms …  29.7 ms    200 runs
 
Benchmark 5: node-v21.7.3-linux-x64/bin/node semicolon.js
  Time (mean ± σ):      23.5 ms ±   0.5 ms    [User: 15.5 ms, System: 7.6 ms]
  Range (min … max):    23.1 ms …  29.9 ms    200 runs
 
Benchmark 6: node-v22.11.0-linux-x64/bin/node semicolon.js
  Time (mean ± σ):      22.2 ms ±   0.4 ms    [User: 14.0 ms, System: 7.9 ms]
  Range (min … max):    21.8 ms …  26.4 ms    200 runs
 
Benchmark 7: node-v23.2.0-linux-x64/bin/node semicolon.js
  Time (mean ± σ):      21.8 ms ±   0.2 ms    [User: 13.7 ms, System: 8.1 ms]
  Range (min … max):    21.4 ms …  23.5 ms    200 runs
 
Summary
  'node-v23.2.0-linux-x64/bin/node semicolon.js' ran
    1.02 ± 0.02 times faster than 'node-v22.11.0-linux-x64/bin/node semicolon.js'
    1.08 ± 0.03 times faster than 'node-v21.7.3-linux-x64/bin/node semicolon.js'
    1.09 ± 0.03 times faster than 'node-v20.18.0-linux-x64/bin/node semicolon.js'
    1.27 ± 0.06 times faster than 'node-v18.20.4-linux-x64/bin/node semicolon.js'
    1.29 ± 0.02 times faster than 'node-v19.9.0-linux-x64/bin/node semicolon.js'
    1.66 ± 0.04 times faster than 'node-v17.9.1-linux-x64/bin/node semicolon.js'

to summarise the findings from various people above i think we can say the regression only affects arm64 and is likely down to an issue with the ssl library being used and/or v8 snapshot initialisation on arm64 linux and macos?

@lemire
Copy link
Member

lemire commented Nov 12, 2024

@richardlau

That's very helpful. We can use this to try to see if openssl is to blame.

@billywhizz

but if we stick to the final release for each version then i think we can say clearly on linux/x64 on my i5 laptop performance is steadily improving, while on macos/arm64, as shown above, it seems to be getting worse.

That's consistent with my guess that this is a macOS problem. (Meaning a problem that occurs on macOS with node.)

@billywhizz
Copy link

billywhizz commented Nov 12, 2024

ok, so i did the following on my Apple M3 Max macbook:

git clone --depth 1 --single-branch -b v22.11.0 [email protected]:nodejs/node.git
./configure
make -j 16

and to build the non-ssl version of node i did

./configure --without-ssl
make -j 16

interestingly, the binary for the default build with ssl is 104 MB whereas the officially released node.js version 22.11.0 is 114 MB, and the default version with ssl support i built locally runs ~15% faster than the official release i downloaded from the node.js downloads page. 🤔

these are my compiler/linker versions:

Xcode 16.1
Build version 16B40
Homebrew LLD 16.0.6
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix

and these are the results of hyperfine test comparing official 22.11.0 release against locally built 22.11.0 with and without ssl support:

node-startup % hyperfine --warmup 100 --runs 200 "./node-22-11 semicolon.js" "./node-22-11-new semicolon.js" "./node-22-11-nossl semicolon.js" "./node-20-18 semicolon.js" 2>/dev/null
Benchmark 1: ./node-22-11 semicolon.js
  Time (mean ± σ):      35.3 ms ±   0.6 ms    [User: 26.5 ms, System: 3.1 ms]
  Range (min … max):    33.4 ms …  40.0 ms    200 runs
 
Benchmark 2: ./node-22-11-new semicolon.js
  Time (mean ± σ):      30.4 ms ±   0.7 ms    [User: 25.5 ms, System: 3.2 ms]
  Range (min … max):    29.8 ms …  35.1 ms    200 runs
 
Benchmark 3: ./node-22-11-nossl semicolon.js
  Time (mean ± σ):      26.9 ms ±   0.4 ms    [User: 22.5 ms, System: 2.9 ms]
  Range (min … max):    26.5 ms …  31.1 ms    200 runs
 
Benchmark 4: ./node-20-18 semicolon.js
  Time (mean ± σ):      19.9 ms ±   0.4 ms    [User: 16.4 ms, System: 2.7 ms]
  Range (min … max):    19.6 ms …  23.6 ms    200 runs
 
Summary
  ./node-20-18 semicolon.js ran
    1.35 ± 0.03 times faster than ./node-22-11-nossl semicolon.js
    1.53 ± 0.04 times faster than ./node-22-11-new semicolon.js
    1.77 ± 0.05 times faster than ./node-22-11 semicolon.js

so, a local build is ~12% faster without ssl support than local build with ssl support. not a huge difference and obv. not as fast as the v20 official release with ssl support.

i can/will run some more tests now i have everything building locally. if anyone has any suggestions for further tests i can do please let me know. also, @richardlau, @RafaelGSS, @joyeecheung - are there specific flags that are set when building a release version of node in CI that i should enable locally? right now i am just using the defaults.

PS - something interesting to note is fact that user + sys time for local ssl build is only 1ms less than the officially released version but wall clock time is 5ms less. this would i think indicate there is some blocking/async work happening in the official release that isn't happening with my local build. 🤷‍♂️

@lemire
Copy link
Member

lemire commented Nov 12, 2024

@billywhizz

i'm not clear on your reasoning for this. can you explain a bit more? if i look at what libraries the binary is dynamically linking it's only the following. and why would this be getting worse over time if we are not linking against more external libs?

It is a theory, an hypothesis. It could be wrong.

I don't know at this point what the issue is.

But...

You can see visually dyld taking more time on your flamegraphs.

So dyld definitively takes more time. It is just a fact.

I do expect that the libraries used are the same, but I do not think that it implies that time used by dyld will be the same. The only way to check my theory would be to statically link or prebind the shared libraries, but I don't think you can generally statically link under macOS, and I am not sure prebinding is currently supported.

@billywhizz
Copy link

billywhizz commented Nov 12, 2024

@lemire

I do expect that the libraries used are the same, but I do not think that it implies that time used by dyld will be the same. The only way to check my theory would be to statically link or prebind the shared libraries, but I don't think you can generally statically link under macOS, and I am not sure prebinding is currently supported.

thanks for the explanation. yes, makes sense from what we see on the flamegraph. i'll see if i can capture any more useful info but this seems to be very painful on macos. yes - no way to do a fully static link on macos as far as i am aware. 😞

if we drill down in the flamegraph we can see the issue is related to atexit and threads somehow, but i cannot figure out why. it's quite possible these flamegraphs are not very accurate too?
Screenshot from 2024-11-12 22-13-56

@billywhizz
Copy link

i tried another little experiment, compiling node with shared sqlite and zlib and without ssl and some other options and we see a slight improvement. so, i'm not sure in general dynamic linking is the issue here. i'll see if i can get a flamegraph of the non-ssl version next. the binary is only 53 MB when compiled with these flags. (node-22-11-min below)

./configure --without-ssl --without-intl --enable-lto --shared-sqlite --shared-zlib --without-npm --without-node-options --without-amaro --without-corepack --without-node-code-cache --disable-single-executable-application
hyperfine --warmup 100 --runs 200 "./node-22-11 semicolon.js" "./node-22-11-new semicolon.js" "./node-22-11-nossl semicolon.js" "./node-20-18 semicolon.js" "./node-22-11-min semicolon.js" 2>/dev/null
Benchmark 1: ./node-22-11 semicolon.js
  Time (mean ± σ):      35.4 ms ±   0.7 ms    [User: 26.6 ms, System: 3.0 ms]
  Range (min … max):    33.4 ms …  39.7 ms    200 runs
 
Benchmark 2: ./node-22-11-new semicolon.js
  Time (mean ± σ):      29.4 ms ±   0.3 ms    [User: 25.7 ms, System: 2.9 ms]
  Range (min … max):    28.9 ms …  31.8 ms    200 runs
 
Benchmark 3: ./node-22-11-nossl semicolon.js
  Time (mean ± σ):      26.1 ms ±   0.4 ms    [User: 22.7 ms, System: 2.7 ms]
  Range (min … max):    25.5 ms …  28.8 ms    200 runs
 
Benchmark 4: ./node-20-18 semicolon.js
  Time (mean ± σ):      20.0 ms ±   0.4 ms    [User: 16.6 ms, System: 2.6 ms]
  Range (min … max):    19.5 ms …  23.5 ms    200 runs
 
Benchmark 5: ./node-22-11-min semicolon.js
  Time (mean ± σ):      24.9 ms ±   0.4 ms    [User: 21.3 ms, System: 2.4 ms]
  Range (min … max):    24.1 ms …  28.6 ms    200 runs
 
Summary
  ./node-20-18 semicolon.js ran
    1.24 ± 0.03 times faster than ./node-22-11-min semicolon.js
    1.30 ± 0.03 times faster than ./node-22-11-nossl semicolon.js
    1.47 ± 0.03 times faster than ./node-22-11-new semicolon.js
    1.77 ± 0.05 times faster than ./node-22-11 semicolon.js

@RafaelGSS
Copy link
Member Author

Hi, I'll try to catch up when I get some time. I don't have much bandwidth right now.

@billywhizz
Copy link

Hi, I'll try to catch up when I get some time. I don't have much bandwidth right now.

no worries. i will dump as much info as i can find here and we can hopefully make some sense of it all. 🙏

@lemire
Copy link
Member

lemire commented Nov 12, 2024

@billywhizz

i tried another little experiment, compiling node with shared sqlite and zlib and without ssl and some other options and we see a slight improvement. so, i'm not sure in general dynamic linking is the issue here. i'll see if i can get a flamegraph of the non-ssl version next. the binary is only 53 MB when compiled with these flags.

Bun is 58MB and it runs in about 7 ms on my mac with the same benchmark.

One maybe significant difference between bun and node is that node depends on Core Foundation whereas bun does not appear to do so...

@billywhizz
Copy link

billywhizz commented Nov 12, 2024

here is the flamegraph for the non-ssl locally built node version 22.11.0.
node-22-11

also uploaded here as github seems to screw with the svg files i upload somehow.

@RafaelGSS
Copy link
Member Author

RafaelGSS commented Nov 12, 2024

Maybe we should change the communication to the slack channel (#nodejs-core-performance) and send only assertive thoughts on the issue? It should make it easier for others (like me) to contribute to something.

@billywhizz
Copy link

@lemire

Bun is 58MB and it runs in about 7 ms on my mac with the same benchmark.

One maybe significant difference between bun and node is that node depends on Core Foundation whereas bun does not appear to do so...

yes bun is crazy fast for a micro bench like this. i think is more to do with JSC than anything. i have tried building a minimal runtime on v8 for macos and best i can do is ~15ms, which is roughly same as deno. while bun on same hardware is ~7ms. on linux the situation is the opposite ime - bun/JSC is almost 2x slower than a minimal v8 runtime. is possible v8 snapshots are problematic here, but i would need to investigate further.

it also could be down to fact we are also measuring isolate/process teardown time when using hyperfine because in experiments i have done on v8 we can bootstrap a v8 isolate in ~6ms and i am pretty sure last time i checked i can startup and teardown v8 isolates faster (on linux) than bun/JSC can. so many things to test! 😭

@billywhizz
Copy link

billywhizz commented Nov 12, 2024

It seems like Snapshot::Initialize is expensive.

yes - that is my feeling from looking at those flamegraphs. i will see if i can build without snapshot but i imagine that will be slower again. hard to tell without benchmarking.

@billywhizz
Copy link

Maybe we should change the communication to the slack channel (#nodejs-core-performance) and send only assertive thoughts on the issue? It should make it easier for others (like me) to contribute to something.

sure. sorry for spamming the issue with lots of results. i am just trying to go through as many scenarios as i can rn. @lemire are you good to continue discussion on slack?

@billywhizz
Copy link

continuing the convo here. hopefully that works for you @lemire.

@joyeecheung
Copy link
Member

joyeecheung commented Nov 12, 2024

Have you tried making the main function a no-op and benchmark again, like what I did in #180 (comment)? My testing showed that whatever that regressed the startup mostly happened before/after the main function is run. After the V8 11.3 -> 11.8 upgrade the Node.js binary with a no-op main function regressed for about 8ms (or before the V8 upgrade it was 1.6x faster). I am a bit surprised that we are still focusing on what happens inside the main function after 15 days of that comment...

Also on Snapshot::Initialize: if you look at the flame graph you will notice that in recent versions of Node.js no internal JS code is compiled at all when executing a CJS script, that's because we moved the internal JS code compilation into build time and serialized the bytecode into the snapshot. You may see a percentage increase in snapshot initialization but that was just because the code are deserialized instead of being compiled now, which is what makes the startup overall faster on other platforms, so it's compensated with all the V8 compile/parse frames disappearing from the flamegraph. You could check if snapshot played a difference at all by using --without-node-snapshot runtime flag and my previous showed that it didn't matter. But then again, my testing also showed that whatever happened in the main function didn't matter because the regression persists even if you make the main function a no-op...

@billywhizz
Copy link

I am a bit surprised that we are still focusing on what happens inside the main function after 15 days of that comment...

that's my fault. sorry if i sidetracked the investigation. at least i learned a lot more about how to debug/trace these issues on macos. i can spend some time over next few days following your advice. will post in slack if i find anything useful. 🙏

@joyeecheung
Copy link
Member

joyeecheung commented Nov 13, 2024

No worries, maybe I wasn't clear enough about my findings. Also, I am not 100% sure about the theory unless someone reproduce on their machine (I was able to reproduce this on two different generation of MacBooks, though). Thanks for helping out and looking into this!

@joyeecheung
Copy link
Member

I tried to profile the empty main function binary on macOS - the binary with an empty main function already takes ~25ms on my laptop, meanwhile v23.x running an empty script takes 32ms, so I was trying to figure out what it's trying to do in the 25ms outside of the main function. Unfortunately neither Instruments, dtrace nor samply gave me adequate information because it seems neither of these work well when you are trying to profile what happens outside of the main function...

I do have a theory, though, since it isn't actually doing much outside of the main function (not enough for a sampling profiler to pick up, anyway), I suspect it could be some compilation option changed in the V8 11.3 -> 11.8 upgrade that turned off some optimization or introduced something that would make the loading of the executable more costly. That could explain why this is macOS only because the compiler commands are typically generated in a platform-specific manner. I haven't looked into the actual compiler commands or how they differed along the years, though.

@billywhizz
Copy link

@joyeecheung i am wondering about best approaches for investigating this. tracing in any serious way on macos seems to require booting into recovery mode afaict? i can see if that sheds any light, but probably not til next weekend.

apart from that i have been looking at the build.gn options defined for v8 in the relevant builds of node.js, but it's going to be incredibly tedious work trying to bisect the differences between them and seeing if they make a difference by trial and error. wondering if it would be possible to embed these options in the node runtime going forward so they could easily be dumped with a cli flag at least? 🤔

@joyeecheung
Copy link
Member

joyeecheung commented Nov 26, 2024

wondering if it would be possible to embed these options in the node runtime going forward so they could easily be dumped with a cli flag at least?

You could be talking about process.config but that is pretty high-level, the compiler options are not embedded because er, well they can be pretty verbose and big ;)

tracing in any serious way on macos seems to require booting into recovery mode afaict? i can see if that sheds any light

Yes, that's dtrace/dtruss, I tried to use them but they didn't really reveal anything interesting to me (for something as short-lived as ~20ms, I could sample it at 5000Hz at most and didn't get much samples anyways, and didn't find anything particuarly interesting in dtruss output, but I might not be using it correctly/using a buggy version of it).

@targos
Copy link
Member

targos commented Nov 26, 2024

We don't build Node.js with gn, but with gyp. I don't really see any change to compiler options in nodejs/node#49639
Relevant diff: nodejs/node@162a065...86cb23d

@joyeecheung
Copy link
Member

joyeecheung commented Dec 4, 2024

Following up my theory in #180 (comment) I built V8's Node.js fork and the benchmark result is....amusing:

From Node.js main branch:

❯ hyperfine -i "out/Release/node --version"
Benchmark 1: out/Release/node --version
  Time (mean ± σ):      26.3 ms ±   1.4 ms    [User: 23.7 ms, System: 1.2 ms]
  Range (min … max):    25.3 ms …  35.0 ms    80 runs

From V8's fork

❯ hyperfine -i "out/Release/node --version"
Benchmark 1: out/Release/node --version
  Time (mean ± σ):       8.4 ms ±   0.4 ms    [User: 7.3 ms, System: 0.7 ms]
  Range (min … max):     7.9 ms …  11.3 ms    270 runs

I feel that we should probably take a look at their configuration (granted, it's gn, but something might have been missing in the gyp -> gn porting)

@joyeecheung
Copy link
Member

joyeecheung commented Dec 4, 2024

Dumping the v8_features.json:

{
  "v8_deprecation_warnings": true,
  "v8_enable_31bit_smis_on_64bit_arch": false,
  "v8_enable_conservative_stack_scanning": false,
  "v8_enable_direct_handle": false,
  "v8_enable_extensible_ro_snapshot": true,
  "v8_enable_gdbjit": false,
  "v8_enable_hugepage": false,
  "v8_enable_i18n_support": true,
  "v8_enable_javascript_promise_hooks": true,
  "v8_enable_lite_mode": false,
  "v8_enable_map_packing": false,
  "v8_enable_object_print": false,
  "v8_enable_pointer_compression": true,
  "v8_enable_pointer_compression_shared_cage": true,
  "v8_enable_sandbox": false,
  "v8_enable_shared_ro_heap": true,
  "v8_enable_short_builtin_calls": true,
  "v8_enable_v8_checks": false,
  "v8_enable_webassembly": true,
  "v8_enable_zone_compression": false,
  "v8_imminent_deprecation_warnings": false,
  "v8_optimized_debug": true,
  "v8_random_seed": "314159265",
  "v8_use_perfetto": false,
  "v8_use_siphash": false
}

TIL they enabled pointer compression in their CI (I thought they were consistent with the main branch). Maybe that's the key because I did notice a particularly pricy mprotect call in the dtruss output. But I don't know the right way to disable it yet (or it might not build without pointer compression and need to be rolled back a bunch of commits).

Also V8's fork uses their own llvm/clang/libc++ and all that (chromium, you know), so that might be important as well..

@joyeecheung
Copy link
Member

There were some attempts in nodejs/node#56138 to give LTO a try though I think that's a false positive, see nodejs/node#56138 (comment) and nodejs/node#56138 (comment) - also AFAICT, V8's fork doesn't use LTO, judging by the fact that it takes normal time to build and there aren any flags hinting real LTO in the linker commands (--icf=all --no-call-graph-profile-sort might make a bit of difference, but probably not a lot?)

[1/14] TOOL_VERSION=1733348695 ../../build/toolchain/apple/linker_driver.py -Wcrl,driver,../../third_party/llvm-build/Release+Asserts/bin/clang++ -Wcrl,strippath,/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/strip -Wcrl,installnametoolpath,../../third_party/llvm-build/Release+Asserts/bin/llvm-install-name-tool  -Wl,--no-warn-duplicate-rpath -Wl,-no_warn_duplicate_libraries -Werror -fuse-ld=lld -Wl,-fatal_warnings -Wl,--icf=all -Wl,--color-diagnostics -Wl,--strict-auto-link -Wl,--no-call-graph-profile-sort --target=arm64-apple-macos -Wl,-mllvm,-enable-machine-outliner=never -no-canonical-prefixes -nostdlib++ -isysroot ../../../../../../../Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk -mmacos-version-min=11.0 -Wl,-ObjC -Wl,-dead_strip -Wl,--icf=none -o "./mksnapshot" "@./mksnapshot.rsp"
[14/14] TOOL_VERSION=1733348695 ../../build/toolchain/apple/linker_driver.py -Wcrl,driver,../../third_party/llvm-build/Release+Asserts/bin/clang++ -Wcrl,strippath,/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/strip -Wcrl,installnametoolpath,../../third_party/llvm-build/Release+Asserts/bin/llvm-install-name-tool  -Wl,--no-warn-duplicate-rpath -Wl,-no_warn_duplicate_libraries -Werror -fuse-ld=lld -Wl,-fatal_warnings -Wl,--icf=all -Wl,--color-diagnostics -Wl,--strict-auto-link -Wl,--no-call-graph-profile-sort --target=arm64-apple-macos -Wl,-mllvm,-enable-machine-outliner=never -no-canonical-prefixes -nostdlib++ -isysroot ../../../../../../../Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk -mmacos-version-min=11.0 -Wl,-ObjC -o "./node" "@./node.rsp"

@joyeecheung
Copy link
Member

joyeecheung commented Dec 8, 2024

Tried switching to newer version of clang + lld (build from this commit) and --icf=all --no-call-graph-profile-sort though it doesn't seem to make anything faster, or actually it seems to be a tiny bit slower, so we can eliminate those (compared to the V8 build, this doesn't use a custom build of libc++, though I doubt that is the key, because the regression happened between two builds that both dynamically link to system libc++)

❯ hyperfine "./node-c4aa34aa --version" "./node-clang-20-lld --version" --warmup=10
Benchmark 1: ./node-c4aa34aa --version
  Time (mean ± σ):      30.4 ms ±   2.2 ms    [User: 26.5 ms, System: 2.2 ms]
  Range (min … max):    29.2 ms …  39.9 ms    85 runs

Benchmark 2: ./node-clang-20-lld --version
  Time (mean ± σ):      32.0 ms ±   1.9 ms    [User: 28.4 ms, System: 2.1 ms]
  Range (min … max):    31.2 ms …  43.2 ms    80 runs

Summary
  './node-c4aa34aa --version' ran
    1.05 ± 0.10 times faster than './node-clang-20-lld --version'

@billywhizz
Copy link

FWIW:

i see these results using the head of node-ci and building with default args versus latest node23 downloaded from node.js download page.

root@inspiron:/tmp/foo# ./node23 --version
v23.3.0
root@inspiron:/tmp/foo# ./node-ci --version
v24.0.0-pre
root@inspiron:/tmp/foo# cat /proc/cpuinfo | head -n 5 | grep "model name" | cut -d " " -f 3,4,5
Intel(R) Core(TM) i5-8250U
root@inspiron:/tmp/foo# nice -n 20 taskset --cpu-list 0  hyperfine --warmup 50 --runs 100 "./node-ci semi.js" "./node23 semi.js" 2>/dev/null
Benchmark 1: ./node-ci semi.js
  Time (mean ± σ):      18.0 ms ±   0.2 ms    [User: 11.6 ms, System: 6.0 ms]
  Range (min … max):    17.8 ms …  19.1 ms    100 runs
 
Benchmark 2: ./node23 semi.js
  Time (mean ± σ):      21.5 ms ±   0.3 ms    [User: 13.1 ms, System: 8.1 ms]
  Range (min … max):    21.3 ms …  23.3 ms    100 runs
 
Summary
  './node-ci semi.js' ran
    1.19 ± 0.02 times faster than './node23 semi.js'

root@inspiron:/tmp/foo# ldd ./node-ci
	linux-vdso.so.1 (0x00007ffce95e6000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000777165d19000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000777169c96000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000777165a00000)
	/lib64/ld-linux-x86-64.so.2 (0x0000777169ccd000)
root@inspiron:/tmp/foo# ldd ./node23
	linux-vdso.so.1 (0x00007ffe405f7000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007d402d374000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007d402d000000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007d402d28d000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007d402d26d000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007d402d268000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007d402cc00000)
	/lib64/ld-linux-x86-64.so.2 (0x00007d402d390000)
andrew@Mac node-ci % ./node23 --version                                                    
v23.3.0
andrew@Mac node-ci % ./node-ci --version                                                    
v24.0.0-pre
andrew@Mac node-ci % model 2>/dev/null | head -n 5 | tail -n 1                             
Apple M3 Max
andrew@Mac node-ci % hyperfine --warmup 50 --runs 100 "./node-ci semi.js" "./node23 semi.js" 2>/dev/null
Benchmark 1: ./node-ci semi.js
  Time (mean ± σ):      15.6 ms ±   0.1 ms    [User: 12.1 ms, System: 2.4 ms]
  Range (min … max):    15.4 ms …  15.9 ms    100 runs
 
Benchmark 2: ./node23 semi.js
  Time (mean ± σ):      37.2 ms ±   0.2 ms    [User: 27.5 ms, System: 3.0 ms]
  Range (min … max):    36.4 ms …  37.7 ms    100 runs
 
Summary
  ./node-ci semi.js ran
    2.39 ± 0.02 times faster than ./node23 semi.js

andrew@Mac node-ci % otool -L ./node-ci
./node-ci:
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 3107.0.0)
	/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 3107.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
andrew@Mac node-ci % otool -L ./node23
./node23:
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1971.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1500.65.0)

so... huge difference on M3 Max but very small difference on Ubuntu 22.04/8th gen i5. on linux, we can see libc++ is statically linked for node-ci but not for v23. seems to be same on macos but node-ci is linking against an extra foundation library.

@billywhizz
Copy link

billywhizz commented Dec 9, 2024

if i just do --version on same setup:

linux/i5

nice -n 20 taskset --cpu-list 0  hyperfine --warmup 100 --runs 200 "./node-ci --version" "./node23 --version" 2>/dev/null
Benchmark 1: ./node-ci --version
  Time (mean ± σ):       3.1 ms ±   0.1 ms    [User: 1.2 ms, System: 2.0 ms]
  Range (min … max):     3.0 ms …   3.9 ms    200 runs
 
Benchmark 2: ./node23 --version
  Time (mean ± σ):       3.8 ms ±   0.1 ms    [User: 1.7 ms, System: 2.1 ms]
  Range (min … max):     3.6 ms …   4.5 ms    200 runs
 
Summary
  './node-ci --version' ran
    1.21 ± 0.06 times faster than './node23 --version'

macos/m3max

hyperfine --warmup 100 --runs 200 "./node23 --version" "./node-ci --version" 2>/dev/null
Benchmark 1: ./node23 --version
  Time (mean ± σ):      30.9 ms ±   0.6 ms    [User: 23.0 ms, System: 1.9 ms]
  Range (min … max):    29.4 ms …  35.7 ms    200 runs
 
Benchmark 2: ./node-ci --version
  Time (mean ± σ):       9.3 ms ±   0.1 ms    [User: 7.2 ms, System: 1.2 ms]
  Range (min … max):     9.1 ms …   9.9 ms    200 runs
 
Summary
  ./node-ci --version ran
    3.31 ± 0.07 times faster than ./node23 --version

you can use this shell script to build node-ci on macos or linux:

mkdir google
cd google
git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
export PATH=$(pwd)/depot_tools:$PATH
git clone https://chromium.googlesource.com/v8/node-ci
gclient config https://chromium.googlesource.com/v8/node-ci --unmanaged
gclient sync
CPUS=$(nproc --all)
cd node-ci
make clean
make deps
make -j $CPUS node

i did a clean build on the machines i have here. macos M3Max is >10x quicker to do a clean build on 16 threads versus ubuntu 22.04 on 8th gen core i5 / 8 threads.

machine time threads cpuseconds
M3Pro 5m 45s 16 5520
i5 60m 8 28800

if there are any other tests i can run over coming days let me know @joyeecheung.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 16, 2024

I think I found out what's happening, which was along the lines of my theory in #180 (comment) - in the v8 upgrade, there were a lot of templated instantiations added, leading to an explosion of weak symbols that are resolved at run time (I think we weren't excluding them very well in the first place, but the explosion of weak symbols from V8's use of templates made the problem more obvious), and then it continue to regress as V8 adds, er, more templated functions..

It's also easy to understand why V8's fork is free from this regression - their builds applies -fvisibility=hidden properly.

Fix in: nodejs/node#56275 - locally the startup time is slightly faster than 20.x (as we are properly excluding the symbols now).

$ hyperfine "./node-c4aa34aa --version" "out/Release/node --version" "/Users/joyee/.nvs/node/20.18.0/arm64/bin/node --version" --warmup 10
Benchmark 1: ./node-c4aa34aa --version
  Time (mean ± σ):      28.5 ms ±   1.5 ms    [User: 25.8 ms, System: 1.9 ms]
  Range (min … max):    27.5 ms …  34.9 ms    89 runs

Benchmark 2: out/Release/node --version
  Time (mean ± σ):      12.1 ms ±   1.3 ms    [User: 8.8 ms, System: 1.9 ms]
  Range (min … max):    11.3 ms …  18.2 ms    179 runs

Benchmark 3: /Users/joyee/.nvs/node/20.18.0/arm64/bin/node --version
  Time (mean ± σ):      14.2 ms ±   1.0 ms    [User: 12.1 ms, System: 1.7 ms]
  Range (min … max):    13.5 ms …  20.0 ms    158 runs

Summary
  'out/Release/node --version' ran
    1.17 ± 0.15 times faster than '/Users/joyee/.nvs/node/20.18.0/arm64/bin/node --version'
    2.35 ± 0.27 times faster than './node-c4aa34aa --version'

Also, it produces a smaller binary

$ ls -lah out/Release/node
-rwxr-xr-x  1 joyee  staff   108M Dec 16 22:39 out/Release/node
$ ls -lah ./node-c4aa34aa
-rwxr-xr-x  1 joyee  staff   118M Dec  5 12:39 ./node-c4aa34aa

aduh95 pushed a commit to nodejs/node that referenced this issue Jan 2, 2025
V8 should be built with -fvisibility=hidden, otherwise
the resulting binary would contain unnecessary symbols. In
particular, on macOS, this leads to 5000+ weak symbols
resolved at runtime, leading to a startup regression.

On macOS this also reduces the binary size about ~10MB.

It's only enabled on macOS in this patch as gcc can time out
or run out of memory on some machines in the CI with
-fvisibility=hidden.

PR-URL: #56275
Fixes: nodejs/performance#180
Reviewed-By: Juan José Arboleda <[email protected]>
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Rafael Gonzaga <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Geoffrey Booth <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Chengzhong Wu <[email protected]>
Reviewed-By: Marco Ippolito <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants