Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 wrangler dev error: Received signal #11: Segmentation fault: 11 #1422

Closed
ds300 opened this issue Nov 21, 2023 · 30 comments
Closed

🐛 wrangler dev error: Received signal #11: Segmentation fault: 11 #1422

ds300 opened this issue Nov 21, 2023 · 30 comments

Comments

@ds300
Copy link

ds300 commented Nov 21, 2023

Hello!

Recently we've been seeing this issue with one of our worker scripts (we have three, the other two seem fine) where after some time being used in dev mode it will segfault and become unresponsive. This seems to happen anywhere from 10 seconds to 10 minutes after booting. I have not been able to find a way to reproduce it on demand, but it will happen every time, at some point.

This is on MacOS, using M1 chips.

The full error that is printed when this happens is (--log-level debug)

*** Received signal #11: Segmentation fault: 11
stack: 

There are always other errors logged out above, but they seem to change every time and are maybe unrelated since they seem to happen all the time regardless of how long it takes to hit the segfault. Here's an example (again --log-level debug)

workerd/io/worker-entrypoint.c++:218: info: exception = kj/compat/http.c++:2673: disconnected: WebSocket disconnected between frames without sending `Close`.
stack: 105822c40 10689dd88 105824bd8 10529a004 10506a108 10506a517 104f40d78
workerd/io/worker-entrypoint.c++:218: info: exception = kj/compat/http.c++:2673: disconnected: WebSocket disconnected between frames without sending `Close`.
stack: 105822c40 10689dd88 105824bd8 105824d97 10689dd88 105842d8c 105824bd8 10529a004 10506a108 10506a517 104f40d78
workerd/io/worker-entrypoint.c++:218: info: exception = kj/compat/http.c++:2673: disconnected: WebSocket disconnected between frames without sending `Close`.
stack: 105822c40 10689dd88 105824bd8 105824d97 10689dd88 105842d8c 105824bd8 105824d97 10689dd88 105842d8c 10529a004 10506a108 10506a517 104f40d78
workerd/server/server.c++:2838: error: Uncaught exception: kj/compat/http.c++:2673: disconnected: worker_do_not_log; Request failed due to internal error
stack: 105822c40 10689dd88 105824bd8 105824d97 10689dd88 105842d8c 105824bd8 105824d97 10689dd88 105842d8c 10529a004 10506a108 10506a517 104f40d78 104a85c38 10584ebdc 10584fb84 10

workerd/io/worker-entrypoint.c++:218: info: exception = kj/async.c++:220: disconnected: other end of WebSocketPipe was destroyed

workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = P��-p��-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = `�-(��@��-$(���|�.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104

workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = �

stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104

workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component

stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = 0�- -.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = (o� L2
P�-�0��-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:306: failed: expected part.findFirst('/') == kj::none [(can't stringify) == (can't stringify)]; '/' character in path component; did you mean to use Path::parse()?; part = http://fake-host/w/27a79965-642c-44eb-8bd4-0e857bb0554e/presence.sqlite
stack: 1068e91a7 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = (o� L2
P�-�0��-��-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = P��-`�-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [0 == 71]; NUL character in path component; part = �ٙ-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = @\�-�ٙ-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [0 == 71]; NUL character in path component; part = .sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104
*** Received signal #11: Segmentation fault: 11
stack:

Let me know if any other info I can provide would be helpful.

For now I'm going to try to wrap the wrangler dev process in a retry loop.

@thecatanon
Copy link

thecatanon commented Nov 23, 2023

Also seeing this (M1), but I don't use alarms at all, so I don't get any of the additional logs you provided.

Maybe it's just something with Durable Objects in general? I'm also using Hibernatable Websockets if that helps narrow it down.

I thought I was going crazy 😅

@ds300
Copy link
Author

ds300 commented Nov 23, 2023

We are also using the socket hibernation api.

@ds300
Copy link
Author

ds300 commented Nov 23, 2023

in case anybody else needs it, here is the script I wrote to wrap wrangler and restart it after a segfault.

dev.ts

import { ChildProcessWithoutNullStreams, spawn } from 'child_process'
import stripAnsi from 'strip-ansi'

class WranglerMonitor {
	private process: ChildProcessWithoutNullStreams | null = null

	public start(): void {
		this.stop() // Ensure any existing process is stopped
		console.log(`Starting wrangler...`)
		this.process = spawn('wrangler', ['dev', '--env', 'dev'], {
			env: {
				NODE_ENV: 'development',
				...process.env,
			},
		})

		this.process.stdout.on('data', (data: Buffer) => {
			this.handleOutput(stripAnsi(data.toString().replace('\r', '').trim()))
		})

		this.process.stderr.on('data', (data: Buffer) => {
			this.handleOutput(stripAnsi(data.toString().replace('\r', '').trim()), true)
		})
	}

	private handleOutput(output: string, err = false): void {
		if (!output) return
		if (output.includes('Segmentation fault')) {
			console.error('Segfault detected. Restarting Wrangler...')
			this.restart()
		} else if (!err) {
			console.log(output.replace('[mf:inf]', ''))
		}
	}

	private restart(): void {
		console.log('Restarting wrangler...')
		this.stop()
		setTimeout(() => this.start(), 100) // Restart after a short delay
	}

	private stop(): void {
		if (this.process) {
			this.process.kill()
			this.process = null
		}
	}
}

new WranglerMonitor().start()

Run it with npx tsx <filename>. bun might work too.

There's probably a bash one-liner that can do the same thing but i am not bashfully gifted.

@kentonv
Copy link
Member

kentonv commented Nov 27, 2023

This looks like the same bug as #1386.

@thecatanon
Copy link

This looks like the same bug as #1386.

Unfortunately still seeing this on v1.20231206.0 which closes that issue, but I'm also not using alarms. Once I have some spare time I'll get a minimum reproduction and more detailed system stats, as it happens extremely often for me.

@thecatanon
Copy link

I've spent way too much time trying to get a reproduction that is consistent - there isn't one.

Here's a log from one of the crashes, hopefully I left enough unredacted:
crash.log

This appears to only be affecting M1 chips, and one of the contributing factors is this.state.acceptWebSocket in a Durable Object. Removing it from our durable objects seem to stop the segfaults (but with this, who knows, honestly)

However, we have other projects that do not segfault. This leads us to believe it could be because of

  • Script size
  • The amount of bindings we use (Available in the crash log)
  • The code of the Durable Object that powers the WebSocket (we call D1 before accepting the WebSocket, for example)

Here's the code of the Durable Object I'm primarily testing with:

export class MyDurableObject {
	env: CloudflareEnv;
	state: DurableObjectState;

	constructor(state: DurableObjectState, env: CloudflareEnv) {
		const e = envSchema.safeParse(env);
		if (!e.success) throw new Error(...);
		const parsedEnv = {
			...env,
			...e.data
		};
		this.env = parsedEnv;
		this.state = state;
	}

	async fetch(request: Request) {
		const ctx = getContext(request) // D1 calls, etc
		if (isWebSocket(request)) {
			console.log('GOT WEBSOCKET REQUEST');
			if (!ctx.user) return new Response('NOUSER', { status: 404 });

			if (this.state.id.toString() !== ctx.live.idFromName(ctx.user.id).toString()) {
				return new Response('', { status: 404 });
			}

			const webSocketPair = new WebSocketPair();
			const [client, server] = Object.values(webSocketPair);
			this.state.acceptWebSocket(server);

			return new Response(null, {
				status: 101,
				webSocket: client
			});
		}

		console.log('SENDING USER MESSAGE');

		try {
			const data = await request.json();
			for (const socket of this.state.getWebSockets()) {
				socket.send(JSON.stringify(data));
			}
			return new Response('Hello World');
		} catch (e) {
			console.error(e);
			return new Response('Unknown Error');
		}
	}

	async webSocketClose(ws: WebSocket, code: number, reason: string, wasClean: boolean) {
		console.log('CLOSED', { ws, code, reason, wasClean });
	}

	async webSocketError(ws: WebSocket, error: unknown) {
		console.error('ERROR', error);
	}
}

This has become such a frustrating issue for my team that we've sunk way too many hours into diagnosing and are desperately looking for a solution. We hope this issue gets the attention it deserves.

@mrbbot
Copy link
Contributor

mrbbot commented Jan 5, 2024

Hey @thecatanon! 👋 Thanks for trying to put together a minimal reproduction. Unfortunately, the stacks from the release builds aren't very useful for debugging these kinds of issues. I've compiled an M1 debug build with ASAN enabled from 8c47f13 for you to try: https://drive.google.com/file/d/1OVMhhzU9HltFLyde9T1sg74UNaNn9w1S/view?usp=sharing. To use this build, unzip the file, then run wrangler dev with the MINIFLARE_WORKERD_PATH environment variable set to the absolute path of the unzipped workerd-asan-8c47f13 binary. This build will be significantly slower, but hopefully if the segfault occurs, it will log exactly where the problem is. 🤞

@thecatanon
Copy link

thecatanon commented Jan 5, 2024

Thanks @mrbbot!

Unfortunately I have just a few (unhelpful) findings:

  • With that build, I receive a Received signal #10: Bus error: 10 stack: (Nothing after the "stack:") when writing an HTML response.

  • If I write a HTML response from the regular build (without the custom MINIFLARE_WORKERD_PATH), load the page, exit wrangler, and then restart with the debug build so the websocket connects to the debug build, it seems to never segfault.

Maybe that bus error is another manifestation of the same issue?

buserror.log

CleanShot 2024-01-05 at 13 09 48@2x

It actually will load the HTML, but crash immediately after. Could it be a caching thing?

EDIT: Just created my own ASAN build and received the same bus error.

@DylanRJohnston
Copy link

I seem to be able to reliably trigger the error when using the Hibernating Sockets API. When a socket disconnects and then the Durable Object is evicted it triggers the error. Although I'm only assuming it's the Durable Object eviction as it occurs 10 seconds after the last web socket message which I believe is the durable object eviction timer.

@MellowYarker
Copy link
Contributor

MellowYarker commented Jan 13, 2024

@DylanRJohnston

Can you provide a repro? When do you see the segfault? Also are you on M1 mac as well?

@DylanRJohnston
Copy link

Yeah that might be a bit tricky given I'm using Nix, Rust, and wasm-bindgen to build the project, I'm on an M2 Macbook Pro, but still Apple Silicon.

@thecatanon
Copy link

Could this be related to #1535 / #1299?

@MellowYarker
Copy link
Contributor

@thecatanon I thought it might be which is why I tried closing out that bug, but I suspect it's unrelated. I don't see how that would have caused a segfault.

@thecatanon
Copy link

@MellowYarker Yeah still seeing it in the latest release

Will attempt creating an ASAN build again (got a bus error whenever HTML responses were delivered last time)

@MellowYarker
Copy link
Contributor

MellowYarker commented Jan 18, 2024

still seeing it in the latest release

@thecatanon are you referring to the latest release of Workerd or Miniflare (which I think is still pointing to a prev version of Workerd)? I'm not familiar with how Miniflare works but it seems like earlier in this thread you were manually changing the dep to point to a specific version, so I figure you may have already changed it to point to the latest Workerd?

@thecatanon
Copy link

@MellowYarker Yeah, using the MINIFLARE_WORKERD_PATH environment variable (I assume wrangler uses miniflare somehow under the hood, which in-turn looks at that variable to figure out how to launch workerd)

@thecatanon
Copy link

thecatanon commented Jan 18, 2024

Additionally, I might be seeing this in production now (saw someone get a disconnected screen, went to reproduce with Real-Time logs enabled, saw it happen, and found "outcome": "canceled" as well as "outcome":"exception" on a different one)

Could be unrelated, but thought it was worth adding.

"Cancelled" CF-Ray: 8472bc7d8a5236c0
"Exception" CF-Ray: 8472d052a8b736fd

(Hope that helps - happy to share any other details privately on Discord, I'm in the CF Developers server as the.feline)

@aboodman
Copy link

For what it's worth, we're seeing this a lot in reflect.net too. I have not been able to get a reproduction but will keep trying.

@codefrau
Copy link

I started to see this only after switching to Hibernatable WebSockets. Pretty much no other changes, so this see4ms very much related. FWIW we're using both regular and hibernatable sockets in the same DO:

workerd/io/worker-entrypoint.c++:340: info: exception = kj/compat/http.c++:4001: disconnected: other end of WebSocketPipe was destroyed
stack: 1028788cc 1016432d0 1010a8577 100e6d658 100e6da67 10287433f 100d3fee4
✘ [ERROR] *** Received signal #11: Segmentation fault: 11

  stack:


[InspectorProxyWorker] RUNTIME WEBSOCKET CLOSED 1006 WebSocket disconnected without sending Close frame.
workerd/util/symbolizer.c++:96: warning: Not symbolizing stack traces because $LLVM_SYMBOLIZER is not set. To symbolize stack traces, set $LLVM_SYMBOLIZER to the location of the llvm-symbolizer binary. When running tests under bazel, use `--test_env=LLVM_SYMBOLIZER=<path>`.

Every subsequent WebSocket connection is rejected with

workerd/jsg/util.c++:283: info: exception = kj/async-io-unix.c++:1524: disconnected: connect(): Connection refused
stack: 1021c5cfb 1021c60d7 10218833f 1021bebc4 1021bf4d0 1021b2bb8 1021b3000 100f7ba9b 100f7c38b 100f7da87 100f5c520 100f62490 100f784f0 10014f8cc 1007d0b7c

and for GET requests I see

workerd/io/worker.c++:1764: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
workerd/io/io-context.c++:359: info: uncaught exception; exception = workerd/jsg/_virtual_includes/jsg/workerd/jsg/value.h:1334: failed: jsg.Error: Network connection lost.
stack: 10078145c 10078223b 10218833f 100651f1c 1006524dc
workerd/io/worker.c++:1764: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
workerd/io/io-context.c++:359: info: uncaught exception; exception = workerd/jsg/_virtual_includes/jsg/workerd/jsg/value.h:1334: failed: jsg.Error: Network connection lost.
stack: 10078145c 10078223b 10218833f 100651f1c 1006524dc
✘ [ERROR] Error in ProxyController: Error inside ProxyWorker

   {
    name: 'Error',
    message: 'Network connection lost.',
    stack: 'Error: Network connection lost.'
  }

I have to restart wrangler dev to get it to work again.

@MellowYarker
Copy link
Contributor

@codefrau do you have a repro we could take a look at? Are you also on mac?

@thecatanon
Copy link

Just for tracking purposes (in case Cloudflare can't reproduce this on their end), still seeing this on v1.20240208.0.

Additionally, when creating ASAN builds (could absolutely be doing this wrong, I'm no C++ dev), when HTML is delivered I receive Error connecting. restart in the logs (as well as a crash). This is similar to the bus error described here: #1422 (comment)

@codefrau
Copy link

@codefrau do you have a repro we could take a look at? Are you also on mac?

The repo is not public unfortunately.

But yes, I'm on a Mac (2021 MacBook Pro with M1 Max, running Sonoma 14.2.1)

wrangler 3.28.1, node v20.11.0

@codefrau
Copy link

codefrau commented Feb 14, 2024

So this is curious: if it happens, then the segfault always happens at about 5 min after the last regular web socket disconnected and we sent the last message through the single hibernateble socket remaining. I looked through more logs than this but it's always ~5 minutes, never less than 4, never more than 6. There is no 5 min timeout in our code, and there was no other request. We only have an auto-response, the client pings every 30s.

Feb 13 17:52:14 last non-hibernateable websocket closed / last message sent on hibernatable
          +5:02
Feb 13 17:57:16 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11
...
Feb 13 18:28:17 last non-hibernateable websocket closed / last message sent on hibernatable
          +5:13   
Feb 13 18:33:30 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11
...
Feb 13 19:28:06 last non-hibernateable websocket closed / last message sent on hibernatable
          +4:56
Feb 13 19:33:02 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11
...
Feb 13 20:41:13 last non-hibernateable websocket closed / last message sent on hibernatable
          +4:12
Feb 13 20:45:25 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11

@nvie
Copy link

nvie commented Apr 5, 2024

Upvoted, +1! This one is particularly annoying, as there doesn't seem to be a workaround, and it's making unit testing waking up from hibernation impossible for us.

I've been able to make a minimal reliable replication of this crash:
https://github.com/nvie/cloudflare-repros/tree/segfault#readme

It uses a Durable Object with the WebSocket Hibernation API, and pretty much nothing else.

I'm on an M3 Max, on Sonoma 14.4.1.

Update from April 15, 2024: Still an issue, after using all the latest versions of all Cloudflare packages.
Update from April 17, 2024: Still an issue, after using all the latest versions of all Cloudflare packages.

@matt384
Copy link

matt384 commented Apr 10, 2024

+1 on this issue, extremely irritating and difficult to do development with wrangler on hibernateable sockets

@MellowYarker
Copy link
Contributor

@nvie thanks for the repro! I see a segfault on linux too (very surprising tbh).

It looks like it's segfaulting in the ActorContainerRef destructor. Not sure why and will need to look into it further. I'll give this a proper look over the next week or two.

@MellowYarker
Copy link
Contributor

MellowYarker commented Apr 19, 2024

Let me know if this is still a problem after the next release goes out. It looks like we aren't putting one out this week, so I suspect it should go out next week.

Edit: Looks like a release has gone out now. Just need to wait for Miniflare to catch up.

@MellowYarker
Copy link
Contributor

@nvie I'm running your repro with [email protected] and [email protected] and I'm not seeing a crash or segfault, can you confirm it's fixed for you too?

@nvie
Copy link

nvie commented Apr 24, 2024

Yes! This is amazing — thanks so much, @MellowYarker! 🙌

@MellowYarker
Copy link
Contributor

Closing for now as this seems to have been resolved. Feel free to reopen if still experiencing a segfault after [email protected] and [email protected]. Thanks all for the helpful comments + @nvie for the reproduction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants