Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug Report — Runtime APIs: alarm-scheduler fails for inscrutable reason #1386

Closed
huw opened this issue Nov 4, 2023 · 16 comments · Fixed by #1474
Closed

🐛 Bug Report — Runtime APIs: alarm-scheduler fails for inscrutable reason #1386

huw opened this issue Nov 4, 2023 · 16 comments · Fixed by #1474

Comments

@huw
Copy link

huw commented Nov 4, 2023

Sometimes when I schedule alarms in a Durable Object, I get the following error:

workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [6 == 71]; NUL character in path component; part = ��zUdc5238b9�$�v$7559429.sqlite

I tried looking through the source code, but it’s very hard to determine why this is occurring, especially because I don’t think KJ is open source. The behaviour appears inconsistently (restarting usually fixes it), but if it helps the kinds of calls I’m making are:

  • Schedule around 500ms in the future
  • Schedule at the current date (via Temporal.Now.instant(), which uses Date.now() under the hood)

For both, I sometimes check for the presence of an existing alarm first, but not always. Any hints?

@dom96
Copy link
Collaborator

dom96 commented Nov 4, 2023

especially because I don’t think KJ is open source

it is open source, it's in the capnp repo: https://github.com/capnproto/capnproto/tree/v2/c%2B%2B/src/kj

@huw
Copy link
Author

huw commented Nov 6, 2023

Thanks for that—sorry! I tried looking a bit further but knowing the KJ code didn’t enlighten me much—it is hard to find where workerd/wrangler is calling that deep into the stack without a useful stack trace, which I’m not getting (mine just spits out an LLVM_SYMBOLIZER error a la cloudflare/workers-sdk#3631). I would love to debug this myself but it’s just too hard. Should I move this to workers-sdk?

@kentonv
Copy link
Member

kentonv commented Nov 6, 2023

It's unlikely an issue in KJ. It looks like workerd is probably calling the filesystem APIs and passing a Path object that has already been freed.

@gregory
Copy link

gregory commented Nov 17, 2023

Thanks for that—sorry! I tried looking a bit further but knowing the KJ code didn’t enlighten me much—it is hard to find where workerd/wrangler is calling that deep into the stack without a useful stack trace, which I’m not getting (mine just spits out an LLVM_SYMBOLIZER error a la cloudflare/workers-sdk#3631). I would love to debug this myself but it’s just too hard. Should I move this to workers-sdk?

Having the same issue, really hard to debug.
I'm going through the same usage pattern, scheduling alarms at Date.now() or in a very near future.
adding 1000 ms on top of Date.now() seems to not be throwing the exception.
This seems to point to some kind of race condition in workerd.

@bkniffler
Copy link

bkniffler commented Nov 24, 2023

Can confirm the bug appears if setAlert(Date.now()), but does not if setAlert(Date.now()+5000)

@jake-danton
Copy link

I get this after setting the 3rd alarm regardless of how far in the future

@huw
Copy link
Author

huw commented Nov 26, 2023

I’ve tried the following MWE, but I can’t reliably reproduce the bug on Wrangler 3.16.0 or ^3.17.0:

export class Test implements DurableObject {
  constructor(private readonly state: DurableObjectState) {}

  async fetch() {
    await this.state.storage.setAlarm(Date.now() - 1000);
    return new Response("OK");
  };

  async alarm() {
    console.log("Alarm execution succeeded");
  }
}

const index: ExportedHandler<{ TEST: DurableObjectNamespace }> = {
    fetch: async (_, { TEST }) => await TEST.get(TEST.idFromName("test")).fetch("https://do")
};

export default index;

I also tried subtracting time from Date.now(), and trying to schedule multiple alarms at once. For now, I’m waiting to catch it again in Wrangler ^3.17.0, which includes the stack trace symbolizer fixes and might (?) have a better stack trace. But any tips from the core team on how I can see a proper stack trace from KJ & workerd would be hugely appreciated :)

@jasnell
Copy link
Member

jasnell commented Nov 27, 2023

I just looked quickly at alarm-scheduler.c++ and there's currently only one place we're passing along a kj::PathPtr that could potentially be a problem here. I just opened #1442 that ensures the kj::Path is attached and kept alive. I haven't repro'd the actual segfault locally yet so it would be helpful if someone who has repro'd it could give the potential fix in #1442 a try.

@jasnell
Copy link
Member

jasnell commented Nov 29, 2023

For those of you who are reliably seeing this, can I ask someone to put together a simple repro worker that always fails for you. I'm unable to reproduce the issue on main

@huw
Copy link
Author

huw commented Dec 4, 2023

Okay, I have a reproduction!

Use the code from above:

index.ts:

export class Test implements DurableObject {
  constructor(private readonly state: DurableObjectState) {}

  async fetch() {
    await this.state.storage.setAlarm(Date.now() - 1000);
    return new Response("OK");
  };

  async alarm() {
    console.log("Alarm execution succeeded");
  }
}

const index: ExportedHandler<{ TEST: DurableObjectNamespace }> = {
    fetch: async (_, { TEST }) => await TEST.get(TEST.idFromName("test")).fetch("https://do")
};

export default index;

wrangler.toml:

compatibility_date = "2023-11-23"
name = "debugging"
workers_dev = true
main = "./index.ts"
minify = true

[[durable_objects.bindings]]
name = "TEST"
class_name = "Test"

.vscode/launch.json:

{
	"configurations": [
		{
			"name": "Wrangler",
			"type": "node",
			"request": "attach",
			"port": 9229,
			"attachExistingChildren": false,
			"autoAttachChildProcesses": false
		}
	]
}

I’ve only reproduced in VS Code. The steps are:

  1. Bind a Logpoint to whatever line you’d like (I bound to the console.log line)
  2. wrangler dev
  3. Attach a debugger via VS Code and wait for it to bind
  4. curl localhost:8787
  5. If you don’t see the error, disconnect the debugger, wait a second, and re-attach it. Repeat steps 3–5 until you see it (should show up frequently enough)
  6. If you still don’t see the error, restart Wrangler and repeat steps 2–5.

I don’t know how many of these steps are required, but I am fairly confident this only shows up when a debugger is attached. I was able to get that repro working within 1–3 tries most of the time, occasionally it wouldn’t reproduce at all and I’d have to try step 6. I assume it doesn’t reliably reproduce because it’s memory-related?

I’m not super confident about my ability to build a workerd branch and host it inside miniflare inside wrangler but if you need me to do it I could give it a crack.

@mrbbot
Copy link
Contributor

mrbbot commented Dec 4, 2023

If it helps, you can use the MINIFLARE_WORKERD_PATH environment variable to point miniflare and wrangler to a custom build of workerd.

@jasnell
Copy link
Member

jasnell commented Dec 5, 2023

Ok, think I've identified the issue and the fix (#1474)

@jasnell
Copy link
Member

jasnell commented Dec 5, 2023

I've landed #1474 which should fix this issue (which is why the issue autoclosed). If you still see this issue once the workerd package is updated, let us know here and we can open this issue and keep investigating further.

@johnspurlock
Copy link

By any chance, did this change land in Workers prod around Wednesday of this week? I saw a few alarms start re-firing around (all around 2023-12-13T20:30:00Z) in 3 colos that I hadn't noticed weren't firing for months (!)

@kentonv
Copy link
Member

kentonv commented Dec 18, 2023

@johnspurlock James' change is a fix to code that only runs in local testing; our production implementation of alarms is very different. Separately, though, I think I did hear someone say something about fixing a bug which had caused some alarms to fail to run, and that old alarms might start running as a result. I don't personally know the details of this. Maybe ask in the Durable Objects room on Discord?

@johnspurlock
Copy link

@johnspurlock James' change is a fix to code that only runs in local testing; our production implementation of alarms is very different. Separately, though, I think I did hear someone say something about fixing a bug which had caused some alarms to fail to run, and that old alarms might start running as a result. I don't personally know the details of this. Maybe ask in the Durable Objects room on Discord?

Ah ok thanks, got a response a few days ago there:

A bug with Durable Objects Alarms was fixed and will cause skipped alarms to run. Accounts with affected alarms will receive an email next week with more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants