Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ponder Implicitly Removes Null Bytes from Event Arguments #1456

Open
shrugs opened this issue Jan 20, 2025 · 2 comments
Open

Ponder Implicitly Removes Null Bytes from Event Arguments #1456

shrugs opened this issue Jan 20, 2025 · 2 comments

Comments

@shrugs
Copy link
Contributor

shrugs commented Jan 20, 2025

Version

0.8.28

Current behavior

In indexing ENS, some labels with null bytes are emitted, but these labels cannot be stored in postgres (which is likely the reason ponder trims them to begin with). Unfortunately this removes information from the application-layer who might handle these cases differently. In ENS's case, it's important that 'vitalik\x00'.eth and vitalik.eth are differentiable.

In this specific case, we can use const hadNullBytes = labelhash(name) !== label as a proxy for null-byte detection, since the label argument in that event is just labelhash(name) and will be different for the original name value that includes the null byte.

More information is available here, where I debugged this behavior 👇

namehash/ensnode#36

Expected behavior

I believe the current behavior of ponder is extremely reasonable, but I worry that the developer integrating ponder loses this information and may accidentally index the wrong data. In our case we were treating these two names as identical, which messed up the resulting data (and would have perhaps enabled someone resolving vitalik.eth to receive the imposter's information. We assumed we were getting the raw argument data from ponder, intending to handle the null-byte ourselves, but were actually receiving strings with null bytes already removed.

It seems that the default behavior of postgres is to throw an error if a null byte is inserted, though I'm not sure why this didn't cause subgraph indexers from failing to index these names in the original postmortem.

https://ens.mirror.xyz/9GN77d-MqGvRypm72FcwgxlUnPSuKWhG3rWxddHhRwM

In any case, I think that the default behavior, where postgres throws an unrecoverable error, forcing the user to handle this case, is likely the correct path. Ponder can/should provide null-byte helper functions to make this easier on users, for example a containsNullCharacters or removeNullCharacters helper fn.

Steps to reproduce

You don't really need to reproduce this, but here's an offending tx: 0x535bf909ec258054cb03ab03498b4de0b365e1037a43fcdfbdcb28202382a9cd which you can access via ponder with the following config:

EthRegistrarControllerOld: {
      network: "mainnet",
      abi: EthRegistrarControllerOld,
      address: "0x283Af0B28c62C092C9727F1Ee09c02CA627EB7F5",
      startBlock: 16369364,
      endBlock: 16369364 + 1,
    },

and you can log the arguments provided to your NameRegistered event handler to see that the null byte is removed.

the relevant line in ponder is https://github.com/ponder-sh/ponder/blob/178bcbfd36b5eb418ee118c0e6b02a46c9c7ed6d/packages/core/src/sync/events.ts#L505C25-L505C45

Link to repository

https://github.com/namehash/ensnode

Anything else?

No response

@typedarray
Copy link
Collaborator

Hi, thanks for opening. Was not aware of the historical context here, thanks for sharing.

We shipped this change ~2 weeks ago in 0.8.18, indeed by removing the null characters from decoded event arguments (event.args) before passing them to userland.

I can definitely see how this makes it difficult to achieve the desired behavior here for ENS, so I'm open to a different solution.

For context, we received several reports (both ENS and non-ENS apps) running into this issue because they were trying to store user-provided text in a postgres TEXT column. They had no idea where these characters came from. To put it another way - if we don't remove them, any ponder app that stores decoded event args containing user-provided bytes or string values is essentially vulnerable to the postgres insertion error crashing their app.

So, it's a bit of a dilemma. Couple ideas:

  • Add a global configuration option removeNullBytes which defaults to true
  • Remove the built-in sanitization but improve observability around this, eg log a warning when a decoded arg contains null bytes, and improve the postgres insertion error with a link to the docs
  • Change some behavior at the query builder / driver layer ?

My instinct is that ENS is the outlier here, so I struggle with any option that doesn't sanitize by default. That direction regresses to every ponder app having to manually sanitize event.args before writing to the database. If we can find examples of protocols other than ENS where null bytes need special treatment, that would be helpful.

Appreciate your take!

@shrugs
Copy link
Contributor Author

shrugs commented Jan 20, 2025

It's a bit strict of me but I think the postgres error should be intentional behavior. It does mean user-education & burden on devs to sanitize user-provided strings, which is definitely an overhead.

I don't love a configuration flag, in a code-over-config way, it'd be more explicit for someone in userland to say .values({ comment: removeNullCharacters(event.args.comment) }) than a config laying about in a different file. It specifies which arguments are user-provided and that they explicitly no longer contain null bytes.

I agree that ENS is an outlier, but it does seem that storing user-provided strings is pretty universal. The unique aspect here is that in ENS these strings are hyper specific unique values and identity matters, versus something like the zora on-chain comments.

I'll admit that if postgres threw an error, it'd be a really shitty experience to not know about this and then have your index crash randomly in the future because of a user providing a malformed string. So ensuring that doesn't happen would be quite nice for DX — that's a point for the config option w/ defaults. That said the alternative (bad data because dev didn't know about null character removal) is a much worse type of bug (implicit, difficult to track down) that could occur in the future as well, depending on how the dev is using those user-provided values. Failing hard and fast seems preferable in this case.

But if ENS-style apps are a clear outlier, then the config + documentation seems like a reasonable choice.
A balanced option would be a WARNING log, like you mentioned, though in a long-running index those logs disappear quite quickly — any ideas for ensuring the dev becomes aware of the warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants