Better handling of string lexing #140

chewxy · 2020-12-23T00:40:24Z

The Problem

Given a schema like this:

type Post {
    id: ID!
    raw: String!
}

Here's a way to break a mutation:

mutation MyMutation {
  addPost(input: {raw: "\xaa Raw content"}) {
    numUids
  }
}

giving rise to the following error:

input:3: Unexpected <Invalid>

The issue is that \xaa is not parsed correctly.

The Solution

The solution is quite simple: add a case to the readString() method of the lexer to handle x as a escape character.

Side Note: How Parsing of Strings Should Work

Please note that the handling of bad inputs for \xHH (where H is a meta-variable standing in for a hexadecimal number) is not quite the same as elsewhere in the package. I've got a good reason for this, and I am willing to make changes to the other escape sequences as well.

With this PR, the bad inputs will lead to the literals being returned - so that "\xZZ" will return "\xZZ", while good inputs such as "\xaa" will be parsed correctly, returning "ª".

I believe this is more user friendly.

In the example I had listed, the scenario is one where the server is receiving a mutation. The input string, could be an end user input. And end users do often type the wrong things.

For example, consider a dyslexic person trying to write the sentence "they will give it to me/u". Said person would often type something along the lines of "they will give it to me\u". In this case, the extra parsing for UTF-8 characters in the string will cause this input to fail. What the user meant to type, in a string representation, is "they will give it to me\\u".

An argument could be made that it is the onus of the user of this library to escape the string "they will give it to me\u" to "they will give it to me\\u". My counter argument is that the role of a lexer is to simply find tokens. A string token that contains the literal bytes in "they will give it to me\u" would qualify as a string token. That gqlparser goes above and beyond in order to parse out the UTF8 characters in the contents of a string is most commendable.

But it should not return an error. In the example I have given so far, it would be very unfriendly to the end user, as well as the user of the library.

There can be a further argument to be made - that having the user of this library parse the string and escape any invalid sequences would be extra computation wasted. Now as a total, the program has to go through the string twice - once to escape bad sequences (done by the user of this library), and the second, to parse the correct escape sequences into UTF8 chars (done by this library). If we could save computation by doing all at once, we would make the world a much nicer place.

As I mentioned, I am willing to put the changes in for handling the rest of the bad escape sequences. I am also OK if you want to just keep returning errors for string parsing, and will modify this PR. Your call @vektah .

End Notes

This was originally filed as an issue in Dgraph's community forum: https://discuss.dgraph.io/t/graphql-string-semantics-unclear-and-inconsistent-in-the-presence-of-escape-sequences/12019

coveralls · 2020-12-23T00:41:43Z

Coverage decreased (-0.03%) to 92.248% when pulling 9de70ae on chewxy:master into 7e475a7 on vektah:master.

… the cases easier to read

abhimanyusinghgaur · 2020-12-24T07:13:16Z

Hi @lwc @vvakame,

Tagging you guys for PR review. Need your blessings :)

Thanks

JatinDev543

looks good!!
After this PR we should handle cases for other bad escape sequences also.
Hi @lwc @vvakame, a gentle reminder to review this PR.
Thanks.

chewxy · 2021-01-12T00:08:17Z

Please also review the notes in the PR message above and give me an idea what the plans are wrt handling bad inputs.

JatinDev543 · 2021-01-12T06:49:10Z

Please also review the notes in the PR message above and give me an idea what the plans are wrt handling bad inputs.

Some clarifications needed.
1.Why can't we treat everything as literal. Like instead of processing for example characters after "\x", if we just accept everything and left it to user how he process this. Because from our end we need to accept and store the data.
And for other bad escape sequences also "\v","\0" etc., if we just accept them and store instead of throwing error.

2.I also have doubt in your implementation.

you are converting two characters after "\x" to haxadecimal if they are valid,otherwise treat them as literal. What  If someone wants to treat valid input as literal instead of converting them into hex, for example 
"\xaa"  should be "\xaa" not ª . is it valid requirement ?

StevenACoffman · 2022-01-26T13:37:32Z

@chewxy Sorry for the long delay. I'm looking this over, but I'm happy to merge this if you address the concerns of @JatinDev543

Better handling of string lexing vektah#140

chewxy added 2 commits December 23, 2020 10:12

added functionality to parse \\xHH with a decent fall back

d497d4b

go-fuzz found a crasher. Fixed it

444fa1a

boyscout: clean up switch cases so they all line up vertically. Makes…

9de70ae

… the cases easier to read

JatinDev543 approved these changes Jan 11, 2021

View reviewed changes

speedoops mentioned this pull request Jul 9, 2022

Better handling of string lexing #140 speedoops/gqlparser#1

Merged

speedoops added a commit to speedoops/gqlparser that referenced this pull request Jul 9, 2022

Merge pull request #1 from chewxy/master

f230bab

Better handling of string lexing vektah#140

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of string lexing #140

Better handling of string lexing #140

chewxy commented Dec 23, 2020 •

edited

Loading

coveralls commented Dec 23, 2020 •

edited

Loading

abhimanyusinghgaur commented Dec 24, 2020

JatinDev543 left a comment •

edited

Loading

chewxy commented Jan 12, 2021

JatinDev543 commented Jan 12, 2021 •

edited

Loading

StevenACoffman commented Jan 26, 2022

Better handling of string lexing #140

Are you sure you want to change the base?

Better handling of string lexing #140

Conversation

chewxy commented Dec 23, 2020 • edited Loading

The Problem

The Solution

Side Note: How Parsing of Strings Should Work

End Notes

coveralls commented Dec 23, 2020 • edited Loading

abhimanyusinghgaur commented Dec 24, 2020

JatinDev543 left a comment • edited Loading

Choose a reason for hiding this comment

chewxy commented Jan 12, 2021

JatinDev543 commented Jan 12, 2021 • edited Loading

StevenACoffman commented Jan 26, 2022

chewxy commented Dec 23, 2020 •

edited

Loading

coveralls commented Dec 23, 2020 •

edited

Loading

JatinDev543 left a comment •

edited

Loading

JatinDev543 commented Jan 12, 2021 •

edited

Loading