Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds option to pass token end locations #201

Closed
wants to merge 1 commit into from

Conversation

EricMCornelius
Copy link

Hi there,

I've been working on an expression parsing language with syntax highlighting and thus found a need for the end locations for each rule application to be passed to the postprocessor functions. Not sure if there's a better alternative here, but it seemed sensible to thread the parser options reference down to the Column and State objects for future configuration flexibility.

Not sure if this is the best approach, as I've just started familiarizing myself with this project yesterday, so please let me know if there is a preferable alternative way to achieve this.

@tjvr
Copy link
Collaborator

tjvr commented Mar 11, 2017

Hello! Thanks for working on this :-)

Unfortunately I don't think we can merge it—it's quite invasive, and passing the options down to individual States makes me sad and really isn't something we want to do!

Ideally, you would combine location information with your characters/tokens before passing them to nearley, something like this:

parser.feed(Array.from(source).map((char, index) => ({ char, index })))

Unfortunately it then becomes harder to match them in nearley, since the "tokens" are no longer equal === to the literals defined in your rules. :-(

Custom tokens is actually something we've been thinking about very recently -- @Hardmath123 I think this is another argument for allowing user-defined marchers as per #198! This seems like a case where adding a custom test() function to each one of your literals is clearly less convenient than supplying a custom match function.

@tjvr
Copy link
Collaborator

tjvr commented Mar 11, 2017

Out of curiousity—what are you using position for in your postprocessors? :-) Ignore me; you already said!

@EricMCornelius
Copy link
Author

EricMCornelius commented Mar 11, 2017

Returning full token location information in each AST node - which gives the ability to easily do semantic highlighting of various node types in the original input string.

As far as passing options through, the easiest (read HACKY) way to achieve this functionality was just to thread the location through and pass it as a 4th argument to the postprocessor functions. However, the resulting signature:

(data, start, reject, end) => {}

Leaves a bit to be desired.

Have you guys addressed any ideas of how to pass a global context through to the postprocessors? Seems like a very useful thing to be able to support w/o having to resort to arbitrary globals. That was my motivation for trying to add options references, but usually I'd see that sort of thing achieved by a closure context instead of needing to explicitly assign a reference to each object instance.

@tjvr
Copy link
Collaborator

tjvr commented Mar 11, 2017

how to pass a global context through to the postprocessors

Smells like #63. That might be something we should revisit.

(data, start, reject, end) => {}

Hold on; I'd forgotten postprocessors already take start as an argument! Given that's the case, you can work out the end index yourself as part of your AST building :-) [Admittedly not ideal, but definitely doable.]

@EricMCornelius
Copy link
Author

It's not possible, afaics, to calculate the end index at each node without requiring that you return that information from every parsed terminal - which is a major performance hit/grammar complexity when dealing with discarded whitespace terminals.

@tjvr
Copy link
Collaborator

tjvr commented Mar 11, 2017

Have you considered using a tokenizer?

@EricMCornelius
Copy link
Author

In stricter languages I certainly prefer to, but token boundary classification would be fairly challenging for this particular DSL, unfortunately.

@tjvr
Copy link
Collaborator

tjvr commented Mar 11, 2017

@Hardmath123 How about in postprocessors binding this to the Parser? (Or making the Parser the fourth argument?) Massive potential for abuse, of course; but it would stop us from adding a potentially unbounded number of arguments to State :-)

It would trivially solve @EricMCornelius's problem; you could just readthis.current.

It might even make the people who want context happy...

@EricMCornelius
Copy link
Author

As an aside, the PEG.js semantic predicates provide the full location information for rule applications, which is what I've used in the past / am most familiar with: https://github.com/pegjs/pegjs#--predicate-

It certainly does make quality error messages significantly easier to generate when you want to do (lazy) context-sensitive error generation during parsing (undefined variable names, etc.)

@tjvr
Copy link
Collaborator

tjvr commented Mar 11, 2017

It certainly does make quality error messages significantly easier to generate when you want to do (lazy) context-sensitive error generation during parsing (undefined variable names, etc.)

Nearley doesn't allow giving a rejection reason, since many postprocessors may run or parsing may even continue after a rejection—see #195

@EricMCornelius
Copy link
Author

I'd argue that valid production matches with invalid context are great times to generate quality errors. Undefined variable names being one of the canonical examples.

That being said, it's off of the main topic, which is that there's definitely value in having the end positions of production matches available, if only for applications like syntax highlighting and making semantic modifications to original input strings.

@kach
Copy link
Owner

kach commented Mar 11, 2017

I might be misunderstanding, but can't you get the "end" offset by subtracting 1 from the "start" offset of the next item in your AST?

@EricMCornelius
Copy link
Author

I don't want to clutter up the AST generating nodes for whitespace and other punctuation in this particular case (operators, parentheses, etc.), and whitespace could be arbitrarily long. This makes it impossible to determine which section of the input was actually matched by a given production rule in a general manner.

@EricMCornelius
Copy link
Author

EricMCornelius commented Mar 11, 2017

Drastically simplified example:

field1 = value1
AND (field2=value2  || field3=value3)

Here I'd want to have an AST representing the boolean expression tree, with leaf nodes representing comparison operations. Don't need to know location information specifically about the parentheses or the arbitrarily structured whitespace components. Only the fields, values, and boolean expression boundaries are necessary, but there's no straightforward way to get the full extent w/o major amounts of explicit bookkeeping when bubbling up data.

@tjvr
Copy link
Collaborator

tjvr commented Mar 11, 2017

I'm closing this since we're not going to merge it; however you're welcome to continue the discussion here or on my proposed solution here: #203

@tjvr tjvr closed this Mar 11, 2017
@EricMCornelius
Copy link
Author

Yep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants