Adds option to pass token end locations #201

EricMCornelius · 2017-03-11T13:06:31Z

Hi there,

I've been working on an expression parsing language with syntax highlighting and thus found a need for the end locations for each rule application to be passed to the postprocessor functions. Not sure if there's a better alternative here, but it seemed sensible to thread the parser options reference down to the Column and State objects for future configuration flexibility.

Not sure if this is the best approach, as I've just started familiarizing myself with this project yesterday, so please let me know if there is a preferable alternative way to achieve this.

…end location with end positions

tjvr · 2017-03-11T13:23:58Z

Hello! Thanks for working on this :-)

Unfortunately I don't think we can merge it—it's quite invasive, and passing the options down to individual States makes me sad and really isn't something we want to do!

Ideally, you would combine location information with your characters/tokens before passing them to nearley, something like this:

parser.feed(Array.from(source).map((char, index) => ({ char, index })))

Unfortunately it then becomes harder to match them in nearley, since the "tokens" are no longer equal === to the literals defined in your rules. :-(

Custom tokens is actually something we've been thinking about very recently -- @Hardmath123 I think this is another argument for allowing user-defined marchers as per #198! This seems like a case where adding a custom test() function to each one of your literals is clearly less convenient than supplying a custom match function.

tjvr · 2017-03-11T13:25:42Z

~~Out of curiousity—what are you using position for in your postprocessors? :-)~~ Ignore me; you already said!

EricMCornelius · 2017-03-11T13:30:13Z

Returning full token location information in each AST node - which gives the ability to easily do semantic highlighting of various node types in the original input string.

As far as passing options through, the easiest (read HACKY) way to achieve this functionality was just to thread the location through and pass it as a 4th argument to the postprocessor functions. However, the resulting signature:

(data, start, reject, end) => {}

Leaves a bit to be desired.

Have you guys addressed any ideas of how to pass a global context through to the postprocessors? Seems like a very useful thing to be able to support w/o having to resort to arbitrary globals. That was my motivation for trying to add options references, but usually I'd see that sort of thing achieved by a closure context instead of needing to explicitly assign a reference to each object instance.

tjvr · 2017-03-11T13:35:33Z

how to pass a global context through to the postprocessors

Smells like #63. That might be something we should revisit.

(data, start, reject, end) => {}

Hold on; I'd forgotten postprocessors already take start as an argument! Given that's the case, you can work out the end index yourself as part of your AST building :-) [Admittedly not ideal, but definitely doable.]

EricMCornelius · 2017-03-11T13:39:40Z

It's not possible, afaics, to calculate the end index at each node without requiring that you return that information from every parsed terminal - which is a major performance hit/grammar complexity when dealing with discarded whitespace terminals.

tjvr · 2017-03-11T13:40:06Z

Have you considered using a tokenizer?

EricMCornelius · 2017-03-11T13:45:38Z

In stricter languages I certainly prefer to, but token boundary classification would be fairly challenging for this particular DSL, unfortunately.

tjvr · 2017-03-11T13:51:44Z

@Hardmath123 How about in postprocessors binding this to the Parser? (Or making the Parser the fourth argument?) Massive potential for abuse, of course; but it would stop us from adding a potentially unbounded number of arguments to State :-)

It would trivially solve @EricMCornelius's problem; you could just readthis.current.

It might even make the people who want context happy...

EricMCornelius · 2017-03-11T13:55:48Z

As an aside, the PEG.js semantic predicates provide the full location information for rule applications, which is what I've used in the past / am most familiar with: https://github.com/pegjs/pegjs#--predicate-

It certainly does make quality error messages significantly easier to generate when you want to do (lazy) context-sensitive error generation during parsing (undefined variable names, etc.)

tjvr · 2017-03-11T13:58:46Z

It certainly does make quality error messages significantly easier to generate when you want to do (lazy) context-sensitive error generation during parsing (undefined variable names, etc.)

Nearley doesn't allow giving a rejection reason, since many postprocessors may run or parsing may even continue after a rejection—see #195

EricMCornelius · 2017-03-11T14:09:10Z

I'd argue that valid production matches with invalid context are great times to generate quality errors. Undefined variable names being one of the canonical examples.

That being said, it's off of the main topic, which is that there's definitely value in having the end positions of production matches available, if only for applications like syntax highlighting and making semantic modifications to original input strings.

kach · 2017-03-11T18:22:30Z

I might be misunderstanding, but can't you get the "end" offset by subtracting 1 from the "start" offset of the next item in your AST?

EricMCornelius · 2017-03-11T19:26:49Z

I don't want to clutter up the AST generating nodes for whitespace and other punctuation in this particular case (operators, parentheses, etc.), and whitespace could be arbitrarily long. This makes it impossible to determine which section of the input was actually matched by a given production rule in a general manner.

EricMCornelius · 2017-03-11T19:28:57Z

Drastically simplified example:

field1 = value1
AND (field2=value2  || field3=value3)

Here I'd want to have an AST representing the boolean expression tree, with leaf nodes representing comparison operations. Don't need to know location information specifically about the parentheses or the arbitrarily structured whitespace components. Only the fields, values, and boolean expression boundaries are necessary, but there's no straightforward way to get the full extent w/o major amounts of explicit bookkeeping when bubbling up data.

tjvr · 2017-03-11T19:42:42Z

I'm closing this since we're not going to merge it; however you're welcome to continue the discussion here or on my proposed solution here: #203

EricMCornelius · 2017-03-11T19:43:16Z

Yep.

Passes parser options to column/rule types and provides option to ext…

efe790c

…end location with end positions

tjvr mentioned this pull request Mar 11, 2017

Make Parser instance available to postprocessors #203

Open

tjvr closed this Mar 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds option to pass token end locations #201

Adds option to pass token end locations #201

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017 •

edited

Loading

tjvr commented Mar 11, 2017 •

edited

Loading

EricMCornelius commented Mar 11, 2017 •

edited

Loading

tjvr commented Mar 11, 2017 •

edited

Loading

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

kach commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017 •

edited

Loading

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

Adds option to pass token end locations #201

Adds option to pass token end locations #201

Conversation

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017 • edited Loading

tjvr commented Mar 11, 2017 • edited Loading

EricMCornelius commented Mar 11, 2017 • edited Loading

tjvr commented Mar 11, 2017 • edited Loading

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

kach commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017 • edited Loading

tjvr commented Mar 11, 2017

EricMCornelius commented Mar 11, 2017

tjvr commented Mar 11, 2017 •

edited

Loading

tjvr commented Mar 11, 2017 •

edited

Loading

EricMCornelius commented Mar 11, 2017 •

edited

Loading

tjvr commented Mar 11, 2017 •

edited

Loading

EricMCornelius commented Mar 11, 2017 •

edited

Loading