Parsing locations #159

ejgallego · 2023-02-06T18:11:35Z

Dear yojson devs,

I'm reading some JSON data that I then analize, however, as to emit correct source locations for the analysis I'd need that the parsed JSON would contain locations on the original input file / buffer.

Is that something possible to do with yojson?

TIA!

mjambon · 2023-02-09T06:37:33Z

Historically, yojson was focused on speed and that's why the tree contains the data and nothing else. Getting a proper AST with locations and more is useful for some applications. Having this in yojson is up to the maintainers but it would be an addition that doesn't have much in common with the existing implementation.

With some extra preprocessor hacking, we may be able to reuse the existing lexer which currently uses a hack to disable location tracking to improve performance. I'm not sure what others think.

There's also the YAML path, which would consist in finding a YAML parser that returns a precise AST and has an option to restrict itself to JSON (modern YAML being a strict superset of JSON IIRC). I don't know much about the possibilities of ocaml-yaml but it may be worth looking into it.

ejgallego · 2023-02-09T09:43:54Z

Thanks for the feedback @mjambon , indeed location tracking can decrease efficiently greatly, good point.

I guess it makes sense to keep this issue open to see if a solution could be found, for now, for my own hacking purposes I derived a JSON parser from RWC that does indeed parse with locations, sharing it here in case could be useful to others:

ejgallego/coq-lsp#306

abishekvashok · 2023-03-29T09:53:05Z

Hey, yes, having location can greatly help my cause as well.
Would appreciate adding that capability like a switch if possible. I guess all it needs is to uncomment a couple of lines in read.mll ?

I will be happy to make a patch if the maintainers are receptive of it :)

Leonidas-from-XIV · 2023-03-29T13:00:28Z

The problem is how to represent this in the type. Can't really change the AST without breaking every single user of Yojson, so a new AST would need to be created.

In such case, it would be useful to revisit the structure of the AST as it is now anyway (see #158), to not have to do this in the future… this turns into a whole rabbit hole of issues so it's not particularly easy to tackle I would say without creating a lot of issues and questions as @mjambon mentions.

abishekvashok · 2023-03-29T14:14:21Z

Can't we reuse the existing lexer,

yojson/lib/read.mll

Lines 2 to 23 in cf2f130

    
             module Lexing = 
        
               (* 
        
                 We override Lexing.engine in order to avoid creating a new position 
        
                 record each time a rule is matched. 
        
                 This reduces total parsing time by about 31%. 
        
               *) 
        
             struct 
        
               include Lexing 
        
               external c_engine : lex_tables -> int -> lexbuf -> int = "caml_lex_engine" 
        
               let engine tbl state buf = 
        
                 let result = c_engine tbl state buf in 
        
                 (* 
        
                 if result >= 0 then begin 
        
                   buf.lex_start_p <- buf.lex_curr_p; 
        
                   buf.lex_curr_p <- {buf.lex_curr_p 
        
                                      with pos_cnum = buf.lex_abs_pos + buf.lex_curr_pos}; 
        
                 end; 
        
                 *) 
        
                 result 
        
             end

I don't know what all needs to be done after uncommenting those lines, and don't no whether conditional include or overriding is permitted in Ocaml so as to make this a switch.

At least some sort of support, like to have line numbers while matched, would be cool

Leonidas-from-XIV · 2023-03-29T14:56:13Z

You probably can, as the locations are in there so that's not the problem. The main issue is how to attach line numbers to the AST.

You could use a similar approach as opam-file-format, wrapping the current AST into with_pos, but this adds a lot of intermediate records and allocations:

https://github.com/ocaml/opam-file-format/blob/303ab85afb67c7c22ea548b87241b08616b9f6f2/src/opamParserTypes.ml#L59-L62

gcluzel · 2024-03-20T09:38:43Z

Having locations embedded directly within the AST would really be beneficial in scenarios where the JSON file originates directly from a user. In such cases, prioritizing parsing efficiency alone seems inefficient when compared to the enhanced utility that could be gained from parsing locations, allowing for more informative messages to be displayed to the user.

What about creating new ASTs that store the locations directly in the nodes, more or less like that:

type pos = Lexing.position * Lexing.position
type t = [
  | `Null of pos
  | `Int of int * pos
  | `Assoc of (string * t * pos) list * pos
  ... ]

Would it make sense?

Leonidas-from-XIV · 2024-03-20T10:08:21Z

Such an AST would work (though I'd probably define the type as (t * pos) to not have to write pos on every constructor), the issue is that it is a completely new AST so it would be yet another type that very few libraries use while requiring implementation and maintenance work for a comparatively small improvement.

Given the amount of active maintainers in this project is very low and the demand is sporadic I have a hard time seeing this move forward anytime soon.

ejgallego · 2024-03-20T13:56:25Z

I guess more than seeing the improvement as "small", it is more in the category of "critical for some users" vs "non important for others", for example if you need to parse user-provided JSON , locations are certainly a must if you need to provide feedback.

Maybe indeed such parsing library would make sense as an independent project? I'm happy to fork the lib I wrote for coq-lsp and make it into a (Yo)json parser that is better suited for interactive use (error recovery, locations, etc...)

Leonidas-from-XIV · 2024-03-20T14:12:29Z

Agreed, you do make a good point. It's probably better to think of it an inessential improvement for current users of Yojson that e.g. value API stability, interop and speed vs. an essential improvement for people who need the error reporting.

The problem I see here is that extending the current parser to save locations would make parsing slower due to a lot more allocations and making that conditional would make the parser a good deal harder to maintain. Creating another parser solves this, but creates another issue of compliance - we probably don't want to have a parser with locations that parses data one way and a parser without locations that parses the same data another way.

However, if someone would want to step up as a co-maintainer to make sure a parser with locations and an extended AST that is identical to, say Yojson.Safe.t with the locations stripped, I think that could be a good compromise and even allows people to use the parsed JSON data structure in libraries that requires Yojson.Safe.t. With e.g. QCheck we have the tooling for it, so it's more a question of developer-time.

ejgallego · 2024-03-20T16:47:47Z

Indeed!

While I'm not expert on parsing, it seems to me that a parser geared towards user input is very different than a parser geared towards de-serialization. In particular locations are not enough, you also want to recover properly which makes the parser harder (tho easy to do with modern tools like the amazing menhir)

So that would be a different project, maybe in a different repository.

I also agree that testing for compliance would be important, but I think it should be easy to write a test driver that parses the test-cases twice and indeed check they are identical to the yojson reference parser.

Leonidas-from-XIV added the feature request label Jun 4, 2024

Leonidas-from-XIV changed the title ~~[feature request] Parsing locations~~ Parsing locations Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing locations #159

Parsing locations #159

ejgallego commented Feb 6, 2023

mjambon commented Feb 9, 2023

ejgallego commented Feb 9, 2023

abishekvashok commented Mar 29, 2023

Leonidas-from-XIV commented Mar 29, 2023

abishekvashok commented Mar 29, 2023 •

edited

Loading

Leonidas-from-XIV commented Mar 29, 2023

gcluzel commented Mar 20, 2024

Leonidas-from-XIV commented Mar 20, 2024

ejgallego commented Mar 20, 2024

Leonidas-from-XIV commented Mar 20, 2024

ejgallego commented Mar 20, 2024

Parsing locations #159

Parsing locations #159

Comments

ejgallego commented Feb 6, 2023

mjambon commented Feb 9, 2023

ejgallego commented Feb 9, 2023

abishekvashok commented Mar 29, 2023

Leonidas-from-XIV commented Mar 29, 2023

abishekvashok commented Mar 29, 2023 • edited Loading

Leonidas-from-XIV commented Mar 29, 2023

gcluzel commented Mar 20, 2024

Leonidas-from-XIV commented Mar 20, 2024

ejgallego commented Mar 20, 2024

Leonidas-from-XIV commented Mar 20, 2024

ejgallego commented Mar 20, 2024

abishekvashok commented Mar 29, 2023 •

edited

Loading