Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing locations #159

Open
ejgallego opened this issue Feb 6, 2023 · 11 comments
Open

Parsing locations #159

ejgallego opened this issue Feb 6, 2023 · 11 comments

Comments

@ejgallego
Copy link

Dear yojson devs,

I'm reading some JSON data that I then analize, however, as to emit correct source locations for the analysis I'd need that the parsed JSON would contain locations on the original input file / buffer.

Is that something possible to do with yojson?

TIA!

@mjambon
Copy link
Member

mjambon commented Feb 9, 2023

Historically, yojson was focused on speed and that's why the tree contains the data and nothing else. Getting a proper AST with locations and more is useful for some applications. Having this in yojson is up to the maintainers but it would be an addition that doesn't have much in common with the existing implementation.

With some extra preprocessor hacking, we may be able to reuse the existing lexer which currently uses a hack to disable location tracking to improve performance. I'm not sure what others think.

There's also the YAML path, which would consist in finding a YAML parser that returns a precise AST and has an option to restrict itself to JSON (modern YAML being a strict superset of JSON IIRC). I don't know much about the possibilities of ocaml-yaml but it may be worth looking into it.

@ejgallego
Copy link
Author

Thanks for the feedback @mjambon , indeed location tracking can decrease efficiently greatly, good point.

I guess it makes sense to keep this issue open to see if a solution could be found, for now, for my own hacking purposes I derived a JSON parser from RWC that does indeed parse with locations, sharing it here in case could be useful to others:

ejgallego/coq-lsp#306

@abishekvashok
Copy link

Hey, yes, having location can greatly help my cause as well.
Would appreciate adding that capability like a switch if possible. I guess all it needs is to uncomment a couple of lines in read.mll ?

I will be happy to make a patch if the maintainers are receptive of it :)

@Leonidas-from-XIV
Copy link
Member

The problem is how to represent this in the type. Can't really change the AST without breaking every single user of Yojson, so a new AST would need to be created.

In such case, it would be useful to revisit the structure of the AST as it is now anyway (see #158), to not have to do this in the future… this turns into a whole rabbit hole of issues so it's not particularly easy to tackle I would say without creating a lot of issues and questions as @mjambon mentions.

@abishekvashok
Copy link

abishekvashok commented Mar 29, 2023

Can't we reuse the existing lexer,

yojson/lib/read.mll

Lines 2 to 23 in cf2f130

module Lexing =
(*
We override Lexing.engine in order to avoid creating a new position
record each time a rule is matched.
This reduces total parsing time by about 31%.
*)
struct
include Lexing
external c_engine : lex_tables -> int -> lexbuf -> int = "caml_lex_engine"
let engine tbl state buf =
let result = c_engine tbl state buf in
(*
if result >= 0 then begin
buf.lex_start_p <- buf.lex_curr_p;
buf.lex_curr_p <- {buf.lex_curr_p
with pos_cnum = buf.lex_abs_pos + buf.lex_curr_pos};
end;
*)
result
end

I don't know what all needs to be done after uncommenting those lines, and don't no whether conditional include or overriding is permitted in Ocaml so as to make this a switch.

At least some sort of support, like to have line numbers while matched, would be cool

@Leonidas-from-XIV
Copy link
Member

You probably can, as the locations are in there so that's not the problem. The main issue is how to attach line numbers to the AST.

You could use a similar approach as opam-file-format, wrapping the current AST into with_pos, but this adds a lot of intermediate records and allocations:

https://github.com/ocaml/opam-file-format/blob/303ab85afb67c7c22ea548b87241b08616b9f6f2/src/opamParserTypes.ml#L59-L62

@gcluzel
Copy link
Contributor

gcluzel commented Mar 20, 2024

Having locations embedded directly within the AST would really be beneficial in scenarios where the JSON file originates directly from a user. In such cases, prioritizing parsing efficiency alone seems inefficient when compared to the enhanced utility that could be gained from parsing locations, allowing for more informative messages to be displayed to the user.

What about creating new ASTs that store the locations directly in the nodes, more or less like that:

type pos = Lexing.position * Lexing.position
type t = [
  | `Null of pos
  | `Int of int * pos
  | `Assoc of (string * t * pos) list * pos
  ... ]

Would it make sense?

@Leonidas-from-XIV
Copy link
Member

Such an AST would work (though I'd probably define the type as (t * pos) to not have to write pos on every constructor), the issue is that it is a completely new AST so it would be yet another type that very few libraries use while requiring implementation and maintenance work for a comparatively small improvement.

Given the amount of active maintainers in this project is very low and the demand is sporadic I have a hard time seeing this move forward anytime soon.

@ejgallego
Copy link
Author

I guess more than seeing the improvement as "small", it is more in the category of "critical for some users" vs "non important for others", for example if you need to parse user-provided JSON , locations are certainly a must if you need to provide feedback.

Maybe indeed such parsing library would make sense as an independent project? I'm happy to fork the lib I wrote for coq-lsp and make it into a (Yo)json parser that is better suited for interactive use (error recovery, locations, etc...)

@Leonidas-from-XIV
Copy link
Member

Agreed, you do make a good point. It's probably better to think of it an inessential improvement for current users of Yojson that e.g. value API stability, interop and speed vs. an essential improvement for people who need the error reporting.

The problem I see here is that extending the current parser to save locations would make parsing slower due to a lot more allocations and making that conditional would make the parser a good deal harder to maintain. Creating another parser solves this, but creates another issue of compliance - we probably don't want to have a parser with locations that parses data one way and a parser without locations that parses the same data another way.

However, if someone would want to step up as a co-maintainer to make sure a parser with locations and an extended AST that is identical to, say Yojson.Safe.t with the locations stripped, I think that could be a good compromise and even allows people to use the parsed JSON data structure in libraries that requires Yojson.Safe.t. With e.g. QCheck we have the tooling for it, so it's more a question of developer-time.

@ejgallego
Copy link
Author

Indeed!

While I'm not expert on parsing, it seems to me that a parser geared towards user input is very different than a parser geared towards de-serialization. In particular locations are not enough, you also want to recover properly which makes the parser harder (tho easy to do with modern tools like the amazing menhir)

So that would be a different project, maybe in a different repository.

I also agree that testing for compliance would be important, but I think it should be easy to write a test driver that parses the test-cases twice and indeed check they are identical to the yojson reference parser.

@Leonidas-from-XIV Leonidas-from-XIV changed the title [feature request] Parsing locations Parsing locations Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants