Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mysql/oracle] Add remaining ports to support all targets. #4345

Merged
merged 29 commits into from
Dec 6, 2024

Conversation

kaby76
Copy link
Contributor

@kaby76 kaby76 commented Nov 29, 2024

This PR implements the rest of the ports of the mysql/oracle grammar. This is a first step in generating each of the ports automatically from the official source (here).

These ports are necessary because people want to work with a grammar with a minimum of effort. Asking them to implement a port of the grammar is time-consuming and difficult.

In fact, this PR was difficult. The implementation of the two tokens emitted with DOT_IDENTIFIER rule implementation was outrageous because every target has its idiosyncrasies. (It probably would have been easier to have "emitDot()" emit both the DOT_SYMBOL and IDENTIFIER. I think I've seen this issue in other grammars.) The use of base class methods in the Antlr4ng source is inconsistent. (Actions need to be implemented as base class methods for target-agnostic.)

Previously, from the original Antlr4ng port, initialization was done in the driver:

lexer.serverVersion = 80200;
lexer.sqlModeFromString("ANSI_QUOTES");
lexer.charSets = charSets;
parser.serverVersion = lexer.serverVersion;
parser.sqlModes = lexer.sqlModes;

This is actually a bad place to put initialization because it requires the user to read the driver code or readme. Most people who use this repo don't do either. Driver codes are not part of the grammar. Initialization should be done in a constructor. NB: The Go port required a workaround.

What was done?

  • The existing ports (Antlr4ng CSharp Java Python3 TypeScript) were fixed to set defaults in constructors. For the Go port, the constructor for the base class is not called. To get around this, many of the fields were place in a static var and init() defined to initialize the static var.
  • The custom Test.* files were removed.
  • StackQueue<> was removed and all ports now use a standard queue data structure.
  • SqlMode and SqlModes were placed in their own files.
  • The Cpp, Dart, Go, and JavaScript ports were added.

Performance

As I mentioned in a discussion, I have been writing scripts to output a graphical comparison of the relative speed across targets. Here is the graph for this grammar.

times
data.zip

As mentioned previously, the large variance in the times for Java are due to anti-virus and disk caching for the first run of the parser app. (There are over 800 .class files generated by the Java compiler!)

To do

  • I should probably have a metric for "tokens parsed per second" to compare different grammars, e.g., mysql/oracle vs. postgresql. And, I should also devise other metrics, such as total parse tree size for the test suite.
  • charSets is still not implemented.
  • General clean up, tighten up, and sort the base class methods.

@mike-lischke This is a good first step in getting the ports generated via a script from the source. This grammar in target-agnostic format works fine. As with all grammars in this repo, Python3 is pretty slow. But, it does work.

…, so Test.* can be generated/removed. Add Dart port.
…n a class, so enums have to be global! What a screwed up OO language.
It should probably not be done this way, but this does work, and the properties for indices, line and column, and text are now all correct. *Original source code by Mike has bugs.*
@kaby76 kaby76 marked this pull request as ready for review December 3, 2024 03:44
Copy link
Member

@mike-lischke mike-lischke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You put a lot of work into this. Are you planning to do the same for all grammars? That could be a job that takes several years :-)

@mike-lischke
Copy link
Member

Very nice to see the performance charts! Interesting numbers and interesting to see that Dart does better than C++ (although only by a small margin).

About the TypeScript code: this was meant to be test/example code. The users of the grammar have to implement the base classes on their own anyway, to match their environment. I think it makes not so much sense to port this demo code to all supported targets. And to serve as a model/template a single implementation is all what's needed.

In any case the effort to make this work cross-target is enormous and I want to get rid of this kind of target specific actions altogether in my TS port of ANTLR4. From today's perspective it was a bad decision to allow native code go into a grammar, but hey, this is how things go sometimes.

@kaby76
Copy link
Contributor Author

kaby76 commented Dec 3, 2024

Yes, there's a lot of work to do here to make ports for every split grammar. But, I think time would be better spent on understanding how to detect and remove grammar ambiguity and fallbacks, and apply it to the TIOBE-rated programming language grammars. Antlr is wonderful in accepting all sorts of bad grammars, but it also is why people tend to give it a bad rap. (You should see the trash-talk about Antlr in Reddit.) Looking forward to Antlrng picking up where Antlr4 stops.

@teverett
Copy link
Member

teverett commented Dec 6, 2024

@kaby76 thanks!

@teverett teverett merged commit b411201 into antlr:master Dec 6, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants