Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support? #40

Open
manticore-projects opened this issue Jan 7, 2023 · 49 comments
Open

Unicode support? #40

manticore-projects opened this issue Jan 7, 2023 · 49 comments

Comments

@manticore-projects
Copy link

select * from मकान;

Parse error at line 1, column 10. Encountered: from
select * from मकान;

Same for Thai, Traditional Chinese and German Umlauts.
I believe we will need to allow Unicode Alphabet Letters explicitly?

@kaikalur
Copy link

kaikalur commented Jan 7, 2023

Hmm looks like the spec requires things to be unicode escaped?

| <Unicode_delimited_identifier: "U" "&" "\"" <Unicode_delimiter_body> "\"" ( <Unicode_escape_specifier> )?  >

Is one of the rules.

@kaikalur
Copy link

kaikalur commented Jan 7, 2023

Oh never mind. Didn't implement it lol:

| <#identifier_start: ["a"-"z"]  // temp

It's a todo

@kaikalur
Copy link

kaikalur commented Jan 7, 2023

This will be a good exercise to do - define the unicode char classes in a separate file as token macros and use that to define identifiers

@manticore-projects
Copy link
Author

Yes, I thought that much.
Thanks for the feedback.

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

https://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Has the definitions. Should be easy to produce private token definitions from this - at least upto FFFD. Looks like now unicode is more than 16bits :( Javacc don't support that yet.

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

#41

Initial version! Also added test in my mother tongue (Telugu) lol:

"SELECT 1 ఒకట;"

@manticore-projects
Copy link
Author

manticore-projects commented Jan 8, 2023

#41

Initial version! Also added test in my mother tongue (Telugu) lol:

"SELECT 1 ఒకట;"

I must confess that I am not able to differentiate between Thai, Khmer, Lao and Telugo.
But I am glad that it works!

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

#41
Initial version! Also added test in my mother tongue (Telugu) lol:
"SELECT 1 ఒకట;"

I must confess that I am not able to differentiate between Thai, Khmer, Lao and Telugo. But I am glad that it works!

It's Telugu - spoken in the south east state of Andhra Pradesh in India which is where I'm from originally (though now I'm basically a native Californian - after 30 years lol)

@manticore-projects
Copy link
Author

manticore-projects commented Jan 8, 2023

Thats a lot of Tokens!
Can't we better define Ranges?


PART_LETTER
         ::= [$@0-9A-Z_a-z#x23#x0-#x8#xE-#x1B#x7F-#x9F#xA2-#xA5#xAA#xB5#xBA#xC0-#xD6#xD8-#xF6#xF8-#x21F#x222-#x233#x250-#x2AD#x2B0-#x2B8#x2BB-#x2C1#x2D0-#x2D1#x2E0-#x2E4#x2EE#x300-#x34E#x360-#x362#x37A#x386#x388-#x38A#x38C#x38E-#x3A1#x3A3-#x3CE#x3D0-#x3D7#x3DA-#x3F3#x400-#x481#x483-#x486#x48C-#x4C4#x4C7-#x4C8#x4CB-#x4CC#x4D0-#x4F5#x4F8-#x4F9#x531-#x556#x559#x561-#x587#x591-#x5A1#x5A3-#x5B9#x5BB-#x5BD#x5BF#x5C1-#x5C2#x5C4#x5D0-#x5EA#x5F0-#x5F2#x621-#x63A#x640-#x655#x660-#x669#x670-#x6D3#x6D5-#x6DC#x6DF-#x6E8#x6EA-#x6ED#x6F0-#x6FC#x70F-#x72C#x730-#x74A#x780-#x7B0#x901-#x903#x905-#x939#x93C-#x94D#x950-#x954#x958-#x963#x966-#x96F#x981-#x983#x985-#x98C#x98F-#x990#x993-#x9A8#x9AA-#x9B0#x9B2#x9B6-#x9B9#x9BC#x9BE-#x9C4#x9C7-#x9C8#x9CB-#x9CD#x9D7#x9DC-#x9DD#x9DF-#x9E3#x9E6-#x9F3#xA02#xA05-#xA0A#xA0F-#xA10#xA13-#xA28#xA2A-#xA30#xA32-#xA33#xA35-#xA36#xA38-#xA39#xA3C#xA3E-#xA42#xA47-#xA48#xA4B-#xA4D#xA59-#xA5C#xA5E#xA66-#xA74#xA81-#xA83#xA85-#xA8B#xA8D#xA8F-#xA91#xA93-#xAA8#xAAA-#xAB0#xAB2-#xAB3#xAB5-#xAB9#xABC-#xAC5#xAC7-#xAC9#xACB-#xACD#xAD0#xAE0#xAE6-#xAEF#xB01-#xB03#xB05-#xB0C#xB0F-#xB10#xB13-#xB28#xB2A-#xB30#xB32-#xB33#xB36-#xB39#xB3C-#xB43#xB47-#xB48#xB4B-#xB4D#xB56-#xB57#xB5C-#xB5D#xB5F-#xB61#xB66-#xB6F#xB82-#xB83#xB85-#xB8A#xB8E-#xB90#xB92-#xB95#xB99-#xB9A#xB9C#xB9E-#xB9F#xBA3-#xBA4#xBA8-#xBAA#xBAE-#xBB5#xBB7-#xBB9#xBBE-#xBC2#xBC6-#xBC8#xBCA-#xBCD#xBD7#xBE7-#xBEF#xC01-#xC03#xC05-#xC0C#xC0E-#xC10#xC12-#xC28#xC2A-#xC33#xC35-#xC39#xC3E-#xC44#xC46-#xC48#xC4A-#xC4D#xC55-#xC56#xC60-#xC61#xC66-#xC6F#xC82-#xC83#xC85-#xC8C#xC8E-#xC90#xC92-#xCA8#xCAA-#xCB3#xCB5-#xCB9#xCBE-#xCC4#xCC6-#xCC8#xCCA-#xCCD#xCD5-#xCD6#xCDE#xCE0-#xCE1#xCE6-#xCEF#xD02-#xD03#xD05-#xD0C#xD0E-#xD10#xD12-#xD28#xD2A-#xD39#xD3E-#xD43#xD46-#xD48#xD4A-#xD4D#xD57#xD60-#xD61#xD66-#xD6F#xD82-#xD83#xD85-#xD96#xD9A-#xDB1#xDB3-#xDBB#xDBD#xDC0-#xDC6#xDCA#xDCF-#xDD4#xDD6#xDD8-#xDDF#xDF2-#xDF3#xE01-#xE3A#xE3F-#xE4E#xE50-#xE59#xE81-#xE82#xE84#xE87-#xE88#xE8A#xE8D#xE94-#xE97#xE99-#xE9F#xEA1-#xEA3#xEA5#xEA7#xEAA-#xEAB#xEAD-#xEB9#xEBB-#xEBD#xEC0-#xEC4#xEC6#xEC8-#xECD#xED0-#xED9#xEDC-#xEDD#xF00#xF18-#xF19#xF20-#xF29#xF35#xF37#xF39#xF3E-#xF47#xF49-#xF6A#xF71-#xF84#xF86-#xF8B#xF90-#xF97#xF99-#xFBC#xFC6#x1000-#x1021#x1023-#x1027#x1029-#x102A#x102C-#x1032#x1036-#x1039#x1040-#x1049#x1050-#x1059#x10A0-#x10C5#x10D0-#x10F6#x1100-#x1159#x115F-#x11A2#x11A8-#x11F9#x1200-#x1206#x1208-#x1246#x1248#x124A-#x124D#x1250-#x1256#x1258#x125A-#x125D#x1260-#x1286#x1288#x128A-#x128D#x1290-#x12AE#x12B0#x12B2-#x12B5#x12B8-#x12BE#x12C0#x12C2-#x12C5#x12C8-#x12CE#x12D0-#x12D6#x12D8-#x12EE#x12F0-#x130E#x1310#x1312-#x1315#x1318-#x131E#x1320-#x1346#x1348-#x135A#x1369-#x1371#x13A0-#x13F4#x1401-#x166C#x166F-#x1676#x1681-#x169A#x16A0-#x16EA#x1780-#x17D3#x17DB#x17E0-#x17E9#x180B-#x180E#x1810-#x1819#x1820-#x1877#x1880-#x18A9#x1E00-#x1E9B#x1EA0-#x1EF9#x1F00-#x1F15#x1F18-#x1F1D#x1F20-#x1F45#x1F48-#x1F4D#x1F50-#x1F57#x1F59#x1F5B#x1F5D#x1F5F-#x1F7D#x1F80-#x1FB4#x1FB6-#x1FBC#x1FBE#x1FC2-#x1FC4#x1FC6-#x1FCC#x1FD0-#x1FD3#x1FD6-#x1FDB#x1FE0-#x1FEC#x1FF2-#x1FF4#x1FF6-#x1FFC#x200C-#x200F#x202A-#x202E#x203F-#x2040#x206A-#x206F#x207F#x20A0-#x20AF#x20D0-#x20DC#x20E1#x2102#x2107#x210A-#x2113#x2115#x2119-#x211D#x2124#x2126#x2128#x212A-#x212D#x212F-#x2131#x2133-#x2139#x2160-#x2183#x3005-#x3007#x3021-#x302F#x3031-#x3035#x3038-#x303A#x3041-#x3094#x3099-#x309A#x309D-#x309E#x30A1-#x30FE#x3105-#x312C#x3131-#x318E#x31A0-#x31B7#x3400-#x4DB5#x4E00-#x9FA5#xA000-#xA48C#xAC00-#xD7A3#xF900-#xFA2D#xFB00-#xFB06#xFB13-#xFB17#xFB1D-#xFB28#xFB2A-#xFB36#xFB38-#xFB3C#xFB3E#xFB40-#xFB41#xFB43-#xFB44#xFB46-#xFBB1#xFBD3-#xFD3D#xFD50-#xFD8F#xFD92-#xFDC7#xFDF0-#xFDFB#xFE20-#xFE23#xFE33-#xFE34#xFE4D-#xFE4F#xFE69#xFE70-#xFE72#xFE74#xFE76-#xFEFC#xFEFF#xFF04#xFF10-#xFF19#xFF21-#xFF3A#xFF3F#xFF41-#xFF5A#xFF65-#xFFBE#xFFC2-#xFFC7#xFFCA-#xFFCF#xFFD2-#xFFD7#xFFDA-#xFFDC#xFFE0-#xFFE1#xFFE5-#xFFE6#xFFF9-#xFFFB]

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

Hmm where did you get that from? See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges - exactly as defined in that unicode data file. They are not tokens. They are explicitly listed out along with comment so we know if there is a problem, we can check easily. Look at unicode-identifiers.txt file in my PR

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

Thats a lot of Tokens! Can't we better define Ranges?


PART_LETTER
         ::= [$@0-9A-Z_a-z#x23#x0-#x8#xE-#x1B#x7F-#x9F#xA2-#xA5#xAA#xB5#xBA#xC0-#xD6#xD8-#xF6#xF8-#x21F#x222-#x233#x250-#x2AD#x2B0-#x2B8#x2BB-#x2C1#x2D0-#x2D1#x2E0-#x2E4#x2EE#x300-#x34E#x360-#x362#x37A#x386#x388-#x38A#x38C#x38E-#x3A1#x3A3-#x3CE#x3D0-#x3D7#x3DA-#x3F3#x400-#x481#x483-#x486#x48C-#x4C4#x4C7-#x4C8#x4CB-#x4CC#x4D0-#x4F5#x4F8-#x4F9#x531-#x556#x559#x561-#x587#x591-#x5A1#x5A3-#x5B9#x5BB-#x5BD#x5BF#x5C1-#x5C2#x5C4#x5D0-#x5EA#x5F0-#x5F2#x621-#x63A#x640-#x655#x660-#x669#x670-#x6D3#x6D5-#x6DC#x6DF-#x6E8#x6EA-#x6ED#x6F0-#x6FC#x70F-#x72C#x730-#x74A#x780-#x7B0#x901-#x903#x905-#x939#x93C-#x94D#x950-#x954#x958-#x963#x966-#x96F#x981-#x983#x985-#x98C#x98F-#x990#x993-#x9A8#x9AA-#x9B0#x9B2#x9B6-#x9B9#x9BC#x9BE-#x9C4#x9C7-#x9C8#x9CB-#x9CD#x9D7#x9DC-#x9DD#x9DF-#x9E3#x9E6-#x9F3#xA02#xA05-#xA0A#xA0F-#xA10#xA13-#xA28#xA2A-#xA30#xA32-#xA33#xA35-#xA36#xA38-#xA39#xA3C#xA3E-#xA42#xA47-#xA48#xA4B-#xA4D#xA59-#xA5C#xA5E#xA66-#xA74#xA81-#xA83#xA85-#xA8B#xA8D#xA8F-#xA91#xA93-#xAA8#xAAA-#xAB0#xAB2-#xAB3#xAB5-#xAB9#xABC-#xAC5#xAC7-#xAC9#xACB-#xACD#xAD0#xAE0#xAE6-#xAEF#xB01-#xB03#xB05-#xB0C#xB0F-#xB10#xB13-#xB28#xB2A-#xB30#xB32-#xB33#xB36-#xB39#xB3C-#xB43#xB47-#xB48#xB4B-#xB4D#xB56-#xB57#xB5C-#xB5D#xB5F-#xB61#xB66-#xB6F#xB82-#xB83#xB85-#xB8A#xB8E-#xB90#xB92-#xB95#xB99-#xB9A#xB9C#xB9E-#xB9F#xBA3-#xBA4#xBA8-#xBAA#xBAE-#xBB5#xBB7-#xBB9#xBBE-#xBC2#xBC6-#xBC8#xBCA-#xBCD#xBD7#xBE7-#xBEF#xC01-#xC03#xC05-#xC0C#xC0E-#xC10#xC12-#xC28#xC2A-#xC33#xC35-#xC39#xC3E-#xC44#xC46-#xC48#xC4A-#xC4D#xC55-#xC56#xC60-#xC61#xC66-#xC6F#xC82-#xC83#xC85-#xC8C#xC8E-#xC90#xC92-#xCA8#xCAA-#xCB3#xCB5-#xCB9#xCBE-#xCC4#xCC6-#xCC8#xCCA-#xCCD#xCD5-#xCD6#xCDE#xCE0-#xCE1#xCE6-#xCEF#xD02-#xD03#xD05-#xD0C#xD0E-#xD10#xD12-#xD28#xD2A-#xD39#xD3E-#xD43#xD46-#xD48#xD4A-#xD4D#xD57#xD60-#xD61#xD66-#xD6F#xD82-#xD83#xD85-#xD96#xD9A-#xDB1#xDB3-#xDBB#xDBD#xDC0-#xDC6#xDCA#xDCF-#xDD4#xDD6#xDD8-#xDDF#xDF2-#xDF3#xE01-#xE3A#xE3F-#xE4E#xE50-#xE59#xE81-#xE82#xE84#xE87-#xE88#xE8A#xE8D#xE94-#xE97#xE99-#xE9F#xEA1-#xEA3#xEA5#xEA7#xEAA-#xEAB#xEAD-#xEB9#xEBB-#xEBD#xEC0-#xEC4#xEC6#xEC8-#xECD#xED0-#xED9#xEDC-#xEDD#xF00#xF18-#xF19#xF20-#xF29#xF35#xF37#xF39#xF3E-#xF47#xF49-#xF6A#xF71-#xF84#xF86-#xF8B#xF90-#xF97#xF99-#xFBC#xFC6#x1000-#x1021#x1023-#x1027#x1029-#x102A#x102C-#x1032#x1036-#x1039#x1040-#x1049#x1050-#x1059#x10A0-#x10C5#x10D0-#x10F6#x1100-#x1159#x115F-#x11A2#x11A8-#x11F9#x1200-#x1206#x1208-#x1246#x1248#x124A-#x124D#x1250-#x1256#x1258#x125A-#x125D#x1260-#x1286#x1288#x128A-#x128D#x1290-#x12AE#x12B0#x12B2-#x12B5#x12B8-#x12BE#x12C0#x12C2-#x12C5#x12C8-#x12CE#x12D0-#x12D6#x12D8-#x12EE#x12F0-#x130E#x1310#x1312-#x1315#x1318-#x131E#x1320-#x1346#x1348-#x135A#x1369-#x1371#x13A0-#x13F4#x1401-#x166C#x166F-#x1676#x1681-#x169A#x16A0-#x16EA#x1780-#x17D3#x17DB#x17E0-#x17E9#x180B-#x180E#x1810-#x1819#x1820-#x1877#x1880-#x18A9#x1E00-#x1E9B#x1EA0-#x1EF9#x1F00-#x1F15#x1F18-#x1F1D#x1F20-#x1F45#x1F48-#x1F4D#x1F50-#x1F57#x1F59#x1F5B#x1F5D#x1F5F-#x1F7D#x1F80-#x1FB4#x1FB6-#x1FBC#x1FBE#x1FC2-#x1FC4#x1FC6-#x1FCC#x1FD0-#x1FD3#x1FD6-#x1FDB#x1FE0-#x1FEC#x1FF2-#x1FF4#x1FF6-#x1FFC#x200C-#x200F#x202A-#x202E#x203F-#x2040#x206A-#x206F#x207F#x20A0-#x20AF#x20D0-#x20DC#x20E1#x2102#x2107#x210A-#x2113#x2115#x2119-#x211D#x2124#x2126#x2128#x212A-#x212D#x212F-#x2131#x2133-#x2139#x2160-#x2183#x3005-#x3007#x3021-#x302F#x3031-#x3035#x3038-#x303A#x3041-#x3094#x3099-#x309A#x309D-#x309E#x30A1-#x30FE#x3105-#x312C#x3131-#x318E#x31A0-#x31B7#x3400-#x4DB5#x4E00-#x9FA5#xA000-#xA48C#xAC00-#xD7A3#xF900-#xFA2D#xFB00-#xFB06#xFB13-#xFB17#xFB1D-#xFB28#xFB2A-#xFB36#xFB38-#xFB3C#xFB3E#xFB40-#xFB41#xFB43-#xFB44#xFB46-#xFBB1#xFBD3-#xFD3D#xFD50-#xFD8F#xFD92-#xFDC7#xFDF0-#xFDFB#xFE20-#xFE23#xFE33-#xFE34#xFE4D-#xFE4F#xFE69#xFE70-#xFE72#xFE74#xFE76-#xFEFC#xFEFF#xFF04#xFF10-#xFF19#xFF21-#xFF3A#xFF3F#xFF41-#xFF5A#xFF65-#xFFBE#xFFC2-#xFFC7#xFFCA-#xFFCF#xFFD2-#xFFD7#xFFDA-#xFFDC#xFFE0-#xFFE1#xFFE5-#xFFE6#xFFF9-#xFFFB]

Like I mentioned earlier, for this project I want to make spec compliance and easy verification of the spec is what I'm going for.

@manticore-projects
Copy link
Author

See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges

TOKEN:
{
<#Ll:
  [ "\u0061"    // LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
  , "\u0062"    // LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
  , "\u0063"    // LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043
  , "\u0064"    // LATIN SMALL LETTER D;Ll;0;L;;;;;N;;;0044;;0044
  , "\u0065"    // LATIN SMALL LETTER E;Ll;0;L;;;;;N;;;0045;;0045
  , "\u0066"    // LATIN SMALL LETTER F;Ll;0;L;;;;;N;;;0046;;0046
  , "\u0067"    // LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
  , "\u0068"    // LATIN SMALL LETTER H;Ll;0;L;;;;;N;;;0048;;0048
  , "\u0069"    // LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
  , "\u006A"    // LATIN SMALL LETTER J;Ll;0;L;;;;;N;;;004A;;004A
  , "\u006B"    // LATIN SMALL LETTER K;Ll;0;L;;;;;N;;;004B;;004B
  , "\u006C"    // LATIN SMALL LETTER L;Ll;0;L;;;;;N;;;004C;;004C
  , "\u006D"    // LATIN SMALL LETTER M;Ll;0;L;;;;;N;;;004D;;004D
  , "\u006E"    // LATIN SMALL LETTER N;Ll;0;L;;;;;N;;;004E;;004E
  , "\u006F"    // LATIN SMALL LETTER O;Ll;0;L;;;;;N;;;004F;;004F
  , "\u0070"    // LATIN SMALL LETTER P;Ll;0;L;;;;;N;;;0050;;0050
  , "\u0071"    // LATIN SMALL LETTER Q;Ll;0;L;;;;;N;;;0051;;0051
  , "\u0072"    // LATIN SMALL LETTER R;Ll;0;L;;;;;N;;;0052;;0052
  , "\u0073"    // LATIN SMALL LETTER S;Ll;0;L;;;;;N;;;0053;;0053
  , "\u0074"    // LATIN SMALL LETTER T;Ll;0;L;;;;;N;;;0054;;0054
  , "\u0075"    // LATIN SMALL LETTER U;Ll;0;L;;;;;N;;;0055;;0055
  , "\u0076"    // LATIN SMALL LETTER V;Ll;0;L;;;;;N;;;0056;;0056
  , "\u0077"    // LATIN SMALL LETTER W;Ll;0;L;;;;;N;;;0057;;0057
  , "\u0078"    // LATIN SMALL LETTER X;Ll;0;L;;;;;N;;;0058;;0058
  , "\u0079"    // LATIN SMALL LETTER Y;Ll;0;L;;;;;N;;;0059;;0059
  , "\u007A"    // LATIN SMALL LETTER Z;Ll;0;L;;;;;N;;;005A;;005A
  , "\u00B5"    // MICRO SIGN;Ll;0;L; | <compat> 03BC;;;;N;;;039C;;039C

I did that and I found that every single Character is defined one by one (unless I am completely lost again).
This creates a huge Grammar which I could not open in Git or the IDE (unless switching off the internal parser/plugins).

Of course it is correct and pure, but you can achieve the same by defining ranges (instead of token by token).

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges

TOKEN:
{
<#Ll:
  [ "\u0061"    // LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
  , "\u0062"    // LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
  , "\u0063"    // LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043
  , "\u0064"    // LATIN SMALL LETTER D;Ll;0;L;;;;;N;;;0044;;0044
  , "\u0065"    // LATIN SMALL LETTER E;Ll;0;L;;;;;N;;;0045;;0045
  , "\u0066"    // LATIN SMALL LETTER F;Ll;0;L;;;;;N;;;0046;;0046
  , "\u0067"    // LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
  , "\u0068"    // LATIN SMALL LETTER H;Ll;0;L;;;;;N;;;0048;;0048
  , "\u0069"    // LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
  , "\u006A"    // LATIN SMALL LETTER J;Ll;0;L;;;;;N;;;004A;;004A
  , "\u006B"    // LATIN SMALL LETTER K;Ll;0;L;;;;;N;;;004B;;004B
  , "\u006C"    // LATIN SMALL LETTER L;Ll;0;L;;;;;N;;;004C;;004C
  , "\u006D"    // LATIN SMALL LETTER M;Ll;0;L;;;;;N;;;004D;;004D
  , "\u006E"    // LATIN SMALL LETTER N;Ll;0;L;;;;;N;;;004E;;004E
  , "\u006F"    // LATIN SMALL LETTER O;Ll;0;L;;;;;N;;;004F;;004F
  , "\u0070"    // LATIN SMALL LETTER P;Ll;0;L;;;;;N;;;0050;;0050
  , "\u0071"    // LATIN SMALL LETTER Q;Ll;0;L;;;;;N;;;0051;;0051
  , "\u0072"    // LATIN SMALL LETTER R;Ll;0;L;;;;;N;;;0052;;0052
  , "\u0073"    // LATIN SMALL LETTER S;Ll;0;L;;;;;N;;;0053;;0053
  , "\u0074"    // LATIN SMALL LETTER T;Ll;0;L;;;;;N;;;0054;;0054
  , "\u0075"    // LATIN SMALL LETTER U;Ll;0;L;;;;;N;;;0055;;0055
  , "\u0076"    // LATIN SMALL LETTER V;Ll;0;L;;;;;N;;;0056;;0056
  , "\u0077"    // LATIN SMALL LETTER W;Ll;0;L;;;;;N;;;0057;;0057
  , "\u0078"    // LATIN SMALL LETTER X;Ll;0;L;;;;;N;;;0058;;0058
  , "\u0079"    // LATIN SMALL LETTER Y;Ll;0;L;;;;;N;;;0059;;0059
  , "\u007A"    // LATIN SMALL LETTER Z;Ll;0;L;;;;;N;;;005A;;005A
  , "\u00B5"    // MICRO SIGN;Ll;0;L; | <compat> 03BC;;;;N;;;039C;;039C

I did that and I found that every single Character is defined one by one (unless I am completely lost again). This creates a huge Grammar which I could not open in Git or the IDE (unless switching off the internal parser/plugins).

Of course it is correct and pure, but you can achieve the same by defining ranges (instead of token by token).

It's not that big - only 22k. Also they are character lists why are you displaying char lists as tokens there are only 6 local tokens - like Ll, Lu, Lo etc. So the tool processing this grammar should do a better job lol. Or we can strip out the comments if grammar loading is a problem.

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

See my PR - I just massaged the UnicodeData.txt file (attached above) and produced char sets and ranges

TOKEN:
{
<#Ll:
  [ "\u0061"    // LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
  , "\u0062"    // LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
  , "\u0063"    // LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043
  , "\u0064"    // LATIN SMALL LETTER D;Ll;0;L;;;;;N;;;0044;;0044
  , "\u0065"    // LATIN SMALL LETTER E;Ll;0;L;;;;;N;;;0045;;0045
  , "\u0066"    // LATIN SMALL LETTER F;Ll;0;L;;;;;N;;;0046;;0046
  , "\u0067"    // LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
  , "\u0068"    // LATIN SMALL LETTER H;Ll;0;L;;;;;N;;;0048;;0048
  , "\u0069"    // LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
  , "\u006A"    // LATIN SMALL LETTER J;Ll;0;L;;;;;N;;;004A;;004A
  , "\u006B"    // LATIN SMALL LETTER K;Ll;0;L;;;;;N;;;004B;;004B
  , "\u006C"    // LATIN SMALL LETTER L;Ll;0;L;;;;;N;;;004C;;004C
  , "\u006D"    // LATIN SMALL LETTER M;Ll;0;L;;;;;N;;;004D;;004D
  , "\u006E"    // LATIN SMALL LETTER N;Ll;0;L;;;;;N;;;004E;;004E
  , "\u006F"    // LATIN SMALL LETTER O;Ll;0;L;;;;;N;;;004F;;004F
  , "\u0070"    // LATIN SMALL LETTER P;Ll;0;L;;;;;N;;;0050;;0050
  , "\u0071"    // LATIN SMALL LETTER Q;Ll;0;L;;;;;N;;;0051;;0051
  , "\u0072"    // LATIN SMALL LETTER R;Ll;0;L;;;;;N;;;0052;;0052
  , "\u0073"    // LATIN SMALL LETTER S;Ll;0;L;;;;;N;;;0053;;0053
  , "\u0074"    // LATIN SMALL LETTER T;Ll;0;L;;;;;N;;;0054;;0054
  , "\u0075"    // LATIN SMALL LETTER U;Ll;0;L;;;;;N;;;0055;;0055
  , "\u0076"    // LATIN SMALL LETTER V;Ll;0;L;;;;;N;;;0056;;0056
  , "\u0077"    // LATIN SMALL LETTER W;Ll;0;L;;;;;N;;;0057;;0057
  , "\u0078"    // LATIN SMALL LETTER X;Ll;0;L;;;;;N;;;0058;;0058
  , "\u0079"    // LATIN SMALL LETTER Y;Ll;0;L;;;;;N;;;0059;;0059
  , "\u007A"    // LATIN SMALL LETTER Z;Ll;0;L;;;;;N;;;005A;;005A
  , "\u00B5"    // MICRO SIGN;Ll;0;L; | <compat> 03BC;;;;N;;;039C;;039C

I did that and I found that every single Character is defined one by one (unless I am completely lost again). This creates a huge Grammar which I could not open in Git or the IDE (unless switching off the internal parser/plugins).
Of course it is correct and pure, but you can achieve the same by defining ranges (instead of token by token).

No that could be bug prone. But if you want to do that, write a preprocessor that takes this file and compacts into ranges before the concatenation.

It's not that big - only 22k. Also they are character lists why are you displaying char lists as tokens there are only 6 local tokens - like Ll, Lu, Lo etc. So the tool processing this grammar should do a better job lol. Or we can strip out the comments if grammar loading is a problem.

@manticore-projects
Copy link
Author

It's not that big - only 22k.

The TEXT file is 1.5 MB and without mangling it will end like that in the Grammar (where it will be translated into 1 single token of course. My use of the term tokens was not correct, when I was referring to the explicit characters.)
Although I agree, that the Shell script can suppress the comments at least and maybe compress it into ranges.

Although I don't understand why we don't want to provide ranges as per Unicode Page -- but that's rather a matter of taste and not debatable.

@manticore-projects
Copy link
Author

This one: https://jrgraphix.net/r/Unicode/

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

It's not that big - only 22k.

The TEXT file is 1.5 MB and without mangling it will end like that in the Grammar (where it will be translated into 1 single token of course. My use of the term tokens was not correct, when I was referring to the explicit characters.) Although I agree, that the Shell script can suppress the comments at least and maybe compress it into ranges.

Although I don't understand why we don't want to provide ranges as per Unicode Page -- but that's rather a matter of taste and not debatable.

There is no easy way to find the ranges in general. But it should be easy to do in your document ganerator! When there are charlists, you compact them for display purposes.

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

This one: https://jrgraphix.net/r/Unicode/

That's not enough. You need Lu, Li etc as defined for each of those separate languages in the Unicode data txt file I attached. Also this is not the official spec so can't use that lol.

@manticore-projects
Copy link
Author

manticore-projects commented Jan 8, 2023

Some ranges are commented out already?

//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;

Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

Some ranges are commented out already?

//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;

Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.

Unfortunately JavaCC cannot handle more than 2 byte characters :( > 0xffff (or "\uffff") won't work.

@manticore-projects
Copy link
Author

Some ranges are commented out already?

//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;

Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.

Unfortunately JavaCC cannot handle more than 2 byte characters :( > 0xffff won't work.

Ok, I get that.
Your are the boss, but in my opinion, if we can't support All Unicode because technical reasons already, then we could easily stick with a "practical" support of most relevant unicode even when missing out one or two obscure characters.

Although, for me this is not the hill to die upon.
I will change replace this Text File for my own concerns.

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

Some ranges are commented out already?

//  , "\u10480"    // OSMANYA LETTER ALEF;Lo;0;L;;;;;N;;;;;
//  , "\u10481"    // OSMANYA LETTER BA;Lo;0;L;;;;;N;;;;;
//  , "\u10482"    // OSMANYA LETTER TA;Lo;0;L;;;;;N;;;;;
//  , "\u10483"    // OSMANYA LETTER JA;Lo;0;L;;;;;N;;;;;
//  , "\u10484"    // OSMANYA LETTER XA;Lo;0;L;;;;;N;;;;;
//  , "\u10485"    // OSMANYA LETTER KHA;Lo;0;L;;;;;N;;;;;
//  , "\u10486"    // OSMANYA LETTER DEEL;Lo;0;L;;;;;N;;;;;
//  , "\u10487"    // OSMANYA LETTER RA;Lo;0;L;;;;;N;;;;;

Certainly Mr. Erdogan won't be too happy about that, but I would like to understand the idea behind please.

Unfortunately JavaCC cannot handle more than 2 byte characters :( > 0xffff won't work.

Ok, I get that. Your are the boss, but in my opinion, if we can't support All Unicode because technical reasons already, then we could easily stick with a "practical" support of most relevant unicode even when missing out one or two obscure characters.

This new unicode spec is relatively new and I don;t think anyone supports it yet properly (not sure if Java even supports it - haven't checked). If/when Java supports it, we can extend JavaCC to do that as well.

Although, for me this is not the hill to die upon. I will change replace this Text File for my own concerns.

👍🏾

@manticore-projects
Copy link
Author

Yes, lets turn that into a selling point: "The only OSMANYA supporting SQL Parser in the world!". :-D

@kaikalur
Copy link

kaikalur commented Jan 8, 2023

Somewhat related - the spec does not allow digits/numbers from other languages!

<digit> ::=
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

@manticore-projects
Copy link
Author

In my opinion, this makes a kind of sense since the Database would need to calculate with the Numbers eventually.
On the upside, your Identifiers can start with Thai Digits, but not with Latin/Arabic Digits -- so something for everyone.

@manticore-projects
Copy link
Author

manticore-projects commented Jan 8, 2023

In case anyone else may want to use it.
You can call it like

./unicode.sh Downloads/unicode-identifiers.txt

I kept it verbose for debugging purpose. We can remove the noise when verified.
Cheers!

Edit: image working on a parser and not knowing how large 2 bytes are :P

unicode-identifiers_compact.txt
unicode.sh.zip

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

Checkout the PR again - I added an awk script (lol) to fix up the char ranges. We should add a test that it actually captures all the unicode chars in all the ranges properly. Now we have < 500 ranges.

@manticore-projects
Copy link
Author

manticore-projects commented Jan 9, 2023

Good Morning, thank you for investing into this.

Unfortunately your script does not work thows irritating warnings for me. At first, I thought it was not working but the I found the generated file.

are@archlinux ~/D/s/p/p/grammar (unicode-support)> ./prepare-javacc-grammar.sh
awk: ./compact_char_sets.awk:3: warning: regexp escape sequence `\"' is not a known regexp operator
awk: ./compact_char_sets.awk:16: (FILENAME=- FNR=4) warning: regexp escape sequence `\u' is not a known regexp operator

Have you tested, if the output has as many Characters as the output from my script?
My understanding was, they should produce the same Ranges and Characters?

I have compared the output and we achieve more or less the same. Main difference is the segregation into Unicode Categories. With categories, more ranges. Without ranges, more compact output.

Although my AWK skills are extremely poor, I have 3 comments:
1) I did not understand, where you handle the "Range has only 1 character" case, e. g.
<#Nl: ["ᛮ"-"ᛰ","Ⅰ"-"ↂ","ↅ"-"ↈ","〇","〡"-"〩","〸"-"〺","ꛦ"-"ꛯ"]>

Confirmed, when I found the generated file. All good.

  1. I liked very much the split into the Nl, Lu ... Ll categories as it helps to understand the characters we are dealing with.
    Also (at least in Thai) certain characters can't start an Identifier. I believe Identifiers can start only with Ll or Lu. If I was right, keeping the categories may be useful.

  2. AWK may not be available on all machines (at least in the Standard installation).

So my recommendation was to a) check-in the script and b) check-in the Compacted Unicode Files as well and to use that as per default. It should literally never change. Although I feel I am digressing here.

One more thing: If we start pre-pre-processing with AWK scripts now, should we not better operate based on the official source https://www.unicode.org/Public/UNIDATA/UnicodeData.txt instead based on the intermediate unicode-identifiers.txt?

@manticore-projects
Copy link
Author

manticore-projects commented Jan 9, 2023

We should add a test that it actually captures all the unicode chars in all the ranges properly.

I have rebuilt the ranges based on https://www.unicode.org/Public/UNIDATA/UnicodeData.txt, using the "L" category only.

Now we have < 500 ranges.

I get a few more Ranges. Please see attached.
UnicodeData_compact.txt
unicode2.sh.zip

Example:
"\u00F8"-"\u02B8" vs "\u00F8"-"\u02C1"

02B7;MODIFIER LETTER SMALL W;Lm;0;L;<super> 0077;;;;N;;;;;
02B8;MODIFIER LETTER SMALL Y;Lm;0;L;<super> 0079;;;;N;;;;;
02B9;MODIFIER LETTER PRIME;Lm;0;ON;;;;;N;;;;;
02BA;MODIFIER LETTER DOUBLE PRIME;Lm;0;ON;;;;;N;;;;;
02BB;MODIFIER LETTER TURNED COMMA;Lm;0;L;;;;;N;;;;;

\u02B7 is a Letter of the Category Lm and should be in.
But \u02B8 to \u02BA are no Letters L (despite Category Lm) -- should they still go in?

@manticore-projects
Copy link
Author

Then what about those:

0BBE;TAMIL VOWEL SIGN AA;Mc;0;L;;;;;N;;;;;
0BBF;TAMIL VOWEL SIGN I;Mc;0;L;;;;;N;;;;;
0BC0;TAMIL VOWEL SIGN II;Mn;0;NSM;;;;;N;;;;;
0BC1;TAMIL VOWEL SIGN U;Mc;0;L;;;;;N;;;;;
0BC2;TAMIL VOWEL SIGN UU;Mc;0;L;;;;;N;;;;;
0BC6;TAMIL VOWEL SIGN E;Mc;0;L;;;;;N;;;;;
0BC7;TAMIL VOWEL SIGN EE;Mc;0;L;;;;;N;;;;;
0BC8;TAMIL VOWEL SIGN AI;Mc;0;L;;;;;N;;;;;
0BCA;TAMIL VOWEL SIGN O;Mc;0;L;0BC6 0BBE;;;;N;;;;;
0BCB;TAMIL VOWEL SIGN OO;Mc;0;L;0BC7 0BBE;;;;N;;;;;
0BCC;TAMIL VOWEL SIGN AU;Mc;0;L;0BC6 0BD7;;;;N;;;;;

I don't speak Tamil, but with my understanding of Thai I would have expected those to be "in":

0E34;THAI CHARACTER SARA I;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA I;;;;
0E35;THAI CHARACTER SARA II;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA II;;;;
0E36;THAI CHARACTER SARA UE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UE;;;;
0E37;THAI CHARACTER SARA UEE;Mn;0;NSM;;;;;N;THAI VOWEL SIGN SARA UEE;;;;
0E38;THAI CHARACTER SARA U;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA U;;;;
0E39;THAI CHARACTER SARA UU;Mn;103;NSM;;;;;N;THAI VOWEL SIGN SARA UU;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;

You can't write Thai without those (although they never stand alone).

@manticore-projects
Copy link
Author

Scientific Symbols?

0E3F;THAI CURRENCY SYMBOL BAHT;Sc;0;ET;;;;;N;THAI BAHT SIGN;;;;

@manticore-projects
Copy link
Author

Configurable categories:

CATEGORIES=("Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me",  "Nl", "No", "Sc")

My favorite Thai vowels and Small Dollar Sign are in, as well as currency symbols.

UnicodeData_compact.txt
unicode3.sh.zip

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

The original unicode list is complete. So they should all work if allowed! Here is the snippet:

1) An <identifier start> is any character in the Unicode General Category classes “Lu”, “Ll”, “Lt”, “Lm”,
“Lo”, or “Nl”.

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

Good Morning, thank you for investing into this.

Unfortunately your script does not work thows irritating warnings for me. At first, I thought it was not working but the I found the generated file.

are@archlinux ~/D/s/p/p/grammar (unicode-support)> ./prepare-javacc-grammar.sh
awk: ./compact_char_sets.awk:3: warning: regexp escape sequence \"' is not a known regexp operator awk: ./compact_char_sets.awk:16: (FILENAME=- FNR=4) warning: regexp escape sequence \u' is not a known regexp operator

Hmm let me check

Have you tested, if the output has as many Characters as the output from my script? My understanding was, they should produce the same Ranges and Characters?

It should be trivial to do that since we have the original file with all the allowed chars. Just generate a test from that.

I have compared the output and we achieve more or less the same. Main difference is the segregation into Unicode Categories. With categories, more ranges. Without ranges, more compact output.

Yeah - when we are doing that - might as well do the best possible job!

Although my AWK skills are extremely poor, I have 3 comments: 1) I did not understand, where you handle the "Range has only 1 character" case, e. g. <#Nl: ["ᛮ"-"ᛰ","Ⅰ"-"ↂ","ↅ"-"ↈ","〇","〡"-"〩","〸"-"〺","ꛦ"-"ꛯ"]>

Confirmed, when I found the generated file. All good.

  1. I liked very much the split into the Nl, Lu ... Ll categories as it helps to understand the characters we are dealing with.

That's already in the original file

Also (at least in Thai) certain characters can't start an Identifier. I believe Identifiers can start only with Ll or Lu. If I was right, keeping the categories may be useful.
3. AWK may not be available on all machines (at least in the Standard installation).

It is - it is one of the really old tools that's there on any self-respecting Linux installation lol

So my recommendation was to a) check-in the script and b) check-in the Compacted Unicode Files as well and to use that as per default. It should literally never change. Although I feel I am digressing here.

One more thing: If we start pre-pre-processing with AWK scripts now, should we not better operate based on the official source https://www.unicode.org/Public/UNIDATA/UnicodeData.txt instead based on the intermediate unicode-identifiers.txt?

Interesting idea! Should be doable. In fact that's a CSV - so we could bootstrap using sqlite or something lol

@manticore-projects
Copy link
Author

Interesting idea! Should be doable. In fact that's a CSV - so we could bootstrap using sqlite or something lol

I have done that, just check the attached Bash script above please.

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

Interesting idea! Should be doable. In fact that's a CSV - so we could bootstrap using sqlite or something lol

I have done that, just check the attached Bash script above please.

One minor issue is the few ranges that file has - like First> and Last> are beg/end of ranges - annoying :(

@manticore-projects
Copy link
Author

manticore-projects commented Jan 9, 2023

  1. An is any character in the Unicode General Category classes “Lu”, “Ll”, “Lt”, “Lm”,
    “Lo”, or “Nl”.

Sorry, this does not make sense without the "Mn" at least. Example: อักษรไทย, you will need the A Vowel.

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

  1. An is any character in the Unicode General Category classes “Lu”, “Ll”, “Lt”, “Lm”,
    “Lo”, or “Nl”.

Sorry, this does not make sense without the "Mn" at least.

I didn;'t come up with the spec lol so can't do that if it's not in here!

Also, they had:

An <identifier extend> is U+00B7, “Middle Dot”, or any character in the Unicode General Category classes
“Mn”, “Mc”, “Nd”, “Pc”, or “Cf”.

I need to add that as well

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

Ideally we should have these as predefined in JavaCC so any other grammar wanting to use it will benefit from it. If you want to contribute that to JavaCC, please go ahead. I will keep it like this for now.

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

In the grammar, I had:

| <#identifier_extend:  ["\u00B7", "0"-"9", "_"] // temp
//!! See the Syntax Rules.

One more todo lol. I will add it today. That will give you 'Mn' (I also noticed that for Telugu - having another vowel gives syntax error)

@manticore-projects
Copy link
Author

Agreed. We can close this issue.

@manticore-projects
Copy link
Author

manticore-projects commented Jan 9, 2023

That will give you 'Mn'

I was good with 7 bit ASCII, but my wife stood right behind me!

(I also noticed that for Telugu - having another vowel gives syntax error)

While you are on it: your OUTPUT file has a spelling error (sorry for being pedantic.)

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

Agreed. We can close this issue.

Yeah like my PR title said - it's the initial implementation. Let me know if you can take it over and fix it up.

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

Agreed. We can close this issue.

Yeah like my PR title said - it's the initial implementation. Let me know if you can take it over and fix it up using the general principle of going from the unicode spec.

@manticore-projects
Copy link
Author

Yes, if SH/Bash is acceptable.
No, if it needs to be an AWK script -- that's beyond my pay grade :)

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

Yes, if SH/Bash is acceptable. No, if it needs to be an AWK script -- that's beyond my pay grade :)

Sure as long as it works and we can verify that it does, we are good! Let's add a test that can be generated from Unicodedata.txt one for each allowed char (and also a negative test to make sure we did not add anything extra)

@kaikalur
Copy link

kaikalur commented Jan 9, 2023

OK added the identifier extend stuff and now it works for vowel additions in Telugu. So should work for Thai as well. Check it out. The PR is now clean. I removed my awk shit. It just uses the full list. You can add your shell script separately. I will keep the reference grammar clean.

@kaikalur
Copy link

kaikalur commented Feb 7, 2023

FYI - the spec does NOT allow underscore "_" as part of identifier lol - learnt the hard way today.

@manticore-projects
Copy link
Author

manticore-projects commented Mar 7, 2023

Greetings, from my other project I have learned that the CJK Block needs to be added explicitly:

// CJK Unified Ideographs block according to https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
| <#CJK: ["\u4E00"-"\u62FF", "\u6300"-"\u77FF", "\u7800"-"\u8CFF", "\u8D00"-"\u9FCC"]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants