-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Arabic support (include positional forms and ligatures) #16
Comments
Can we get a minimum list of unencoded glyphs for an Arabic language from Shaperglot checks?To understand this issue better, and whether Shaperglot can be useful in solving it, I performed the following test. Open GF_Arabic_Core.glyphs in Glyphs, then immediately export a binary .ttf to test in Shaperglot. No need to draw any glyphs as Shaperglot doesn't care about outlines.
That looks promising. Next, use some regex search/replace and a few manual edits to translate those errors into nice glyph names that Glyphs will recognize, like these:
Add them all to GF_Arabic_Core.glyphs, then Font Info > Features > Update to generate all the OpenType features. Export a new binary and run it through Shaperglot again:
Success! This means Shaperglot can indeed be used to infer a minimum glyph set for an Arabic language. |
I'm glad you found this - it's the perfect example for why you should not use glyph names to detect language support, and why the shaperglot approach (looking at font behaviour) is needed instead. Suppose I have a font which decomposes the diacritics such that |
Thanks, that's also a perfect example of the type of problem I suspected I would run into, but I wasn't able to identify it thus far due to my limited experience with Arabic and shaping in general. So, the distinction I should make is: it would be relatively easy for TalkingLeaves to tell the user, "here's a glyph set that will work for the Arabic language you selected." But it would be much harder to tell the user, "given your current glyph set, you could add these additional glyphs in order to support the Arabic language you selected." Do you think the latter is even possible? If it's not possible, I'll look for another approach. The first purpose for TalkingLeaves is to tell the user which encoded characters are definitely needed for a given language. That was a relatively easy problem to solve. The second purpose is to inform the user when other things are needed (or might be needed) to support a given language. If that means leaving the advice open-ended and pointing them to a book or an online resource for Arabic rather than recommending specific glyphs, that's fine if it's the best solution. |
To be honest I think the entire concept is cursed. Type designers, particularly those who have been spoiled by Glyphs, confuse "adding glyphs" with "adding support". For a lot of scripts, there are many things you need to do to a font to add language support; adding glyphs is just "level one" of them. (Even in Latin this is true. Thinking holistically about the font is important: is This is why I am not a fan of tools which tell designers which glyphs or codepoints they need to add to their font to make languages work - at worst they encourage in-fill-ism ("10 more glyphs and I've made a Sindhi font!"), but even at best they can mislead people into thinking they have added language support when in fact they haven't. In short a glyph-based approach isn't enough. You have to think about font behaviour. |
You've given me a lot to think about. I think I need to start by changing the communication in the UI to make it more clear that character sets are only part of what you need to support many languages. I don't expect I'll be able to build a magic tool that sorts out everything you need to support a given language, but I'd be extremely happy if I could get to the point where TalkingLeaves visually highlights any languages that have known requirements beyond codepoints, and wherever possible gives the user some general notes about what they need to look at. Hyperglot has notes on some languages, like explaining the two Eng forms for Sami and African languages, and Shaperglot's checks can be used to flag things like anchors that might be required in a given language. |
For Arabic positional forms you have most (?) of them encoded as unicodes anyways. You can also rely on unicode joining types (Hyperglot includes the unicode data and has a convenience method to access it) so you know which Arabic letters need init/medi/fina forms. In Hyperglot we are checking this for font checks where we test actual shaping akin to Shaperglot, e.g. from here on. In Glyphs you don't have the compiled font to check actual shaping for a string, but you can get all the encoded positional forms' unicodes, and you got access to GlyphData to get the glyph names for all positional forms to check if they are in the font. |
Nope. Legacy presentation forms aren't always encoded in Arabic fonts. Noto Nastaliq Urdu, for example, doesn't bother, nor does SIL Scheherazade. |
You don’t need to assign those Unicode code points in your font. But them being in Unicode can act as a repository of important glyphs (encoded as characters) and their joining behavior. I don’t know how complete or reliable that repository is, though. |
Ah yes, indeed, Glyphs adds them just with name but without unicode, so mostly likely the unicodes are not set for those in the Glyphs file. I also assumed GlyphInfo would have a better mapping back to the root of a positional form, but that too seems a bit tricky. Probably needs matching the glyph name suffix, or x.glyphInfo.desc (I think that matches the python's unicodedata.name(), minus the positional text, e.g. "ARABIC LETTER BEH" and "ARABIC LETTER BEH MEDIAL FORM"). |
A naive example of how to get the Arabic positional forms (assuming the default GlyphsData.xml with those specific suffixes):
Of course you could also just ignore any finessed directionality check and simply brute force and see if |
Now any font which doesn't follow the Glyphs naming convention (again, Noto Nastaliq Urdu) "doesn't support Arabic". When I say the whole thing is cursed, I mean it. You should not try doing it this way. |
Point taken. I suppose the only avenue to at least some kind of certainty would be to compile a font on the fly with the features and glyphnames from the file, then check if it does shape anything for given sequences. |
@simoncozens makes an extremely important point that checking language support is really independent of any glyph naming conventions. But TalkingLeaves is not meant to be a tool for checking language support, and I'd like to make that clearer to the user somehow. I think a small improvement would be to change these checkbox labels to "Show complete" and "Show incomplete". Then, as @kontur suggested, I might eventually add a feature that allows the user to run a language support check via Hyperglot or Shaperglot, on their last exported font file. This would help solidify the intended workflow of using TalkingLeaves to build and expand your glyph set, and then running a check to confirm that your font indeed supports your target languages. @kontur I couldn't get your example code using joining types to produce a font that passes Hyperglot or Shaperglot checks, but your second idea of brute-forcing works beautifully with some minor modifications. Comparing the two def get_required_glyphs(char):
info = Glyphs.glyphInfoForUnicode(ord(char))
required = [info.name]
positions = ['.init', '.medi', '.fina', '.isol']
for pos in positions:
if Glyphs.glyphInfoForName(info.name).index != Glyphs.glyphInfoForName(info.name+pos).index:
required.append(info.name+pos)
return required
arabic_ort = Orthography(Language("arb").get_orthography())
arabic_chars = arabic_ort.base_chars + arabic_ort.base_marks
required = []
for char in arabic_chars:
required.extend(get_required_glyphs(char))
for char in required:
glyph = GSGlyph(char)
Font.glyphs.append(glyph) The above script adds all the glyphs that are needed to pass Hyperglot's check for Standard Arabic (arb). Shaperglot's charset for Standard Arabic is a little different, so if I add those missing codepoints and run the script on them, the resulting font passes Shaperglot's check for Standard Arabic. TLDR; we now have a way to generate a set of glyph names, to be used in GlyphsApp, that will result in an Arabic font that passes Hyperglot/Shaperglot checks. |
Re: @simoncozens’ earlier comments, c681157 adds a small section What does it mean to “support” a language? near the top of README.md, and some other small changes to help inform users that TalkingLeaves is not a tool for checking language support, and it currently deals only with character sets, which are only one piece of adding language support. |
There are a few tricky issues to overcome for TalkingLeaves to fully support Arabic. I do not have much background knowledge of the Arabic script, so any help/comments/feedback/corrections will be greatly appreciated.
Hyperglot defines character sets, not glyph sets
Hyperglot is the core dataset used by TalkingLeaves to define required Unicode characters for any given language. It does not currently define unencoded "alternate" glyphs that are required by many languages, especially in complex scripts such as Arabic.
But, as I discovered recently, Unicode includes many Arabic ligatures containing various positional forms as "compatibility characters". These are in fact defined in Hyperglot's Arabic language definitions, but I have no idea how complete they are, since I'm sure there must be many possible ligatures and positional forms that aren't in Unicode and are needed for some languages. When these characters are added to a font in Glyphs, Glyphs strips the codepoint and gives them a "nice name" that follows the naming conventions that will auto-generated the necessary feature code.
Are there minimum glyph sets for Arabic languages?
Is it even possible to define a minimum glyph set (including alternates) for any Arabic languages? Or does it entirely vary based on the type designer's preference and the project they're working on? Since Arabic relies heavily on shaping and OpenType programming, there must be many ways to define an Arabic glyph set. But I think it still might be possible to define a "recommended" glyph set for an Arabic language, that covers all the minimum requirements, and also allows Glyphs to auto-generate the OpenType feature code.
What would a glyph set definition for an Arabic language look like?
Hopefully, just a list of glyph names, following Glyphs naming conventions for alternates with underscore
_
separators indicating ligatures, and.init
.medi
.fina
and.isol
suffixes indicating positional forms.Where will the glyph set definitions come from?
I don't know of any data sources that would have clearly defined required glyphs for Arabic languages. But, I haven't done a lot of searching yet, as I'm only just beginning to wrap my head around how the Arabic script works. I'll try to search for more info on possible data sources that would be helpful for defining Arabic glyph sets.
Shaperglot has some shaping checks, which essentially check an input string to see what changes in the output string after feeding the input string and the font through a shaping engine (Harfbuzz). So, if Shaperglot requires certain characters to be substituted for certain Arabic languages, then I think it might be possible to infer which positional forms or ligatures are needed in order to pass those checks. I'm in the process of digging into this.
The text was updated successfully, but these errors were encountered: