Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dutch hyphenation #46

Open
bertfrees opened this issue May 18, 2015 · 8 comments
Open

Dutch hyphenation #46

bertfrees opened this issue May 18, 2015 · 8 comments
Assignees

Comments

@bertfrees
Copy link
Member

See snaekobbi/issues#2 for the various options for implementing a hyphenator.

@bertfrees
Copy link
Member Author

Maybe a useful tip from CBB (Christelijke Bibliotheek voor Blinden en Slechtzienden): they use a version of hyph_nl_NL.dic from OpenTaal.

@dkager
Copy link

dkager commented May 21, 2015

The OpenTaal data sounds promising. Will look at this next week and maybe you can fill me in on the best way to implement this in mod-braille (from what I read there is OpenOffice data available for this dict).

@dkager
Copy link

dkager commented May 28, 2015

I'm guessing this is the hyphenation dictionary from OpenTaal.org that CBB is using. Maybe I can use the same approach as in snaekobbi/issues#2 for this?
I don't have test data yet, so integrating the dictionary into mod-braille could be done first.

@bertfrees
Copy link
Member Author

The dictionary you linked is the one that is already included in Pipeline. I think CBB was maybe referring to an updated version. We'd have to ask them.

We need test data before we can do anything else. Then, if you need to modify the dictionary, it's best you copy the file to a new project (like Jukka did with pipeline-mod-celia) because the dictionary from LibreOffice is downloaded and packaged automatically.

@dkager
Copy link

dkager commented May 28, 2015

I believe the OpenTaal data dates from 2011, but I'll see if I can confirm this with someone from CBB.
What sort of test data are you looking for?

@bertfrees
Copy link
Member Author

Hyphenated words I guess. I understand you may not have that kind of data just lying around. But if there's nothing to test then our job is done. Then we just take what's currently available. I think at the minimum we should have a small test, if only so we can easily add more to it later. Jukka's test data is also very limited, but it's easy to add more. He did it in pipeline-mod-celia because that's were his dictionary lives, but we could have your tests in functional-testing.

@dkager
Copy link

dkager commented Jun 2, 2015

So if I understand this correctly, we have:

  • The hyphenation dict in mod-braille.
  • Generic code to use this dict also in mod-braille.

And we need:

  • Test data in functional-testing.

For Finnish the test data is in the JUnit test case. I could clone this into another module, but think it would be a bit nicer to have something similar to liblouis' harness tests for this. I.e. experts only worry about JSON or some other format and the JUnit tests pull these in and run them.

Also, which of the three libs (Libhyphen, Hyphenator, TexHyphenator) should we use?

@bertfrees
Copy link
Member Author

I suggest we use XML instead of JSON. Something like this. If everybody includes test data in that format in the functional-testing repo, then I can have one test (JUnit or XSpec) that runs them all. Of course from the point of view of the developer it is nice to have to tests closer to the implementation, but since you don't intend to modify the dictionary yet for the time being, that's not a problem. Later we can still copy/move the test to its own module.

Which of the libraries we should use is not so important I think. What I've done with Finnish is I convert the patterns into several formats at build time so that several implementations become available in DP2. As long as all implementations behave the same (which they should in theory, and we easily can test each of them with the same test data) we don't have to worry about which one is actually used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants