-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NBSP for other French names? #763
Comments
I think the reason I didn't do that is because le and la are way too common in regular Spanish prose and there would be a lot of false positives in Spanish quotes. But I don't remember if I checked the corpus or not to see if my hunch was correct. |
OK, I do remember you saying that now. I’ll do some tests on the corpus and see what turns out. Regardless of typogrify, if we have instances in a production, should we manually add the NBSP? In this case, I have dozens of les Touches and le Croisic. Or not worth the bother? |
You could, but I don't think it's worth a rule, it would essentially be overlooked by everyone. |
Oh, definitely, I was asking for permission, not a rule. :) Back to the typogrify code. I assume you know that there are almost 1500 instances of "De [A-Z]…" in the corpus with a space, i.e. it doesn't appear that the new typogrify has been run across the corpus. The first test I ran was just for the lowercase for i in $(ls -d ./*) ; do gc -o " de ([A-Z][a-z]+?\b)" $i/src/epub/text/*.xhtml ; done It gets almost 20K+ hits. All of them are obviously not standalone, e.g. Lord George de Bruce Carruthers from Trollope, but I'm assuming we would want those to stick together as well? I then ran a similar test for le/la/les: for i in $(ls -d ./*) ; do gc -o " (la|le|les) ([A-Z][a-z]+?\b)" $i/src/epub/text/*.xhtml ; done That gets another 5600+ hits. A few of those are in Shakespeare in personas; maybe we don't want to change those, maybe it doesn't matter. If we don't want to change them, determining whether they exist in a tag might be harder, since they would typically have at least a first name before the particle. |
Yes, when I added that rule it was too much work to go back and review all those. Typogrify can't be safely run on an existing ebook because it might (rarely) do the wrong thing sometimes on a finished ebook. The thing with updating the corpus for things that are not automated (like le/la/les), is that we can update the corpus manually today, but over time it will just drift back because new ebooks aren't having an automated rule applied. So I question the usefulness of doing - why go through the effort if it will just drift away again? Then someone in the future will question why old ebooks have an nbsp there but new one's don't. You can certainly update the corpus for De, because that's an automated rule. But you would have to check each instance to make sure we're not in Spanish or a Latin title (like the book |
Sorry, two things are being conflated here, probably my fault.
Why wouldn't we want to keep then together with what follows, even if they're in a Spanish/Latin/etc. title? By "title" do you mean that literally, i.e. within a Re the instances of |
No, I mean that something like
We don't want to link together words in that case in the same way we wouldn't link "Of" with nbsp. Ideally everything would be tagged correctly in the corpus, and we could ignore things with xpath or regex; in practice, there are occasional gaps. |
Typogrify was updated last year to put a NBSP between De and a name. Do we want to try to do that for any other similar names, e.g. [Ll]a Fontaine, or [Ll]e Clerc, etc.? I'm proofing the next Balzac and just ran across a "le Blah" that was split across a line. (I believe I remember asking about lowercase de before, but I had no luck trying to search for that in my email.)
The text was updated successfully, but these errors were encountered: