NBSP for other French names? #763

vr8hub · 2024-11-02T01:45:19Z

Typogrify was updated last year to put a NBSP between De and a name. Do we want to try to do that for any other similar names, e.g. [Ll]a Fontaine, or [Ll]e Clerc, etc.? I'm proofing the next Balzac and just ran across a "le Blah" that was split across a line. (I believe I remember asking about lowercase de before, but I had no luck trying to search for that in my email.)

acabal · 2024-11-02T15:59:45Z

I think the reason I didn't do that is because le and la are way too common in regular Spanish prose and there would be a lot of false positives in Spanish quotes. But I don't remember if I checked the corpus or not to see if my hunch was correct.

vr8hub · 2024-11-02T17:18:15Z

OK, I do remember you saying that now. I’ll do some tests on the corpus and see what turns out.

Regardless of typogrify, if we have instances in a production, should we manually add the NBSP? In this case, I have dozens of les Touches and le Croisic. Or not worth the bother?

acabal · 2024-11-03T02:31:49Z

You could, but I don't think it's worth a rule, it would essentially be overlooked by everyone.

vr8hub · 2024-11-04T21:32:34Z

Oh, definitely, I was asking for permission, not a rule. :)

Back to the typogrify code. I assume you know that there are almost 1500 instances of "De [A-Z]…" in the corpus with a space, i.e. it doesn't appear that the new typogrify has been run across the corpus.

The first test I ran was just for the lowercase de

for i in $(ls -d ./*) ; do gc -o " de ([A-Z][a-z]+?\b)" $i/src/epub/text/*.xhtml ; done

It gets almost 20K+ hits. All of them are obviously not standalone, e.g. Lord George de Bruce Carruthers from Trollope, but I'm assuming we would want those to stick together as well?

I then ran a similar test for le/la/les:

for i in $(ls -d ./*) ; do gc -o " (la|le|les) ([A-Z][a-z]+?\b)" $i/src/epub/text/*.xhtml ; done

That gets another 5600+ hits. A few of those are in Shakespeare in personas; maybe we don't want to change those, maybe it doesn't matter. If we don't want to change them, determining whether they exist in a tag might be harder, since they would typically have at least a first name before the particle.

acabal · 2024-11-04T21:45:13Z

Yes, when I added that rule it was too much work to go back and review all those. Typogrify can't be safely run on an existing ebook because it might (rarely) do the wrong thing sometimes on a finished ebook.

The thing with updating the corpus for things that are not automated (like le/la/les), is that we can update the corpus manually today, but over time it will just drift back because new ebooks aren't having an automated rule applied. So I question the usefulness of doing - why go through the effort if it will just drift away again? Then someone in the future will question why old ebooks have an nbsp there but new one's don't.

You can certainly update the corpus for De, because that's an automated rule. But you would have to check each instance to make sure we're not in Spanish or a Latin title (like the book De Re Republica) (or other romance languages maybe?)

vr8hub · 2024-11-04T22:18:47Z

Sorry, two things are being conflated here, probably my fault.

Regardless of what decision ultimately gets made about typogrify, I asked if I could manually change the le Croisic and les Touches in my current production. That was a yes. That's finished, and I have nothing else to say about that. :)
Everything else is about typogrify. So my tests on lowercase de and lowercase la/le/les above were to see whether, judging from the corpus, we would want to update typogrify to also put a NBSP on them—not updating the corpus, but updating typogrify.

Why wouldn't we want to keep then together with what follows, even if they're in a Spanish/Latin/etc. title? By "title" do you mean that literally, i.e. within a <title> tag? If so, the existing typogrify regex already does that with the ([^>]) at the beginning of the regex. If not, why wouldn't we put the NBSP in a book (or whatever) tag; it doesn't hurt anything, and it keeps it from breaking just like the ones not in a book tag.

Re the instances of de|la|le|les [A-Z][a-z]+ appearing in a language tag (xml:lang=".*?"), that happens 1300+ times with de and another 775 with la|le|les. If we want to exclude those from typogrify, then that will require a fancier regex; I'll have to research that a bit. Or, typogrify could update the ones in a tag to something known, update everything else, then update the ones in a tag back to their original state.

acabal · 2024-11-05T03:44:23Z

Why wouldn't we want to keep then together with what follows, even if they're in a Spanish/Latin/etc. title? By "title" do you mean that literally, i.e. within a <title> tag?

No, I mean that something like De Re Rustica is the actual title of a real book in Latin, which could be (and is) referenced in body text.

De ("Of") is common in Latin book titles that are often referenced in places like nonfiction endnotes, e.g. De Agricultura, Carmen de Moribus, etc.

We don't want to link together words in that case in the same way we wouldn't link "Of" with nbsp.

Ideally everything would be tagged correctly in the corpus, and we could ignore things with xpath or regex; in practice, there are occasional gaps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NBSP for other French names? #763

NBSP for other French names? #763

vr8hub commented Nov 2, 2024

acabal commented Nov 2, 2024

vr8hub commented Nov 2, 2024

acabal commented Nov 3, 2024

vr8hub commented Nov 4, 2024

acabal commented Nov 4, 2024

vr8hub commented Nov 4, 2024

acabal commented Nov 5, 2024 •

edited

Loading

NBSP for other French names? #763

NBSP for other French names? #763

Comments

vr8hub commented Nov 2, 2024

acabal commented Nov 2, 2024

vr8hub commented Nov 2, 2024

acabal commented Nov 3, 2024

vr8hub commented Nov 4, 2024

acabal commented Nov 4, 2024

vr8hub commented Nov 4, 2024

acabal commented Nov 5, 2024 • edited Loading

acabal commented Nov 5, 2024 •

edited

Loading