semanticate: Work on the XML elements tree #753

gvtulder · 2024-09-23T17:59:27Z

Rough example (see #752) of how the semanticate rules could be applied to the content only by walking recursively through the XML tree.
Not sure if this is a good idea, so this is a fairly quick proof-of-concept.

The tricky bit is that the regular expressions might introduce new XML elements. In this version, this works by applying the regex to text string (which doesn't contain any tags), parsing the regex output to create any new XML elements, and finally inserting those in the document at the right place.

acabal · 2024-09-23T18:38:55Z

I understand the reasoning for this PR, but I think regex is acceptable for this use case 95% of the time. It's much simpler to understand and - importantly - maintain, and with full XML parsing there will likely be complex edge cases that we will have to custom code around. Also any errors introduced by a regex are easily corrected using git diff - it never results in big catastrophes.

If you want to play around with it further go ahead, but I think you'll find that this will become very unwieldy very fast. If you're able to put together a complete solution that is 100% correct vs. our current implementation's 95% correct, and it doesn't look like a maintenance hellscape, then we can consider it.

alerque · 2024-09-23T19:58:01Z

Sure 95% of the time a regex will get you there, but your existing regexes are needing to be stuffed with lookaheads and other features to compensate for the lack of context. Flatly XML cannot be parsed with regexes. On the other hand walking XML trees using something built for the job is pretty straight forward and should be much easier to maintain and contribute to in the long run.

acabal · 2024-09-23T20:01:34Z

I understand. We do not need lookaheads in the general case (or, strictly speaking, in this case). That is because the occasions in which the regexes without lookaheads are wrong, given our narrow use cases, are rare, and undoing the mistake one of these regex makes is simple - an error is never catastrophic. Remember that these tools are not general purpose XML processors, they are for the very narrow use case of ebooks in the English language that are made with our style guide.

On the other hand, parsing an entire syntax tree and dealing with the hundreds of edge cases of the various English language constructs we must examine is extremely complex, error prone, and most importantly, difficult to maintain. It will also likely be quite slow because the tree must be walked and updated for almost every check requiring even the most basic of logic.

I understand the desire for intellectual purity and formal correctness, but using regex is easier to develop and maintain as a practical matter. Of course I'm happy to be proved wrong!

gvtulder · 2024-09-23T21:02:54Z

I agree with the point that regexes are simpler, and I don't know where the sweet spot lies. However, a more XML-based approach could still use a lot of regexes, the main difference is choosing where they are applied.

In my current implementation, the regex rules are very similar to what they are in the current implementation. I copy-pasted them from the old function and made only some minor changes:

tools/se/formatting.py

Lines 176 to 180 in b4e9357

    
           sub(r"(?<!(?:\.|\B))(P\.(?:P\.)?S\.(?:S\.)?\B)", r"""<abbr epub:type="z3998:initialism">\1</abbr>""") 
        
           sub(r"(?<!(?:\.|\B))inst\.", r"""<abbr xml:lang="la">inst.</abbr>""") # `inst.` is short for `instante mense` but it is not italicized 
        
           sub(r"(?<!(?:\.|\B))([Ii])\.e\.", r"""<abbr epub:type="z3998:initialism">\1.e.</abbr>""") 
        
           sub(r"(?<!(?:\.|\B))([Ee])\.g\.", r"""<abbr epub:type="z3998:initialism">\1.g.</abbr>""") 
        
           sub(r"(?<!(?:\.|\B))\bN\.?B\.\B", r"""<abbr epub:type="z3998:initialism">N.B.</abbr>""")

So the representation of the rules doesn't have to change much. The difference is that these regexes are no longer applied to the entire raw XML document at once, but only to one bit of text at a time. This avoids changes to attribute values and makes it easier to say things like: "do not apply these rules inside <abbr> tags".

One thing that does become a bit more difficult is when the regexes look at XML tags. For example, when deciding to add class="eoc" to <abbr> tags at the end of a paragraph <p>: (original regexes in the comments)

tools/se/formatting.py

Lines 228 to 239 in b4e9357

    
           # sub(r"<abbr>etc\.</abbr>([”’]?(?:</p>|\s+[“‘]?[\p{Uppercase_Letter}]))", r"""<abbr class="eoc">etc.</abbr>\1""") 
        
           # sub(r"""<abbr( epub:type="[^"]+")?>([^<]+\.)</abbr>([”’]?</p>)""", r"""<abbr class="eoc"\1>\2</abbr>\3""") 
        
           if el.text == "etc." and not el.children: 
        
           	if el.tail and regex.match(r"[”’]?\s+[“‘]?[\p{Uppercase_Letter}]", el.tail): 
        
           		eoc = True 
        
           if el.text.endswith(".") and not el.children: 
        
           	if el.parent.tag == "p" and el.next is None and (not el.tail or el.tail in "”’"): 
        
           		eoc = True 
        
           if eoc and not "eoc" in (el.get_attr("class") or ""): 
        
           	el.add_attr_value("class", "eoc")

Or a simpler example, checking we're inside a <p> and using the $ marker in the regex:

tools/se/formatting.py

Lines 192 to 194 in b4e9357

    
           # keep a period after TV that terminates a clause 
        
           if el.tag == "p": 
        
           	sub(r"(?<!(?:\.|\B))T\.?V\.([”’]?)$", r"""<abbr epub:type="z3998:initialism">TV</abbr>.\1""")

Secondly, it also becomes more difficult to apply regexes that match text on both sides of a tag, but there currently aren't any rules that need this.

I'm not convinced this is necessarily better, but it's good to know that it's not a binary choice between regex and syntax tree parsing.

(WIP) semanticate: Recursively apply regex rules to the XML tree.

b4e9357

acabal force-pushed the master branch from d8b1b7d to 35d971c Compare October 15, 2024 15:48

acabal force-pushed the master branch from 2d840d2 to 047269d Compare November 7, 2024 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semanticate: Work on the XML elements tree #753

semanticate: Work on the XML elements tree #753

gvtulder commented Sep 23, 2024

acabal commented Sep 23, 2024

alerque commented Sep 23, 2024

acabal commented Sep 23, 2024 •

edited

Loading

gvtulder commented Sep 23, 2024

semanticate: Work on the XML elements tree #753

Are you sure you want to change the base?

semanticate: Work on the XML elements tree #753

Conversation

gvtulder commented Sep 23, 2024

acabal commented Sep 23, 2024

alerque commented Sep 23, 2024

acabal commented Sep 23, 2024 • edited Loading

gvtulder commented Sep 23, 2024

acabal commented Sep 23, 2024 •

edited

Loading