Formatted examples #1138

1313ou · 2024-11-12T17:40:22Z

One has perhaps noted how punctuation and capitalization of examples appear to be sloppy.

This PR is to remedy this, based on a classification of examples. Classification was partly automated with syntactic dependencies as provided by spaCy then stanza and analyzed by and ad-hoc algorithm and cross-checked by 'hand'-review. Unfortunatly deep models were largely inefficient: I didn't find one trained on dictionary data, and the others didn't perform well.

Examples are split into:

sentences (or standalone) : first character is upper-cased, final period added if needed
phrases (or incomplete) : first character is lower-cased, no final punctuation is added
(in the case of subjectless verb phrases like '… had scarcely rung the bell when the door flew open', '…' ellipsis is prepended). Care has been taken to leave initial upper case where required like in 'God's mercy'

The dividing line is sometimes hard to draw between verb phrases and imperatives like 'treat the infection with antibiotics' which could be an instruction (imperative) or just a verb phrase expressing collocations, usually object complements. Fortunately such words as 'you'. 'your'... favor imperative classification while 'one', 'one's' ... favor classibication as verb phrase. Sometimes an imperative context can hardly be thought of (like for 'square the circle')

As this is partly automated and the volume reviewed is huge, errors must have sneaked through because of misassessment, errors or simply fatigue. This is inevitable.

…lish-wordnet into pr_formatted_examples

arademaker · 2024-11-12T22:10:13Z

It would be informative if you share the code used to automate the fixes. For instance, how did you managed

Care has been taken to leave initial upper case where required like in 'God's mercy'

1313ou · 2024-11-13T08:45:49Z

@arademaker , you'll find here the classification on which this formatting is based (see second sheet for more explanations).

Basically it's a Stanza-assisted 'hand'-review of all examples. Stanza's deep dependencies and constituency dependency (in the last two columns) are designed to flag certain conditions but a great number of their findings are overridden by the review.

Dependencies are analyzed along these lines: find a verb phrase, find if it has a subject, consider the mood feature ... etc to determine if the input is a sentence.

A column is dedicated to directions that override the standard formatting behaviour (like in 'God's mercy')

The (limited) redundancy doesn't mean it's fool-proof: 49k examples are a lot to review and errors are bound to slip in. Fixes will be welcome.

jmccrae · 2024-11-13T09:29:25Z

This is a large PR and I would flag that it is automatically constructed, which is something that we advise against in our contribution guidelines. I would probably reject this from a new contributor, but as @1313ou has made many good PRs, I trust that the quality of this contribution.

A quick check shows that 1,323/49,638 (2.6%) of examples end with a period and 11,337/49,638 (22.8%) of examples start with a capital letter. As such, it seems that we have an inconsistency that this PR would improve.

I would choose to accept this, but I will leave it open to other community.

1313ou added 6 commits October 31, 2024 10:14

Normalize

1d719d7

Merge remote-tracking branch 'upstream/main'

50eef25

Formatted examples

5346b77

Merge remote-tracking branch 'upstream/main'

b4b2ba7

Formatted examples

8c8a8bf

Merge branch 'pr_formatted_examples' of https://github.com/1313ou/eng…

b08537c

…lish-wordnet into pr_formatted_examples

Fixed 2 cases as ellipsized sentences

afa572d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formatted examples #1138

Formatted examples #1138

1313ou commented Nov 12, 2024

arademaker commented Nov 12, 2024

1313ou commented Nov 13, 2024

jmccrae commented Nov 13, 2024

Formatted examples #1138

Are you sure you want to change the base?

Formatted examples #1138

Conversation

1313ou commented Nov 12, 2024

arademaker commented Nov 12, 2024

1313ou commented Nov 13, 2024

jmccrae commented Nov 13, 2024