Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatted examples #1138

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

1313ou
Copy link
Contributor

@1313ou 1313ou commented Nov 12, 2024

One has perhaps noted how punctuation and capitalization of examples appear to be sloppy.

This PR is to remedy this, based on a classification of examples. Classification was partly automated with syntactic dependencies as provided by spaCy then stanza and analyzed by and ad-hoc algorithm and cross-checked by 'hand'-review. Unfortunatly deep models were largely inefficient: I didn't find one trained on dictionary data, and the others didn't perform well.

Examples are split into:

  • sentences (or standalone) : first character is upper-cased, final period added if needed
  • phrases (or incomplete) : first character is lower-cased, no final punctuation is added
    (in the case of subjectless verb phrases like '… had scarcely rung the bell when the door flew open', '…' ellipsis is prepended). Care has been taken to leave initial upper case where required like in 'God's mercy'

The dividing line is sometimes hard to draw between verb phrases and imperatives like 'treat the infection with antibiotics' which could be an instruction (imperative) or just a verb phrase expressing collocations, usually object complements. Fortunately such words as 'you'. 'your'... favor imperative classification while 'one', 'one's' ... favor classibication as verb phrase. Sometimes an imperative context can hardly be thought of (like for 'square the circle')

As this is partly automated and the volume reviewed is huge, errors must have sneaked through because of misassessment, errors or simply fatigue. This is inevitable.

@arademaker
Copy link
Member

It would be informative if you share the code used to automate the fixes. For instance, how did you managed

Care has been taken to leave initial upper case where required like in 'God's mercy'

@1313ou
Copy link
Contributor Author

1313ou commented Nov 13, 2024

@arademaker , you'll find here the classification on which this formatting is based (see second sheet for more explanations).

Basically it's a Stanza-assisted 'hand'-review of all examples. Stanza's deep dependencies and constituency dependency (in the last two columns) are designed to flag certain conditions but a great number of their findings are overridden by the review.

Dependencies are analyzed along these lines: find a verb phrase, find if it has a subject, consider the mood feature ... etc to determine if the input is a sentence.

A column is dedicated to directions that override the standard formatting behaviour (like in 'God's mercy')

The (limited) redundancy doesn't mean it's fool-proof: 49k examples are a lot to review and errors are bound to slip in. Fixes will be welcome.

@jmccrae
Copy link
Member

jmccrae commented Nov 13, 2024

This is a large PR and I would flag that it is automatically constructed, which is something that we advise against in our contribution guidelines. I would probably reject this from a new contributor, but as @1313ou has made many good PRs, I trust that the quality of this contribution.

A quick check shows that 1,323/49,638 (2.6%) of examples end with a period and 11,337/49,638 (22.8%) of examples start with a capital letter. As such, it seems that we have an inconsistency that this PR would improve.

I would choose to accept this, but I will leave it open to other community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants