Skip to content

Sentence complete classifier #389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

LeonardPuettmannKern
Copy link
Contributor

refinery

  • Tested by creator on refinery
  • Tested by reviewer on refinery
  • Ensured that output of brick conforms with refinery structure (to be checked by reviewer)

API

  • Tested by creator on localhost:8000/docs
  • Tested by reviewer on localhost:8000/docs

common code

  • Common code tested in notebook/ script by creator
  • Common code tested in notebook/ script by reviewer
  • Common code contains docstrings and type hints

additional points:

  • Docstring and README is existing
  • Import statements (in __init__.py)
  • (If necessary) Added dependency to requirements.txt
  • (If necessary) Added dependency to issue for refinery env here
  • Published brick to Strapi CMS (locally)

@FelixKirschKern Not sure at all about the logic behind this brick, would love to get your feedback. Classifying whether or not a sentence is complete is not easy, but I think it's alright to check if a sentence already has some features we would expect in a usual sentence (Uppercase character in the beginning, end with a punctuation and contains nouns and a verb). This will of course miss some sentences. I also expect that the input will rarely just be one sentence, but multiple sentences in a text that might just be cut off at the end due to chunking. There might be better ways for for the aggregation part, too.

@FelixKirschKern FelixKirschKern marked this pull request as ready for review October 23, 2023 08:24
@@ -0,0 +1 @@
Languages can be very dynamic and complicated. This brick does not actually try to be able to accurately classify all sentences, which would be quite complex. Instead, this brick is meant to check if some characteristics apply that a lot of complete sentences have. These characteristics being: does the sentence starts with an uppercase character, if it ends on a punctuation and if it contains at least two nouns and a verb. The name `starts_with_uppercase_ends_with_punctuation_and_contains_two_nouns_and_a_verb` would be a bit long for a brick, though.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the name and description of the brick, I would not expect the aggregation logic.
I suggest mentioning it in the README

Comment on lines +21 to +24
cognition_init_mapping = {
"incomplete": "Needs fix",
"complete": "null"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mapping for partly complete is missing

"incomplete": "Needs fix",
"complete": "null"
},
integrator_inputs={
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs are missing

Comment on lines +21 to +36
classifications = []
for sent in doc.sents:
if sent[0].is_title and sent[-1].is_punct:
has_noun = 2
has_verb = 1
for token in sent:
if token.pos_ in ["NOUN", "PROPN", "PRON"]:
has_noun -= 1
elif token.pos_ == "VERB":
has_verb -= 1
if has_noun < 1 and has_verb < 1:
classifications.append("complete")
else:
classifications.append("incomplete")
else:
classifications.append("incomplete")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the following restructuring of the code?

classifications = []
for sent in doc.sents:
    if not (sent[0].is_title and sent[-1].is_punct):
        classification.append("incomplete")
        continue
   has_noun = 2
   has_verb = 1
   for token in sent:
       if token.pos_ in ["NOUN", "PROPN", "PRON"]:
            has_noun -= 1
        elif token.pos_ == "VERB":
            has_verb -= 1
            
        if has_noun < 1 and has_verb < 1:
            classifications.append("complete")
            continue
   classifications.append("incomplete")

another option could also be to encapsulate the per sentence classification in a function, and call this via a list comprehension

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants