-
Notifications
You must be signed in to change notification settings - Fork 24
Sentence complete classifier #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1 @@ | |||
Languages can be very dynamic and complicated. This brick does not actually try to be able to accurately classify all sentences, which would be quite complex. Instead, this brick is meant to check if some characteristics apply that a lot of complete sentences have. These characteristics being: does the sentence starts with an uppercase character, if it ends on a punctuation and if it contains at least two nouns and a verb. The name `starts_with_uppercase_ends_with_punctuation_and_contains_two_nouns_and_a_verb` would be a bit long for a brick, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the name and description of the brick, I would not expect the aggregation logic.
I suggest mentioning it in the README
cognition_init_mapping = { | ||
"incomplete": "Needs fix", | ||
"complete": "null" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the mapping for partly complete
is missing
"incomplete": "Needs fix", | ||
"complete": "null" | ||
}, | ||
integrator_inputs={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputs
are missing
classifications = [] | ||
for sent in doc.sents: | ||
if sent[0].is_title and sent[-1].is_punct: | ||
has_noun = 2 | ||
has_verb = 1 | ||
for token in sent: | ||
if token.pos_ in ["NOUN", "PROPN", "PRON"]: | ||
has_noun -= 1 | ||
elif token.pos_ == "VERB": | ||
has_verb -= 1 | ||
if has_noun < 1 and has_verb < 1: | ||
classifications.append("complete") | ||
else: | ||
classifications.append("incomplete") | ||
else: | ||
classifications.append("incomplete") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of the following restructuring of the code?
classifications = []
for sent in doc.sents:
if not (sent[0].is_title and sent[-1].is_punct):
classification.append("incomplete")
continue
has_noun = 2
has_verb = 1
for token in sent:
if token.pos_ in ["NOUN", "PROPN", "PRON"]:
has_noun -= 1
elif token.pos_ == "VERB":
has_verb -= 1
if has_noun < 1 and has_verb < 1:
classifications.append("complete")
continue
classifications.append("incomplete")
another option could also be to encapsulate the per sentence classification in a function, and call this via a list comprehension
refinery
API
common code
additional points:
__init__.py
)@FelixKirschKern Not sure at all about the logic behind this brick, would love to get your feedback. Classifying whether or not a sentence is complete is not easy, but I think it's alright to check if a sentence already has some features we would expect in a usual sentence (Uppercase character in the beginning, end with a punctuation and contains nouns and a verb). This will of course miss some sentences. I also expect that the input will rarely just be one sentence, but multiple sentences in a text that might just be cut off at the end due to chunking. There might be better ways for for the aggregation part, too.