-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#420 #433 - Berkeley Parser improvements #421
Conversation
Instead of throwing an exception, it is better to print an error message.
return AnalysisEngineFactory.createEngineDescription( | ||
ParserAnnotator.class, | ||
ParserAnnotator.PARAM_PARSER_MODEL_PATH, | ||
modelPath, | ||
ParserWrapper_ImplBase.PARAM_OUTPUT_TYPES_HELPER_CLASS_NAME, | ||
DefaultOutputTypesHelper.class.getName()); | ||
DefaultOutputTypesHelper.class.getName(), | ||
ParserWrapper_ImplBase.PARAM_TOKENIZER_CLASS_NAME, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjlaali - It is more UIMA-like to make the tokenizer a separate annotator which adds tokens to the CAS. This way the ParserAnnotator can just focus on extracting what it needs from the CAS independently of how tokenization and other preprocessing was performed.
For examples of this decoupling, I would look at our ClearNLP wrappers.
Right now, it seems like the Similarly, if you want to normalize the text and operate on changed representations of the text, I would suggest writing normalization as its own AnalysisEngine/Annotator as well. The UIMA approach would put the normalized text in a new view that the |
Lee, thank you for your feedbacks. I separated the Berkeley tokenizer and pos tagger from the parser. Regard the normalization, I am not sure. Normalizing the text is only done to make the text compatible with the Penn Treebank conventions (such as changing '(' to '-LRB') so that parser model matches with the input text. I believe creating an extra view just to save the PTB style of text is not only add overhead, but also adding all tokens to this view does not make sense to me. |
public class DefaultBerkeleyTokenizerTest extends BerkeleyTestBase { | ||
|
||
@Test | ||
public void givenASentenctWhenTokenizingThenAllTokenAreReturned() throws UIMAException{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in test name. Change givenASentenctWhenTokenizingThenAllTokenAreReturned
to givenASentenceWhenTokenizingThenAllTokenAreReturned
.
Ping? @leebecker @mjlaali @bethard anybody still listening here? Any opinion whether it might be worth saving this effort or discarding it? Unfortunately, the PR settings does not allow other people than @mjlaali to contribute, so I cannot even let GitHub update the PR and check if it still builds... |
Looks like the Java Berkeley Parser hasn't been updated in 7 years: https://github.com/slavpetrov/berkeleyparser. So I'm guessing it would be fined to abandon this (and the parser wrapper). |
Fixes #420 #422