Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#420 #433 - Berkeley Parser improvements #421

Closed
wants to merge 24 commits into from

Conversation

mjlaali
Copy link

@mjlaali mjlaali commented Oct 19, 2015

Fixes #420 #422

@mjlaali mjlaali closed this Oct 21, 2015
@mjlaali mjlaali reopened this Oct 21, 2015
return AnalysisEngineFactory.createEngineDescription(
ParserAnnotator.class,
ParserAnnotator.PARAM_PARSER_MODEL_PATH,
modelPath,
ParserWrapper_ImplBase.PARAM_OUTPUT_TYPES_HELPER_CLASS_NAME,
DefaultOutputTypesHelper.class.getName());
DefaultOutputTypesHelper.class.getName(),
ParserWrapper_ImplBase.PARAM_TOKENIZER_CLASS_NAME,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjlaali - It is more UIMA-like to make the tokenizer a separate annotator which adds tokens to the CAS. This way the ParserAnnotator can just focus on extracting what it needs from the CAS independently of how tokenization and other preprocessing was performed.

For examples of this decoupling, I would look at our ClearNLP wrappers.

@leebecker
Copy link
Contributor

Right now, it seems like the ParserAnnotator is too closely coupled to the tokenizers and normalizers. Before it is merged, I would prefer to see the Tokenizer split into its own annotator, so the logic of the ParserAnnotator can remain mostly unchanged.

Similarly, if you want to normalize the text and operate on changed representations of the text, I would suggest writing normalization as its own AnalysisEngine/Annotator as well. The UIMA approach would put the normalized text in a new view that the ParserAnnotator would run on instead of the initialView.

@mjlaali
Copy link
Author

mjlaali commented Oct 29, 2015

Lee, thank you for your feedbacks. I separated the Berkeley tokenizer and pos tagger from the parser.

Regard the normalization, I am not sure. Normalizing the text is only done to make the text compatible with the Penn Treebank conventions (such as changing '(' to '-LRB') so that parser model matches with the input text. I believe creating an extra view just to save the PTB style of text is not only add overhead, but also adding all tokens to this view does not make sense to me.

public class DefaultBerkeleyTokenizerTest extends BerkeleyTestBase {

@Test
public void givenASentenctWhenTokenizingThenAllTokenAreReturned() throws UIMAException{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in test name. Change givenASentenctWhenTokenizingThenAllTokenAreReturned to givenASentenceWhenTokenizingThenAllTokenAreReturned.

@reckart reckart changed the title Issue/420 #420 / #433 - Berkeley Parser improvements Jul 29, 2022
@reckart reckart changed the title #420 / #433 - Berkeley Parser improvements #420 #433 - Berkeley Parser improvements Jul 29, 2022
@reckart reckart marked this pull request as draft November 4, 2022 12:45
@reckart
Copy link
Contributor

reckart commented Nov 5, 2022

Ping? @leebecker @mjlaali @bethard anybody still listening here? Any opinion whether it might be worth saving this effort or discarding it?

Unfortunately, the PR settings does not allow other people than @mjlaali to contribute, so I cannot even let GitHub update the PR and check if it still builds...

@bethard
Copy link
Contributor

bethard commented Nov 5, 2022

Looks like the Java Berkeley Parser hasn't been updated in 7 years: https://github.com/slavpetrov/berkeleyparser. So I'm guessing it would be fined to abandon this (and the parser wrapper).

@mjlaali mjlaali closed this Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenization in the Berkeley parser wrapper may be not compatible with the PTB
4 participants