Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question regarding CFG morphological analysis #11

Open
dobijan opened this issue Nov 27, 2017 · 2 comments
Open

A question regarding CFG morphological analysis #11

dobijan opened this issue Nov 27, 2017 · 2 comments
Labels

Comments

@dobijan
Copy link

dobijan commented Nov 27, 2017

In the third homework one of the exercises is about implementing morphological analysis. It is unclear to me how to proceed with it. When we parse sentences, then the whitespaces determine the tag boundaries. But if all we get is one word, then what are the boundaries? Letters? Should we write a CFG grammar that accepts the input one letter at a time? Or are we supposed to break the word into chunks (leg, obb, at, et, etc...) based on some logic, and then apply the parsing on it? The Tree provided as an example implies the latter, but that chunking is basically almost the parsing itself... Besides, without the underlying automaton, if we look at the surface form, then the 'tag' boundaries can be ambiguous. So the 'some logic' can be very complex, for example a full FST... So, I am simply not sure what the exercise demands from me.

@dobijan dobijan changed the title A question regarding CFG morphological analyis A question regarding CFG morphological analysis Nov 27, 2017
@DavidNemeskey
Copy link
Collaborator

Whitespaces don't determine the tag boundaries. If you look closely, you will see that we called .split() on the inputs to the parser exactly because of that. However, the parse() method expects a sequence of tokens, so for the Arithmetics exercise, these are equivalent:

aparser.parse('1 - 2 / ( 3 - 4 )'.split())  # list
aparser.parse(['1', '-', '2', '/', '(', '3', '-', '4', ')'])  # same as the previous one, actually
aparser.parse('1-2/(3-4)')  # string

So you can see that some parsing is going on. Two hints:

  • think about what your terminal symbols should be
  • your grammar doesn't have to be CNF for this exercise

@dobijan
Copy link
Author

dobijan commented Nov 27, 2017

I decided to consider letters as base tokens. I wrote a letter-based grammar. Due to this the parse tree was a little cluttered, but then I postprocessed that tree to collapse sibling letter terminals into word parts. It works, I don't know whether this is the best solution, but the alternative seemed far more complex (somehow universally find the word parts in the word without an FSA, regexes for this purpose would be quite numerous and long). This way at least, only the grammar defines the word parts, and this information is not split between two distinct parts of the program.

I'm interested whether the others take a different approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants