Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having problem with "yetersizliği" #8

Open
eakarsu opened this issue Sep 19, 2014 · 7 comments
Open

Having problem with "yetersizliği" #8

eakarsu opened this issue Sep 19, 2014 · 7 comments

Comments

@eakarsu
Copy link

eakarsu commented Sep 19, 2014

I am testing TRMorph with server mode like this:

flookup -S -A 127.0.0.1 trmorph.fst

When I send "yetersizliği" word from UDP client, server is getting problem:
sendto() failed: Message too long

client hung

@coltekin
Copy link
Owner

The problem is both in TRmorph, and in foma server mode.

Foma uses UDP, so the packet size is limited to 64KB (give or take a few 100 bytes). So, one solution would be to modify flookup to use TCP (or implement some other mechanism that can handle multiple packets).

The problem on TRmorph side is it just generates too many analyses for the word. Main trouble is the fact that -(s)I suffix can be deted after -sIz in some (small number of) cases. For example 'at arabası' + sız can surface as 'at arabasız'. So, in your example TRmorph hallucinates a -(s)I after sIz, which triples the number of analyses. Besides that, 'yeter' and 'yetersiz' are also listed in the lexicon as adjectives. Equivalent derivations are also among the analyses, but I think these are rather lexicalized, so they should stay there.

For now, here are a few workarounds that may solve the problem (at least for this word):

  • remove 'yeter' from lexicon/adjective
  • remove the line with
    s %^I z |
    between lines
    l %^I k |
    and
    %^C %^I ],
    from morph-phon.xfst. This is line 222 in the current version in the master branch.

... and recompile TRmorph.

in case you do not need full analysis, e.g., if you are only stemming, you should probably use the relevant .fst which will probably produce a lot smaller output.

If you need analyses, eventually, you will hit some form that generates large enough analyses to cause the same problem in flookup.

I will notify you in case I have a better solution.

@eakarsu
Copy link
Author

eakarsu commented Sep 19, 2014

Hi Çağrı

Thanks for quick response,

in case you do not need full analysis, e.g., if you are only stemming, you
should probably use the relevant .fst which will probably >produce a lot
smaller output.

Yes, I need only stemming and I am using trmorph.fst

flookup -S -A 127.0.0.1 trmorph.fst

Which .fst shall I use ? stem.fst ?

Erol Akarsu

On Fri, Sep 19, 2014 at 11:38 AM, Çağrı Çöltekin [email protected]
wrote:

The problem is both in TRmorph, and in foma server mode.

Foma uses UDP, so the packet size is limited to 64KB (give or take a few
100 bytes). So, one solution would be to modify flookup to use TCP (or
implement some other mechanism that can handle multiple packets).

The problem on TRmorph side is it just generates too many analyses for the
word. Main trouble is the fact that -(s)I suffix can be deted after -sIz in
some (small number of) cases. For example 'at arabası' + sız can surface as
'at arabasız'. So, in your example TRmorph hallucinates a -(s)I after sIz,
which triples the number of analyses. Besides that, 'yeter' and 'yetersiz'
are also listed in the lexicon as adjectives. Equivalent derivations are
also among the analyses, but I think these are rather lexicalized, so they
should stay there.

For now, here are a few workarounds that may solve the problem (at least
for this word):

  • remove 'yeter' from lexicon/adjective
  • remove the line with s %^I z | between lines l %^I k | and %^C %^I ],
    from morph-phon.xfst. This is line 222 in the current version in the master
    branch.

... and recompile TRmorph.

in case you do not need full analysis, e.g., if you are only stemming, you
should probably use the relevant .fst which will probably produce a lot
smaller output.

If you need analyses, eventually, you will hit some form that generates
large enough analyses to cause the same problem in flookup.

I will notify you in case I have a better solution.


Reply to this email directly or view it on GitHub
#8 (comment).

@coltekin
Copy link
Owner

If you are compiling the FST files with Makefile, make stemmer generates an .fst file called stem.fst. In that case the word 'yetersizliği' should give only three possibilities (and should in general be much less likely to exceed the UDP packet size limit).

$ echo yetersizliği|flookup stem.fst 
yetersizliği    yetmek<V>
yetersizliği    yeter<Adj>
yetersizliği    yetersiz<Adj>

You can get rid of the part of speech tags too, if you set the relevant switch in option.sh before compiling. All stemmer related options should start with STEMMER_ and should be described sufficiently in the file.

@eakarsu
Copy link
Author

eakarsu commented Sep 19, 2014

Çağrı,

Excellent.

Stemmer module is much better.

You can get rid of the part of speech tags too, if you set the relevant
switch in option.sh before compiling. All stemmer related >options should
start with STEMMER_ and should be described sufficiently in the file.
I am new here using TRMorph and don't have any idea on why stemmer options
I should chose and which speech of tags I should remove.

The only concern I have is to find stem of word. Here I have tested several
words. So here it is difficult to find which one is correct root from out
of flookup. In the first run, fındık is the second. In the second, second
yeter and third yetersiz can be rood candidates.
So is there any rule who can find me to find which is correct root?

Thanks for your help

eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "Fındıklı"
|flookup stem.fst
Fındıklı Fındıklı<N:prop>
Fındıklı fındık

eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo
yetersizliği|flookup stem.fst
yetersizliği yetmek
yetersizliği yeter
yetersizliği yetersiz

eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "Sütlü"
|flookup stem.fst
Sütlü süt
Sütlü sütlü

eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "girdiler"
|flookup stem.fst
girdiler girmek
girdiler girdi

eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "omurgasız"
|flookup stem.fst
omurgasız omurga

eakarsu@ubuntu:~/SolrTurkihsAnalysers/TRmorph-master$ echo "omurgasızlar"
|flookup stem.fst
omurgasızlar omurga
omurgasızlar omurgasızlar

On Fri, Sep 19, 2014 at 1:48 PM, Çağrı Çöltekin [email protected]
wrote:

If you are compiling the FST files with Makefile, make stemmer generates
an .fst file called stem.fst. In that case the word 'yetersizliği' should
give only three possibilities (and should in general be much less likely to
exceed the UDP packet size limit).

$ echo yetersizliği|flookup stem.fst
yetersizliği yetmek
yetersizliği yeter
yetersizliği yetersiz

You can get rid of the part of speech tags too, if you set the relevant
switch in option.sh before compiling. All stemmer related options should
start with STEMMER_ and should be described sufficiently in the file.


Reply to this email directly or view it on GitHub
#8 (comment).

@coltekin
Copy link
Owner

Unfortunately there is no easy way. The analyzer (and the stemmer) tries to produce all possible forms. The results should be disambiguated outside the finete-state tools. TRmorph distribution has a simple python script to select the most likely analysis ( scripts/disambigate.py). It does not make use of sentential context at all, but it turns out it does not perform a lot worse than more-complex, context-aware disambiguators.I have been working on a better disambiguator and a few other solutions/tools that may make users' life easier, but at the moment they are not usable yet.

To get the most likely analysis (for a definition of most likely analysis) one needs to analyze the input, pick the highets scoring analysis, and strip off the analysis symbols. The python script provided can be modified to do that, or if needed the disambiguation code is rather simple, porting to another language should not take much time.

@eakarsu
Copy link
Author

eakarsu commented Sep 20, 2014

Çağrı,

Thanks,
I have used tmorph.fst and take first option provided, that looks very
good. I know it is computationally more
intensive that stem.fst.

Is it possible to create multithreaded foma? I anticipate that I will have
concurrent clients for foma
Do you think it will be possible?

Thanks

Erol Akarsu

On Fri, Sep 19, 2014 at 4:01 PM, Çağrı Çöltekin [email protected]
wrote:

Unfortunately there is no easy way. The analyzer (and the stemmer) tries
to produce all possible forms. The results should be disambiguated outside
the finete-state tools. TRmorph distribution has a simple python script to
select the most likely analysis ( scripts/disambigate.py). It does not
make use of sentential context at all. but, it turns out it does not
perform a lot worse than more-complex, context-aware disambiguators.I have
been working on a better disambiguator a few other solutions that may make
users' life easier, but at the moment they are not usable yet.

To get the most likely analysis (for a definition of most likely analysis)
one needs to analyze the input, pick the highets scoring analysis, and
strip off the analysis symbols. The python script provided can be modified
to do that, or if needed the disambiguation code is rather simple, porting
to another language should not take much time.


Reply to this email directly or view it on GitHub
#8 (comment).

@coltekin
Copy link
Owner

I think it shouldn't be very difficult to modify foma UDP server code to make it multi-threaded, but I do not know whether foma libraries are thread safe or not. As far as I can see, the documentation does not mention it.
Regards,
Cagri

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants