-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UP Arabic #36
Comments
For Issue 2, in PETRwriter.py, it post-processed events for each story using this function to make sure that there can only be only one unique (DATE, SRC, TGT, EVENT) tuple per story. This function is same as the function in petrarch2. |
@JingL1014 What about question 3? Why does it correctly code an event when there's no preposition, but fails to code an event when there is a preposition? In the example @khaledJabr posted, it fails to correctly identify Iraq as the target when a preposition is present. Is this related to #26? |
Here is a more detailed view at the third issue: parsed file
Note : the above two sentences are the exact same except that in the second one the proposition (ADP) is removed. Here is the debug output
|
The difference is caused by the input parse tree. In the first sentence, the parser parsed the token 2-6 as one noun phrase. And in the second sentence, the parser parsed the token 2 as one noun phrase and token 3-5 as another noun phrase. In those two sentences, the pos tag of "أمريكا " is predicted as "X" incorrectly. I think it is caused by the fact that the word "أمريكا“ doesn't appear in the training dataset when UDpipe trains the Arabic parser. Does "X" in the parsed tree appear a lot when you preprocess the Arabic sentences? |
I see the point your making there, but I am not quite sure it only UDPide. To test, i changed the word "أمريكا“ (the US), to the iraq, and it still the same problem. Here is what the parsed file looks like
I am not sure how often, but I do come across each now and then |
In the new sentences, sentence "شنت العراق ضربة جوية على العراق" should generate the correct output, since from its parsed tree it has a subject(token 2) and an object (token 3-6). But in the sentence "شنت العراق ضربة جوية العراق", it only has one noun phrase based on the parsed tree. Are you using the UDpipe 1.1.0 with language model version 1.4 in the github? Could you try the UPpipe 1.2.0 with language model of version 2.0? |
Good catch! yes, that was the issue, and it fixed it! |
Then make sure we close that and add the appropriate versions and
dependencies so this does not come up for others.
…On Fri, Jul 20, 2018, 16:49 MalandroKLD ***@***.***> wrote:
Good catch! yes, that was the issue, and it fixed it!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#36 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJrP1v3ybdUyX3lMwy5YR1tyikeb5hK9ks5uIl51gaJpZM4VB0f2>
.
|
UP Arabic
I pushed a separate branch for UniversalPetrarch Arabic development, and here are some of the thing and issues i’d like to mention.
Code changes:
The code itself, aside from the dictionaries, is almost identical to the master branch, and it is as up-to-date as the time of writing this issue. There is only one minor change in
generatedParsedFile.py
script in which i used `pyarabic’ to remove diacritics from Arabic text. Diacritics are extra short vowels added to Arabic words, and they are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface.Dictionaries:
The dictionaries on the arabic branch are the most up-to-date. The CAMEO verbs dictionary that we generated using the utd interface was augmented. The augmentation was to pass dictionary through udpipe so that the dictionary entries are identical to the parsed news articles passed to UniversalPetrarch. The code for this can be found here. It took a lot of manual cleaning of the dictionaries to make them compatible, and working with Universal Petrarch. I would say we have good dictionaries now, but it is still a work in progress, as there are still things to improve. I will be detailing the Arabic dictionary development in a separate issue.
Performance:
There was an intention to use the validation scripts from the dev-validate branch to test UniversalPetrarch in arabic, however, due to the lack of validation records, I have not been able to do that just yet. Instead I ran a few basic sentences through UP to gauge its performance. Here are some of the issues I encountered:
1 On very basic Input, it works.
Here is a basic examples where I used to sentences. First sentence is "US airstriked Iraq", and second is "US attacked Iraq"
The output was
Which is correct according to our dictionaries, which means that UP identified and matched the event and actors correctly.
2 This is an issue I have noticed for a while. In the previous example there are two sentences, that results in the same event, however, UP outputs only one event. Many other times, UP would report an X number of events generated but it would output less (or much less in sometimes). I am guessing this is a functionality to remove duplicate events, but can anyone confirm that it is only that?
3 UP Arabic gets thrown off when prepositions are found in Arabic. Take the first sentence in the previous example and modify it in Arabic to include a preposition before Iraq (In Arabic, it is written as The US made an airstrike at Iraq),
The output is
It mismatchs the airstrike (actually just the 'strike' part of the work, as 'airstrike' is two words in Arabic), and it codes it as IRQ.
I will continue updating this issue as I will be testing UP in Arabic more in the upcoming days.
The text was updated successfully, but these errors were encountered: