Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UP Arabic #36

Closed
khaledJabr opened this issue Jul 4, 2018 · 8 comments
Closed

UP Arabic #36

khaledJabr opened this issue Jul 4, 2018 · 8 comments

Comments

@khaledJabr
Copy link

UP Arabic

I pushed a separate branch for UniversalPetrarch Arabic development, and here are some of the thing and issues i’d like to mention.

Code changes:
The code itself, aside from the dictionaries, is almost identical to the master branch, and it is as up-to-date as the time of writing this issue. There is only one minor change in generatedParsedFile.py script in which i used `pyarabic’ to remove diacritics from Arabic text. Diacritics are extra short vowels added to Arabic words, and they are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface.

Dictionaries:
The dictionaries on the arabic branch are the most up-to-date. The CAMEO verbs dictionary that we generated using the utd interface was augmented. The augmentation was to pass dictionary through udpipe so that the dictionary entries are identical to the parsed news articles passed to UniversalPetrarch. The code for this can be found here. It took a lot of manual cleaning of the dictionaries to make them compatible, and working with Universal Petrarch. I would say we have good dictionaries now, but it is still a work in progress, as there are still things to improve. I will be detailing the Arabic dictionary development in a separate issue.

Performance:
There was an intention to use the validation scripts from the dev-validate branch to test UniversalPetrarch in arabic, however, due to the lack of validation records, I have not been able to do that just yet. Instead I ran a few basic sentences through UP to gauge its performance. Here are some of the issues I encountered:

1 On very basic Input, it works.

Here is a basic examples where I used to sentences. First sentence is "US airstriked Iraq", and second is "US attacked Iraq"

<Sentences>
<Sentence date="20000715" id="AFP_ARB_20000715.0015_1" sentence="True" source="afp">
<Text>
شنت أمريكا ضربة جوية العراق
</Text>
<Parse>1 شنت   شن VERB  VP-A-3FS--  Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act 0  root  _  _
2  امريكا   أمريكا   X  X---------  Foreign=Foreign   1  nsubj _  _
3  ضربة  ضربة  NOUN  N------S4I  Case=Acc|Definite=Ind|Number=Sing   1  dobj  _  _
4  جوية  جوي   ADJ   A-----FS2I  Case=Gen|Definite=Ind|Gender=Fem|Number=Sing 3  amod  _  _
5  العراق   عراق  NOUN  N------S2D  Case=Gen|Definite=Def|Number=Sing   3  nmod  _  _
</Parse></Sentence>

<Sentence date="20000715" id="AFP_ARB_20000715.0015_2" sentence="True" source="afp">
<Text>
هاجمت أمريكا العراق
</Text>
<Parse>1 هاجمت هاجم  VERB  VP-A-3FS--  Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act 0  root  _  _
2  امريكا   أمريكا   X  X---------  Foreign=Foreign   1  nsubj _  _
3  العراق   عراق  NOUN  N------S4D  Case=Acc|Definite=Def|Number=Sing   1  dobj  _  _
</Parse></Sentence>


</Sentences>

The output was

Event: 20000715    USA    IRQ    124    AFP_ARB_20000715.0015_2    afp

Which is correct according to our dictionaries, which means that UP identified and matched the event and actors correctly.

2 This is an issue I have noticed for a while. In the previous example there are two sentences, that results in the same event, however, UP outputs only one event. Many other times, UP would report an X number of events generated but it would output less (or much less in sometimes). I am guessing this is a functionality to remove duplicate events, but can anyone confirm that it is only that?

3 UP Arabic gets thrown off when prepositions are found in Arabic. Take the first sentence in the previous example and modify it in Arabic to include a preposition before Iraq (In Arabic, it is written as The US made an airstrike at Iraq),

<?xml version='1.0' encoding='utf-8'?>
<Sentences>
<Sentence date="20000715" id="AFP_ARB_20000715.0015_1" sentence="True" source="afp">
<Text>
شنت أمريكا ضربة جوية  على العراق
</Text>
<Parse>1 شنت   شن VERB  VP-A-3FS--  Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act 0  root  _  _
2  امريكا   أمريكا   X  X---------  Foreign=Foreign   1  nsubj _  _
3  ضربة  ضربة  NOUN  N------S2I  Case=Gen|Definite=Ind|Number=Sing   2  nmod  _  _
4  جوية  جوي   ADJ   A-----FS2I  Case=Gen|Definite=Ind|Gender=Fem|Number=Sing 3  amod  _  _
5  على   على   ADP   P---------  AdpType=Prep   6  case  _  _
6  العراق   عراق  NOUN  N------S2D  Case=Gen|Definite=Def|Number=Sing   3  nmod  _  _
</Parse></Sentence>

</Sentences>

The output is

Event: 20000715	USA	---	124	AFP_ARB_20000715.0015_1	afp

It mismatchs the airstrike (actually just the 'strike' part of the work, as 'airstrike' is two words in Arabic), and it codes it as IRQ.

I will continue updating this issue as I will be testing UP in Arabic more in the upcoming days.

@JingL1014
Copy link
Collaborator

For Issue 2, in PETRwriter.py, it post-processed events for each story using this function to make sure that there can only be only one unique (DATE, SRC, TGT, EVENT) tuple per story. This function is same as the function in petrarch2.

@ahalterman
Copy link
Member

@JingL1014 What about question 3? Why does it correctly code an event when there's no preposition, but fails to code an event when there is a preposition? In the example @khaledJabr posted, it fails to correctly identify Iraq as the target when a preposition is present. Is this related to #26?

@khaledJabr
Copy link
Author

Here is a more detailed view at the third issue:

parsed file

<?xml version='1.0' encoding='utf-8'?>
<Sentences>
<Sentence date="20000715" id="AFP_ARB_20000715.0015_1" sentence="True" source="afp">
<Text>
شنت أمريكا ضربة جوية  على العراق
</Text>
<Parse>1	شنت	شن	VERB	VP-A-3FS--	Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act	0	root	_	_
2	امريكا	أمريكا	X	X---------	Foreign=Foreign	1	nsubj	_	_
3	ضربة	ضربة	NOUN	N------S2I	Case=Gen|Definite=Ind|Number=Sing	2	nmod	_	_
4	جوية	جوي	ADJ	A-----FS2I	Case=Gen|Definite=Ind|Gender=Fem|Number=Sing	3	amod	_	_
5	على	على	ADP	P---------	AdpType=Prep	6	case	_	_
6	العراق	عراق	NOUN	N------S2D	Case=Gen|Definite=Def|Number=Sing	3	nmod	_	_
</Parse></Sentence>

<Sentence date="20000715" id="AFP_ARB_20000715.0015_2" sentence="True" source="afp">
<Text>
شنت أمريكا ضربة جوية  العراق
</Text>
<Parse>1	شنت	شن	VERB	VP-A-3FS--	Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act	0	root	_	_
2	امريكا	أمريكا	X	X---------	Foreign=Foreign	1	nsubj	_	_
3	ضربة	ضربة	NOUN	N------S4I	Case=Acc|Definite=Ind|Number=Sing	1	dobj	_	_
4	جوية	جوي	ADJ	A-----FS2I	Case=Gen|Definite=Ind|Gender=Fem|Number=Sing	3	amod	_	_
5	العراق	عراق	NOUN	N------S2D	Case=Gen|Definite=Def|Number=Sing	3	nmod	_	_
</Parse></Sentence>


</Sentences>

Note : the above two sentences are the exact same except that in the second one the proposition (ADP) is removed.

Here is the debug output

petr_log    : INFO     Running
Using user-specified config: data/config/PETR_AR_config.ini
petr_log    : INFO     Using user-specified config: data/config/PETR_AR_config.ini

 use_Petrarch1_verb_dictionary = False
use_Petrarch2_verb_dictionary = True
new_actor_length = 0
stop_on_error = False
write_actor_root = False
write_actor_text = True
write_event_text = True
null_verbs = False
null_actors = False
require_dyad = False
code-by-sentence True
pause_by_sentence False
pause_by_story False
Comma-delimited clause elimination:
Initial : deactivated
Internal: min = 2    max = 8
Terminal: min = 2    max = 8
Internal Coding Ontology: arabic/PETR.Internal.Coding.Ontology.txt
petr_log    : INFO     Reading arabic/PETR.Internal.Coding.Ontology.txt
Verb dictionary: arabic/CAMEO.ar.0.2.txt
petr_log    : INFO     Reading arabic/CAMEO.ar.0.2.txt
Undefined synset &FIGHT_CRIM
Actor dictionaries: [u'arabic/actor_dict_ar_v2.txt']
Agent dictionary: [u'arabic/agents.all.ar_v2.txt']
petr_log    : INFO     Reading arabic/agents.all.ar_v2.txt

Discard dictionary: arabic/discards_ar.txt
petr_log    : INFO     Reading arabic/discards_ar.txt
Issues dictionary: arabic/issues_ar.txt
petr_log    : INFO     Reading arabic/issues_ar.txt



petr_log    : DEBUG    Incoming data from XML: 


Processing story AFP_ARB_20000715.0015

 AFP_ARB_20000715.0015_1
petr_log.getPhrase: DEBUG    extracting verb:شن
petr_log.get_source_target: DEBUG    find subj location:2
petr_log.getNP: DEBUG    noun:امريكا
petr_log.getNP: DEBUG    [5, 6]
petr_log.getNP: DEBUG    noun:ضربة جوية على العراق
petr_log.getNP: DEBUG    على العراق
petr_log.getNP: DEBUG    noun:العراق
petr_log.NPgetmeaning: DEBUG    npMainText:ضربة found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText:ضربة جوية على العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText meaning:IRQ
petr_log.petrarch1: DEBUG    noun: ضربة code: IRQ
petr_log.NPgetmeaning: DEBUG    npMainText:العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText:العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText meaning:IRQ
petr_log.petrarch1: DEBUG    noun: العراق code: IRQ
petr_log.PETRgraph: DEBUG    finding code of verb:شن
petr_log.PETRgraph: DEBUG    match vp token:شن
petr_log.PETRgraph: DEBUG    124	قاوم	شن	شن	1
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    check event:2#-#1#0
petr_log.NPgetmeaning: DEBUG    npMainText:امريكا found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText:امريكا ضربة جوية على العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText meaning:USA
petr_log.PETRgraph: DEBUG    source: امريكا code: USA
petr_log.PETRgraph: DEBUG    finding code of verb:شن
petr_log.PETRgraph: DEBUG    match vp token:شن
petr_log.PETRgraph: DEBUG    124	قاوم	شن	شن	1
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    Root verb:شن code:124 passive:False
petr_log.PETRgraph: DEBUG    ([u'USA'], ['---'], '124')
petr_log.PETRgraph: DEBUG    event transformation....
petr_log.PETRgraph: DEBUG    self.events: 0
petr_log    : DEBUG    check events of id:AFP_ARB_20000715.0015_1
petr_log    : DEBUG    event:2#-#1#0
petr_log    : DEBUG    ([u'USA'], ['---'], '124')
petr_log    : DEBUG    triplet:2#-#1#0
petr_log    : DEBUG    شن

 AFP_ARB_20000715.0015_2
petr_log.getPhrase: DEBUG    extracting verb:شن
petr_log.get_source_target: DEBUG    find subj location:2
petr_log.getNP: DEBUG    noun:امريكا
petr_log.getNP: DEBUG    [5]
petr_log.getNP: DEBUG    noun:ضربة جوية العراق
petr_log.getNP: DEBUG    nmod:5:العراق
petr_log.getNP: DEBUG    [5]
petr_log.getNP: DEBUG    noun:ضربة جوية العراق
petr_log.getNP: DEBUG    noun:العراق
petr_log.NPgetmeaning: DEBUG    npMainText:ضربة found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText:ضربة جوية العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText meaning:IRQ
petr_log.petrarch1: DEBUG    noun: ضربة code: IRQ
petr_log.NPgetmeaning: DEBUG    npMainText:العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText:العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText meaning:IRQ
petr_log.petrarch1: DEBUG    noun: العراق code: IRQ
petr_log.PETRgraph: DEBUG    finding code of verb:شن
petr_log.PETRgraph: DEBUG    match vp token:شن
petr_log.PETRgraph: DEBUG    124	قاوم	شن	شن	1
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:ضربة#ضربة جوية العراق
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:ضربة#ضربة جوية العراق
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:ضربة#ضربة جوية العراق
petr_log.PETRgraph: DEBUG    check event:2#3#1#0
petr_log.NPgetmeaning: DEBUG    npMainText:امريكا found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText:امريكا found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText meaning:USA
petr_log.PETRgraph: DEBUG    source: امريكا code: USA
petr_log.NPgetmeaning: DEBUG    npMainText:ضربة found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText:ضربة جوية العراق found_compound:False
petr_log.NPgetmeaning: DEBUG    npMainText meaning:IRQ
petr_log.PETRgraph: DEBUG    target: ضربة code: IRQ
petr_log.PETRgraph: DEBUG    finding code of verb:شن
petr_log.PETRgraph: DEBUG    match vp token:شن
petr_log.PETRgraph: DEBUG    124	قاوم	شن	شن	1
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:ضربة#ضربة جوية العراق
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:ضربة#ضربة جوية العراق
petr_log.PETRgraph: DEBUG    processing source:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:امريكا#امريكا
petr_log.PETRgraph: DEBUG    '*' matched
petr_log.PETRgraph: DEBUG    processing target:
petr_log.PETRgraph: DEBUG    mn-entry
petr_log.PETRgraph: DEBUG    noun:ضربة#ضربة جوية العراق
petr_log.PETRgraph: DEBUG    Root verb:شن code:124 passive:False
petr_log.PETRgraph: DEBUG    ([u'USA'], [u'IRQ'], '124')
petr_log.PETRgraph: DEBUG    event transformation....
petr_log.PETRgraph: DEBUG    self.events: 0
petr_log    : DEBUG    check events of id:AFP_ARB_20000715.0015_2
petr_log    : DEBUG    event:2#3#1#0
petr_log    : DEBUG    ([u'USA'], [u'IRQ'], '124')
petr_log    : DEBUG    triplet:2#3#1#0
petr_log    : DEBUG    شن

Summary:
Stories read: 1    Sentences coded: 2   Events generated: 2
Discards:  Sentence 0   Story 0   Sentences without events: 0
Average Coding time =  0.00654995441437
Event: 20000715	USA	IRQ	124	AFP_ARB_20000715.0015_2	afp
Event: 20000715	USA	---	124	AFP_ARB_20000715.0015_1	afp
Coding time: 0.0152440071106
Finished

@JingL1014
Copy link
Collaborator

The difference is caused by the input parse tree. In the first sentence, the parser parsed the token 2-6 as one noun phrase. And in the second sentence, the parser parsed the token 2 as one noun phrase and token 3-5 as another noun phrase. In those two sentences, the pos tag of "أمريكا " is predicted as "X" incorrectly. I think it is caused by the fact that the word "أمريكا“ doesn't appear in the training dataset when UDpipe trains the Arabic parser. Does "X" in the parsed tree appear a lot when you preprocess the Arabic sentences?

@khaledJabr
Copy link
Author

I see the point your making there, but I am not quite sure it only UDPide. To test, i changed the word "أمريكا“ (the US), to the iraq, and it still the same problem. Here is what the parsed file looks like

<?xml version='1.0' encoding='utf-8'?>
<Sentences>
<Sentence date="20000715" id="AFP_ARB_20000715.0015_1" sentence="True" source="afp">
<Text>
شنت العراق ضربة جوية  على العراق
</Text>
<Parse>1	شنت	شن	VERB	VP-A-3FS--	Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act	0	root	_	_
2	العراق	عراق	NOUN	N------S1D	Case=Nom|Definite=Def|Number=Sing	1	nsubj	_	_
3	ضربة	ضربة	NOUN	N------S2I	Case=Gen|Definite=Ind|Number=Sing	1	dobj	_	_
4	جوية	جوي	ADJ	A-----FS2I	Case=Gen|Definite=Ind|Gender=Fem|Number=Sing	3	amod	_	_
5	على	على	ADP	P---------	AdpType=Prep	6	case	_	_
6	العراق	عراق	NOUN	N------S2D	Case=Gen|Definite=Def|Number=Sing	3	nmod	_	_
</Parse></Sentence>

<Sentence date="20000715" id="AFP_ARB_20000715.0015_2" sentence="True" source="afp">
<Text>
شنت العراق ضربة جوية  العراق
</Text>
<Parse>1	شنت	شن	VERB	VP-A-3FS--	Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act	0	root	_	_
2	العراق	عراق	NOUN	N------S1D	Case=Nom|Definite=Def|Number=Sing	1	nsubj	_	_
3	ضربة	ضربة	NOUN	N------S2I	Case=Gen|Definite=Ind|Number=Sing	2	nmod	_	_
4	جوية	جوي	ADJ	A-----FS2I	Case=Gen|Definite=Ind|Gender=Fem|Number=Sing	3	amod	_	_
5	العراق	عراق	NOUN	N------S2D	Case=Gen|Definite=Def|Number=Sing	3	nmod	_	_
</Parse></Sentence>


</Sentences>
Does "X" in the parsed tree appear a lot when you preprocess the Arabic sentences?

I am not sure how often, but I do come across each now and then

@JingL1014
Copy link
Collaborator

In the new sentences, sentence "شنت العراق ضربة جوية على العراق" should generate the correct output, since from its parsed tree it has a subject(token 2) and an object (token 3-6). But in the sentence "شنت العراق ضربة جوية العراق", it only has one noun phrase based on the parsed tree. Are you using the UDpipe 1.1.0 with language model version 1.4 in the github? Could you try the UPpipe 1.2.0 with language model of version 2.0?

@khaledJabr
Copy link
Author

Good catch! yes, that was the issue, and it fixed it!

@PTB-OEDA
Copy link
Member

PTB-OEDA commented Jul 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants