Skip to content
This repository has been archived by the owner on Aug 19, 2020. It is now read-only.
paulrodrigues edited this page Jul 8, 2015 · 8 revisions

Arabic-specific

The good:

  • Root-and-pattern morphology allows you to determine semantically related words easily. Typically in Arabic NLP processes, you would run the text through a morphological analyzer to break down the word into its component pieces and tell you the parts of speech.

k-t-b

  • kataba (he wrote)
  • maktab (office)
  • kitab (book)

d-r-s

  • darasa (he studied)
  • madrasa (a place of studying)

The bad:

  • The semantic field may cause problems in a character-based NN system. We saw one system tested on English and Chinese. English and Chinese both have concatenative systems where "morphemes" (bits of the word that have meaning) are added to the end of the word (dog+s=dogs). Bits of meaning can be interspersed with the Arabic word (k-t-b+a-a=kataba). A possible hack is to shift these characters to the end of the word to treat it like a suffix (ktbaa). The right way to do it is probably a NN architecture with short term memory.
  • Arabic vowels are optional in written text. In formal (religious) text, they're written, but in informal writing they are usually not. The vowels provide a signal to word sense, tense, and aspect, but most NLP researchers just strip them out in a normalization process. So while a human reader would be able to distinguish kataba and kitab, to a program they may both look like ktb.
  • Prepositions, determiners, conjunctions, and other word types may be attached to the word as prefixes. An Arabic tokenizer is typically used to separate these off.
  1. bimaktab (in an office - bi+maktab)
  2. almaktab (the office - al+maktab)
  3. bilmaktab (in the office - bi+al+maktab)
  • Informal Arabic is sometimes written in Latin script (20-35%), using an unstandardized alphabet. To process this, it would need to be converted back/normalized to Arabic script for processing. So when you're collecting "Arabic" on Twitter, understand you're only catching ~ 65%-80% of what's out there.)

Arabic NLP tools

MADAMIRA Morphological analyzes and tokenizes in one pass, optimizing the two processes together, and ranking by most likely segmentation given word context. http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html http://www.lrec-conf.org/proceedings/lrec2014/pdf/593_Paper.pdf

Standard Arabic Morphological Analyzer (SAMA) (Renamed from Buckwalter Arabic Morphological Analyzer (BAMA)) Lexicons (dictionaries) and a perl script that output all possible morphological analyses of a word.
https://catalog.ldc.upenn.edu/LDC2010L01

AraMorph Open source product similar to SAMA, written in Java. (I haven't used it.) http://www.nongnu.org/aramorph/english/download.html

Arabic Sentiment Datasets


ArSENL http://www.oma-project.com

“We are pleased to announce the release of the first public large scale ARabic SENtiment Lexicon (ArSenL) version 1.0, as part of the OMA (Opinion Mining for Arabic) project. ArSenL includes 157,969 entries corresponding to 28,780 lemmas with three sentiment scores: positive, negative and objective, whose sum is equal to 1….

More details can be found in: G. Badaro; R. Baly; H. Hajj; N. Habash and W. El-Hajj; "A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining" in Proceedings of ANLP 2014, EMNLP 2014 Doha, Qatar


Twitter Data set for Arabic Sentiment Analysis Data Set https://archive.ics.uci.edu/ml/datasets/Twitter+Data+set+for+Arabic+Sentiment+Analysis

“2000 labelled tweets (1000 positive tweets and 1000 negative ones) on various topics such as: politics and arts. These tweets include opinions written in both Modern Standard Arabic (MSA) and the Jordanian dialect.”

Abdulla N. A., Mahyoub N. A., Shehab M., Al-Ayyoub M., Arabic Sentiment Analysis: Corpus-based and Lexicon-based, IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT 2013),December 3-12, 2013, Amman, Jordan.


OCA: Opinion corpus for Arabic http://sinai.ujaen.es/en/?s=oca&submit=Search http://repository.dlsi.ua.es/694/1/21598_ftp.pdf

Movies, star ratings…

Rushdi-Saleh, Mohammed and Martín-Valdivia, M.T. and Ureña-López, L.A. and Perea-Ortega, José M. (2011) OCA: Opinion corpus for Arabic. Journal of the American Society for Information Science and Technology . ISSN 1532-2890


Large Arabic Book Reviews Corpus http://www.mohamedaly.info/datasets/labr

“This a set of Arabic book reviews containing over 63,000 reviews. … downloaded from www.goodreads.com during the month of March 2013.”

Book star ratings… not terribly useful, with other resources now available.


AWATIF / ARABSENTI http://lrec.elra.info/proceedings/lrec2012/pdf/1057_Paper.pdf http://ella.slis.indiana.edu/~mabdulma/#software

Not available online. (Talk to Paul) Words from newswire, Wikipedia, and web forums and labeled {POS,NEG,NEUT}


Linguistic Inquiry and Word Count (LIWC) – Arabic Dictionary http://www.liwc.net/ Translated from English, and not empirically validated.


The Arabic Modality Lexicon (AML) v1.0 http://www.rania-alsabbagh.com/publications-and-resources.html Direct Link: http://www.rania-alsabbagh.com/amlv1.html

“The Arabic Modality Lexicon (AML) v1.0 is a manually compiled lexicon of Arabic modality triggers (i.e. words and phrases that convey modality)…. 7535 entries…”

Rania Al-Sabbagh, Jana Diesner and Roxana Girju. (2013). Using the Semantic-Syntactic Interface for Reliable Arabic Modality Annotation.


CASL's Tweet dataset (Talk to Paul)