A list of Natural Language Processing resources for Moroccan Arabic (Darija)
- An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction
- Moroccan Dialect -Darija- Open Dataset
- Building the Moroccan Darija WordNet (MDW) using Bilingual Resources
- MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script
- Improving Sentiment Analysis of Moroccan Tweets Using Ensemble Learning
- MSTD: Moroccan Sentiment Twitter Dataset
- Standard and Dialectal Arabic Text Classification for Sentiment Analysis
- ASA: A framework for Arabic sentiment analysis
- An Arabic-Moroccan Darija Code-Switched Corpus
- Goud.ma: a News Article Dataset for Summarization in Moroccan Darija
- Automatic Text Summarization for Moroccan Arabic Dialect Using an Artificial Intelligence Approach
- Diacritization of Maghrebi Arabic Sub-Dialects
- Building a language model for Moroccan Darija using fastai
- Finetuning DziriBERT for Dialect Detection
- Moroccan Darija Wikipedia: Basics of Natural Language Processing for a Low-Resource Language: This is a workshop that was presented as part of AMLD Africa 2021.
- Modeling, Simulation and Data Analysis (MSDA) Datasets: Contains a dataset of 50k tweets labeled for sentiment analysis, topic detection and dialect detection as it contains tweets from 5 countries including Morocco.
- Darija Open Dataset (DODA): An open-source project for building a dataset of Darija-English vocabulary.
- DVOICE: Darija audio dataset, contains audio files and their corresponding text.
- Darija Wikipedia articles
- Moroccan News and Comments from Hespress
- Moroccan Sentiment Analysis corpus
- ElecMorocco2016: A sentiment analysis dataset of Arabic facebook comments about the Moroccan elections of 2016.
- Goud-sum: A text summarization dataset of 158k examples.
- Arabic POS dialect: Dialectal Arabic POS tagging dataset that contains sets of 350 manually segmented and POS tagged tweets for each of 4 dialects: Egyptian, Levantine, Gulf, and Maghrebi.
- DarijaBERT: A BERT-base model trained on ~3 Million Darija sequences.
- Goud-summarization: Text summarization models trained on Goud-sum.
- t5-darija-summarization: T5 model for Darija text summarization.