Skip to content

A dataset with Arabic words, English glosses, sourced from Wikimedia and annotated with maximal diacritization Resources

Notifications You must be signed in to change notification settings

CAMeL-Lab/CamelProp

Repository files navigation

CamelProp

This repository contains CP-WIKI-D3K, a dataset of 3,362 Arabic proper nouns from Wikipedia, each annotated with gold-standard lemma diacritizations and aligned with their English equivalents. It includes:

  1. The full dataset
  2. The postprocessing pipeline used to convert ChatGPT-4o outputs into final annotations, as described in 1
  3. Markdown tables listing the examples used for few-shot and one-shot prompting

Footnotes

  1. Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash (2025) arXiv:2505.02656

About

A dataset with Arabic words, English glosses, sourced from Wikimedia and annotated with maximal diacritization Resources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages