This repository contains CP-WIKI-D3K, a dataset of 3,362 Arabic proper nouns from Wikipedia, each annotated with gold-standard lemma diacritizations and aligned with their English equivalents. It includes:
- The full dataset
- The postprocessing pipeline used to convert ChatGPT-4o outputs into final annotations, as described in 1
- Markdown tables listing the examples used for few-shot and one-shot prompting
Footnotes
-
Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash (2025) arXiv:2505.02656 ↩