CamelProp

This repository contains CP-WIKI-D3K, a dataset of 3,362 Arabic proper nouns from Wikipedia, each annotated with gold-standard lemma diacritizations and aligned with their English equivalents. It includes:

The full dataset
The postprocessing pipeline used to convert ChatGPT-4o outputs into final annotations, as described in ¹
Markdown tables listing the examples used for few-shot and one-shot prompting

Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash (2025) arXiv:2505.02656 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CamelPROPWIKID3K.tsv		CamelPROPWIKID3K.tsv
README.md		README.md
few_shots_table.txt		few_shots_table.txt
one_shot_table.txt		one_shot_table.txt
post-process.py		post-process.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CamelProp

About

Uh oh!

Releases

Packages

Languages

CAMeL-Lab/CamelProp

Folders and files

Latest commit

History

Repository files navigation

CamelProp

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages