TurkLang 2023 is the Eleventh International Conference on Computer Processing of Turkic Languages, which focuses on advancements and research in computational linguistics specifically for Turkic languages. The conference serves as a vital platform for scholars to present papers that explore various computational approaches to analyzing and processing Turkic languages. The conference proceedings are published in a collection that includes a wide array of research papers, totaling 1030 pages, and is issued by the Academy of Sciences of the Republic of Tatarstan in Kazan.
ISSUES OF KYRGYZ SYNTACTIC ANNOTATION WITHIN THE
UNIVERSAL DEPENDENCIES FRAMEWORK
Aida Kasieva, Gulnura Dzhumalieva, Anna Thompson,
Murat Jumashev, Bermet Chontaeva, Jonathan Washington
[email protected], [email protected],
[email protected], [email protected],
[email protected],
[email protected]
This paper examines key issues encountered in syntactic annotation work
for a forthcoming Universal Dependencies (UD) corpus of Kyrgyz. It presents
an overview of the corpus creation process, including sentence sampling from
the Manas-UdS Kyrgyz corpus and manual annotation using UD guidelines. The
corpus contains over 1600 tokens across 230 sentences sampled from literary and
news domains. Four central issues in Kyrgyz UD annotation are then discussed in-
depth: copula tokenization, categorization of “small words” like да and керек, null-
headed clauses (including relative clauses, and -DAGI and -NIKI constructions),
and differentiating inflection vs. derivation. For each issue, multiple analysis
options are weighed, including contrasting the approach in prior Turkic UD
treebanks. Copula analysis compares subject agreement morphology as dependent
subtokens vs independent words. The discourse and intensifier functions of да
are examined to determine optimal POS and dependency labels. Strategies for
representing implicit nominal heads in relative clauses and genitive constructions
are evaluated. Criteria for categorizing productive derivational morphology as
inflectional cases vs separate words are outlined. Throughout, examples illustrate
annotation decisions and dependency graphs. Comparisons are made to the analysis
of related phenomena in existing UD treebanks for Kazakh [Tyers & Washington
2015, Makazhanov et al. 2015], Turkish, and the small Kyrgyz UD corpus [Benli,
2023]. The work identifies ongoing challenges in representing Kyrgyz syntax
within UD, while developing an improved annotated resource. It highlights issues
where UD guidelines exhibit limitations for Turkic languages, providing analysis
to advance understanding of best practices for Kyrgyz and related languages.
Keywords: Kyrgyz, syntax, annotation, Universal Dependencies.