From 5df6b985c100d025757aae75795412ea49597228 Mon Sep 17 00:00:00 2001 From: rlskoeser Date: Wed, 4 Oct 2023 14:38:40 -0400 Subject: [PATCH] Add DOIs for Tharsen & Wang citations --- content/issues/4/sonorous-medieval/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index 814f5744..6dacc552 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -520,7 +520,7 @@ This work was made possible through our participation in the “[New Languages f [^15]: See David B. Honey's translation of Lu Deming's biography from the *New Tang History* (*Xin Tang shu* 新唐書) in *Decline of Factual Philology*, 216--19. -[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian shiwen* systematically as a database; see their Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (paper, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. +[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian Shiwen* systematically as a database; see Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian Shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (poster, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015. [https://doi.org/10.6082/uchicago.8367](https://doi.org/10.6082/uchicago.8367)); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016. [https://doi.org/10.6082/uchicago.8368](https://doi.org/10.6082/uchicago.8368)). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian Shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian Shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. [^17]: For comparison, in terms of file size, the modern "gzip" algorithm compresses the entirety of the same corpus with a ratio of about 3:1.