Skip to content

Commit

Permalink
Add DOIs for Tharsen & Wang citations
Browse files Browse the repository at this point in the history
  • Loading branch information
rlskoeser committed Oct 4, 2023
1 parent 96d317e commit 5df6b98
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion content/issues/4/sonorous-medieval/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -520,7 +520,7 @@ This work was made possible through our participation in the “[New Languages f

[^15]: See David B. Honey's translation of Lu Deming's biography from the *New Tang History* (*Xin Tang shu* 新唐書) in *Decline of Factual Philology*, 216--19.

[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian shiwen* systematically as a database; see their Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (paper, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*.
[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian Shiwen* systematically as a database; see Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian Shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (poster, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015. [https://doi.org/10.6082/uchicago.8367](https://doi.org/10.6082/uchicago.8367)); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016. [https://doi.org/10.6082/uchicago.8368](https://doi.org/10.6082/uchicago.8368)). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian Shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian Shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*.

[^17]: For comparison, in terms of file size, the modern "gzip" algorithm compresses the entirety of the same corpus with a ratio of about 3:1.

Expand Down

0 comments on commit 5df6b98

Please sign in to comment.