Skip to content

Commit

Permalink
Merge pull request #380 from Princeton-CDH/issue-1-a
Browse files Browse the repository at this point in the history
Issue 1 revision 1
  • Loading branch information
gwijthoff authored Oct 4, 2023
2 parents 926d25c + 519d3e7 commit 8d5765a
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 6 deletions.
17 changes: 11 additions & 6 deletions content/issues/4/sonorous-medieval/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ authors:
- BudakNick
- RomingerGian
date: 2023-10-02
doi: 10.5281/zenodo.8380842
pdf: https://zenodo.org/record/8380842/files/startwords-4-sonorous-medieval.pdf
doi: 10.5281/zenodo.8380841
pdf: https://zenodo.org/record/8408357/files/startwords-4-sonorous-medieval.pdf
images: ["issues/4/sonorous-medieval/images/sonorous-medieval-social.png"]
summary: A distinctive set of challenges arises when training machines to process a historical language, especially one that was last spoken two millennia ago.
hook_height_override: 175
Expand Down Expand Up @@ -484,6 +484,11 @@ In using historical secondary sources for training an NLP model, we have found t

Our project continues the practice of reading classical texts as data, but with a different aim than previous iterations of this practice: our goal is to produce a machine-learning algorithm. Some of the outlined steps that we have used to parse the *Jingdian Shiwen* are still works in progress, and additionally, some of the most difficult work remains to be approached. Constructing a statistical model that can represent all the complexities of current hypotheses regarding Old Chinese syllables will strain the limits of contemporary NLP platforms, most of which have no concept whatsoever of phonology. Our aim here is to persuade our peers that this task, and others like it in other languages, is not just possible but wholly worthwhile for researchers in the humanities. For this reason, we believe that further collaboration --- with digital humanists, philologists, and others interested in expanding the debates around ancient texts to incorporate sound --- is one of the most generative approaches to making use of NLP frameworks in the study of ancient texts.

## Acknowledgments

This work was made possible through our participation in the “[New Languages for NLP: Building Linguistic Diversity in the Digital Humanities](https://newnlp.princeton.edu/),” a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities. Our thanks to the organizers, Natalia Ermolaev and Andrew Janco, and to Toma Tasovac, Quinn Dombrowski, and David Lassner. We also thank Gissoo Doroudian and Rebecca Sutton Koeser for their design and implementation of the figures of this article, in a lively exchange with Nick Budak. Figure 4 was further based on the work of Jeffrey R. Tharsen, whose work more generally has inspired many of the thoughts in this article. Lastly, thank you to Grant Wythoff for his erudite editorial work.


[^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922).

[^2]: This phrase is referring to the idea that in order to get a sense of the meaning of a target word, one need only carefully to select from among its surrounding context words; it draws from the title of Ashish Vaswani et al., "Attention Is All You Need," *NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems* (Red Hook: Curran Associates, 2017): 6000--10, [https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf](https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).  "Attention" refers specifically to an algorithmic method of calculating the importance of context words relative to the target.
Expand All @@ -507,19 +512,19 @@ Our project continues the practice of reading classical texts as data, but with

[^11]: Compare, for example, Martin Kern, "The *Odes* in Excavated Manuscripts," in *Text and Ritual in Early China* (Seattle: University of Washington Press, 2005), 149--193, esp. 171. For broader overviews of the developments of Chinese writing in antiquity, compare Xigui Qiu, Gilbert Louis Mattos, and Jerry Norman, *Chinese Writing* (Berkeley: Society for the Study of Early China, 2000); William G. Boltz, *The Origin and the Development of the Chinese Writing System* (New Haven: American Oriental Society, 2003); and Imre Galambos, *Orthography of Early Chinese Writing: Evidence from Newly Excavated Manuscripts* (Budapest: Department of East Asian Studies, Eötvös Loránd University, 2006).

[^12]: Compare David Schaberg, "Speaking of Documents: *Shu* Citations in Warring States Texts," in *Origins of Chinese Political Philosophy*, ed. Martin Kern and Dirk Meyer (Leiden: Brill, 2017), 320--59. Compare also recent arguments regarding the applicability of the concept of *mouvance* to early Chinese textual phenomena; see Martin Kern, "'Xi Shuai' 蟋蟀 ('Cricket') and Its Consequences: Issues in Early Chinese Poetry and Textual Studies," *Early China* 42 (2019): 39--74, esp. 56--62, [https://doi.org/10.1017/eac.2019.1](https://doi.org/10.1017/eac.2019.1); compare also Dirk Meyer, *Documentation and Argument in Early China: The* Shangshu *(Venerated Documents) and the* Shu *Traditions* (Berlin: De Gruyter, 2021), 17--19.
[^12]: Compare David Schaberg, "Speaking of Documents: *Shu* Citations in Warring States Texts," in *Origins of Chinese Political Philosophy*, ed. Martin Kern and Dirk Meyer (Leiden: Brill, 2017), 320--59. Compare also recent arguments regarding the applicability of the concept of *mouvance* to early Chinese textual phenomena; see Martin Kern, "'Xi Shuai' 蟋蟀 ('Cricket') and Its Consequences: Issues in Early Chinese Poetry and Textual Studies," *Early China* 42 (2019): 39--74, esp. 56--62, [https://doi.org/10.1017/eac.2019.1](https://doi.org/10.1017/eac.2019.1); compare also Dirk Meyer, *Documentation and Argument in Early China: The* Shangshu *(Venerated Documents) and the* Shu *Traditions* (Berlin: De Gruyter, 2021), 17--19.

[^13]: The Ming dynasty scholar Chen Di 陳第 (1541--1617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 154--55. 
[^13]: For the underlying logic of representing early Chinese texts through visualizations of their phonological features, and for a fuller aural representation of the poem “Guan ju” upon which figure 4 is largely based, see Jeffrey R. Tharsen, “From Form to Sound 自形至聲: Visual and Aural Representations of Premodern Chinese Phonology and Phonorhetoric with Applications for Phonetic Scripts.” *International Journal of Digit Humanities* 4 (2023), 115–129, https://doi.org/10.1007/s42803-022-00053-8. Further note that the Ming dynasty scholar Chen Di 陳第 (15411617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 15455.

[^14]: For more background on different premodern Chinese lexicons, compare Zev Handel, "Early Lexicons," in *Literary Information in China: A History*, ed. Jack W. Chen et al., (New York: Columbia University Press, 2021), 53--64, esp. 60--61 for the *Qieyun* and sound-based lexicons; and also Victor H. Mair, "*Tzu-shu* 字書 or *tzu-tien* 字典 [Dictionaries]," in *The Indiana Companion to Traditional Chinese Literature*, ed. William H. Nienhauser, vol. 2 (Bloomington: Indiana University Press, 1998); for more on the *Jingdian Shiwen*, see David B. Honey, *Northern and Southern Dynasties, Sui, and Early Tang: The Decline of Factual Philology and the Rise of Speculative Hermeneutics*, vol. 3 of *A History of Classical Chinese Scholarship* (Washington: Academica Press, 2021), 215--20.

[^15]: See David B. Honey's translation of Lu Deming's biography from the *New Tang History* (*Xin Tang shu* 新唐書) in *Decline of Factual Philology*, 216--19.

[^16]: In a way, the *Jingdian Shiwen* can be understood as a dictionary. This is also why an approach relying solely on the Transformer architecture will be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*.
[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian Shiwen* systematically as a database; see Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian Shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (poster, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015. [https://doi.org/10.6082/uchicago.8367](https://doi.org/10.6082/uchicago.8367)); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016. [https://doi.org/10.6082/uchicago.8368](https://doi.org/10.6082/uchicago.8368)). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian Shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian Shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*.

[^17]: For comparison, in terms of file size, the modern "gzip" algorithm compresses the entirety of the same corpus with a ratio of about 3:1.

[^18]: This more abstract understanding of phonology may have reached Chinese scholars by way of Sanskrit and Indian linguistics, which had gained relevance with the increasing institutionalization of Chinese Buddhism in the sixth and seventh centuries; compare Mair, “*Tzu-shu* 字書,” 168. 
[^18]: This more abstract understanding of phonology may have reached Chinese scholars by way of Sanskrit and Indian linguistics, which had gained relevance with the increasing institutionalization of Chinese Buddhism in the sixth and seventh centuries; compare Mair, “*Tzu-shu* 字書,” 168.

[^19]: "About Kanseki Repository," Kanripo, last accessed August 21, 2023, [https://www.kanripo.org/](https://www.kanripo.org/).

Expand Down
14 changes: 14 additions & 0 deletions themes/startwords/assets/scss/article/_single.scss
Original file line number Diff line number Diff line change
Expand Up @@ -78,10 +78,22 @@ body.article {

.formats a { margin-right: rem(5px); }
}

h2#acknowledgments {
margin-top: rem(120px);
font-size: rem(18px);
text-align: center;
}
#acknowledgments ~ p {
font-size: rem(16px);
}

}

}



// common styles for articles and content pages
body.article article, body.page article {

Expand Down Expand Up @@ -328,8 +340,10 @@ details.code {
background-color: #272822;
}
}

}


/* fallback content included to be shown in TXT version only */
.txt-only {
display: none;
Expand Down
14 changes: 14 additions & 0 deletions themes/startwords/assets/scss/print.scss
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,20 @@ iframe {
}
}

body.article article {
h2#acknowledgments {
font-size: 14px;
text-align: center;
}
#acknowledgments ~ p {
font-size: 12px;
}
#acknowledgments ~ p:last-of-type {
margin-bottom: 120px;
}
}


/* table styles */
body.article article table {

Expand Down

0 comments on commit 8d5765a

Please sign in to comment.