From e45a07b97d7fe5f70f4800510866d3274929d7a0 Mon Sep 17 00:00:00 2001 From: rlskoeser Date: Tue, 3 Oct 2023 18:52:00 -0400 Subject: [PATCH 1/8] Adjust font size and spacing for acknowledgments --- themes/startwords/assets/scss/article/_single.scss | 14 ++++++++++++++ themes/startwords/assets/scss/print.scss | 14 ++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/themes/startwords/assets/scss/article/_single.scss b/themes/startwords/assets/scss/article/_single.scss index 53b1dedd..4fa603f3 100644 --- a/themes/startwords/assets/scss/article/_single.scss +++ b/themes/startwords/assets/scss/article/_single.scss @@ -78,10 +78,22 @@ body.article { .formats a { margin-right: rem(5px); } } + + h2#acknowledgments { + margin-top: rem(120px); + font-size: rem(18px); + text-align: center; + } + #acknowledgments ~ p { + font-size: rem(16px); + } + } } + + // common styles for articles and content pages body.article article, body.page article { @@ -328,8 +340,10 @@ details.code { background-color: #272822; } } + } + /* fallback content included to be shown in TXT version only */ .txt-only { display: none; diff --git a/themes/startwords/assets/scss/print.scss b/themes/startwords/assets/scss/print.scss index 905e1160..17e23249 100644 --- a/themes/startwords/assets/scss/print.scss +++ b/themes/startwords/assets/scss/print.scss @@ -370,6 +370,20 @@ iframe { } } +body.article article { + h2#acknowledgments { + font-size: 14px; + text-align: center; + } + #acknowledgments ~ p { + font-size: 12px; + } + #acknowledgments ~ p:last-of-type { + margin-bottom: 120px; + } +} + + /* table styles */ body.article article table { From df2afdf8da91d891fa67b55eecbd96b0c8431c3c Mon Sep 17 00:00:00 2001 From: rlskoeser Date: Tue, 3 Oct 2023 18:58:42 -0400 Subject: [PATCH 2/8] Add acknowledgements and expand footnotes --- content/issues/4/sonorous-medieval/index.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index 139e1b4f..79f2ca01 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -484,6 +484,16 @@ In using historical secondary sources for training an NLP model, we have found t Our project continues the practice of reading classical texts as data, but with a different aim than previous iterations of this practice: our goal is to produce a machine-learning algorithm. Some of the outlined steps that we have used to parse the *Jingdian Shiwen* are still works in progress, and additionally, some of the most difficult work remains to be approached. Constructing a statistical model that can represent all the complexities of current hypotheses regarding Old Chinese syllables will strain the limits of contemporary NLP platforms, most of which have no concept whatsoever of phonology. Our aim here is to persuade our peers that this task, and others like it in other languages, is not just possible but wholly worthwhile for researchers in the humanities. For this reason, we believe that further collaboration --- with digital humanists, philologists, and others interested in expanding the debates around ancient texts to incorporate sound --- is one of the most generative approaches to making use of NLP frameworks in the study of ancient texts. +## Acknowledgments + +This work was made possible through our participation in the “[New Languages for NLP: Building Linguistic Diversity in the Digital Humanities](https://newnlp.princeton.edu/),” a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities. Our thanks to the organizers, Natalia Ermolaev and Andrew Janco, and to Toma Tasovac, Quinn Dombrowski, and David Lassner. + +We also thank Gissoo Doroudian and Rebecca Sutton Koeser, for their design and implementation of the figures of this article, in a lively exchange with Nick Budak. Figure 4 was further inspired by the work of Jeffrey R. Tharsen, whose work more generally has inspired many of the thoughts in this article. + +Lastly, thank you to Grant Wythoff for his erudite editorial work. + + + [^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922). [^2]: This phrase is referring to the idea that in order to get a sense of the meaning of a target word, one need only carefully to select from among its surrounding context words; it draws from the title of Ashish Vaswani et al., "Attention Is All You Need," *NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems* (Red Hook: Curran Associates, 2017): 6000--10, [https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf](https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).  "Attention" refers specifically to an algorithmic method of calculating the importance of context words relative to the target. @@ -507,19 +517,19 @@ Our project continues the practice of reading classical texts as data, but with [^11]: Compare, for example, Martin Kern, "The *Odes* in Excavated Manuscripts," in *Text and Ritual in Early China* (Seattle: University of Washington Press, 2005), 149--193, esp. 171. For broader overviews of the developments of Chinese writing in antiquity, compare Xigui Qiu, Gilbert Louis Mattos, and Jerry Norman, *Chinese Writing* (Berkeley: Society for the Study of Early China, 2000); William G. Boltz, *The Origin and the Development of the Chinese Writing System* (New Haven: American Oriental Society, 2003); and Imre Galambos, *Orthography of Early Chinese Writing: Evidence from Newly Excavated Manuscripts* (Budapest: Department of East Asian Studies, Eötvös Loránd University, 2006). -[^12]: Compare David Schaberg, "Speaking of Documents: *Shu* Citations in Warring States Texts," in *Origins of Chinese Political Philosophy*, ed. Martin Kern and Dirk Meyer (Leiden: Brill, 2017), 320--59. Compare also recent arguments regarding the applicability of the concept of *mouvance* to early Chinese textual phenomena; see Martin Kern, "'Xi Shuai' 蟋蟀 ('Cricket') and Its Consequences: Issues in Early Chinese Poetry and Textual Studies," *Early China* 42 (2019): 39--74, esp. 56--62, [https://doi.org/10.1017/eac.2019.1](https://doi.org/10.1017/eac.2019.1); compare also Dirk Meyer, *Documentation and Argument in Early China: The* Shangshu *(Venerated Documents) and the* Shu *Traditions* (Berlin: De Gruyter, 2021), 17--19. +[^12]: Compare David Schaberg, "Speaking of Documents: *Shu* Citations in Warring States Texts," in *Origins of Chinese Political Philosophy*, ed. Martin Kern and Dirk Meyer (Leiden: Brill, 2017), 320--59. Compare also recent arguments regarding the applicability of the concept of *mouvance* to early Chinese textual phenomena; see Martin Kern, "'Xi Shuai' 蟋蟀 ('Cricket') and Its Consequences: Issues in Early Chinese Poetry and Textual Studies," *Early China* 42 (2019): 39--74, esp. 56--62, [https://doi.org/10.1017/eac.2019.1](https://doi.org/10.1017/eac.2019.1); compare also Dirk Meyer, *Documentation and Argument in Early China: The* Shangshu *(Venerated Documents) and the* Shu *Traditions* (Berlin: De Gruyter, 2021), 17--19. -[^13]: The Ming dynasty scholar Chen Di 陳第 (1541--1617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 154--55.  +[^13]: For the underlying logic of representing early Chinese texts through visualizations of their phonological features, and for a better aural representation of the poem “Guan ju” that inspired figure 4, see Jeffrey R. Tharsen, “From Form to Sound 自形至聲: Visual and Aural Representations of Premodern Chinese Phonology and Phonorhetoric with Applications for Phonetic Scripts. International Journal of Digit Humanities 4 (2023), 115–129, https://doi.org/10.1007/s42803-022-00053-8.
The Ming dynasty scholar Chen Di 陳第 (1541--1617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 154--55. [^14]: For more background on different premodern Chinese lexicons, compare Zev Handel, "Early Lexicons," in *Literary Information in China: A History*, ed. Jack W. Chen et al., (New York: Columbia University Press, 2021), 53--64, esp. 60--61 for the *Qieyun* and sound-based lexicons; and also Victor H. Mair, "*Tzu-shu* 字書 or *tzu-tien* 字典 [Dictionaries]," in *The Indiana Companion to Traditional Chinese Literature*, ed. William H. Nienhauser, vol. 2 (Bloomington: Indiana University Press, 1998); for more on the *Jingdian Shiwen*, see David B. Honey, *Northern and Southern Dynasties, Sui, and Early Tang: The Decline of Factual Philology and the Rise of Speculative Hermeneutics*, vol. 3 of *A History of Classical Chinese Scholarship* (Washington: Academica Press, 2021), 215--20. [^15]: See David B. Honey's translation of Lu Deming's biography from the *New Tang History* (*Xin Tang shu* 新唐書) in *Decline of Factual Philology*, 216--19. -[^16]: In a way, the *Jingdian Shiwen* can be understood as a dictionary. This is also why an approach relying solely on the Transformer architecture will be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. +[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the Jingdian shiwen systematically as a database; see their Jeffrey R. Tharsen and Hantao Wang, “Digitizing the Jingdian shiwen《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (paper, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015; see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016).
Here, we understand the *Jingdian Shiwen* as a dictionary. This is also why an approach relying solely on the Transformer architecture will be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. [^17]: For comparison, in terms of file size, the modern "gzip" algorithm compresses the entirety of the same corpus with a ratio of about 3:1. -[^18]: This more abstract understanding of phonology may have reached Chinese scholars by way of Sanskrit and Indian linguistics, which had gained relevance with the increasing institutionalization of Chinese Buddhism in the sixth and seventh centuries; compare Mair, “*Tzu-shu* 字書,” 168.  +[^18]: This more abstract understanding of phonology may have reached Chinese scholars by way of Sanskrit and Indian linguistics, which had gained relevance with the increasing institutionalization of Chinese Buddhism in the sixth and seventh centuries; compare Mair, “*Tzu-shu* 字書,” 168. [^19]: "About Kanseki Repository," Kanripo, last accessed August 21, 2023, [https://www.kanripo.org/](https://www.kanripo.org/). From 4dbd43b44941a5b6bc65d24990ef9ae3ebb5ecc0 Mon Sep 17 00:00:00 2001 From: rlskoeser Date: Tue, 3 Oct 2023 19:02:53 -0400 Subject: [PATCH 3/8] Switch to the DOI for all versions of this record --- content/issues/4/sonorous-medieval/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index 79f2ca01..1f245152 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -8,7 +8,7 @@ authors: - BudakNick - RomingerGian date: 2023-10-02 -doi: 10.5281/zenodo.8380842 +doi: 10.5281/zenodo.8380841 pdf: https://zenodo.org/record/8380842/files/startwords-4-sonorous-medieval.pdf images: ["issues/4/sonorous-medieval/images/sonorous-medieval-social.png"] summary: A distinctive set of challenges arises when training machines to process a historical language, especially one that was last spoken two millennia ago. From 3a2639586a371e04733c7725b303165cbf760069 Mon Sep 17 00:00:00 2001 From: Grant Wythoff <5580840+gwijthoff@users.noreply.github.com> Date: Tue, 3 Oct 2023 20:53:13 -0400 Subject: [PATCH 4/8] minor edit --- content/issues/4/sonorous-medieval/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index 1f245152..6d993e59 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -486,7 +486,7 @@ Our project continues the practice of reading classical texts as data, but with ## Acknowledgments -This work was made possible through our participation in the “[New Languages for NLP: Building Linguistic Diversity in the Digital Humanities](https://newnlp.princeton.edu/),” a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities. Our thanks to the organizers, Natalia Ermolaev and Andrew Janco, and to Toma Tasovac, Quinn Dombrowski, and David Lassner. +This work was made possible through our participation in “[New Languages for NLP: Building Linguistic Diversity in the Digital Humanities](https://newnlp.princeton.edu/),” a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities. Our thanks to the organizers, Natalia Ermolaev and Andrew Janco, and to Toma Tasovac, Quinn Dombrowski, and David Lassner. We also thank Gissoo Doroudian and Rebecca Sutton Koeser, for their design and implementation of the figures of this article, in a lively exchange with Nick Budak. Figure 4 was further inspired by the work of Jeffrey R. Tharsen, whose work more generally has inspired many of the thoughts in this article. From 2e75ed924ab445f2f7f01bd2305a41737488662e Mon Sep 17 00:00:00 2001 From: Grant Wythoff <5580840+gwijthoff@users.noreply.github.com> Date: Tue, 3 Oct 2023 21:02:49 -0400 Subject: [PATCH 5/8] italicize journal title --- content/issues/4/sonorous-medieval/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index 6d993e59..f505d62f 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -519,7 +519,7 @@ Lastly, thank you to Grant Wythoff for his erudite editorial work. [^12]: Compare David Schaberg, "Speaking of Documents: *Shu* Citations in Warring States Texts," in *Origins of Chinese Political Philosophy*, ed. Martin Kern and Dirk Meyer (Leiden: Brill, 2017), 320--59. Compare also recent arguments regarding the applicability of the concept of *mouvance* to early Chinese textual phenomena; see Martin Kern, "'Xi Shuai' 蟋蟀 ('Cricket') and Its Consequences: Issues in Early Chinese Poetry and Textual Studies," *Early China* 42 (2019): 39--74, esp. 56--62, [https://doi.org/10.1017/eac.2019.1](https://doi.org/10.1017/eac.2019.1); compare also Dirk Meyer, *Documentation and Argument in Early China: The* Shangshu *(Venerated Documents) and the* Shu *Traditions* (Berlin: De Gruyter, 2021), 17--19. -[^13]: For the underlying logic of representing early Chinese texts through visualizations of their phonological features, and for a better aural representation of the poem “Guan ju” that inspired figure 4, see Jeffrey R. Tharsen, “From Form to Sound 自形至聲: Visual and Aural Representations of Premodern Chinese Phonology and Phonorhetoric with Applications for Phonetic Scripts. International Journal of Digit Humanities 4 (2023), 115–129, https://doi.org/10.1007/s42803-022-00053-8.
The Ming dynasty scholar Chen Di 陳第 (1541--1617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 154--55. +[^13]: For the underlying logic of representing early Chinese texts through visualizations of their phonological features, and for a better aural representation of the poem “Guan ju” that inspired figure 4, see Jeffrey R. Tharsen, “From Form to Sound 自形至聲: Visual and Aural Representations of Premodern Chinese Phonology and Phonorhetoric with Applications for Phonetic Scripts. *International Journal of Digit Humanities* 4 (2023), 115–129, https://doi.org/10.1007/s42803-022-00053-8.
The Ming dynasty scholar Chen Di 陳第 (1541--1617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 154--55. [^14]: For more background on different premodern Chinese lexicons, compare Zev Handel, "Early Lexicons," in *Literary Information in China: A History*, ed. Jack W. Chen et al., (New York: Columbia University Press, 2021), 53--64, esp. 60--61 for the *Qieyun* and sound-based lexicons; and also Victor H. Mair, "*Tzu-shu* 字書 or *tzu-tien* 字典 [Dictionaries]," in *The Indiana Companion to Traditional Chinese Literature*, ed. William H. Nienhauser, vol. 2 (Bloomington: Indiana University Press, 1998); for more on the *Jingdian Shiwen*, see David B. Honey, *Northern and Southern Dynasties, Sui, and Early Tang: The Decline of Factual Philology and the Rise of Speculative Hermeneutics*, vol. 3 of *A History of Classical Chinese Scholarship* (Washington: Academica Press, 2021), 215--20. From 96d317e4f430bd26640398327c955647ea4680f1 Mon Sep 17 00:00:00 2001 From: rlskoeser Date: Wed, 4 Oct 2023 10:35:48 -0400 Subject: [PATCH 6/8] Additional revisions from @GDRom --- content/issues/4/sonorous-medieval/index.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index f505d62f..814f5744 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -486,12 +486,7 @@ Our project continues the practice of reading classical texts as data, but with ## Acknowledgments -This work was made possible through our participation in “[New Languages for NLP: Building Linguistic Diversity in the Digital Humanities](https://newnlp.princeton.edu/),” a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities. Our thanks to the organizers, Natalia Ermolaev and Andrew Janco, and to Toma Tasovac, Quinn Dombrowski, and David Lassner. - -We also thank Gissoo Doroudian and Rebecca Sutton Koeser, for their design and implementation of the figures of this article, in a lively exchange with Nick Budak. Figure 4 was further inspired by the work of Jeffrey R. Tharsen, whose work more generally has inspired many of the thoughts in this article. - -Lastly, thank you to Grant Wythoff for his erudite editorial work. - +This work was made possible through our participation in the “[New Languages for NLP: Building Linguistic Diversity in the Digital Humanities](https://newnlp.princeton.edu/),” a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities. Our thanks to the organizers, Natalia Ermolaev and Andrew Janco, and to Toma Tasovac, Quinn Dombrowski, and David Lassner. We also thank Gissoo Doroudian and Rebecca Sutton Koeser for their design and implementation of the figures of this article, in a lively exchange with Nick Budak. Figure 4 was further based on the work of Jeffrey R. Tharsen, whose work more generally has inspired many of the thoughts in this article. Lastly, thank you to Grant Wythoff for his erudite editorial work. [^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922). @@ -519,13 +514,13 @@ Lastly, thank you to Grant Wythoff for his erudite editorial work. [^12]: Compare David Schaberg, "Speaking of Documents: *Shu* Citations in Warring States Texts," in *Origins of Chinese Political Philosophy*, ed. Martin Kern and Dirk Meyer (Leiden: Brill, 2017), 320--59. Compare also recent arguments regarding the applicability of the concept of *mouvance* to early Chinese textual phenomena; see Martin Kern, "'Xi Shuai' 蟋蟀 ('Cricket') and Its Consequences: Issues in Early Chinese Poetry and Textual Studies," *Early China* 42 (2019): 39--74, esp. 56--62, [https://doi.org/10.1017/eac.2019.1](https://doi.org/10.1017/eac.2019.1); compare also Dirk Meyer, *Documentation and Argument in Early China: The* Shangshu *(Venerated Documents) and the* Shu *Traditions* (Berlin: De Gruyter, 2021), 17--19. -[^13]: For the underlying logic of representing early Chinese texts through visualizations of their phonological features, and for a better aural representation of the poem “Guan ju” that inspired figure 4, see Jeffrey R. Tharsen, “From Form to Sound 自形至聲: Visual and Aural Representations of Premodern Chinese Phonology and Phonorhetoric with Applications for Phonetic Scripts. *International Journal of Digit Humanities* 4 (2023), 115–129, https://doi.org/10.1007/s42803-022-00053-8.
The Ming dynasty scholar Chen Di 陳第 (1541--1617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 154--55. +[^13]: For the underlying logic of representing early Chinese texts through visualizations of their phonological features, and for a fuller aural representation of the poem “Guan ju” upon which figure 4 is largely based, see Jeffrey R. Tharsen, “From Form to Sound 自形至聲: Visual and Aural Representations of Premodern Chinese Phonology and Phonorhetoric with Applications for Phonetic Scripts.” *International Journal of Digit Humanities* 4 (2023), 115–129, https://doi.org/10.1007/s42803-022-00053-8. Further note that the Ming dynasty scholar Chen Di 陳第 (1541–1617) used this problem of the Odes not rhyming to persuasively make the case that the Chinese language had undergone significant phonological change since ancient times; see William H. Baxter, *A Handbook of Old Chinese Phonology* (Berlin: Mouton De Gruyter, 1992), 154–55. [^14]: For more background on different premodern Chinese lexicons, compare Zev Handel, "Early Lexicons," in *Literary Information in China: A History*, ed. Jack W. Chen et al., (New York: Columbia University Press, 2021), 53--64, esp. 60--61 for the *Qieyun* and sound-based lexicons; and also Victor H. Mair, "*Tzu-shu* 字書 or *tzu-tien* 字典 [Dictionaries]," in *The Indiana Companion to Traditional Chinese Literature*, ed. William H. Nienhauser, vol. 2 (Bloomington: Indiana University Press, 1998); for more on the *Jingdian Shiwen*, see David B. Honey, *Northern and Southern Dynasties, Sui, and Early Tang: The Decline of Factual Philology and the Rise of Speculative Hermeneutics*, vol. 3 of *A History of Classical Chinese Scholarship* (Washington: Academica Press, 2021), 215--20. [^15]: See David B. Honey's translation of Lu Deming's biography from the *New Tang History* (*Xin Tang shu* 新唐書) in *Decline of Factual Philology*, 216--19. -[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the Jingdian shiwen systematically as a database; see their Jeffrey R. Tharsen and Hantao Wang, “Digitizing the Jingdian shiwen《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (paper, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015; see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016).
Here, we understand the *Jingdian Shiwen* as a dictionary. This is also why an approach relying solely on the Transformer architecture will be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. +[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian shiwen* systematically as a database; see their Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (paper, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. [^17]: For comparison, in terms of file size, the modern "gzip" algorithm compresses the entirety of the same corpus with a ratio of about 3:1. From 5df6b985c100d025757aae75795412ea49597228 Mon Sep 17 00:00:00 2001 From: rlskoeser Date: Wed, 4 Oct 2023 14:38:40 -0400 Subject: [PATCH 7/8] Add DOIs for Tharsen & Wang citations --- content/issues/4/sonorous-medieval/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index 814f5744..6dacc552 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -520,7 +520,7 @@ This work was made possible through our participation in the “[New Languages f [^15]: See David B. Honey's translation of Lu Deming's biography from the *New Tang History* (*Xin Tang shu* 新唐書) in *Decline of Factual Philology*, 216--19. -[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian shiwen* systematically as a database; see their Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (paper, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. +[^16]: The use of the *Jingdian Shiwen* as a data source was pioneered by Jeffrey R. Tharsen and Hantao Wang, who categorized and segmented the *Jingdian Shiwen* systematically as a database; see Jeffrey R. Tharsen and Hantao Wang, “Digitizing the *Jingdian Shiwen*《經典釋文》: Deriving a Lexical Database from Ancient Glosses” (poster, Chicago Colloquium on Digital Humanities and Computer Science (DHCS), University of Chicago, USA, November 14, 2015. [https://doi.org/10.6082/uchicago.8367](https://doi.org/10.6082/uchicago.8367)); see also Jeffrey R. Tharsen, “Understanding the Databases of Premodern China: Harnessing the Potential of Textual Corpora as Digital Data Sources” (paper, Digital Research in East Asian Studies: Corpora, Methods, and Challenges Conference, Leiden University, the Netherlands, July 12, 2016. [https://doi.org/10.6082/uchicago.8368](https://doi.org/10.6082/uchicago.8368)). We were initially unaware of the work of Tharsen and Wang, and at first approached the *Jingdian Shiwen* purely from the needs of training a Natural Language Processing model. Our approach therefore utilizes a different labeling schema that was focused on the NLP model training, and a different digitized version of the source text. Nonetheless, we are grateful for Tharsen and Wang for generously sharing their data, which allowed us to compare their data and approach to ours. As part of our approach, we understand the *Jingdian Shiwen* as a dictionary, and we identify its semi-structured dictionary form also as the reason why an approach relying solely on the Transformer architecture would be misleading. Compare this to the ability of a human reader who speaks English to look up any word she may find in a source text in the *Oxford English Dictionary*. GPT-3 processes the same text differently, and the only way it learns from the dictionary as a text is in the same way it understands a sequential text like *Moby-Dick*. [^17]: For comparison, in terms of file size, the modern "gzip" algorithm compresses the entirety of the same corpus with a ratio of about 3:1. From 519d3e7021364f46ca1f2e66e8403b1138985dd8 Mon Sep 17 00:00:00 2001 From: gwijthoff <5580840+gwijthoff@users.noreply.github.com> Date: Wed, 4 Oct 2023 15:25:05 -0400 Subject: [PATCH 8/8] update PDF link with new file --- content/issues/4/sonorous-medieval/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md index 6dacc552..eeaebcec 100644 --- a/content/issues/4/sonorous-medieval/index.md +++ b/content/issues/4/sonorous-medieval/index.md @@ -9,7 +9,7 @@ authors: - RomingerGian date: 2023-10-02 doi: 10.5281/zenodo.8380841 -pdf: https://zenodo.org/record/8380842/files/startwords-4-sonorous-medieval.pdf +pdf: https://zenodo.org/record/8408357/files/startwords-4-sonorous-medieval.pdf images: ["issues/4/sonorous-medieval/images/sonorous-medieval-social.png"] summary: A distinctive set of challenges arises when training machines to process a historical language, especially one that was last spoken two millennia ago. hook_height_override: 175