Caroline Gish | [email protected]
This project was undertaken by Caroline Gish for the course project in the Data Science for Linguists 2022 course.
The overall goal of this project was to see how the UDS semantic annotation framework, a novel framework claimed to have better coverage for non-prototypical instances, was equipped to handle child speech that may contain nonprototypical instances dissimilar from the UDS training sentences. My personal goals in undeertaking this project were to gain experience with a massive dataset, that comes with its own library, designed specifically for semantic research.
To read what my classmates had to say about my project during the semester, be sure to visit my project guestbook!
Data were sourced from the both the Decomp repository of the Decompositional Semantics Initiative and the CHILDES child language component of the TalkBank system.
final_report.md
is the final write-up for my projectREADME.md
is what you are currently reading! It contains an overview of my project, links to all files, and information on the licensing and works cited.presentation_gish.pdf
is a PDF copy of my presentation slides for the presentation I gave in the 2022 Data Science for Linguists class. These slides do not contain any of my notes, so please feel free to contact me for more information about them!progress_report.md
contains three different progress reports each detailing my project progress over the course of the semester.project_plan.md
is my initital project plan that I proposed at the beginning of the semester.LICENSE.md
is the license for the code elements of the repository.LICENSE-cc.md
is the license for the non-code elements of the repository.
code_notebooks/
- contains all of my Jupyter notebooks (.ipynb
files) and license information for the PyLangAcq library and Decomp toolkit I usedchildes_exploration.ipynb
is my all my code dedicated to the CHILDES dataset. Here is the same notebook through Jupyter's nbviewer since GitHub sometimes messes with the formatting.UDS_exploration_CRC.ipynb
is all my code dedicated to the UDS dataset. Here is the same notebook through Jupyter's nbviewer since GitHub sometimes messes with the formatting.
data_samples/
contains samples of the CHILDES CHAT data separated into subdirectories by child grade level. The full dataset is available on this page.resources/
is a collection of handy resources and notes for ease of accessvisualizations/
contains an example of a UDS graph and my plot graphs saved as.png
files
- The code content of this repo is covered by the GNU General Public License v3, and the non-code content of this repo is covered under the CC BY-NC-SA 3.0 license.
The Universal Decompositional Semantics Dataset and Decomp Toolkit (White et al., LREC 2020)
ACL version: Aaron Steven White, Elias Stengel-Eskin, Siddharth Vashishtha, Venkata Subrahmanyan Govindarajan, Dee Ann Reisinger, Tim Vieira, Keisuke Sakaguchi, Sheng Zhang, Francis Ferraro, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2020. The Universal Decompositional Semantics Dataset and Decomp Toolkit. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5698–5707, Marseille, France. European Language Resources Association.
Hicks, D. (1990). Kinds of texts: Narrative genre skills among children from two communities. In A. McCabe (Ed.), Developing narrative structure. Hillsdale, NJ: Erlbaum.
Additional references that go along with Hicks, D. (1990) include:
Berman, R. A. and D. I. Slobin (1994). Relating events in narrative: A crosslinguistic de-velopmental study. Hillsdale, NJ, Lawrence Erlbaum Associates.
Heath, S. (1983). Ways with words: Language, life and work in communities and classrooms. Cambridge, Cambridge University Press.
Quirk, R., S. Greenbaum, et al. (1972). A grammar of contemporary English. London, Longman.