Skip to content

Latest commit

 

History

History
39 lines (29 loc) · 7.55 KB

README.md

File metadata and controls

39 lines (29 loc) · 7.55 KB

About the Project

The goal of this project is to use computational methods to connect every word in the Babylonian Talmud with its proper entry in the Jastrow Talmudic dictionary. I.e. it is a single-word translation feature, a la the Google Translate Chrome extension. Development is near completion, and the dataset will be used by TalmudLab and Sefaria for their own digital implementations of the Talmud.

Support and Dependencies

This project relies on two proprietary datasets, which prevents the code from being run unless one has access. These are:

  1. Sefaria's digitized Jastrow databse.
  2. Dicta's comprehensive lexicon of Talmudic Aramaic. The developer was generously granted access to both of these datasets, without which this project could never have been completed to the degree it has reached at this time.

Two open source projects were used in this project:

  1. YAP, a Part-of-Speech tagger for Modern Hebrew, has been used to tag the POS of words in Rabbinic Hebrew, which has a close syntactic relationship to Modern Hebrew, as the penultimate step of the data pipeline.
  2. I have relied upon the aligned CAL-Sefaria Talmud data generated by Noah Santacruz for his PSHAT project. This data was aligned with the vowelized text of the Dicta Talmud in order to create a dataset of 70,000+ vowelized Aramaic words for training the language classifier.

Python module dependencies: Scikitlearn, Pandas, Numpy, Requests.

Data Pipeline

  1. align_and_classify.py -- with the raw Dicta Talmud in a local directory (data/dicta_talmud), this downloads the correspondinig texts of each Talmudic tractate from the Sefaria API and aligns the corresponding words. It also classifies the proper "type" of each segment (henceforth, "chunk") of the Talmud as "m" for Mishna (written in a mix of Rabbinic Hebrew and Biblical Hebrew), or "g" for Gemara (written in a mix of Aramaic, Rabbinic Hebrew, and Biblical Hebrew). The program asks for user input when the words do not line up perfectly; most tractates take only a few minutes to align, with very few human decisions. The output is a json file for each tractate that substitutes each word for a word "container" that stores the word in Sefaria's version, along with the two possible spellings provided by Dicta of that word; can be found in data/aligned_talmud.
  2. connect_sources.py -- this uses the Sefaria API to download pre-Talmudic sources (Bible, Mishna, Tosefta, Midrash) that are referenced by a particular chunk and store them and the aligned text itself in another json file. This part requires no human input, but takes some time depending on the length of the tractate and the number of sources it references; can be found in data/connected_talmud.
  3. (i) scripts/vowelize_aram_train_data.py -- this generates a training set for the language classifier model by taking the aligned CAL/Sefaria Talmud text generated by Noah Santacruz (data/cal_sefaria_matched) and aligning each text with the corresponding text in the data generated from part 1 (data/aligned_talmud). The vowelized Aramaic words are selected out and each tractate is ooutputted as a different json file (data/vowelized_cal_text). (ii) The contents of the vowelized mishna from Sefaria are downloaded and saved as a json file (data/vowelized_cal_texts/download_mishnas.py). The words of the Mishna word corpus have then been shuffled and truncated to be the same size as the Aramaic training data. These training sets are compiled into a single json file for training the model (71667_each_training_data.py). (iii) generate_LangTagger.py -- this creates an SVM model for language classification, training on the data from 3.2. This model is saved as a joblib file to be loaded quickly in the future; located at src/languagetagger/GemaraLanguageTagger.joblib. A simple SVM model trained on only 40,000 words has remarkable success at distinguishing between vowelized Hebrew and vowelized Aramaic words, context-independently: tests showed 96% accuracy on a test set. The words are converted into vectors using a simple one-to-one mapping of characters to binary vectors; each character is mapped to a vector, all of whose entries are 0, except for the space representing that character, which is a 1. See the Jupyter notebook for more information.
  4. tag_language.py -- using the language tagger and simple heuristics (e.g. any word that appears in a linked Biblical source should be tagged as 'B'), every word in a tractate is tagged as Biblical Hebrew (B), Rabbinic Hebrew (R), or Aramaic (A). The output is another json file for each tractate, but with page numbers and linked sources gone, as these are no longer needed; can be found at data/lang_tagged_talmud.
  5. tag_heb_pos.py -- utilizes YAP to tag the POS of all words that were marked as Rabbinic Hebrew in part 4. The output is another json, with every word in the Talmud having a POS tag; words not marked as 'R' are labelled 'yydot' by YAP; located at data/pos_tagged_talmud. This is necessary, as there is no database mapping all Hebrew words to their corresponding roots to be directly linked to the Jastrow databse, unlike for Aramaic and Biblical Hebrew. Rather, as a workaround, the Hebrew translator pipes the word through the Morfix mobile API. This returns a range of context- and vowel-independent root suggestions, along with their Parts-of-Speech. Hence, knowing the probably POS of a Rabbinic Hebrew word will help wittle down and rank the options.
  6. translate_masekhet.py -- Translates the text, linking each word in the Talmud to its proper location (RID) in the Jastrow. Currently has not been implemented, as this requires the compilation of 1 or more additional data sets, which are currently in progress.

Results

Although part 6 has not yet been completed, parts 1-5 are done, which is the bulk of the project. A proof of concept, demonstrating that the penultimate step has indeed been reached, can be found at proof_of_concept/Meilah_full.csv. This csv contains a mapping of every word in tractate Meilah to a word either (1) a Biblical Hebrew word root, which will be linked directly to a Jastrow entry through a database; (2) an Aramaic word root from Dicta, which will be linked directly to a Jastrow entry through a database; (3) a Modern Hebrew word root, that has appears in the Jastrow dictionary. This guarantees that all of these words will be mapped to their correponding dictionary entry once the final step is complete.

Future work

The first order of business is to finish compiling the direct mapping of Dicta roots (and Biblical Hebrew roots) to corresponding Jastrow headwords. Once this is done, the job will essentially be complete.

Other datasets may also be useful, especially a dataset of Aramaic stopwords and the names of Rabbinic sages.

The Hebrew POS tagger is currently a workaround for the issue of Morfix providing too many options. In the future, it would be better to train a machine learning model to recognize the proper root of a word from a list of options. This sub-project is a work-in-progress.

Notes

Please note that files within the "scripts" folder, if run, must be run from the root directory.

Mishnayot are always tagged by the language tagger as Rabbinic or Biblical Hebrew. There are some instances where Aramaic appears in the Mishna, but since these instances are so rare, they should be properly mapped manually on a case-by-case basis. A complete list of these can be found in Strack and Stemberger, "Introduction to the Talmud and Midrash."