forked from tmallon/morpheus
-
Notifications
You must be signed in to change notification settings - Fork 0
Transform Greek and Latin texts into morphology databases using Perseus' Morpheus service.
License
kerberosdelhades/morpheus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
morpheus and morpheuslib README Dated 14 October 2013 Contents: I. Preliminary II. Terminology and Background III. Usage IV. Outputs V. Programming Notes I. Preliminary 1. Status Produces correct output, but produces rather verbose output. 2. Programmer/maintainer: Timothy Mallon Bug reports, complaints, requests to [email protected]. 3. Purposes: Fetch morphological analyses of Greek and Latin words from Perseus Morpheus service for quick reference; assemble clauses for a Prolog knowledge base and/or JSON records for use in a Mongo database. II. Background 1. Terminology This is an explanation of how I intend to use these the following words in code and documentation. I plan to regularize any deviations from this standard as I find them. word: a word found in a text. May be extended with a triple of location ordinals indicating the word's location in the text; the location of the clause containing the word; the location of the containing sentence. analysis: a tuple of values that are a possible inlectional analysis of a word. The values can be thought of as situating the word in a paradigm (for nouns and verbs). form: a value in an analysis that in priniple is just the word analyzed, but in the case of Greek words may differ from the word (see the dis- cussion below, under "Greek URLs"). lemma: a value in an analysis that is the headword under which the word is found in a dictionary. Sometimes, but not generally equal to the form or the word. sentence: a unit of text terminated by a period, or equivalent (question mark, or in later texts, exclamation mark). clause: a unit of text terminated by comma or equivalent (semicolon or colon) or period or equivalent. 2. Perseus and Morpheus Perseus is an website run by Tufts University (http://www.perseus.tufts.edu) which offers texts and text-related services. One of those services is Morpheus, which takes a Greek or Latin word as input, and returns morphological analyses in the form of XML documents see below for an example). The analyses include: lemma (i.e. dictionary headword), part of speech, and other morphological features. The features separate into a core set, common to all analyses; and features specific to a part of speech. The core features are: form (the word submitted), lemma (headword or word listed in a dictionary or lexicon and to which the form belongs); expandedForm (usually = form, but may include a suffixed word); pos (part of speech - noun, verb, part(iciple), adj(ective), conj(unction), exclam(ation). adv(erb)); lang ('greek' or 'la'); dialect (especially Greek, e.g. 'doric'); feature (something not fitting into the preceding). As an example of non-core features, a noun will have case, gender and number. Note that many words submitted to Morpheus will have several analyses, which may refer to one lemma ('feminae' msy be one of three forms of 'femina') or more than one ('amare' may be the infinitive 'to love' (lemma 'amo') or a vocative adjective 'o bitter man'(lemma 'amarus')). 3. Modififications Made to Morpheus Data Morpheuslib makes a few changes to Morpheus data in methods of class Analysis. First, the present participle in Latin is supplemented with a voice feature (see fixpart()). Second, Latin pronouns are supplemented with the person feature (see fixpron()). Note: this requires that the pronoun's person be listed in the info.la data file. Third, the supine and infinitive are changed from "moods" (of noun and verb, respectively) to their own parts of speech (see fix_bad_mood()). This classification accords better with contemporary understanding of their history and function. 4. Handling of Greek Text: This program was written against Perseus' dialect of Beta Code, which uses lower case Latin letters. Upper case letters are converted to lower case before submission to Morpheus. Morpheus returns Greek text as cased (for the lemma) or uncased (for the form) polytonic Greek letters (precomposed Unicode). 5. Greek URLS: Greek words submitted with diacritics return an empty document. E.g. the url: http://www.perseus.tufts.edu/hopper/xmlmorph?lang=greek&lookup=mhru%2Fsanto produces: <?xml version="1.0" encoding="utf-8"?> <analyses> </analyses>; whereas http://www.perseus.tufts.edu/hopper/xmlmorph?lang=greek&lookup=mhrusanto produces: <?xml version="1.0" encoding="utf-8"?> <analyses> <analysis> <form lang="greek">μηρύσαντο</form> <lemma>μηρύομαι</lemma> <expandedForm>μηρύσαντο</expandedForm> <pos>verb</pos> <person>3rd</person> <number>pl</number> <tense>aor</tense> <mood>ind</mood> <voice>mid</voice> <dialect>homeric ionic</dialect> <feature>unaugmented</feature> </analysis> </analyses>. Therefore, Greek words are subitted without breathing or accent diacritics. This causes more matches than necessary to be returned. This is addressed by checking that the returned form matches the submitted form, and retaining only those analyses that match. III. Usage 1. Setup It is sufficient to copy the files morpheuslib.py and morpheus.py, along with *.info and cachewords.* files to a single directory. 2. Supported Python Version The program was written in Python 3.3.0. 3. Command and Arguments morpheus.py [-h] [--core CORE] [--word WORD] [--json JSON] [--prolog PROLOG] [--oz OZ] [--echo {basic,off,prolog,json,oz}] [--label LABEL] [--log LOG] [--start START] input {greek,la} NOTE: if you have both Python 2 and 3 installed, and "python" refers to v. 2, use "python3" in the command. OPTIONAL: --core A list of core features to output: form, lemma, pos, dialect, expandedForm, feature. Most useful: lemma and form. If more than one, in quotes, separated by spaces. --word A list ot text (or word features to export: label, w, c, s. "label" is the text label supplied as a scope for the word location ordinals, supplied in the optional --label argument. "w", "c", and "s" are the zero based ordinals of the word, the clause containing the word, and the sentence containing the word, respectively. If more than one, in quotes, separated by spaces. --json Specify a file for output in JSON. If specificartion starts with '+', append output, otherwise overwrite. --prolog Specify a file for output in the form of Prolog clauses. If specificartion starts with '+', append output, otherwise overwrite. --oz Specify a file for Oz record output. If specificartion starts with '+', append output, otherwise overwrite. --echo If not 'off', print output to terminal in the form requested: Prolog, JSON, or Oz as above. 'basic' prints a condensed version of the data returned by Perseus in feature:value pairs for those features that have values. --label A label for the text. Competely at the user's discretion, it is not parsed or verified. Intended only to provide a scope for word, clause and sentence locations. --log Specify a file for logging of non- exceptions. --start Zero-based ordinal of the word to start proccesing at. Useful if a process terminates on error (e.g. from an HTTP 503). REQUIRED: input A string of words for analysis OR specification of a file containing text. If the string cannot be used to open a file, it is treated as input. lang 'greek' or 'la' as befits. 4. Use cases Some recommended commands for typical uses: for cache generation: morpheus.py cachewords.la la OR morpheus.py cachewords.greek greek for a text database: prolog: morpheus.py --core "form lemma" --word "label w c s" --prolog PROLOG --label LABEL text {greek,la} JSON: morpheus.py --core "form lemma pos" --word "label w c s" --json JSON --label LABEL text {greek,la} Oz: morpheus.py --core "form lemma pos" --word "label w c s" --oz OZ --label LABEL text {greek,la} for raw morphology data: prolog: morpheus.py --core "form lemma lang" --prolog PROLOG wordlist {greek,la} JSON: morpheus.py --core "form lemma lang pos" --json JSON wordlist {greek,la} Oz: morpheus.py --core "form lemma lang pos" --oz OZ wordlist {greek,la} for morphology lookup (prints feature:value pairs): morpheus.py --echo basic word {greek, la} V. Outputs 1. Prolog Output: Output of this program was run on SWI Prolog 5.10.4 (the web site is http://www.swi-prolog.org/). It is generic Prolog and should run on any standard Prolog. There is a useful comparison of Prolog implementations at http://en.wikipedia.org/wiki/Comparison_of_Prolog_implementations. 2. JSON (JavaScript Object Notation) Output This output is suitable for import into MongoDb, a NoSQL database. I haven't used this database,but have entered Greek and Latin JSON records into MongoDb's online tutorial, and they were accepted and queryable. For more information, http://mongodb.org/. 3. Oz Output Output records are designed for conversion into Oz (Mozart 1.4) language records. The format is analysis|names|values, where names is a colon-separated list of feature names, values is a colon separated list of single-quoted feature values, each of which is converted to an Oz atom. Mozart 1.4 does not handle UTF-8 encoded strings, however my tests have shown that UTF-8 byte arrays are converted to atoms without exception and the resulting atoms can be tested for equality. characters with code points greater than 255 are displayed as ?, and are not correctly con- verted to strings. I assume equa;ity testing works because the test uses a low-level, character-agnostic method. Mozart 2 will support Unicode encodings, so I have left Greek text as-is in Oz output strings. VI. Programming Notes 1. Overview of morpheuslib Classes (brief) In quasi-functional notation the process frm text to output is: text (string) -> [morpheuslib.Word] -> [[morpheuslib.Analysis]]->[datum] Text is tokenized by class morpheuslib.WordStream into a stream of words ending with 0 or 1 separator or terminator characters. WordStream also maintains word location by word, clause, and sentence ordinal, all relative to text beginning, which is zero. Words are wrapped with location and language in instances of class Word. Instances of morpheuslib.MorpheusUrl use Word instances to form the URL string, and then download the analyses XML document from Perseus. This document is wrapped in instance of morpheuslib.Analyses, which supplies instances of class Analysis through Python's iteration mechanism. Each Analysis is convertible into string form for output to file or terminal. 2. Caching Class morpheus.Cache implements two caches, volatile (not saved between runs) an persistent (saved). Both caches store Morpheus XML <analyses> as values. The key is the form submitted to the Morpheus (in the case of Greek, "cleansed" Beta Code). Cache.lookup() returns an Analyses object, so the fetch by url and cache lookup have the same interface. Both caches are implemented with Python dict objects. The persistent one is saved between runs with the Python pickle facility. The contents of the persistent cache are determined by the cachewords files (.la and .greek). The words in them are mostly "function" words which appear often in texts. The user can add or delete words from these files, and the dictionaries will be adjusted to match. The persistent cache can be (re)created with the commands: morpheus.py cachewords.la la morpheus.py cachewords.greek greek The volatile cache holds words not added to the persistent cache. It is most likely to be useful with long texts, or texts with a lot of repetition. 3. Motivation My original motivation for writing this program was to create a more computationally useful form for morphological analysis data than XML. Prolog facts seemed to be a promising choice for that form. 4. Future directions 1. Adding support for text that mixes Greek and Latin words.
About
Transform Greek and Latin texts into morphology databases using Perseus' Morpheus service.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Python 100.0%