GitHub - kerberosdelhades/morpheus: Transform Greek and Latin texts into morphology databases using Perseus' Morpheus service.

kerberosdelhades / morpheus Public
forked from tmallon/morpheus
Notifications You must be signed in to change notification settings
Fork 0
Star 0
Transform Greek and Latin texts into morphology databases using Perseus' Morpheus service.
GPL-3.0 license
0 stars 1 fork Branches Tags Activity
Star
Notifications
Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
LICENSE		LICENSE
README		README
cachewords.greek		cachewords.greek
cachewords.la		cachewords.la
info.greek		info.greek
info.la		info.la
morpheus.py		morpheus.py
morpheuslib.conf		morpheuslib.conf
morpheuslib.py		morpheuslib.py
morpheuslib2.py		morpheuslib2.py
newlib.py		newlib.py
run3.py		run3.py
run4.py		run4.py
run5.py		run5.py
textstructure.py		textstructure.py
Repository files navigation

morpheus and morpheuslib README

Dated 14 October 2013

Contents:
        I.   Preliminary
        II.  Terminology and Background
        III. Usage
        IV.  Outputs
        V.   Programming Notes

I. Preliminary 

1. Status
        Produces correct output, but produces rather verbose
        output.

2. Programmer/maintainer:
        Timothy Mallon
        Bug reports, complaints, requests to [email protected].

3. Purposes:
	Fetch morphological analyses of Greek and Latin words from Perseus 
        Morpheus service for quick reference; assemble clauses for a Prolog
        knowledge base and/or JSON records for use in a Mongo database.

II. Background

1. Terminology
        This is an explanation of how I intend to use these the following 
        words in code and documentation. I plan to regularize any deviations 
        from this standard as I find them.

        word: a word found in a text. May be extended with a triple of location
        ordinals indicating the word's location in the text; the location of the
        clause containing the word; the location of the containing sentence.

        analysis: a tuple of values that are a possible inlectional analysis of
        a word. The values can be thought of as situating the word in a paradigm
        (for nouns and verbs).

        form: a value in an analysis that in priniple is just the word analyzed,
        but in the case of Greek words may differ from the word (see the dis-
        cussion below, under "Greek URLs").

        lemma: a value in an analysis that is the headword under which the word 
        is found in a dictionary. Sometimes, but not generally equal to the form
        or the word.
        
        sentence: a unit of text terminated by a period, or equivalent (question
        mark, or in later texts, exclamation mark).

        clause: a unit of text terminated by comma or equivalent
        (semicolon or colon) or period or equivalent.

2. Perseus and Morpheus
	Perseus is an website run by Tufts University 
        (http://www.perseus.tufts.edu) which offers texts and text-related 
        services. One of those services is Morpheus, which takes a Greek or 
        Latin word as input, and returns morphological analyses in the form of
        XML documents see below for an example). The analyses include:
        lemma (i.e. dictionary headword), part of speech, and other 
        morphological features. The features separate into a core set, common to
        all analyses; and features specific to a part of speech. The core 
        features are: form (the word submitted), lemma (headword or word listed 
        in a dictionary or lexicon and to which the form belongs); 
        expandedForm (usually = form, but may include a suffixed word); pos 
        (part of speech - noun, verb, part(iciple), adj(ective), conj(unction), 
        exclam(ation). adv(erb)); lang ('greek' or 'la'); dialect (especially 
        Greek, e.g. 'doric'); feature (something not fitting into the 
        preceding). As an example of non-core features, a noun will have case, 
        gender and number.

        Note that many words submitted to Morpheus will have several analyses,
        which may refer to one lemma ('feminae' msy be one of three forms of 
        'femina') or more than one ('amare' may be the infinitive 'to love' 
        (lemma 'amo') or a vocative adjective 'o bitter man'(lemma 'amarus')).

3. Modififications Made to Morpheus Data
        Morpheuslib makes a few changes to Morpheus data in methods of class 
        Analysis. First, the present participle in Latin is supplemented with a 
        voice feature (see fixpart()). Second, Latin pronouns are supplemented
        with the person feature (see fixpron()). Note: this requires that the
        pronoun's person be listed in the info.la data file. Third, the supine 
        and  infinitive are changed from "moods"  (of noun and verb, 
        respectively) to their own parts of speech (see fix_bad_mood()). This 
        classification accords better with contemporary understanding of their 
        history and function.
 
4. Handling of Greek Text:
        This program was written against Perseus' dialect of Beta Code, which 
        uses lower case Latin letters. Upper case letters are converted to lower
        case before submission to Morpheus. 

        Morpheus returns Greek text as cased (for the lemma)  or uncased (for the 
        form) polytonic Greek letters (precomposed Unicode).
        
5. Greek URLS:
        Greek words submitted with diacritics return an empty document.
        E.g. the url:
        http://www.perseus.tufts.edu/hopper/xmlmorph?lang=greek&lookup=mhru%2Fsanto 
        produces:
        <?xml version="1.0" encoding="utf-8"?>
        <analyses>
        </analyses>;
        whereas 
        http://www.perseus.tufts.edu/hopper/xmlmorph?lang=greek&lookup=mhrusanto 
        produces:
        <?xml version="1.0" encoding="utf-8"?>
        <analyses>
	        <analysis>
                        <form lang="greek">μηρύσαντο</form>
                        <lemma>μηρύομαι</lemma>
                        <expandedForm>μηρύσαντο</expandedForm>
                        <pos>verb</pos>
                        <person>3rd</person>
                        <number>pl</number>
                        <tense>aor</tense>
                        <mood>ind</mood>
                        <voice>mid</voice>
                        <dialect>homeric ionic</dialect>
                        <feature>unaugmented</feature>
                </analysis>
        </analyses>.
        Therefore, Greek words are subitted without breathing or accent 
        diacritics. This causes more matches than necessary to be returned.
        This is addressed by checking that the returned form matches the 
        submitted form, and retaining only those analyses that match.  

III. Usage

1. Setup
        It is sufficient to copy the files morpheuslib.py and morpheus.py, along 
        with *.info and cachewords.* files to a single directory. 
        

2. Supported Python Version
        The program was written in Python 3.3.0.
         										
3. Command and Arguments
        morpheus.py [-h] [--core CORE] [--word WORD] [--json JSON]
                    [--prolog PROLOG] [--oz OZ]
                    [--echo {basic,off,prolog,json,oz}] [--label LABEL]
                    [--log LOG] [--start START]
                    input {greek,la}


        NOTE: if you have both Python 2 and 3 installed, and "python" refers to 
        v. 2, use "python3" in the command.

        OPTIONAL:

        --core A list of core features to output: form, lemma, pos, dialect, 
               expandedForm, feature. Most useful: lemma and form. If more than 
               one, in quotes, separated by spaces.

        --word A list ot text (or word features to export: label, w, c, s.
               "label" is the text label supplied as a scope for the word
               location ordinals, supplied in the optional --label argument.
               "w", "c", and "s" are the zero based ordinals of the word,
               the clause containing the word, and the sentence containing the
               word, respectively.

               If more than one, in quotes, separated by spaces.

        --json Specify a file for output in JSON. If specificartion starts with 
               '+', append output, otherwise overwrite.

        --prolog Specify a file for output in the form of Prolog clauses. If 
                 specificartion starts with '+', append output, otherwise 
                 overwrite.

        --oz    Specify a file for Oz record output. If 
                specificartion starts with '+', append output, otherwise 
                overwrite. 

        --echo If not 'off', print output to terminal in the form requested: 
               Prolog, JSON, or Oz as above. 'basic' prints a condensed version of 
               the data returned by Perseus in feature:value pairs for those 
               features that have values.

        --label A label for the text. Competely at the user's discretion, it is 
                not parsed or verified. Intended only to provide a scope for 
                word, clause and sentence locations.

        --log Specify a file for logging of non- exceptions.

        --start Zero-based ordinal of the word to start proccesing at. Useful if
                a process terminates on error (e.g. from  an HTTP 503).        

        REQUIRED:
        
        input A string of words for analysis OR specification of a file 
              containing text. If the string cannot be used to open a file, it 
              is treated as input.

        lang 'greek' or 'la' as befits.

4. Use cases
        Some recommended commands for typical uses:

        for cache generation: 
            morpheus.py cachewords.la la OR morpheus.py cachewords.greek greek

        for a text database: 
            prolog: morpheus.py --core "form lemma" --word "label w c s" --prolog PROLOG --label LABEL text {greek,la}
            JSON: morpheus.py --core "form lemma pos" --word "label w c s" --json JSON --label LABEL text {greek,la}
            Oz: morpheus.py --core "form lemma pos" --word "label w c s" --oz OZ --label LABEL text {greek,la}

        for raw morphology data:
            prolog: morpheus.py --core "form lemma lang" --prolog PROLOG wordlist {greek,la}
            JSON: morpheus.py --core "form lemma lang pos" --json JSON wordlist {greek,la}
            Oz: morpheus.py --core "form lemma lang pos" --oz OZ wordlist {greek,la}

        for morphology lookup (prints feature:value pairs):
            morpheus.py --echo basic word {greek, la}
        
V. Outputs

1. Prolog Output:
        Output of this program was run on SWI Prolog 5.10.4 (the web site is
        http://www.swi-prolog.org/). It is generic Prolog and should run on any
        standard Prolog. There is a useful comparison of Prolog implementations at
        http://en.wikipedia.org/wiki/Comparison_of_Prolog_implementations. 

2. JSON (JavaScript Object Notation) Output
        This output is suitable for import into MongoDb, a NoSQL database. I 
        haven't used this database,but have entered Greek and Latin JSON 
        records into MongoDb's online tutorial, and they were accepted and 
        queryable. For more information, http://mongodb.org/.

3. Oz Output
        Output records are designed for conversion into Oz (Mozart 1.4)
        language records. The format is analysis|names|values, where names is a
        colon-separated list of feature names, values is a colon separated list 
        of single-quoted feature values, each of which is converted to an Oz 
        atom. 

        Mozart 1.4 does not handle UTF-8 encoded strings, however my tests have
        shown that UTF-8 byte arrays are converted to atoms without exception 
        and the resulting atoms can be tested for equality. characters with code
        points greater than 255 are displayed as ?, and are not correctly con-
        verted to strings. I assume equa;ity testing works because the test uses
        a low-level, character-agnostic method.

        Mozart 2 will support Unicode encodings, so I have left Greek text as-is
        in Oz output strings.

VI. Programming Notes

1. Overview of morpheuslib Classes (brief)
        In quasi-functional notation the process frm text to output is:

        text (string) -> [morpheuslib.Word] -> [[morpheuslib.Analysis]]->[datum]
        
        Text is tokenized by class morpheuslib.WordStream into a stream of words
        ending with 0 or 1 separator or terminator characters. WordStream also 
        maintains word location by word, clause, and sentence ordinal, all 
        relative to text beginning, which is zero.

        Words are wrapped with location and language in instances of class Word.
        Instances of morpheuslib.MorpheusUrl use Word instances to form the URL 
        string, and then download the analyses XML document from Perseus. This 
        document is wrapped in instance of morpheuslib.Analyses, which supplies 
        instances of class Analysis through Python's iteration mechanism. Each 
        Analysis is convertible into string form for output to file or terminal. 

2. Caching
        Class morpheus.Cache implements two caches, volatile (not saved between
        runs) an persistent (saved). Both caches store Morpheus XML <analyses>
        as values. The key is the form submitted to the Morpheus (in the case of
        Greek, "cleansed" Beta Code). Cache.lookup() returns an Analyses object,
        so the fetch by url and cache lookup have the same interface.

        Both caches are implemented with Python dict objects. The persistent one
        is saved between runs with the Python pickle facility.

        The contents of the persistent cache are determined by the cachewords 
        files (.la and .greek). The words in them are mostly "function" words 
        which appear often in texts. The user can add or delete words from these
        files, and the dictionaries will be adjusted to match. The persistent
        cache can be (re)created with the commands:
        
        morpheus.py cachewords.la la
        morpheus.py cachewords.greek greek

        The volatile cache holds words not added to the persistent cache. It is
        most likely to be useful with long texts, or texts with a lot of 
        repetition.

3. Motivation
        My original motivation for writing this program was to create a more
        computationally useful form for morphological analysis data than XML.
        Prolog facts seemed to be a promising choice for that form.

4. Future directions
        1. Adding support for text that mixes Greek and Latin words.