Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alignment ouptut idea. #5

Open
da1nerd opened this issue May 2, 2018 · 6 comments
Open

alignment ouptut idea. #5

da1nerd opened this issue May 2, 2018 · 6 comments

Comments

@da1nerd
Copy link
Contributor

da1nerd commented May 2, 2018

We need to support alignments across verses.
Here are three possible solutions.

{
          "confidence": 0.516905944153279,
          "sourceNgram": [0],
          "targetNgram": [10, { // add an object
            "position": 0,
            "verse": 2,
            "chapter": 3
          }],
          // or separate object
          "versification": {
            "target": {
              "nextVerseId": 1
            },
            "source": {
              
            }
          }
        },

or we can just keep the number ids and format it like this 11001 e.g. chapter 11 and verse 1.

@da1nerd
Copy link
Contributor Author

da1nerd commented May 2, 2018

This is better

{
          "confidence": 0.516905944153279,
          "sourceNgram": [0],
          "targetNgram": [10, 0, 1, 10001]
        },

@jag3773
Copy link
Contributor

jag3773 commented May 4, 2018

@neutrinog Can you explain what the values in [10, 0, 1, 10001] refer to?

@da1nerd
Copy link
Contributor Author

da1nerd commented May 4, 2018

NOTE: I think the 10 was in there by accident.

Structure

Given a context of verse 9, and that verse 9 contains two tokens, here is an alignment of three tokens from the target text to one token in the source text:

{
  "confidence": 0.516905944153279,
  "sourceNgram": [0],
  "targetNgram": [0, 1, 10001]
}

Within the targetNgram we see two tokens from verse 9 indicated by the positional values 0 and 1.
Additionally we include the second (zero indexed) token from verse 10 in this alignment indicated by 10001.

Rules

Referring to tokens outside of the current context proceeds as follows:

Prepend additional context as required.

  • If from a different verse, prepend the numerical verse number.
  • If from a different chapter, prepend the numerical chapter number.

NOTE: it is not supported, and we believe unnecessary to align tokens across different books.

As additional context is appended, the previous context must be zero filled to three digits.

Here the above example is shown in it's expanded, simplified, and parsed forms:

chapter verse token
000 010 001 expanded
10 001 simple
10 1 parsed

Parsing such a value is done by casting the value as a string and splitting it in chunks, 3 characters in length, originating from the end (right side).

@da1nerd
Copy link
Contributor Author

da1nerd commented May 4, 2018

@klappy fyi, I included this description ^

@da1nerd
Copy link
Contributor Author

da1nerd commented May 8, 2019

The most recent approach being considered involves storing a context id inside of the tokens. This will allow wordMap to be agnostic to the concept of crossing verse and chapter boundaries.

For example, here is a contrived example where wordMap has received two tokens:

{
              "text": "Lord",
              "occurrence": 1,
              "occurrences": 1,
              "contextId": "BOOK001001"
}
{
              "text": "The",
              "occurrence": 1,
              "occurrences": 1,
              "contextId": "BOOK001002"
}

In this case token at index 0 is from verse 1 and token at index 1 is from verse 2.
wordMap will be able to process these tokens like normal and the alignment will contain these token objects for later reference. See 4499068 as an example for passing the token object to the output.

With this method it should be noted that cross verse alignment would not be supported (at least not in a deterministic way) with simple string input to wordMap. The input must pre-tokenized with the context id added as needed.

@da1nerd
Copy link
Contributor Author

da1nerd commented May 8, 2019

@PhotoNomad0 ☝️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants