Skip to content

mfekadu/nimbus-transformer

Repository files navigation

nimbus-transformer

it's like Nimbus but uses a transformer language model

Written in a Functional Programming style.

Getting Started

Works with macOS, Linux, Windows.

2. Setup virtual environment

pipenv install

This will create a virtual environment with the required:

3. Open virtual environment

pipenv shell

4. Verify your python version

$ python --version
Python 3.6.8

Usage

from ntfp.ntfp import get_context, transformer
question = "what is Dr. Foaad Khosmood email?"
_, _, context = get_context(question)
answer, _ = transformer(question, context)
print("answer: ", answer)
>>> answer:  foaad@ calpoly.edu.

Demo

$ python main.py
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


question: what is Dr. Foaad Khosmood email?
len(context):  911
Converting examples to features: 100%|██| 1/1 [00:00<00:00, 95.61it/s]



answer:  foaad@ calpoly.edu.
appended new row to data.csv

demo.png

How it works

Assumptions

  • "Context" is limited to Cal Poly, so expect non-Cal-Poly "Questions" to fail
  • "Answer" is expected to exist publically on the web, such that Google can access it.

Pipeline

  1. User asks Question to a web application.
  2. Scrape Google for Context limit 10 url results.
  3. Store Context into database.
  4. Transform ( Question, Context ) >> Answer
  5. Reply with Answer
  6. Mark, good/bad answer to learn from later.

TODO

  • a simple web UI with an input box and a section for answers
    • if bad-answer then offer user a toggle: isItAnyOf(ans1,ans2..)
    • if user does not choose a toggle then mark as possibly-answerable
    • set up a nice UI for verification team to complete task.
  • database code for
  • test performance
    • avoid test generation by code because the test itself should not depend on subject-under-test.
    • measure precision & recall of this system
  • make improvements to assumptions
  • consider git rev-parse HEAD to get latest commit hash to associate with data.
  • consider learning new facts from TrustedUser
    • e.g. Dr. Khosmood is a TrustedUser and can offer the system either:
      • URL
        • e.g. a published google doc containing a professor's syllabus.
        • e.g. a professor's personal website
      • UserContext
        • e.g. the plain-text of a professor's syllabus.
        • either provided through real-time chat client
        • or provided through a simple input box
        • also consider ChatContext
      • (Question, Answer) mappings
      • so, when any User asks a previously mapped question, then the correct answer can be returned
      • or, when the most relevant UserContext is found for the given question, a reasonable answer can still be returned.
  • question/answer data augmentation
    • remember augmentations need grammar check by human
    • try Question-Paraphrasing
    • also try style-transformations
      • "PHRASE REPLACEMENT TRANSFORM" (Khosmood, pg. 118)
        • I wanted to be with you alone
          • => I desired to be with you only.
        • class phraseXform
          • update it to latest technologies: SpaCy! BabelNet?
        • similar to /r/IncreasinglyVerbose
        • I teach at Cal Poly
          • => I teach at a university in California
            • (replace Stanford University with definition)
          • => I impart skills or knowledge to students at a university in California
            • (replace teach with definition and append students)
          • => I impart skills or knowledge to students at an establishment where a seat of higher learning is housed in California
            • (replace university with definition)
          • => I impart skills or knowledge to students at an establishment where a seat of higher learning is housed in San Luis Obispo, California
            • (apply knowledge of city location of Cal Poly)
        • "Translation-Tours" (Khosmood, pg. 141)
          • "Translation tour with Spanish, French, German" (Khosmood, pg. 141)
            • I teach at Cal Poly
              • => Enseño en Cal Poly (Enlish => Spanish)
              • => J'enseigne à Cal Poly (Spanish => French)
              • => Ich unterrichte an der Cal Poly (French => German)
              • => I teach at Cal Poly (German => English)
            • I teach at Cal Poly.
              • => Doy clases en Cal Poly. (Enlish => Spanish)
              • => Ich unterrichte an der Cal Poly.
              • => I teach at Cal Poly.
          • Alternative Translation Tours
            • I teach at Cal Poly
              • => እኔ በካሊ ፖሊ አስተምራለሁ ፡፡ (English => Amharic)
              • => I teach by Kali Poly. (Amharic => English)
  • chart useful metrics
    • e.g. averge confidence score of transformer over time (or over code changes) need log commit hash
    • e.g. lexical similarity (fuzz ratio) of question to context over time (or over code changes) need log commit hash

What is data.csv?

data.csv is a temporary "database" for appending question samples with the generated meta-data and final answer of this system.

Keeping track of this data will help with measuring the model's performance and making improvements based on performance metrics.

data.png

Resources

Releases

No releases published

Packages

No packages published