Skip to content

Commit

Permalink
Merge pull request #263 from JohnSnowLabs/release/531
Browse files Browse the repository at this point in the history
Release/531
  • Loading branch information
C-K-Loan authored May 21, 2024
2 parents 909a4cc + 1f227de commit 76161f0
Show file tree
Hide file tree
Showing 40 changed files with 3,006 additions and 28 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ See how easy it is to use any of the **thousands** of models in 1 line of code,
This 1 line let's you visualize and play with **1000+ SOTA NLU & NLP models** in **200** languages

```shell
streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py
streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py
```
<img src="https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/docs/assets/streamlit_docs_assets/gif/start.gif">

Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

183 changes: 183 additions & 0 deletions examples/colab/ocr/ocr_form_relation.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_form_relation_extractor.ipynb)\n",
"\n",
"[Tutorial Notebook](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_form_relation_extractor.ipynb \"https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_form_relation_extractor.ipynb\")\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"# **FormRelationExtractor**\n",
"\n",
"\n",
"The **FormRelationExtractor** is a tool designed to identify the relationships between keys and values. It’s particularly useful in the context of data extracted by a Named Entity Recognition (NER) system, such as VisualDocumentNER.\n",
"\n",
"**All the available models:**\n",
"\n",
"| NLU Spell | Transformer Class |\n",
"|----------------------|-----------------------------------------------------------------------------------------|\n",
"| nlu.load(`visual_form_relation_extractor`) | [FormRelationExtractor](https://nlp.johnsnowlabs.com/docs/en/ocr_visual_document_understanding#formrelationextractor) |"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## **Install NLU**"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install johnsnowlabs\n",
"nlp.install(visual=True,force_browser=True)\n",
"nlp.start(visual=True)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## **Form Relation Extraction**"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🚨 Outdated Medical Secrets in license file. Version=5.3.1 but should be Version=5.1.1\n",
"🚨 Outdated OCR Secrets in license file. Version=5.3.1 but should be Version=5.0.2\n",
"📋 Loading license number 0 from C:\\Users\\gadde/.johnsnowlabs\\licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json\n",
"👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.\n",
"👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.\n",
"👷 Setting up John Snow Labs home in C:\\Users\\gadde/.johnsnowlabs, this might take a few minutes.\n",
"Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.1.1.jar\n",
"🙆 JSL Home setup in C:\\Users\\gadde/.johnsnowlabs\n",
"🤓 Looks like you are missing some jars, trying fetching them ...\n",
"👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.\n",
"Downloading 🫘+💊 Java Library spark-nlp-jsl-5.1.1.jar\n",
"Downloading 🫘+🕶 Java Library spark-ocr-assembly-5.0.2.jar\n",
"🙆 JSL Home setup in C:\\Users\\gadde/.johnsnowlabs\n",
"👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.\n",
"👌 Launched \u001B[92mcpu optimized\u001B[39m session with with: 🚀Spark-NLP==5.3.1, 💊Spark-Healthcare==5.1.1, 🕶Spark-OCR==5.0.2, running on ⚡ PySpark==3.1.2\n",
"Warning::Spark Session already created, some configs may not take.\n",
"Warning::Spark Session already created, some configs may not take.\n",
"lilt_roberta_funsd_v1 download started this may take some time.\n",
"Approximate size to download 419.6 MB\n"
]
}
],
"source": [
"from johnsnowlabs import nlp,visual\n",
"model = nlp.load('visual_form_relation_extractor')"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-05-13T08:27:37.781697200Z",
"start_time": "2024-05-13T08:17:43.901075500Z"
}
}
},
{
"cell_type": "code",
"execution_count": 4,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n"
]
}
],
"source": [
"res = model.predict(['form.png','form2.jpg'])"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 5,
"outputs": [
{
"data": {
"text/plain": " form_relation_prediction_key form_relation_prediction_value \\\n0 division allied health \n0 course hce \n0 number 116 \n0 title calculations medical dosage \n0 credits 2 \n0 developed by dr . by taz \n0 lecture / lab lecture / o ratio 2 \n0 course activity no \n0 cip code 51 . 0800 \n0 semester fall and \n0 ge category none \n0 separate lab no \n0 course awareness no \n0 course no \n1 name : dribbler , bbb \n1 study date : 12 - 09 - 2006 , 6 : 34 \n1 bp : 120 / 80 mmhg \n1 mrn : 12341820060912 \n1 patient location : room \n1 hr : 100 bpm \n1 dob : 19 - 06 - 1979 \n1 gender : male \n1 height : 123 cm \n1 age : 27 years \n1 weight : 25 kg \n1 reason for study : mi \n1 bsa : 0 . 92 m \n1 history : asfgfdgsdg \n1 medications : heparine , paracetamol \n1 performed . the study technically limited . \n1 . no \n\n path \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n0 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... \n1 file:/F:/Work/repos/nlu_new/ner/nlu/examples/c... ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>form_relation_prediction_key</th>\n <th>form_relation_prediction_value</th>\n <th>path</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>division</td>\n <td>allied health</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>course</td>\n <td>hce</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>number</td>\n <td>116</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>title</td>\n <td>calculations medical dosage</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>credits</td>\n <td>2</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>developed by</td>\n <td>dr . by taz</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>lecture / lab lecture / o ratio</td>\n <td>2</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>course activity</td>\n <td>no</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>cip code</td>\n <td>51 . 0800</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>semester</td>\n <td>fall and</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>ge category</td>\n <td>none</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>separate lab</td>\n <td>no</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>course awareness</td>\n <td>no</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>0</th>\n <td>course</td>\n <td>no</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>name :</td>\n <td>dribbler , bbb</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>study date :</td>\n <td>12 - 09 - 2006 , 6 : 34</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>bp :</td>\n <td>120 / 80 mmhg</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>mrn :</td>\n <td>12341820060912</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>patient location :</td>\n <td>room</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>hr :</td>\n <td>100 bpm</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>dob :</td>\n <td>19 - 06 - 1979</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>gender :</td>\n <td>male</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>height :</td>\n <td>123 cm</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>age :</td>\n <td>27 years</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>weight :</td>\n <td>25 kg</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>reason for study :</td>\n <td>mi</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>bsa :</td>\n <td>0 . 92 m</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>history :</td>\n <td>asfgfdgsdg</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>medications :</td>\n <td>heparine , paracetamol</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>performed .</td>\n <td>the study technically limited .</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>.</td>\n <td>no</td>\n <td>file:/F:/Work/repos/nlu_new/ner/nlu/examples/c...</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res_filtered = res[['form_relation_prediction_key','form_relation_prediction_value','path']]\n",
"res_filtered"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-05-13T08:40:51.701641600Z",
"start_time": "2024-05-13T08:40:51.627215600Z"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"name": "myenv",
"language": "python",
"display_name": "myenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
968 changes: 968 additions & 0 deletions examples/colab/ocr/ocr_visual_document_ner.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion examples/colab/ocr/table_extraction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2752,4 +2752,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}
2 changes: 1 addition & 1 deletion nlu/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = '5.1.5rc19'
__version__ = '5.3.1'


import nlu.utils.environment.env_utils as env_utils
Expand Down
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

class FormRelationExtractor:
@staticmethod
def get_default_model():
from sparkocr.transformers import FormRelationExtractor
return FormRelationExtractor() \
.setInputCol("text_entity") \
.setOutputCol("ocr_relations")
Empty file.
7 changes: 7 additions & 0 deletions nlu/ocr_components/utils/hocr_tokenizer/hocr_tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
class HocrTokenizer:
@staticmethod
def get_default_model():
from sparkocr.transformers import HocrTokenizer
return HocrTokenizer() \
.setInputCol("hocr") \
.setOutputCol("text_tokenized")
Empty file.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
class VisualDocumentNer:
@staticmethod
def get_default_model():
from sparkocr.transformers import VisualDocumentNer
return VisualDocumentNer()\
.pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")\
.setInputCols(["text_tokenized", "image"])\
.setOutputCol("text_entity")
24 changes: 20 additions & 4 deletions nlu/pipe/col_substitution/col_name_substitution_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,6 @@

import nlu
from nlu.pipe.col_substitution import substitution_map_OS
from nlu.universe.feature_universes import NLP_FEATURES
from nlu.pipe.col_substitution import substitution_map_OS
from nlu.pipe.col_substitution import col_substitution_OS
import logging

from nlu.pipe.extractors.extractor_base_data_classes import SparkOCRExtractorConfig
Expand Down Expand Up @@ -139,7 +136,26 @@ def get_final_output_cols_of_component(c, df, anno_2_ex) -> List[str]:
result_cols = []
if isinstance(configs, SparkOCRExtractorConfig):
# TODO better OCR-EX handling --> Col Name generator function which we use everywhere for unified col naming !!!!!
return ['text']
# return ['text']
for col in df.columns:
if 'meta_' + configs.output_col_prefix in col:
base_meta_prefix = 'meta_' + configs.output_col_prefix
meta_col_name = base_meta_prefix + col.split(base_meta_prefix)[-1]
if meta_col_name in df.columns:
# special case for overlapping names with _
if col.split(base_meta_prefix)[-1].split('_')[1].isnumeric() and not \
c.spark_output_column_names[0].split('_')[-1].isnumeric(): continue
if col.split(base_meta_prefix)[-1].split('_')[1].isnumeric() and \
c.spark_output_column_names[0].split('_')[-1].isnumeric():
id1 = int(col.split(base_meta_prefix)[-1].split('_')[1])
id2 = int(c.spark_output_column_names.split('_')[-1])
if id1 != id2: continue
result_cols.append(meta_col_name)
elif c.type == AnnoTypes.CHUNK_CLASSIFIER:
result_cols.append(col)
else:
logger.info(f"Could not find meta col for os_components={c}, col={col}. Ommiting col..")
return result_cols
if isinstance(c.model, MultiDocumentAssembler):
return [f'{NLP_FEATURES.DOCUMENT_QUESTION}_results', f'{NLP_FEATURES.DOCUMENT_QUESTION_CONTEXT}_results']

Expand Down
Loading

0 comments on commit 76161f0

Please sign in to comment.