Release PDF Deidentification, MPNet Classifier and Pipeline Tracer in NLU 5.4.0 · JohnSnowLabs/nlu

We are excited to announce NLU 5.4.0 has been released!
It comes with support for deidentifying PDFs leveraging a combination of OCR and Medical NLP models.
Additionally you can leverage MPnet for sequence classifcation and Pipeline Tracer is now supported

Visual PDF Deidentifcation

Tutorial Notebook

Introducing our advanced healthcare deidentification model, effortlessly deployable with a single line of code. This powerful solution integrates state-of-the-art algorithms like ner_deid_subentity_augmented, ContextualParser, RegexMatcher, and TextMatcher, alongside a streamlined de-identification stage. It efficiently masks sensitive entities such as names, locations, and medical records, ensuring compliance and data security in medical texts. Utilizing OCR capabilities, it also redacts detected information before saving the processed file to the specified location.

nlu.load() reference	Spark NLP Model Reference
en.image_deid	pdf_deid_pdf_output

! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/deid2.pdf  
! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/download.pdf  
  
#provide the input and the output path  
input_path,output_path = ['download.pdf',' deid2.pdf'], ['download_deidentified.pdf',' deid2_deidentified.pdf']  
  
#predict and save the deidentified pdf's.  
dfs = model.predict(input_path, output_path=output_path)

MPNetForSequenceClassification

Tutorial Notebook

MPNetForSequenceClassification is a state-of-the-art annotator in Spark NLP, designed for sequence classification tasks. It uses the MPNet architecture, which combines the strengths of BERT and XLNet, addressing their limitations.

MPNet, or Masked and Permuted Pre-training for Language Understanding, improves token dependency understanding and sentence position information. This enhances sentence structure comprehension and reduces position discrepancies seen in XLNet.

The annotator excels in tasks like document classification and sentiment analysis, offering superior performance due to its innovative pre-training and fine-tuning on large datasets. Integrated into Spark NLP, it ensures scalable, efficient, and high-accuracy sequence classification.

Read More: Paper

Language	nlp.load() reference	Spark NLP Model reference
en	en.classify.mpnet.ukr_message	mpnet_sequence_classifier_ukr_message

Pipeline Tracer

Tutorial Notebook

The PipelineTracer is now accessible on NLU pipelines which is a versatile class designed to trace and analyze the stages of a pipeline, offering in-depth insights into entities, assertions, deidentification, classification, and relationships. It also facilitates the creation of parser dictionaries for building a PipelineOutputParser. Key functions include printing the pipeline schema, creating parser dictionaries, and retrieving possible assertions, relations, and entities. Also, provide direct access to parser dictionaries and available pipeline schemas

Load a pipe

pipe = nlp.load("en.explain_doc.clinical_oncology.pipeline")

Get all assertions predictable with pipe

pipe.getPossibleAssertions()
>>> ['Past', 'Family', 'Absent', 'Hypothetical', 'Possible', 'Present']

Get all entities predictable with pipe

pipe.getPossibleEntities()
>>> ['Cycle_Number','Direction','Histological_Type', .... ]

Get all relation predictable with pipe

pipe.getPossibleRelations()
>>> ['is_size_of', 'is_date_of', 'is_location_of', 'is_finding_of']

Predict parsed with configs

column_maps = pipe.createParserDictionary()  
column_maps.update({"document_identifier": "clinical_deidentification"})  
pipe = nlp.load("en.explain_doc.clinical_oncology.pipeline")
res = pipe.predict(data,parser_output=True, parser_config=column_maps)
pd.json_normalize(res['result'][0]["entities"])

Powered By: PipelineTracer

📖Additional NLU resources

140+ NLU Tutorials
Streamlit visualizations docs
The complete list of all 20000+ models & pipelines in 300+ languages is available on Models Hub
Spark NLP publications
NLU documentation
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!

Installation

pip install johnsnowlabs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Deidentification, MPNet Classifier and Pipeline Tracer in NLU 5.4.0

Visual PDF Deidentifcation

MPNetForSequenceClassification

Pipeline Tracer

📖Additional NLU resources

Installation