Skip to content

PDF Deidentification, MPNet Classifier and Pipeline Tracer in NLU 5.4.0

Latest
Compare
Choose a tag to compare
@C-K-Loan C-K-Loan released this 13 Jul 16:15

We are excited to announce NLU 5.4.0 has been released!
It comes with support for deidentifying PDFs leveraging a combination of OCR and Medical NLP models.
Additionally you can leverage MPnet for sequence classifcation and Pipeline Tracer is now supported


Visual PDF Deidentifcation

Tutorial Notebook

Introducing our advanced healthcare deidentification model, effortlessly deployable with a single line of code. This powerful solution integrates state-of-the-art algorithms like ner_deid_subentity_augmented, ContextualParser, RegexMatcher, and TextMatcher, alongside a streamlined de-identification stage. It efficiently masks sensitive entities such as names, locations, and medical records, ensuring compliance and data security in medical texts. Utilizing OCR capabilities, it also redacts detected information before saving the processed file to the specified location.

Powered By: PdfToImage, ImageDrawRegions, ImageToPdf, PositionFinder

nlu.load() reference Spark NLP Model Reference
en.image_deid pdf_deid_pdf_output
! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/deid2.pdf  
! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/download.pdf  
  
#provide the input and the output path  
input_path,output_path = ['download.pdf',' deid2.pdf'], ['download_deidentified.pdf',' deid2_deidentified.pdf']  
  
#predict and save the deidentified pdf's.  
dfs = model.predict(input_path, output_path=output_path)

Pasted image 20240713173840


MPNetForSequenceClassification

Tutorial Notebook

MPNetForSequenceClassification is a state-of-the-art annotator in Spark NLP, designed for sequence classification tasks. It uses the MPNet architecture, which combines the strengths of BERT and XLNet, addressing their limitations.

MPNet, or Masked and Permuted Pre-training for Language Understanding, improves token dependency understanding and sentence position information. This enhances sentence structure comprehension and reduces position discrepancies seen in XLNet.

The annotator excels in tasks like document classification and sentiment analysis, offering superior performance due to its innovative pre-training and fine-tuning on large datasets. Integrated into Spark NLP, it ensures scalable, efficient, and high-accuracy sequence classification.

Read More: Paper

Powered by MPNet

Language nlp.load() reference Spark NLP Model reference
en en.classify.mpnet.ukr_message mpnet_sequence_classifier_ukr_message

Pipeline Tracer

Tutorial Notebook

The PipelineTracer is now accessible on NLU pipelines which is a versatile class designed to trace and analyze the stages of a pipeline, offering in-depth insights into entities, assertions, deidentification, classification, and relationships. It also facilitates the creation of parser dictionaries for building a PipelineOutputParser. Key functions include printing the pipeline schema, creating parser dictionaries, and retrieving possible assertions, relations, and entities. Also, provide direct access to parser dictionaries and available pipeline schemas

Load a pipe

pipe = nlp.load("en.explain_doc.clinical_oncology.pipeline")

Get all assertions predictable with pipe

pipe.getPossibleAssertions()
>>> ['Past', 'Family', 'Absent', 'Hypothetical', 'Possible', 'Present']

Get all entities predictable with pipe

pipe.getPossibleEntities()
>>> ['Cycle_Number','Direction','Histological_Type', .... ] 

Get all relation predictable with pipe

pipe.getPossibleRelations()
>>> ['is_size_of', 'is_date_of', 'is_location_of', 'is_finding_of']

Predict parsed with configs

column_maps = pipe.createParserDictionary()  
column_maps.update({"document_identifier": "clinical_deidentification"})  
pipe = nlp.load("en.explain_doc.clinical_oncology.pipeline")
res = pipe.predict(data,parser_output=True, parser_config=column_maps)
pd.json_normalize(res['result'][0]["entities"])

Pasted image 20240713173038

Powered By: PipelineTracer


📖Additional NLU resources


Installation

pip install johnsnowlabs