We are excited to announce NLU 5.4.0 has been released!
It comes with support for deidentifying PDFs leveraging a combination of OCR and Medical NLP models.
Additionally you can leverage MPnet for sequence classifcation and Pipeline Tracer is now supported
Visual PDF Deidentifcation
Introducing our advanced healthcare deidentification model, effortlessly deployable with a single line of code. This powerful solution integrates state-of-the-art algorithms like ner_deid_subentity_augmented, ContextualParser, RegexMatcher, and TextMatcher, alongside a streamlined de-identification stage. It efficiently masks sensitive entities such as names, locations, and medical records, ensuring compliance and data security in medical texts. Utilizing OCR capabilities, it also redacts detected information before saving the processed file to the specified location.
Powered By: PdfToImage, ImageDrawRegions, ImageToPdf, PositionFinder
nlu.load() reference | Spark NLP Model Reference |
---|---|
en.image_deid | pdf_deid_pdf_output |
! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/deid2.pdf
! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/download.pdf
#provide the input and the output path
input_path,output_path = ['download.pdf',' deid2.pdf'], ['download_deidentified.pdf',' deid2_deidentified.pdf']
#predict and save the deidentified pdf's.
dfs = model.predict(input_path, output_path=output_path)
MPNetForSequenceClassification
MPNetForSequenceClassification is a state-of-the-art annotator in Spark NLP, designed for sequence classification tasks. It uses the MPNet architecture, which combines the strengths of BERT and XLNet, addressing their limitations.
MPNet, or Masked and Permuted Pre-training for Language Understanding, improves token dependency understanding and sentence position information. This enhances sentence structure comprehension and reduces position discrepancies seen in XLNet.
The annotator excels in tasks like document classification and sentiment analysis, offering superior performance due to its innovative pre-training and fine-tuning on large datasets. Integrated into Spark NLP, it ensures scalable, efficient, and high-accuracy sequence classification.
Read More: Paper
Powered by MPNet
Language | nlp.load() reference | Spark NLP Model reference |
---|---|---|
en | en.classify.mpnet.ukr_message | mpnet_sequence_classifier_ukr_message |
Pipeline Tracer
The PipelineTracer is now accessible on NLU pipelines which is a versatile class designed to trace and analyze the stages of a pipeline, offering in-depth insights into entities, assertions, deidentification, classification, and relationships. It also facilitates the creation of parser dictionaries for building a PipelineOutputParser. Key functions include printing the pipeline schema, creating parser dictionaries, and retrieving possible assertions, relations, and entities. Also, provide direct access to parser dictionaries and available pipeline schemas
Load a pipe
pipe = nlp.load("en.explain_doc.clinical_oncology.pipeline")
Get all assertions predictable with pipe
pipe.getPossibleAssertions()
>>> ['Past', 'Family', 'Absent', 'Hypothetical', 'Possible', 'Present']
Get all entities predictable with pipe
pipe.getPossibleEntities()
>>> ['Cycle_Number','Direction','Histological_Type', .... ]
Get all relation predictable with pipe
pipe.getPossibleRelations()
>>> ['is_size_of', 'is_date_of', 'is_location_of', 'is_finding_of']
Predict parsed with configs
column_maps = pipe.createParserDictionary()
column_maps.update({"document_identifier": "clinical_deidentification"})
pipe = nlp.load("en.explain_doc.clinical_oncology.pipeline")
res = pipe.predict(data,parser_output=True, parser_config=column_maps)
pd.json_normalize(res['result'][0]["entities"])
Powered By: PipelineTracer
📖Additional NLU resources
- 140+ NLU Tutorials
- Streamlit visualizations docs
- The complete list of all 20000+ models & pipelines in 300+ languages is available on Models Hub
- Spark NLP publications
- NLU documentation
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!
Installation
pip install johnsnowlabs