Transformer model may be ignoring general entity types #1463

michhar · 2024-10-07T17:41:29Z

Amazing project, thank you so much for presidio and all of the work here.

I am noticing with a few different Hugging Face transformer models, that some of the listed entities associated with the engine, are not being picked up even with mapping.

I am using the following on macOS:

presidio_analyzer==2.2.355
presidio_anonymizer==2.2.355

For example, here is my code to set up the engine:

# Transformer model config
tf_model_config = [
    {"lang_code": "en",
     "model_name": {
         "spacy": "en_core_web_sm", 
         "transformers": "StanfordAIMI/stanford-deidentifier-base"
    }
}]

# Entity mappings
mapping = dict(
    PER="PERSON",
    LOC="LOCATION",
    ORG="ORGANIZATION",
    AGE="AGE",
    ID="ID",
    EMAIL="EMAIL",
    DATE="DATE_TIME",
    PHONE="PHONE_NUMBER",
    PERSON="PERSON",
    LOCATION="LOCATION",
    GPE="LOCATION",
    ORGANIZATION="ORGANIZATION",
    NORP="NRP",
    PATIENT="PERSON",
    STAFF="PERSON",
    HOSP="ORGANIZATION",
    PATORG="ORGANIZATION",
    TIME="DATE_TIME",
    HCW="PERSON",
    HOSPITAL="ORGANIZATION",
    FACILITY="LOCATION"
)

tf_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
    alignment_mode="expand", # "strict", "contract", "expand"
    aggregation_strategy="max") # "simple", "first", "average", "max"

tf_engine = TransformersNlpEngine(
    models=tf_model_config,
    ner_model_configuration=tf_model_configuration)

# Transformer-based analyzer
analyzer_tf = AnalyzerEngine(
    nlp_engine=tf_engine, 
    supported_languages=["en"]
)

# Default anonymizer
anonymizer_default = AnonymizerEngine()

When I list the entities for the transformer engine with this code:

print(f'Entities for transformer engine:  {analyzer_tf.nlp_engine.get_supported_entities()}')

I get Entities for transformer engine: ['DATE_TIME', 'NRP', 'EMAIL', 'PHONE_NUMBER', 'ORGANIZATION', 'AGE', 'LOCATION', 'PERSON', 'ID'] Selection deleted

Then, I do the following to call the engine/analyzer+anonymizer:

for text in text_gens[:5]:
   
    analyzer_tf_results = analyzer_tf.analyze(text=text, language="en")
    anonymized_tf_results = anonymizer_default.anonymize(
        text=text, analyzer_results=analyzer_tf_results
    )

    print(f"Transformer anonymized text    : {anonymized_tf_results.text}\n")

And even though I have many instances of a typical US address, I never get LOCATION as one of my entities. For example (btw, this is completely made up / synthetic text and does not reflect any real people or businesses!):

One example output from above (including the default spacy as well since I did that, but did not show setup above, just adding for completion):

Some synthetic/made-up data:

Original text: Jasmine Rivers, with social security number 987-65-4321, can be reached at 555-123-4567. Her driver's license number is 1234567890. She lives at 1234 Maple Street, Anytown, USA. Jasmine works for Stellar Solutions, located at 5678 Oak Avenue, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year.
SpaCy anonymized text (default, including in case helpful here): <PERSON>, with social security number <US_ITIN>, can be reached at <PHONE_NUMBER>. Her driver's license number is <PHONE_NUMBER>. She lives at <LOCATION>, <LOCATION>, <LOCATION>. <PERSON> works for Stellar Solutions, located at <LOCATION>, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for <DATE_TIME>.
Transformer anonymized text    : <PERSON>, with social security number <ID>, can be reached at <PHONE_NUMBER>. Her driver's license number is <ID>. She lives at 1234 Maple Street, Anytown, USA. <PERSON> works for Stellar Solutions, located at 5678 Oak Avenue, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year.

Any thoughts on how to debug this? Thanks again!!

The text was updated successfully, but these errors were encountered:

omri374 · 2024-10-08T06:10:45Z

Hi @michhar, when running the transformer model directly, I also never get locations. It does output HOSPITAL which is mapped to ORGANIZATION in the mapping, and VENDOR which isn't mapped. The issue though, is that ORGANIZATION is listed as an entity to ignore (see here and here). The reason for this is the high number of false positives for this entity with the default spaCy model. Having said that, I can understand why one would miss that. If you have any suggestions on how to improve this, please reach out!

So to get this working, I updated the labels_to_ignore param, and changed the mapping to include hospital-> location and vendor->organization

# Transformer model config
tf_model_config = [
    {"lang_code": "en",
     "model_name": {
         "spacy": "en_core_web_sm", 
         "transformers": "StanfordAIMI/stanford-deidentifier-base"
    }
}]

# Entity mappings
mapping = dict(
    PER="PERSON",
    LOC="LOCATION",
    ORG="ORGANIZATION",
    AGE="AGE",
    ID="ID",
    EMAIL="EMAIL",
    DATE="DATE_TIME",
    PHONE="PHONE_NUMBER",
    PERSON="PERSON",
    LOCATION="LOCATION",
    GPE="LOCATION",
    ORGANIZATION="ORGANIZATION",
    NORP="NRP",
    PATIENT="PERSON",
    STAFF="PERSON",
    HOSP="LOCATION",
    PATORG="ORGANIZATION",
    TIME="DATE_TIME",
    HCW="PERSON",
    HOSPITAL="LOCATION",
    FACILITY="LOCATION",
    VENDOR="ORGANIZATION",
)

tf_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
    alignment_mode="expand", # "strict", "contract", "expand"
    aggregation_strategy="max", # "simple", "first", "average", "max"
    labels_to_ignore = ["O"])

tf_engine = TransformersNlpEngine(
    models=tf_model_config,
    ner_model_configuration=tf_model_configuration)

# Transformer-based analyzer
analyzer_tf = AnalyzerEngine(
    nlp_engine=tf_engine, 
    supported_languages=["en"]
)

text = "Jasmine Rivers, with social security number 987-65-4321, can be reached at 555-123-4567. Her driver's license number is 1234567890. She lives at 1234 Maple Street, Anytown, USA. Jasmine works for Stellar Solutions, located at 5678 Oak Avenue, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year."
res = analyzer_tf.analyze(text, language="en")

from presidio_anonymizer import AnonymizerEngine

anon = AnonymizerEngine()
anon.anonymize(text, res)

This returns:

text: <PERSON>, with social security number <ID>, can be reached at <PHONE_NUMBER>. Her driver's license number is <ID>. She lives at <LOCATION>. <PERSON> works for <ORGANIZATION>, located at <LOCATION>, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year.
items:
[
    {'start': 186, 'end': 196, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
    {'start': 159, 'end': 173, 'entity_type': 'ORGANIZATION', 'text': '<ORGANIZATION>', 'operator': 'replace'},
    {'start': 140, 'end': 148, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'},
    {'start': 128, 'end': 138, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
    {'start': 109, 'end': 113, 'entity_type': 'ID', 'text': '<ID>', 'operator': 'replace'},
    {'start': 62, 'end': 76, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'},
    {'start': 38, 'end': 42, 'entity_type': 'ID', 'text': '<ID>', 'operator': 'replace'},
    {'start': 0, 'end': 8, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
]

omri374 · 2024-10-08T06:11:46Z

It could also be somewhat related to this change which wasn't released yet: #1454

michhar · 2024-10-08T20:49:05Z

@omri374 Thank you for your quick response! How do I go about getting the info I need to make the mapping better? Is there a good way to list out all entities supported by a model so I can make sure I map them correctly?

For improving the approach (avoiding the FPs with the SpaCy model pipeline), could I exclude SpaCy model altogether in my transformer-based analyzer?

omri374 · 2024-10-10T06:46:06Z

The spacy part of the pipeline is not used for NER, but for all the other NLP tasks (tokenization, lemmatization etc.) We use spacy-huggingface-pipelines to integrate a huggingface NER model instead of the spaCy NER model.

Since presidio comes with spaCy's en_core_web_lg as a default option, and this model has low accuracy rates for ORG, the default setting is to ignore it. If a user wishes to use a different model, there is also a need to change this setting. As said previously, this isn't a good developer experience as it's kind of buried in the code. If you have any suggestions on how to improve this, I'd be happy to discuss!

omri374 · 2024-10-10T06:46:53Z

More on this can be found here: https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/

omri374 mentioned this issue Oct 10, 2024

Updates to the transformers conf docs and yaml file #1467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer model may be ignoring general entity types #1463

Transformer model may be ignoring general entity types #1463

michhar commented Oct 7, 2024

omri374 commented Oct 8, 2024

omri374 commented Oct 8, 2024

michhar commented Oct 8, 2024

omri374 commented Oct 10, 2024

omri374 commented Oct 10, 2024

Transformer model may be ignoring general entity types #1463

Transformer model may be ignoring general entity types #1463

Comments

michhar commented Oct 7, 2024

omri374 commented Oct 8, 2024

omri374 commented Oct 8, 2024

michhar commented Oct 8, 2024

omri374 commented Oct 10, 2024

omri374 commented Oct 10, 2024