Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer model may be ignoring general entity types #1463

Open
michhar opened this issue Oct 7, 2024 · 5 comments
Open

Transformer model may be ignoring general entity types #1463

michhar opened this issue Oct 7, 2024 · 5 comments

Comments

@michhar
Copy link

michhar commented Oct 7, 2024

Amazing project, thank you so much for presidio and all of the work here.

I am noticing with a few different Hugging Face transformer models, that some of the listed entities associated with the engine, are not being picked up even with mapping.

I am using the following on macOS:

presidio_analyzer==2.2.355
presidio_anonymizer==2.2.355

For example, here is my code to set up the engine:

# Transformer model config
tf_model_config = [
    {"lang_code": "en",
     "model_name": {
         "spacy": "en_core_web_sm", 
         "transformers": "StanfordAIMI/stanford-deidentifier-base"
    }
}]

# Entity mappings
mapping = dict(
    PER="PERSON",
    LOC="LOCATION",
    ORG="ORGANIZATION",
    AGE="AGE",
    ID="ID",
    EMAIL="EMAIL",
    DATE="DATE_TIME",
    PHONE="PHONE_NUMBER",
    PERSON="PERSON",
    LOCATION="LOCATION",
    GPE="LOCATION",
    ORGANIZATION="ORGANIZATION",
    NORP="NRP",
    PATIENT="PERSON",
    STAFF="PERSON",
    HOSP="ORGANIZATION",
    PATORG="ORGANIZATION",
    TIME="DATE_TIME",
    HCW="PERSON",
    HOSPITAL="ORGANIZATION",
    FACILITY="LOCATION"
)

tf_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
    alignment_mode="expand", # "strict", "contract", "expand"
    aggregation_strategy="max") # "simple", "first", "average", "max"

tf_engine = TransformersNlpEngine(
    models=tf_model_config,
    ner_model_configuration=tf_model_configuration)

# Transformer-based analyzer
analyzer_tf = AnalyzerEngine(
    nlp_engine=tf_engine, 
    supported_languages=["en"]
)

# Default anonymizer
anonymizer_default = AnonymizerEngine()

When I list the entities for the transformer engine with this code:

print(f'Entities for transformer engine:  {analyzer_tf.nlp_engine.get_supported_entities()}')

I get Entities for transformer engine: ['DATE_TIME', 'NRP', 'EMAIL', 'PHONE_NUMBER', 'ORGANIZATION', 'AGE', 'LOCATION', 'PERSON', 'ID'] Selection deleted

Then, I do the following to call the engine/analyzer+anonymizer:

for text in text_gens[:5]:
   
    analyzer_tf_results = analyzer_tf.analyze(text=text, language="en")
    anonymized_tf_results = anonymizer_default.anonymize(
        text=text, analyzer_results=analyzer_tf_results
    )

    print(f"Transformer anonymized text    : {anonymized_tf_results.text}\n")

And even though I have many instances of a typical US address, I never get LOCATION as one of my entities. For example (btw, this is completely made up / synthetic text and does not reflect any real people or businesses!):

One example output from above (including the default spacy as well since I did that, but did not show setup above, just adding for completion):

Some synthetic/made-up data:

Original text: Jasmine Rivers, with social security number 987-65-4321, can be reached at 555-123-4567. Her driver's license number is 1234567890. She lives at 1234 Maple Street, Anytown, USA. Jasmine works for Stellar Solutions, located at 5678 Oak Avenue, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year.
SpaCy anonymized text (default, including in case helpful here): <PERSON>, with social security number <US_ITIN>, can be reached at <PHONE_NUMBER>. Her driver's license number is <PHONE_NUMBER>. She lives at <LOCATION>, <LOCATION>, <LOCATION>. <PERSON> works for Stellar Solutions, located at <LOCATION>, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for <DATE_TIME>.
Transformer anonymized text    : <PERSON>, with social security number <ID>, can be reached at <PHONE_NUMBER>. Her driver's license number is <ID>. She lives at 1234 Maple Street, Anytown, USA. <PERSON> works for Stellar Solutions, located at 5678 Oak Avenue, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year.

Any thoughts on how to debug this? Thanks again!!

@omri374
Copy link
Contributor

omri374 commented Oct 8, 2024

Hi @michhar, when running the transformer model directly, I also never get locations. It does output HOSPITAL which is mapped to ORGANIZATION in the mapping, and VENDOR which isn't mapped. The issue though, is that ORGANIZATION is listed as an entity to ignore (see here and here). The reason for this is the high number of false positives for this entity with the default spaCy model. Having said that, I can understand why one would miss that. If you have any suggestions on how to improve this, please reach out!

So to get this working, I updated the labels_to_ignore param, and changed the mapping to include hospital-> location and vendor->organization

# Transformer model config
tf_model_config = [
    {"lang_code": "en",
     "model_name": {
         "spacy": "en_core_web_sm", 
         "transformers": "StanfordAIMI/stanford-deidentifier-base"
    }
}]

# Entity mappings
mapping = dict(
    PER="PERSON",
    LOC="LOCATION",
    ORG="ORGANIZATION",
    AGE="AGE",
    ID="ID",
    EMAIL="EMAIL",
    DATE="DATE_TIME",
    PHONE="PHONE_NUMBER",
    PERSON="PERSON",
    LOCATION="LOCATION",
    GPE="LOCATION",
    ORGANIZATION="ORGANIZATION",
    NORP="NRP",
    PATIENT="PERSON",
    STAFF="PERSON",
    HOSP="LOCATION",
    PATORG="ORGANIZATION",
    TIME="DATE_TIME",
    HCW="PERSON",
    HOSPITAL="LOCATION",
    FACILITY="LOCATION",
    VENDOR="ORGANIZATION",
)

tf_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
    alignment_mode="expand", # "strict", "contract", "expand"
    aggregation_strategy="max", # "simple", "first", "average", "max"
    labels_to_ignore = ["O"])

tf_engine = TransformersNlpEngine(
    models=tf_model_config,
    ner_model_configuration=tf_model_configuration)

# Transformer-based analyzer
analyzer_tf = AnalyzerEngine(
    nlp_engine=tf_engine, 
    supported_languages=["en"]
)

text = "Jasmine Rivers, with social security number 987-65-4321, can be reached at 555-123-4567. Her driver's license number is 1234567890. She lives at 1234 Maple Street, Anytown, USA. Jasmine works for Stellar Solutions, located at 5678 Oak Avenue, Suite 100, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year."
res = analyzer_tf.analyze(text, language="en")

from presidio_anonymizer import AnonymizerEngine

anon = AnonymizerEngine()
anon.anonymize(text, res)

This returns:

text: <PERSON>, with social security number <ID>, can be reached at <PHONE_NUMBER>. Her driver's license number is <ID>. She lives at <LOCATION>. <PERSON> works for <ORGANIZATION>, located at <LOCATION>, with 50 employees. The organization's financial health is strong, with a stock price of $100 per share and a positive forecast for the upcoming year.
items:
[
    {'start': 186, 'end': 196, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
    {'start': 159, 'end': 173, 'entity_type': 'ORGANIZATION', 'text': '<ORGANIZATION>', 'operator': 'replace'},
    {'start': 140, 'end': 148, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'},
    {'start': 128, 'end': 138, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
    {'start': 109, 'end': 113, 'entity_type': 'ID', 'text': '<ID>', 'operator': 'replace'},
    {'start': 62, 'end': 76, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'},
    {'start': 38, 'end': 42, 'entity_type': 'ID', 'text': '<ID>', 'operator': 'replace'},
    {'start': 0, 'end': 8, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
]

@omri374
Copy link
Contributor

omri374 commented Oct 8, 2024

It could also be somewhat related to this change which wasn't released yet: #1454

@michhar
Copy link
Author

michhar commented Oct 8, 2024

@omri374 Thank you for your quick response! How do I go about getting the info I need to make the mapping better? Is there a good way to list out all entities supported by a model so I can make sure I map them correctly?

For improving the approach (avoiding the FPs with the SpaCy model pipeline), could I exclude SpaCy model altogether in my transformer-based analyzer?

@omri374
Copy link
Contributor

omri374 commented Oct 10, 2024

The spacy part of the pipeline is not used for NER, but for all the other NLP tasks (tokenization, lemmatization etc.) We use spacy-huggingface-pipelines to integrate a huggingface NER model instead of the spaCy NER model.

Since presidio comes with spaCy's en_core_web_lg as a default option, and this model has low accuracy rates for ORG, the default setting is to ignore it. If a user wishes to use a different model, there is also a need to change this setting. As said previously, this isn't a good developer experience as it's kind of buried in the code. If you have any suggestions on how to improve this, I'd be happy to discuss!

@omri374
Copy link
Contributor

omri374 commented Oct 10, 2024

More on this can be found here: https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants