Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating spacy-huggingface-pipelines and refactoring NlpEngine logic #1159

Merged
merged 70 commits into from
Oct 19, 2023

Conversation

omri374
Copy link
Contributor

@omri374 omri374 commented Aug 28, 2023

Change Description

This PR improves the handling of transformers models using the NlpEngine flow. Note that there are two ways to introduce NER models into presidio: NlpEngine and a standalone recognizer. See more info on #1083.

Main changes:

  1. The entire handling of models, entities, mapping and alignment of model spans to Presidio's spans is done as part of the NlpEngine config (either SpacyNlpEngine or TransformersNlpEngine). In the future, we could also add FlairNlpEngine or other NER packages.
  2. Introduced a new NerModelConfiguration dataclass which holds the user configuration. Configuration can be set in a conf file, for example:
    image
    with most of the configuration coming from here: https://github.com/explosion/spacy-huggingface-pipelines#token-classification
  3. Simplified the logic in SpacyRecognizer which now only gets the entities out of NlpArtifacts and returns them
  4. Integrated spacy-huggingface-pipelines to improve the handling of transformer models inside spaCy pipelines.

Flow before:

sequenceDiagram
    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
    SpacyNlpEngine->>Spacy: call spaCy NER model
    Spacy->>SpacyNlpEngine: return PII entities
    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens etc.)
    Note over AnalyzerEngine: Call all recognizers
    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts, <BR>map to Presidio entities, <BR>filter requested entities, <BR> add confidence scores
    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]

Loading

Flow after:

sequenceDiagram
    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
    SpacyNlpEngine->>Spacy: call spaCy NER model
    Spacy->>SpacyNlpEngine: return PII entities
    Note over SpacyNlpEngine: Map entity names to Presidio's, <BR>update scores, <BR>remove unwanted entities <BR> based on NerModelConfiguration
    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens, scores etc.)
    Note over AnalyzerEngine: Call all recognizers
    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]

Loading

Note to reviewer: A PR with updates to docs can be found here #1177

Issue reference

This PR fixes issue #1083

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests: partially/wip
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required: wip

@omri374
Copy link
Contributor Author

omri374 commented Aug 28, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor Author

omri374 commented Aug 29, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374 omri374 marked this pull request as ready for review August 29, 2023 08:01
@omri374 omri374 requested a review from a team as a code owner August 29, 2023 08:01
SharonHart
SharonHart previously approved these changes Oct 1, 2023
@LSD-98
Copy link

LSD-98 commented Oct 2, 2023

Hello @omri374,

By running some test i got a warning from spacy-huggingface-pipeline:
UserWarning: Skipping annotation, {'entity_group': 'LOC', 'score': 0.46050456, 'word': 'ian', 'start': 33, 'end': 36} is overlapping or can't be aligned for doc 'Je suis Tom et je travaille à Ardian'

In the end, it detected the right entities to anonymize but I wondered whether it could lead to a "false negative" (i.e. skipping an entity that should be detected). I saw you also had the same issue here. I did not understand the technical points of the answer so I wanted to check if you got a solution for this.

Many thanks in advance!

PS : please tell me if I should post here or on the discussion tab.

@omri374
Copy link
Contributor Author

omri374 commented Oct 4, 2023

@LSD-98 thanks for raising this. This is caused by a mismatch between wordpiece tokenization (in transformers) and the spaCy tokenizer. It is defined as "expand" as usually wordpiece tokens are a subset of a spaCy token, but there could be cases where this might result in a false negative or false positive (most likely false positive, as the wordpiece token would be expanded to more than the PII itself). I don't see an immediate workaround for this, so I guess we'd have to live with it. If someone is making sure there are no alignment errors, they could still add a transformers recognizer as an independent recognizer and not as part of the NLPEngine mechanism.

SharonHart
SharonHart previously approved these changes Oct 12, 2023
navalev
navalev previously approved these changes Oct 17, 2023
@omri374
Copy link
Contributor Author

omri374 commented Oct 17, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@niwilso niwilso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Thank you for providing the swimlane flow diagram.

Just a few points to address before approval.

@omri374 omri374 dismissed stale reviews from SharonHart and navalev via e9007d0 October 18, 2023 10:52
niwilso
niwilso previously approved these changes Oct 18, 2023
@omri374 omri374 merged commit 3a7b8f6 into main Oct 19, 2023
23 checks passed
@omri374 omri374 deleted the omri/new_transformers_engine branch October 19, 2023 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants