Recognizers are the main building blocks in Presidio. Each recognizer is in charge of detecting one or more entities in one or more languages. Recognizers define the logic for detection, as well as the confidence a prediction receives and a list of words to be used when context is leveraged.
Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets. For tools and documentation on evaluating and analyzing recognizers, refer to the presidio-research Github repository.
!!! note "Note" When contributing recognizers to the Presidio OSS, new predefined recognizers should be added to the supported entities list, and follow the contribution guidelines.
Make sure your recognizer doesn't take too long to process text. Anything above 100ms per request with 100 tokens is probably not good enough.
When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a RemoteRecognizer
on the presidio-analyzer side to interact with the model's endpoint. In addition, make sure the license on the 3rd party dependency allows you to use it for any purpose.
Generally speaking, there are three types of recognizers:
A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (["Mr.", "Mrs.", "Ms.", "Dr."]
to detect a "Title" entity.)
See this documentation on adding a new recognizer. The PatternRecognizer
class has built-in support for a deny-list input.
Pattern based recognizers use regular expressions to identify entities in text.
See this documentation on adding a new recognizer via code.
The PatternRecognizer
class should be extended.
See some examples here:
!!! example "Examples"
Examples of pattern based recognizers are the CreditCardRecognizer
and EmailRecognizer
.
Many PII entities are undetectable using naive approaches like deny-lists or regular expressions. In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer. There are four options for adding ML and rule based recognizers:
Presidio currently uses spaCy as a framework for text analysis and Named Entity Recognition (NER), and stanza as an alternative. To avoid introducing new tools, it is recommended to first try to use spaCy
or stanza
over other tools if possible.
spaCy
provides descent results compared to state-of-the-art NER models, but with much better computational performance.
spaCy
and stanza
models could be trained from scratch, used in combination with pre-trained embeddings, or retrained to detect new entities.
When integrating such a model into Presidio, a class inheriting from the EntityRecognizer
should be created.
Scikit-learn
models tend to be fast, but usually have lower accuracy than deep learning methods. However, for well defined problems with well defined features, they can provide very good results.
When integrating such a model into Presidio, a class inheriting from the EntityRecognizer
should be created.
In some cases, rule-based logic provides the best way of detecting entities.
The Presidio EntityRecognizer
API allows you to use spaCy
/stanza
extracted features like lemmas, part of speech, dependencies and more to create your logic. When integrating such logic into Presidio, a class inheriting from the EntityRecognizer
should be created.
Deep learning methods offer excellent detection rates for NER. They are however more complex to train, deploy and tend to be slower than traditional approaches. When creating a DL based method for PII detection, there are two main alternatives for integrating it with Presidio:
- Create an external endpoint (either local or remote) which is isolated from the
presidio-analyzer
process. On thepresidio-analyzer
side, one would extend theRemoteRecognizer
class and implement the network interface betweenpresidio-analyzer
and the endpoint of the model's container. - Integrate the model as an additional
EntityRecognizer
within thepresidio-analyzer
flow.
!!! attention "Considerations for selecting one option over another"
- Ease of integration.
- Runtime considerations (For example if the new model requires a GPU).
- 3rd party dependencies of the new model vs. the existing `presidio-analyzer` package.