Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter recognizers based on locale/country #1328

Open
omri374 opened this issue Mar 13, 2024 · 22 comments
Open

Filter recognizers based on locale/country #1328

omri374 opened this issue Mar 13, 2024 · 22 comments
Labels
analyzer enhancement New feature or request good first issue Good for newcomers

Comments

@omri374
Copy link
Contributor

omri374 commented Mar 13, 2024

Is your feature request related to a problem? Please describe.
The number of recognizers in Presidio are growing which is great, but it also means that recognizers are more likely to be irrelevant to some users. Running all recognizers means more processing time and more false positives.

Describe the solution you'd like
The ability to provide the RecognizerRegistry (or AnalyzerEngine) with a country code to either include or filter will allow users to use only a subset of recognizers that is relevant to their tasks

@omri374 omri374 added enhancement New feature or request good first issue Good for newcomers analyzer labels Mar 13, 2024
@AmanSal1
Copy link

@omri374 Can I give it a shot ?

@omri374
Copy link
Contributor Author

omri374 commented Mar 13, 2024

Absolutely!

@AmanSal1
Copy link

@omri374 I am facing difficulties while setting this up locally on a Windows OS . The documentation is a bit unclear to me. Can you help ?

@omri374
Copy link
Contributor Author

omri374 commented Mar 14, 2024

Sure. Which documentation are you following? What issues are you facing?

@AmanSal1
Copy link

@omri374 This is the documentation I am following.

https://microsoft.github.io/presidio/development/

Steps I followed :-
1.First I created a Pycharm project with virtual env enabled
2. Clone the Repo
3. pip install --user pipenv
4.pipenv install --dev --skip-lock (ran in the presidio-analyze directory)

Then it says 'installing dependencies' and it gets stuck there without even showing an error, nor does it proceed.

@omri374
Copy link
Contributor Author

omri374 commented Mar 14, 2024

@AmanSal1 I'm not sure why this is happening, but pipenv isn't mandatory. You can install the package locally (preferably in a virtual environment) using pip: pip install -e . (in the presidio-analyzer folder)

@AmanSal1
Copy link

@omri374 Okay, got it. I just want to confirm: Is the UI developed using Docker? And when I change the code base for some tweaks, do we rebuild the image, right?

@omri374
Copy link
Contributor Author

omri374 commented Mar 14, 2024

By UI do you mean the demo website? It is built using Docker, see code here: https://github.com/microsoft/presidio/blob/main/docs/samples/python/streamlit/index.md

I suggest to start with the Python API, and then add this capability into the REST API later on.

@AmanSal1
Copy link

AmanSal1 commented Mar 14, 2024

@omri374 Okay, I will look into that, but I've thought of a solution. First, we'll modify the structure of the recognizers_map in the RecognizerRegistry, and when a object of AnalyzerEngine is made we would have to add the country code parameter in the get_recognizers method of that class and get back what all recognizer have been initiated .

Something Like this :

recognizers_map = {
"en": {
"US": [
UsBankRecognizer,
UsLicenseRecognizer,
UsItinRecognizer,
UsPassportRecognizer,
UsSsnRecognizer,
NhsRecognizer,
SgFinRecognizer,
],
"AU": [
AuAbnRecognizer,
AuAcnRecognizer,
AuTfnRecognizer,
AuMedicareRecognizer,

Do you think this is the right approach?

@omri374
Copy link
Contributor Author

omri374 commented Mar 14, 2024

Thanks, yes that sounds like a good approach. Note that there will be some universal recognizers, like credit card or URL who don't belong to any country.

@omri374
Copy link
Contributor Author

omri374 commented Mar 14, 2024

Do we also need to add a country field to the EntityRecognizer class?

@AmanSal1
Copy link

Thanks, yes that sounds like a good approach. Note that there will be some universal recognizers, like credit card or URL who don't belong to any country.

es": {
"ES": [EsNifRecognizer],
"ALL": [
CreditCardRecognizer,
CryptoRecognizer,
DateRecognizer,
EmailRecognizer,
IbanRecognizer,
IpRecognizer,
MedicalLicenseRecognizer,
PhoneRecognizer,
UrlRecognizer,
],
},
"it": {
"IT": [
ItDriverLicenseRecognizer,
ItFiscalCodeRecognizer,
ItVatCodeRecognizer,
ItIdentityCardRecognizer,
ItPassportRecognizer,
],
"ALL": [
CreditCardRecognizer,
CryptoRecognizer,
DateRecognizer,
EmailRecognizer,
IbanRecognizer,
IpRecognizer,
MedicalLicenseRecognizer,
PhoneRecognizer,
UrlRecognizer,
],
},

It will look like this for the common recognizers

@AmanSal1
Copy link

AnalyzerEngine

@omri374 Adding the country code directly to the EntityRecognizer class might not be the most feasible option because the country code is typically associated with specific recognizers rather than being a generic property of all recognizers. What do you think ?

@AmanSal1
Copy link

@omri374 Additionally I was also thinking that it would be more feasible if we declare country code parameter in the RecognizerRegistry it is a good approach as it make it central place for it but I guess it can also be done without declaring it in the RecognizerRegistry and directly be declared in the AnalyzerEngine.

@AmanSal1
Copy link

By UI do you mean the demo website? It is built using Docker, see code here: https://github.com/microsoft/presidio/blob/main/docs/samples/python/streamlit/index.md

I suggest to start with the Python API, and then add this capability into the REST API later on.

@omri374 When I follow the given steps in the above given url I get an error

1

@omri374
Copy link
Contributor Author

omri374 commented Mar 20, 2024

Thanks. I'll look into why the demo is not working.

@AmanSal1
Copy link

@omri374 Yes their is some problem with recognizer_register file I guess because if we dont use the the demo and use it externally then also the same error is shown .

@omri374
Copy link
Contributor Author

omri374 commented Mar 23, 2024

@AmanSal1 I couldn't reproduce this. Is your code completely in sync with the one in the main branch?

@AmanSal1
Copy link

AmanSal1 commented Mar 23, 2024

@omri374 Yes you are now it works .

from presidio_analyzer import AnalyzerEngine
text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick"
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text1, language='en')
print(results) `

Is this a valid code ?

@omri374
Copy link
Contributor Author

omri374 commented Mar 23, 2024

Yes

@AmanSal1
Copy link

2
@omri374 I get this error when I run the above code.

@omri374
Copy link
Contributor Author

omri374 commented Mar 24, 2024

can you please share your TEST.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants