Skip to content

Commit

Permalink
Adding a sample for identifying PII in a PDF (#1023)
Browse files Browse the repository at this point in the history
  • Loading branch information
andrwca authored Jan 30, 2023
1 parent ad68363 commit 07b854d
Show file tree
Hide file tree
Showing 4 changed files with 262 additions and 6 deletions.
1 change: 1 addition & 0 deletions docs/samples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
| Usage | Python | [Using Transformers as an external PII model](python/transformers_recognizer/index.md) |
| Usage | Python Notebook | [Anonymizing known values](python/Anonymizing%20known%20values.ipynb)
| Usage | Python Notebook | [Redacting text PII from DICOM images](python/example_dicom_image_redactor.ipynb)
| Usage | Python Notebook | [Annotating PII in a PDF](python/example_pdf_annotation.ipynb)
| Usage | REST API (postman) | [Presidio as a REST endpoint](docker/index.md) |
| Deployment | App Service | [Presidio with App Service](deployments/app-service/index.md) |
| Deployment | Kubernetes | [Presidio with Kubernetes](deployments/k8s/index.md) |
Expand Down
253 changes: 253 additions & 0 deletions docs/samples/python/example_pdf_annotation.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Annotating PII in a PDF\n",
"\n",
"This sample takes a PDF as an input, extracts the text, identifies PII using Presidio and annotates the PII using highlight annotations.\n",
"\n",
"## Prerequisites\n",
"Before getting started, make sure the following packages are installed. For detailed documentation, see the [installation docs](https://microsoft.github.io/presidio/installation).\n",
"\n",
"Install from PyPI:\n",
"```bash\n",
"pip install presidio_analyzer\n",
"pip install presidio_anonymizer\n",
"python -m spacy download en_core_web_lg\n",
"pip install pdfminer.six\n",
"pip install pikepdf\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For Presidio\n",
"from presidio_analyzer import AnalyzerEngine, PatternRecognizer\n",
"from presidio_anonymizer import AnonymizerEngine\n",
"from presidio_anonymizer.entities import OperatorConfig\n",
"\n",
"# For console output\n",
"from pprint import pprint\n",
"\n",
"# For extracting text\n",
"from pdfminer.high_level import extract_text, extract_pages\n",
"from pdfminer.layout import LTTextContainer, LTChar, LTTextLine\n",
"\n",
"# For updating the PDF\n",
"from pikepdf import Pdf, AttachedFileSpec, Name, Dictionary, Array"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze the text in the PDF\n",
"\n",
"To extract the text from the PDF, we use the pdf miner library. We extract the text from the PDF at the text container level. This is roughly equivalent to a paragraph. \n",
"\n",
"We then use Presidio Analyzer to identify the PII and it's location in the text.\n",
"\n",
"The Presidio analyzer is using pre-defined entity recognizers, and offers the option to create custom recognizers.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"analyzer = AnalyzerEngine()\n",
"\n",
"analyzed_character_sets = []\n",
"\n",
"for page_layout in extract_pages(\"./sample_data/sample.pdf\"):\n",
" for text_container in page_layout:\n",
" if isinstance(text_container, LTTextContainer):\n",
"\n",
" # The element is a LTTextContainer, containing a paragraph of text.\n",
" text_to_anonymize = text_container.get_text()\n",
"\n",
" # Analyze the text using the analyzer engine\n",
" analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')\n",
" \n",
" if text_to_anonymize.isspace() == False:\n",
" print(text_to_anonymize)\n",
" print(analyzer_results)\n",
"\n",
" characters = list([])\n",
"\n",
" # Grab the characters from the PDF\n",
" for text_line in filter(lambda t: isinstance(t, LTTextLine), text_container):\n",
" for character in filter(lambda t: isinstance(t, LTChar), text_line):\n",
" characters.append(character)\n",
"\n",
"\n",
" # Slice out the characters that match the analyzer results.\n",
" for result in analyzer_results:\n",
" start = result.start\n",
" end = result.end\n",
" analyzed_character_sets.append({\"characters\": characters[start:end], \"result\": result})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create phrase bounding boxes\n",
"\n",
"The next task is to take the character data, and inflate it into full phrase bounding boxes.\n",
"\n",
"For example, for an email address, we'll turn the bounding boxes for each character in the email address into one single bounding box."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Combine the bounding boxes into a single bounding box.\n",
"def combine_rect(rectA, rectB):\n",
" a, b = rectA, rectB\n",
" startX = min( a[0], b[0] )\n",
" startY = min( a[1], b[1] )\n",
" endX = max( a[2], b[2] )\n",
" endY = max( a[3], b[3] )\n",
" return (startX, startY, endX, endY)\n",
"\n",
"analyzed_bounding_boxes = []\n",
"\n",
"# For each character set, combine the bounding boxes into a single bounding box.\n",
"for analyzed_character_set in analyzed_character_sets:\n",
" completeBoundingBox = analyzed_character_set[\"characters\"][0].bbox\n",
" \n",
" for character in analyzed_character_set[\"characters\"]:\n",
" completeBoundingBox = combine_rect(completeBoundingBox, character.bbox)\n",
" \n",
" analyzed_bounding_boxes.append({\"boundingBox\": completeBoundingBox, \"result\": analyzed_character_set[\"result\"]})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add highlight annotations\n",
"\n",
"We finally iterate through all the analyzed bounding boxes and create highlight annotations for all of them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf = Pdf.open(\"./sample_data/sample.pdf\")\n",
"\n",
"annotations = []\n",
"\n",
"# Create a highlight annotation for each bounding box.\n",
"for analyzed_bounding_box in analyzed_bounding_boxes:\n",
"\n",
" boundingBox = analyzed_bounding_box[\"boundingBox\"]\n",
"\n",
" # Create the annotation. \n",
" # We could also create a redaction annotation if the ongoing workflows supports them.\n",
" highlight = Dictionary(\n",
" Type=Name.Annot,\n",
" Subtype=Name.Highlight,\n",
" QuadPoints=[boundingBox[0], boundingBox[3],\n",
" boundingBox[2], boundingBox[3],\n",
" boundingBox[0], boundingBox[1],\n",
" boundingBox[2], boundingBox[1]],\n",
" Rect=[boundingBox[0], boundingBox[1], boundingBox[2], boundingBox[3]],\n",
" C=[1, 0, 0],\n",
" CA=0.5,\n",
" T=analyzed_bounding_box[\"result\"].entity_type,\n",
" )\n",
" \n",
" annotations.append(highlight)\n",
"\n",
"# Add the annotations to the PDF.\n",
"pdf.pages[0].Annots = pdf.make_indirect(annotations)\n",
"\n",
"# And save.\n",
"pdf.save(\"./sample_data/sample_annotated.pdf\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Result\n",
"\n",
"The output from the samples above creates a new PDF. This contains the original content, with text highlight annotations where the PII has been found.\n",
"\n",
"Each text annotation contains the name of the entity found.\n",
"\n",
"## Note\n",
"\n",
"Before relying on this methodology to detect or markup PII from a PDF, please be aware of the following:\n",
"\n",
"### Text extraction\n",
"\n",
"We purposely use a different library specifically for extracting text from the PDF. This is because text extraction is hard to get right, and it's worth using a library specifically developed with the purpose in mind.\n",
"\n",
"For more details, see:\n",
"\n",
"[https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html](https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html)\n",
"\n",
"That said, even with a purpose built library, there may be occasions where PII is present and visible in a PDF, but it is not detected by the sample code.\n",
"\n",
"This includes, but is not limited to:\n",
"\n",
"- Text that cannot be reliable extracted to be analyzed. (e.g. incorrect spacing, wrong reading order)\n",
"- Text present in previous iterations of the PDF which is hidden from text extraction. (See incremental editing)\n",
"- Text present in images. (requires OCRing)\n",
"- Text present in annotations."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"metadata": {
"interpreter": {
"hash": "1baa965d5efe3ac65b79dfc60c0d706280b1da80fedb7760faf2759126c4f253"
}
},
"vscode": {
"interpreter": {
"hash": "32dc7f589eabbfcc8721d5e42a136d6d28d1f0d89a4194393b25e6e4360e3a4e"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
14 changes: 8 additions & 6 deletions docs/samples/python/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@ Presidio service can be used as python packages inside python scripts
3. [Remote Recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py)
4. [Azure Text Analytics Integration](text_analytics/index.md)
5. [Anonymizing known values](Anonymizing%20known%20values.ipynb)
6. [Custom Anonymizer with lambda expression](example_custom_lambda_anonymizer.py)
7. [Running Presidio on structured / semi-structured data in batch](batch_processing.ipynb)
8. [Getting the detected text value using a custom operator](getting_entity_values.ipynb)
9. [Creating a simple demo website](streamlit/index.md)
10. [Using Flair as an external PII model](flair_recognizer.py)
11. [Using Transformers as an external PII model](transformers_recognizer.py)
6. [Redacting text PII from DICOM images](example_dicom_image_redactor.ipynb)
7. [Annotating PII in a PDF](example_pdf_annotation.ipynb)
8. [Custom Anonymizer with lambda expression](example_custom_lambda_anonymizer.py)
9. [Running Presidio on structured / semi-structured data in batch](batch_processing.ipynb)
10. [Getting the detected text value using a custom operator](getting_entity_values.ipynb)
11. [Creating a simple demo website](streamlit/index.md)
12. [Using Flair as an external PII model](flair_recognizer.py)
13. [Using Transformers as an external PII model](transformers_recognizer.py)
Binary file added docs/samples/python/sample_data/sample.pdf
Binary file not shown.

0 comments on commit 07b854d

Please sign in to comment.