This code will anonymise text files and produce two outputs:
- the redacted text file
- a metadata file describing which PII elements were found, with their position in the document.
The metadata output can be in JSON format or XML format. SMI uses the XML format for two reasons, it is more comprehensive and it can be used as input to the eHOST program for manual annotation and correction. This is essential for training and verification.
The anonymisation is implemented using a rule based approach.
- Installation
- Configuration
- Rules
- Run the anonymisation process
- Update rules to improve anonymisation
- Testing rules separately
This version no longer requires Python2, it works in Python3.
If using the anonymiser standalone then only the python dependencies
need to be installed, see the CogStack-SemEHR/requirements.txt
file.
The python regular expression parser re
cannot handle some of the
regular expressions in the anonymisation rules, especially on some of
the larger documents, so it tries to use Google's replacement, called re2
.
You need to apt install libre2-dev
first, then pip install pyre2
.
If re2 is not installed it will silently fallback to normal re but this may hang
on the complex patterns. Note: do not pip install re2
, it doesn't work.
There is also google-re2
but this is untested.
If using the anonymiser to anonymise text inside Structured Reports in
DICOM format, i.e. within the SMI environment, then use the script
src/tools/anon_init.sh
- Repo name: CogStack-SemEHR
- Entry Script:
./anonymisation/anonymiser.py
- configuration file:
./anonnymisation/conf/anonymisation_task.json
The template configuration file in conf/anonymisation_task.json can be copied and modified.
{
"mode": "mt",
"number_threads": 20,
"rules_folder": "./conf/rules/",
"rule_file_pattern": ".*_rules.json",
"rule_group_name": "PHI_rules",
"working_fields": ["Finding", "Text", "ContentSequence"],
"sensitive_fields": ["Patient ID", "Patient Name", "Person Observer Name", "Referring Physician Name"],
"annotation_mode": false,
"text_data_path": "./test_data",
"anonymisation_output": "./test_output/",
"extracted_phi": "./test_output/extracted_phi.json",
"grouped_phi_output": "./test_output/grouped_phi.txt",
"logging_level": "DEBUG",
"logging_file": "./test_output/anonymisation.log",
"use_spacy": false
}
mode
is either mt
or dir
meaning multithreaded or not.
There is no requirement for using multiple threads.
If mt
then number_threads
is the number of threads used.
rules_folder
is the relative path to the directory containing JSON-format rules files. The filenames in that directory are matched against rule_file_pattern
to find rules files.
rule_group_name
is the group name inside the rules files which will be used so your rules files can have lots of groups for different purposes but only one group will be used.
working_fields
is a list of document sections which will be anonymised. Sections are denoted by a line starting [[ContentSequence]]
for example; outside of such a section the text is ignored.
sensitive_fields
is a list of document sections where sensitive information (names) can be provided, typically extracted from the document (DICOM) header, for example [[Patient Name]] Nicol McNicol
would automatically remove any mention of Nicol
or McNicol
from the document.
annotation_mode
should be true to save annotations in XML format.
text_data_path
is the path to the text files to be anonymised.
anonymisation_output
is the path to the output directory.
extracted_phi
is the filename of the 'phi' file which will be JSON format containing the anonymised parts of all documents.
grouped_phi_output
is the filename of the grouped 'phi' data.
logging_level
can be DEBUG
to log debugging information, or INFO
.
logging_file
is the filename of the log file.
use_spacy
defaults to false but if set to true and spacy is installed
then it uses spaCy to anonymise as well.
The language model is currently hard-coded en_core_web_sm
.
It only anonymises PERSON
entities but has the disadvantage that
it may also remove the names of drugs.
The rules used to anonymise text are stored in the rules_folder
directory.
All files matching the rule_file_pattern
will be loaded.
Rules are defined using regular expressions and grouped into categories.
Two types of rules are defined.
- Document structure rules: used for parsing document structures (Note: not used in SMI)
- PHI (Protected health information) rules: used to identify PHI mentions
Rule document structure
{
"RULE_CATEGORY_NAME": {
"RULE_SET_NAME": [
"RULE": {
...
}
]
}
}
The general data structure
of an atom rule.
"RULE_NAME": {
"pattern": "REGULAR_EXPRESSION_(WITH_GROUPS)",
"flags": ["multiline",...]
"data_labels": ["LABEL1", "LABEL2"],
"data_type": "DATA TYPE"
"disabled": false
}
For example
{
"PHI_rules": {
"clinic": [
{
"comment": "A full description of this rule in plain English.",
"test_true": [ "list of strings which the pattern must match", "more" ],
"test_false": [ "list of strings which the pattern must not match", "more" ],
"pattern": "\\bplease\\s+contact(\\s+\\w+(\\s+\\w+){0,2})",
"flags": [ "ignorecase" ],
"data_labels": [ "name" ],
"data_type": "institute"
},
The pattern is a python regex but note that as it's in JSON it needs a
double backslash so things like \b
for boundary should be written \\b
.
Note that the regex will be searched in fragments of
the document, not the whole document and not necessarily sentences.
(In fact it may be whole sections defined by working_fields
). This
has implications for anchors such as ^
and $
, and multiline
.
The flags
may contain ignorecase
and/or multiline
, having the same meaning
as documented in the Python re
library.
The data_labels
are names given to each regex capture 'group' (the parts
inside round brackets). The order of the names must match the list of groups.
They can be optional but if the group captures the name or number to be anonymised
then the data_labels
must have a name
or number
, as the text which matches
that capture group will be found and replaced.
The data_type
is used to identify what type of information was extracted.
disabled
is optional; when true, the rule is not used.
comment
could also be used to give an explanation for the rule.
The comment is optional but should be used to describe the rule in plain English.
The tests are optional but should be used to allow automated testing of rules,
using the test_rules.py
script. All strings in the test_true
list should
contain something which matches the pattern and all strings in the test_false
list
should contain something that is not matched by the pattern.
Note: these are not used in SMI.
These rules are used to identify locators
of section headings in the document.
Everything after a locator
and before the next locator
belongs to a section
.
These rules are used to identify typed PHIs. They are stored in the part of
the rule document that are indexed with the key sent_rules
as described below.
It is composed of a list of rule sets
.
"sent_rules":{
"RULE_SET_NAME": [
{
"pattern": "\\b(ID)\\:{0,}\\s{0,}(\\d+)\\b",
"flags": [
"ignorecase"
],
"data_labels": [
"label",
"name"
],
"data_type": "ID"
},
...
]
}
Each rule set
is to identify a type of PHI. The following is a snapshot of a rule set called IDs
,
which is to identify identifiers from the text.
"IDS": [
{
"pattern": "\\b(ID)\\:{0,}\\s{0,}(\\d+)\\b",
"flags": [
"ignorecase"
],
"data_labels": [
"label",
"name"
],
"data_type": "ID"
},
... // more rules
]
If SemEHR is installed as docker version: run into the container with bash terminal with docker-compose
:
docker-compose -f YOUR-COMPOSE-FILE-YML-PATH run --entrypoint /bin/bash semehr
If not using docker then just run the script. Pass the path to a configuration file. You can use any path; in this example we are using the provided template config file.
cd CogStack-SemEHR/anonymisation
python3 anonymiser.py conf/anonymisation_task.json
The program will anonymise all the text files in the input folder
and place annotations and/or anonymised text in the output folder.
The folders are specified in the config file as:
text_data_path
for input files,
anonymisation_output
for output files,
extracted_phi
for the filename of the found names,
grouped_phi_output
similarly,
logging_file
for the log file, and set
annotation_mode=true
.
Input files must be in the SMI format for best results. This is the
output from CTP_DicomToText.py
(see the SmiServices repo) but is
easily created manually. It has headers like this:
[[Patient Name]] Anne Boleyn
[[Referring Physician Name]] Charles Dickens
[[ContentSequence]]
The headers are defined in the config file as sensitive_fields
.
It uses the given names (from any tag listed in the sensitive_fields
config)
so they can be replaced if found in the text. Forenames and surnames
are handled separately.
It then anonymises all text after the [[ContentSequence]]
header, or any
tag listed in the working_fields
config. If there is no field in the input
from the working_fields
config then nothing is anonymised.
The output files are given the same name as the input files.
If XML has been requested then additional files will be written having
the same name but with .knowtator.xml
appended. The phi
file will
be in JSON format.
The XML format contains a set of annotations like this:
<?xml version="1.0" ?>
<annotations>
<annotation>
<mention id="filename-1"/>
<annotator id="filename-1">semehr</annotator>
<span start="125" end="135"/>
<spannedText>Tom Sawyer</spannedText>
<creationDate>Wed November 11 13:04:51 2020</creationDate>
</annotation>
<classMention id="filename-1">
<mentionClass id="semehr_sensitive_info">Tom Sawyer</mentionClass>
</classMention>
</annotations>
The phi output looks like this:
{
"doc": "inputfile1.txt",
"pos": 520,
"start": 520,
"type": "date",
"sent": "23/04/15"
},
{
"doc": "inputfile2.txt",
"pos": 1435,
"start": 1447,
"type": "assistant",
"sent": "Dr Jobs"
},
Please refer to the Rules section for details about rule design.
As of June 2021, we have the following rule sets
defined for SMI project.
- NB always make a copy of current rule file before making any changes to exiting rules.
- Add a rule file
NEW_rules.json
to the rule file folder- Or edit an existing rule file.
- Prepare a set of documents for testing. It's better the set contains both new situations you would like to improve on and also a good samples of mentions of other types of the same PHIs that you are modifying on.
- Run the anonymiser script to test and validate.
- There is also a test script
test_rules.py
which allows you to test the rules on a fragment of text, and show you which rules matched.
Use the test_rules.py
script to test all of the rules against a given string.