80ded5aa07 | ||
---|---|---|
.. | ||
faker_extensions | ||
raw_data | ||
README.md | ||
__init__.py | ||
presidio_data_generator.py | ||
presidio_pseudonymize.py |
README.md
Presidio Data Generator
This data generator takes a text file with templates (e.g. my name is {{person}}
)
and creates a list of InputSamples which contain fake PII entities
instead of placeholders. It further creates spans (start and end of each entity)
for model training and evaluation.
Scenarios
There are two main scenarios for using the Presidio Data Generator:
- Create a fake dataset for evaluation or training purposes, given a list of predefined templates (see this file for example)
- Augment an existing labeled dataset with additional fake values.
In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates, and then scenario 1 is applied.
Process
This generator heavily relies on the Faker package with a few differences:
-
PresidioDataGenerator
returns not only fake text, but also the spans in which fake entities appear in the text. -
Faker
samples each value independently. In many cases we would want to keep the semantic dependency between two values. For example, for the templateMy name is {{name}} and my email is {{email}}
, we would prefer a result which has the name within the email address, such asMy name is Mike and my email is mike1243@gmail.com
. For this functionality, a newRecordGenerator
(based on Faker'sGenerator
class) is implemented. It accepts a dictionary / pandas DataFrame, and favors returning objects from the same record (if possible).
Example
For a full example, see the Generate Data Notebook.
Simple example:
from presidio_evaluator.data_generator import PresidioDataGenerator
sentence_templates = [
"My name is {{name}}",
"Please send it to {{address}}",
"I just moved to {{city}} from {{country}}"
]
data_generator = PresidioDataGenerator()
fake_records = data_generator.generate_fake_data(
templates=sentence_templates, n_samples=10
)
fake_records = list(fake_records)
# Print the spans of the first sample
print(fake_records[0].fake)
print(fake_records[0].spans)
The process in high level is the following:
- Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
templates:
My name is John
->My name is [PERSON]
- (Optional) add new Faker providers to the
PresidioDataGenerator
to support types of PII not returned by Faker - (Optional) map dataset entity names into provider equivalents by calling
PresidioDataGenerator.add_provider_alias
. This will create entity aliases (e.g. faker supports "name" but templates contain "person") - Generate samples using the templates list
- Split the generated dataset to train/test/validation while making sure that samples from the same template would only appear in one set
- Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
- Train models
- Evaluate using one of the evaluation notebooks
Notes:
- For steps 5, 6, 7 see the main README.
Copyright notice:
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.