black and flake8-ing the entire code

This commit is contained in:
omri374 2022-01-20 00:04:18 +02:00
Родитель c9151c64c2
Коммит 5655404a2c
27 изменённых файлов: 278 добавлений и 186 удалений

117
README.md
Просмотреть файл

@ -1,8 +1,10 @@
# Presidio-research
This package features data-science related tasks for developing new recognizers for [Presidio](https://github.com/microsoft/presidio).
It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models
This package features data-science related tasks for developing new recognizers for
[Presidio](https://github.com/microsoft/presidio).
It is used for the evaluation of the entire system,
as well as for evaluating specific PII recognizers or PII detection models
In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.
## Who should use it?
- Anyone interested in **developing or evaluating a PII detection model**, an existing Presidio instance or a Presidio PII recognizer.
@ -42,72 +44,100 @@ Note that some dependencies (such as Flair and Stanza) are not installed to redu
See [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.
The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
The data generation process receives a file with templates,0
e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
Then, it creates new synthetic sentences by sampling templates and PII values.
Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).
- For information on data generation/augmentation,
- see the data generator [README](presidio_evaluator/data_generator/README.md).
- For an example for running the generation process, see [this notebook](notebooks/data%20generation/Generate%20data.ipynb).
- For an example for running the generation process,
- see [this notebook](notebooks/1_Generate_data.ipynb).
- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/PII%20EDA.ipynb).
Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
- For an understanding of the underlying fake PII data used,
see this [exploratory data analysis notebook](notebooks/2_PII_EDA.ipynb).
Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See [this notebook for more details](notebooks/Split%20by%20pattern%20%23.ipynb).
Once data is generated, it could be split into train/test/validation sets
while ensuring that each template only exists in one set.
See [this notebook for more details](notebooks/3_Split_by_pattern_%23.ipynb).
## 2. Data representation
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see [data_objects.py](presidio_evaluator/data_objects.py).
In order to standardize the process,
we use specific data objects that hold all the information needed for generating,
analyzing, modeling and evaluating data and models. Specifically,
see [data_objects.py](presidio_evaluator/data_objects.py).
## 3. Recognizer evaluation
The standardized structure, `List[InputSample]` could be translated into different formats:
- CONLL
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
conll = InputSample.create_conll_dataset(dataset)
conll.to_csv("dataset.csv", sep="\t")
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
The main logic lies in the [Evaluator](presidio_evaluator/evaluation/evaluator.py) class. It provides a structured way of evaluating models and recognizers.
```
### Ready model / engine wrappers
- spaCy v3
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
```
Some evaluators were developed for analysis and references. These include:
- Flair
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
flair = InputSample.create_flair_dataset()(dataset)
```
#### Presidio analyzer evaluation
- json
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.to_json(dataset, output_file="dataset_json")
```
Allows you to evaluate an existing Presidio instance. [See this notebook for details](notebooks/Evaluate%20Presidio%20Analyzer.ipynb).
## 3. PII models evaluation
#### One recognizer evaluation
The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model,
or a specific PII recognizer for precision and recall.
Evaluate one specific recognizer for precision and recall.
Similar to the analyzer evaluation just focusing on one type of PII recognizer.
See [presidio_recognizer_wrapper.py](presidio_evaluator/models/presidio_recognizer_wrapper.py)
#### Conditional Random Fields
### Examples:
- [Evaluate Presidio](notebooks/4_Evaluate_Presidio_Analyzer.ipynb)
- [Evaluate spaCy models](notebooks/models/Evaluate%20spacy%20models.ipynb)
- [Evaluate Stanza models](notebooks/models/Evaluate%20stanza%20models.ipynb)
- [Evaluate CRF models](notebooks/models/Evaluate%20crf%20models.ipynb)
- [Evaluate Flair models](notebooks/models/Evaluate%20flair%20models.ipynb)
To train a CRF on a new dataset, see [this notebook](notebooks/models/Train CRF.ipynb).
To evaluate a CRF model, see the the [same notebook](notebooks/models/Train CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).
#### spaCy based models
## 4. Training PII detection models
There are three ways of interacting with spaCy models:
### CRF
1. Evaluate an existing trained model
2. Train with pretrained embeddings
3. Fine tune an existing spaCy model
To train a vanilla CRF on a new dataset, see [this notebook](notebooks/models/Train%20CRF.ipynb). To evaluate, see [this notebook](notebooks/models/Evaluate%20CRF%20models.ipynb).
Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
See [this notebook for creating spaCy datasets](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb).
### spaCy
##### Evaluate an existing spaCy model
To train a new spaCy model, first save the dataset in a spaCy format:
```python
# dataset is a List[InputSample]
InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")
```
To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
To evaluate, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb)
#### Flair based models
To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object.
For experimenting with other embedding types, change the `embeddings` object in the `train` method.
To train a Flair model, run:
### Flair
- To train Flair models, see this [helper class](presidio_evaluator/models/flair_train.py) or this snippet:
```python
from presidio_evaluator.models import FlairTrainer
train_samples = "../data/generated_train.json"
test_samples = "../data/generated_test.json"
val_samples = "../data/generated_validation.json"
train_samples = "data/generated_train.json"
test_samples = "data/generated_test.json"
val_samples = "data/generated_validation.json"
trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
@ -116,9 +146,8 @@ corpus = trainer.read_corpus("")
trainer.train(corpus)
```
To evaluate an existing model, see [this notebook](notebooks/models/Evaluate%20flair%20models.ipynb).
# For more information
## For more information
- [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)
- [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)

Просмотреть файл

@ -1,2 +1,2 @@
0.0.2
0.1.0

Просмотреть файл

@ -27,9 +27,9 @@
"\n",
"import pandas as pd\n",
"\n",
"pd.set_option('display.max_columns', None) \n",
"pd.set_option('display.max_rows', None) \n",
"pd.set_option('display.max_colwidth', None)\n",
"pd.set_option(\"display.max_columns\", None)\n",
"pd.set_option(\"display.max_rows\", None)\n",
"pd.set_option(\"display.max_colwidth\", None)\n",
"\n",
"%reload_ext autoreload\n",
"%autoreload 2"
@ -82,12 +82,16 @@
"print(dataset[1])\n",
"\n",
"print(\"\\nMin and max number of tokens in dataset:\")\n",
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
"print(\n",
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
")\n",
"\n",
"print(\"\\nMin and max sentence length in dataset:\")\n",
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
"print(\n",
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
")"
]
},
{
@ -153,8 +157,8 @@
"metadata": {},
"outputs": [],
"source": [
"sent = 'I am taiwanese but I live in Cambodia.'\n",
"#sent = input(\"Enter sentence: \")\n",
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
"# sent = input(\"Enter sentence: \")\n",
"model.predict(InputSample(full_text=sent))"
]
},
@ -238,7 +242,7 @@
"metadata": {},
"outputs": [],
"source": [
"fns_df = ModelError.get_fns_dataframe(errors,entity=['PHONE_NUMBER'])"
"fns_df = ModelError.get_fns_dataframe(errors, entity=[\"PHONE_NUMBER\"])"
]
},
{
@ -259,7 +263,7 @@
"outputs": [],
"source": [
"print(\"All errors:\\n\")\n",
"[print(error,\"\\n\") for error in errors]"
"[print(error, \"\\n\") for error in errors]"
]
}
],

Просмотреть файл

@ -31,9 +31,9 @@
"\n",
"import pandas as pd\n",
"\n",
"pd.set_option('display.max_columns', None) \n",
"pd.set_option('display.max_rows', None) \n",
"pd.set_option('display.max_colwidth', None)\n",
"pd.set_option(\"display.max_columns\", None)\n",
"pd.set_option(\"display.max_rows\", None)\n",
"pd.set_option(\"display.max_colwidth\", None)\n",
"\n",
"%reload_ext autoreload\n",
"%autoreload 2"
@ -98,12 +98,16 @@
"print(dataset[1])\n",
"\n",
"print(\"\\nMin and max number of tokens in dataset:\")\n",
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
"print(\n",
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
")\n",
"\n",
"print(\"\\nMin and max sentence length in dataset:\")\n",
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
"print(\n",
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
")"
]
},
{
@ -160,7 +164,7 @@
" results = evaluator.calculate_score(evaluation_results)\n",
"\n",
" # update params tracking\n",
" params = {\"dataset_name\":dataset_name, \"model_name\": model_path}\n",
" params = {\"dataset_name\": dataset_name, \"model_name\": model_path}\n",
" params.update(model.to_log())\n",
" experiment.log_parameters(params)\n",
" experiment.log_dataset_hash(dataset)\n",
@ -197,8 +201,8 @@
},
"outputs": [],
"source": [
"sent = 'I am taiwanese but I live in Cambodia.'\n",
"#sent = input(\"Enter sentence: \")\n",
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
"# sent = input(\"Enter sentence: \")\n",
"model.predict(InputSample(full_text=sent))"
]
},
@ -290,7 +294,7 @@
"metadata": {},
"outputs": [],
"source": [
"fns_df = ModelError.get_fns_dataframe(errors, entity=['GPE'])"
"fns_df = ModelError.get_fns_dataframe(errors, entity=[\"GPE\"])"
]
},
{
@ -311,7 +315,7 @@
"outputs": [],
"source": [
"print(\"All errors:\\n\")\n",
"[print(error,\"\\n\") for error in errors]"
"[print(error, \"\\n\") for error in errors]"
]
},
{

Просмотреть файл

@ -27,9 +27,9 @@
"\n",
"import pandas as pd\n",
"\n",
"pd.set_option('display.max_columns', None) \n",
"pd.set_option('display.max_rows', None) \n",
"pd.set_option('display.max_colwidth', None)\n",
"pd.set_option(\"display.max_columns\", None)\n",
"pd.set_option(\"display.max_rows\", None)\n",
"pd.set_option(\"display.max_colwidth\", None)\n",
"\n",
"%reload_ext autoreload\n",
"%autoreload 2"
@ -51,7 +51,9 @@
"outputs": [],
"source": [
"dataset_name = \"synth_dataset_v2.json\"\n",
"dataset = InputSample.read_dataset_json(Path(Path.cwd().parent.parent, \"data\", dataset_name))\n",
"dataset = InputSample.read_dataset_json(\n",
" Path(Path.cwd().parent.parent, \"data\", dataset_name)\n",
")\n",
"print(len(dataset))"
]
},
@ -82,12 +84,16 @@
"print(dataset[1])\n",
"\n",
"print(\"\\nMin and max number of tokens in dataset:\")\n",
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
"print(\n",
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
")\n",
"\n",
"print(\"\\nMin and max sentence length in dataset:\")\n",
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
"print(\n",
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
")"
]
},
{
@ -109,7 +115,13 @@
"flair_ner_fast = \"ner-english-fast\"\n",
"flair_ontonotes_fast = \"ner-english-ontonotes-fast\"\n",
"flair_ontonotes_large = \"ner-english-ontonotes-large\"\n",
"models = [flair_ner, flair_ner_fast, flair_ontonotes_fast ,flair_ner_fast, flair_ontonotes_large]"
"models = [\n",
" flair_ner,\n",
" flair_ner_fast,\n",
" flair_ontonotes_fast,\n",
" flair_ner_fast,\n",
" flair_ontonotes_large,\n",
"]"
]
},
{
@ -138,7 +150,7 @@
" results = evaluator.calculate_score(evaluation_results)\n",
"\n",
" # update params tracking\n",
" params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
" params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
" params.update(model.to_log())\n",
" experiment.log_parameters(params)\n",
" experiment.log_dataset_hash(dataset)\n",
@ -171,8 +183,8 @@
"metadata": {},
"outputs": [],
"source": [
"sent = 'I am taiwanese but I live in Cambodia.'\n",
"#sent = input(\"Enter sentence: \")\n",
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
"# sent = input(\"Enter sentence: \")\n",
"model.predict(InputSample(full_text=sent))"
]
},
@ -265,7 +277,7 @@
"metadata": {},
"outputs": [],
"source": [
"fns_df = ModelError.get_fns_dataframe(errors, entity=['GPE'])"
"fns_df = ModelError.get_fns_dataframe(errors, entity=[\"GPE\"])"
]
},
{
@ -286,7 +298,7 @@
"outputs": [],
"source": [
"print(\"All errors:\\n\")\n",
"[print(error,\"\\n\") for error in errors]"
"[print(error, \"\\n\") for error in errors]"
]
},
{

Просмотреть файл

@ -136,15 +136,13 @@
" print(f\"Evaluating model {model_name}\")\n",
"\n",
" nlp = spacy.load(model_name)\n",
" model = SpacyModel(\n",
" model=nlp, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"]\n",
" )\n",
" model = SpacyModel(model=nlp, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"])\n",
" evaluator = Evaluator(model=model)\n",
" evaluation_results = evaluator.evaluate_all(deepcopy(dataset))\n",
" results = evaluator.calculate_score(evaluation_results)\n",
"\n",
" # update params tracking\n",
" params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
" params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
" params.update(model.to_log())\n",
" experiment.log_parameters(params)\n",
" experiment.log_dataset_hash(dataset)\n",
@ -200,8 +198,8 @@
},
"outputs": [],
"source": [
"sent = 'I am taiwanese but I live in Cambodia.'\n",
"#sent = input(\"Enter sentence: \")\n",
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
"# sent = input(\"Enter sentence: \")\n",
"model.predict(InputSample(full_text=sent))"
]
},
@ -272,7 +270,6 @@
},
"outputs": [],
"source": [
"\n",
"ModelError.most_common_fn_tokens(errors, n=50, entity=[\"PERSON\"])"
]
},

Просмотреть файл

@ -52,7 +52,9 @@
"outputs": [],
"source": [
"dataset_name = \"synth_dataset_v2.json\"\n",
"dataset = InputSample.read_dataset_json(Path(Path.cwd().parent.parent, \"data\", dataset_name))\n",
"dataset = InputSample.read_dataset_json(\n",
" Path(Path.cwd().parent.parent, \"data\", dataset_name)\n",
")\n",
"print(len(dataset))"
]
},
@ -83,12 +85,16 @@
"print(dataset[1])\n",
"\n",
"print(\"\\nMin and max number of tokens in dataset:\")\n",
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
"print(\n",
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
")\n",
"\n",
"print(\"\\nMin and max sentence length in dataset:\")\n",
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
"print(\n",
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
")"
]
},
{
@ -128,14 +134,16 @@
" experiment = get_experiment_tracker()\n",
" print(\"-----------------------------------\")\n",
" print(f\"Evaluating model {model_name}\")\n",
" \n",
" model = StanzaModel(model_name=model_name, entities_to_keep=['PERSON', 'GPE', 'ORG', 'NORP'])\n",
"\n",
" model = StanzaModel(\n",
" model_name=model_name, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"]\n",
" )\n",
" evaluator = Evaluator(model=model)\n",
" evaluation_results = evaluator.evaluate_all(deepcopy(dataset))\n",
" results = evaluator.calculate_score(evaluation_results)\n",
"\n",
" # update params tracking\n",
" params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
" params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
" params.update(model.to_log())\n",
" experiment.log_parameters(params)\n",
" experiment.log_dataset_hash(dataset)\n",
@ -168,8 +176,8 @@
"metadata": {},
"outputs": [],
"source": [
"sent = 'I am taiwanese but I live in Cambodia.'\n",
"#sent = input(\"Enter sentence: \")\n",
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
"# sent = input(\"Enter sentence: \")\n",
"model.predict(InputSample(full_text=sent))"
]
},
@ -243,7 +251,7 @@
"metadata": {},
"outputs": [],
"source": [
"fns_df = ModelError.get_fns_dataframe(errors=results.model_errors,entity=['GPE'])"
"fns_df = ModelError.get_fns_dataframe(errors=results.model_errors, entity=[\"GPE\"])"
]
},
{
@ -264,7 +272,7 @@
"outputs": [],
"source": [
"print(\"All errors:\\n\")\n",
"[print(error,\"\\n\") for error in results.model_errors]"
"[print(error, \"\\n\") for error in results.model_errors]"
]
},
{

Просмотреть файл

@ -9,22 +9,30 @@ for model training and evaluation.
There are two main scenarios for using the Presidio Data Generator:
1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates (see [this file](raw_data/templates.txt) for example)
1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates
(see [this file](raw_data/templates.txt) for example)
2. Augment an existing labeled dataset with additional fake values.
In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates, and then scenario 1 is applied.
In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates,
and then scenario 1 is applied.
## Process
This generator heavily relies on the [Faker package](https://www.github.com/joke2k/faker) with a few differences:
1. `PresidioDataGenerator` returns not only fake text, but all the spans in which fake entities appear in the text
1. `PresidioDataGenerator` returns not only fake text, but also the spans in which fake entities appear in the text.
2. Faker samples each value independently. In many cases we would want to keep the semantic dependency between two values. For example, for the template `My name is {{name}} and my email is {{email}}`, we would prefer a result which has the name within the email address, such as `My name is Mike and my email is mike1243@gmail.com`. For this functionality, a new `RecordGenerator` (based on Faker's `Generator` class) is implemented. It accepts a dictionary / pandas DataFrame, and favors returning objects from the same record (if possible).
2. `Faker` samples each value independently.
In many cases we would want to keep the semantic dependency between two values.
For example, for the template `My name is {{name}} and my email is {{email}}`,
we would prefer a result which has the name within the email address,
such as `My name is Mike and my email is mike1243@gmail.com`.
For this functionality, a new `RecordGenerator` (based on Faker's `Generator` class) is implemented.
It accepts a dictionary / pandas DataFrame, and favors returning objects from the same record (if possible).
## Example
For a full example, see the [Generation Data Notebook](../../notebooks/1_Generate_data.ipynb).
For a full example, see the [Generate Data Notebook](../../notebooks/1_Generate_data.ipynb).
Simple example:
@ -57,44 +65,18 @@ The process in high level is the following:
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
templates: `My name is John` -> `My name is [PERSON]`
2. (Optional) adapt the FakeDataGenerator to support new extensions
which could generate fake PII entities
3. Generate X samples using the templates list + a fake PII dataset +
extensions that add additional PII entities
2. (Optional) add new Faker providers to the `PresidioDataGenerator` to support types of PII not returned by Faker
3. Generate samples using the templates list
4. Split the generated dataset to train/test/validation while making sure
that samples from the same template would only appear in one set
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
6. Train models
7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
7. Evaluate using one of the [evaluation notebooks](../../notebooks/models)
Notes:
- For steps 5, 6, 7 see the main [README](../../README.md).
- For a simple data generation pipeline,
[see this notebook](../../notebooks/data%20generation/Generate%20data.ipynb).
- For information on transforming a NER dataset into a templates,
see the notebooks in the [helper notebooks](../../notebooks/data%20generation) folder.
Example run:
```python
from presidio_evaluator.data_generator import generate
TEMPLATES_FILE = 'raw_data/templates.txt'
OUTPUT = "generated_.txt"
## Should be downloaded from FakeNameGenerator
fake_pii_csv = 'raw_data/FakeNameGenerator.csv'
examples = generate(fake_pii_csv=fake_pii_csv,
utterances_file=TEMPLATES_FILE,
dictionary_path=None,
output_file=OUTPUT,
lower_case_ratio=0.1,
num_of_examples=100,
ignore_types={"IP_ADDRESS", 'US_SSN', 'URL'},
keep_only_tagged=False,
span_to_tag=True)
```
*Copyright notice:*

Просмотреть файл

@ -9,7 +9,7 @@ from .providers import (
IpAddressProvider,
AddressProviderNew,
PhoneNumberProviderNew,
AgeProvider
AgeProvider,
)
__all__ = [
@ -24,5 +24,5 @@ __all__ = [
"AddressProviderNew",
"PhoneNumberProviderNew",
"AgeProvider",
"RecordsFaker"
"RecordsFaker",
]

Просмотреть файл

@ -7,7 +7,6 @@ from presidio_evaluator.data_generator.faker_extensions import RecordGenerator
class RecordsFaker(Faker):
def __init__(self, records: Union[pd.DataFrame, List[Dict]], **kwargs):
if isinstance(records, pd.DataFrame):
records = records.to_dict(orient="records")

Просмотреть файл

@ -281,7 +281,7 @@ class PresidioDataGenerator:
"prefix_female",
"prefix_male",
"last_name_female",
"last_name_male"
"last_name_male",
],
),
axis=1,

Просмотреть файл

@ -5,7 +5,7 @@ from typing import List, Optional, Union, Dict, Any, Tuple
import pandas as pd
import spacy
from spacy import Language
from spacy.tokens import Token, Doc, DocBin
from spacy.tokens import Doc, DocBin
from spacy.training import iob_to_biluo
from tqdm import tqdm
@ -137,14 +137,14 @@ class InputSample(object):
:param full_text: The raw text of this sample
:param masked: Masked/Templated version of the raw text
:param spans: List of spans for entities
:param create_tags_from_span: True if tags (tokens+taks) should be added
:param create_tags_from_span: True if tags (tokens+tags) should be added
:param scheme: IO, BIO/IOB or BILOU. Only applicable if span_to_tag=True
:param tokens: spaCy Doc object
:param tags: list of strings representing the label for each token,
given the scheme
:param metadata: A dictionary of additional metadata on the sample,
in the English (or other language) vocabulary
:param template_id: Original template (utterance) of sample, in case it was generated
:param template_id: Original template (utterance) of sample, in case it was generated # noqa
"""
if tags is None:
tags = []
@ -532,7 +532,7 @@ class InputSample(object):
span.entity_value = "O"
@staticmethod
def create_flair_dataset(dataset):
def create_flair_dataset(dataset: List["InputSample"]) -> List[str]:
flair_samples = []
for sample in dataset:
flair_samples.append(sample.to_flair())

Просмотреть файл

@ -143,7 +143,9 @@ class Evaluator:
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
evaluation_results = []
if self.model.entity_mapping:
print(f"Mapping entity values using this dictionary: {self.model.entity_mapping}")
print(
f"Mapping entity values using this dictionary: {self.model.entity_mapping}"
)
for sample in tqdm(dataset, desc=f"Evaluating {self.model.__class__}"):
# Align tag values to the ones expected by the model

Просмотреть файл

@ -40,7 +40,7 @@ class ExperimentTracker:
labels=List[str],
):
self.confusion_matrix = matrix
self.labels=labels
self.labels = labels
def start(self):
pass

Просмотреть файл

@ -1,3 +1,4 @@
"""Helper scripts for calling different NER models."""
from .base_model import BaseModel
from .crf_model import CRFModel
from .presidio_analyzer_wrapper import PresidioAnalyzerWrapper

Просмотреть файл

@ -16,7 +16,7 @@ class CRFModel(BaseModel):
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
entity_mapping=entity_mapping
entity_mapping=entity_mapping,
)
if model_pickle_path is None:

Просмотреть файл

@ -16,6 +16,15 @@ from presidio_evaluator.models import BaseModel
class FlairModel(BaseModel):
"""
Evaluator for Flair models
:param model: model of type SequenceTagger
:param model_path:
:param entities_to_keep:
:param verbose:
and model expected entity types
"""
def __init__(
self,
model=None,
@ -24,14 +33,7 @@ class FlairModel(BaseModel):
verbose: bool = False,
entity_mapping: Dict[str, str] = PRESIDIO_SPACY_ENTITIES,
):
"""
Evaluator for Flair models
:param model: model of type SequenceTagger
:param model_path:
:param entities_to_keep:
:param verbose:
and model expected entity types
"""
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,

Просмотреть файл

@ -1,5 +1,7 @@
from typing import List
import pandas as pd
try:
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
@ -21,11 +23,20 @@ from os import path
class FlairTrainer:
"""
Helper class for training Flair models
"""
@staticmethod
def to_flair_row(text, pos, label):
def to_flair_row(text: str, pos: str, label: str) -> str:
"""
Turn text, part of speech and label into one row.
:return: str
"""
return "{} {} {}".format(text, pos, label)
def to_flair(self, df, outfile="flair_train.txt"):
def to_flair(self, df: pd.DataFrame, outfile: str = "flair_train.txt") -> None:
"""Translate a pd.DataFrame to a flair dataset."""
sentence = 0
flair = []
for row in df.itertuples():
@ -43,10 +54,19 @@ class FlairTrainer:
def create_flair_corpus(
self, train_samples_path, test_samples_path, val_samples_path
):
"""
Create a flair Corpus object and saive it to train, test, validation files.
:param train_samples_path: Path to train samples
:param test_samples_path: Path to test samples
:param val_samples_path: Path to validation samples
:return:
"""
if not path.exists("flair_train.txt"):
train_samples = InputSample.read_dataset_json(train_samples_path)
train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
print(f"Kept {len(train_tagged)} train samples after removal of non-tagged samples")
print(
f"Kept {len(train_tagged)} train samples after removal of non-tagged samples"
)
train_data = InputSample.create_conll_dataset(train_tagged)
self.to_flair(train_data, outfile="flair_train.txt")
@ -61,7 +81,12 @@ class FlairTrainer:
self.to_flair(val_data, outfile="flair_val.txt")
@staticmethod
def read_corpus(data_folder):
def read_corpus(data_folder: str):
"""
Read Flair Corpus object.
:param data_folder: Path with files
:return: Corpus object
"""
columns = {0: "text", 1: "pos", 2: "ner"}
corpus = ColumnCorpus(
data_folder,
@ -73,7 +98,12 @@ class FlairTrainer:
return corpus
@staticmethod
def train(corpus):
def train(corpus: Corpus):
"""
Train a Flair model
:param corpus: Corpus object
:return:
"""
print(corpus)
# 2. what tag do we want to predict?

Просмотреть файл

@ -15,7 +15,7 @@ class PresidioAnalyzerWrapper(BaseModel):
labeling_scheme: str = "BIO",
score_threshold: float = 0.4,
language: str = "en",
entity_mapping:Optional[Dict[str,str]]=None
entity_mapping: Optional[Dict[str, str]] = None,
):
"""
Evaluation wrapper for the Presidio Analyzer
@ -25,7 +25,7 @@ class PresidioAnalyzerWrapper(BaseModel):
entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
entity_mapping=entity_mapping
entity_mapping=entity_mapping,
)
self.score_threshold = score_threshold
self.language = language

Просмотреть файл

@ -9,6 +9,17 @@ from presidio_evaluator.span_to_tag import span_to_tag
class PresidioRecognizerWrapper(BaseModel):
"""
Class wrapper for one specific PII recognizer
To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
:param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
:param entities_to_keep: List of entity types to focus on while ignoring all the rest.
Default=None would look at all entity types
:param with_nlp_artifacts: Whether NLP artifacts should be obtained
(faster if not, but some recognizers need it)
"""
def __init__(
self,
recognizer: EntityRecognizer,
@ -19,21 +30,12 @@ class PresidioRecognizerWrapper(BaseModel):
entity_mapping: Optional[Dict[str, str]] = None,
verbose: bool = False,
):
"""
Evaluator for one specific PII recognizer
To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
:param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
:param entities_to_keep: List of entity types to focus on while ignoring all the rest.
Default=None would look at all entity types
:param with_nlp_artifacts: Whether NLP artifacts should be obtained
(faster if not, but some recognizers need it)
"""
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
entity_mapping=entity_mapping
entity_mapping=entity_mapping,
)
self.with_nlp_artifacts = with_nlp_artifacts
self.recognizer = recognizer

Просмотреть файл

@ -21,7 +21,7 @@ class SpacyModel(BaseModel):
entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
entity_mapping=entity_mapping
entity_mapping=entity_mapping,
)
if model is None:
@ -32,14 +32,19 @@ class SpacyModel(BaseModel):
self.model = model
def predict(self, sample: InputSample) -> List[str]:
"""
Predict a list of tags for an inpuit sample.
:param sample: InputSample
:return: list of tags
"""
doc = self.model(sample.full_text)
tags = self.get_tags_from_doc(doc)
tags = self._get_tags_from_doc(doc)
if len(doc) != len(sample.tokens):
print("mismatch between input tokens and new tokens")
return tags
@staticmethod
def get_tags_from_doc(doc):
def _get_tags_from_doc(doc):
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
return tags

Просмотреть файл

@ -14,6 +14,17 @@ from presidio_evaluator.models import SpacyModel
class StanzaModel(SpacyModel):
"""
Class wrapping Stanza models, using spacy_stanza.
:param model: spaCy Language object representing a stanza model
:param model_name: Name of model, e.g. "en"
:param entities_to_keep: List of entities to predict on
:param verbose: Whether to print more
:param labeling_scheme: Whether to return IO, BIO or BILUO tags
:param entity_mapping: Mapping between input dataset entities and entities expected by the model
"""
def __init__(
self,
model: spacy.language.Language = None,
@ -23,6 +34,7 @@ class StanzaModel(SpacyModel):
labeling_scheme: str = "BIO",
entity_mapping: Optional[Dict[str, str]] = PRESIDIO_SPACY_ENTITIES,
):
if not model and not model_name:
raise ValueError("Either model_name or model object must be supplied")
if not model:
@ -40,6 +52,12 @@ class StanzaModel(SpacyModel):
)
def predict(self, sample: InputSample) -> List[str]:
"""
Predict the tags using a stanza model.
:param sample: InputSample with text
:return: list of tags
"""
doc = self.model(sample.full_text)
if doc.ents:
@ -48,7 +66,8 @@ class StanzaModel(SpacyModel):
)
# Stanza tokens might not be consistent with spaCy's tokens.
# Use spacy tokenization and not stanza to maintain consistency with other models:
# Use spacy tokenization and not stanza
# to maintain consistency with other models:
if not sample.tokens:
sample.tokens = tokenize(sample.full_text)

Просмотреть файл

@ -20,12 +20,12 @@ setup(
packages=find_packages(exclude=["tests"]),
url="https://www.github.com/microsoft/presidio-research",
license="MIT",
description="PII dataset generator, model evaluator for Presidio and PII data in general",
description="PII dataset generator, model evaluator for Presidio and PII data in general", # noqa
data_files=[
(
"presidio_evaluator/data_generator/raw_data",
[
"presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv",
"presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv", # noqa
"presidio_evaluator/data_generator/raw_data/templates.txt",
"presidio_evaluator/data_generator/raw_data/organizations.csv",
"presidio_evaluator/data_generator/raw_data/nationalities.csv",
@ -46,6 +46,6 @@ setup(
"schwifty",
"faker",
"sklearn_crfsuite",
"python-dotenv"
"python-dotenv",
],
)

Просмотреть файл

@ -69,7 +69,7 @@ def test_to_spacy_file_and_back(small_dataset):
output_path="dataset.spacy",
translate_tags=False,
spacy_pipeline=spacy_pipeline,
alignment_mode = "strict"
alignment_mode="strict",
)
db = DocBin()

Просмотреть файл

@ -23,7 +23,6 @@ def test_flair_simple():
os.path.join(dir_path, "data/generated_small.json")
)
flair_model = FlairModel(model_path="ner", entities_to_keep=["PERSON"])
evaluator = Evaluator(model=flair_model)
evaluation_results = evaluator.evaluate_all(input_samples)

Просмотреть файл

@ -55,10 +55,7 @@ cc_test_template_testdata = [
# credit card recognizer tests on template-generates data
@pytest.mark.parametrize(
"pii_csv, "
"utterances, "
"num_of_examples, "
"acceptance_threshold",
"pii_csv, " "utterances, " "num_of_examples, " "acceptance_threshold",
[testcase.to_pytest_param() for testcase in cc_test_template_testdata],
)
def test_credit_card_recognizer_with_template(

Просмотреть файл

@ -77,7 +77,7 @@ rocket_test_template_testdata = [
utterances="{}/data/rocket_example_sentences.txt",
num_of_examples=100,
acceptance_threshold=0,
max_mistakes_number=100
max_mistakes_number=100,
),
PatternRecognizerTestCase(
test_name="rocket-some-errors",