black and flake8-ing the entire code
This commit is contained in:
Родитель
c9151c64c2
Коммит
5655404a2c
117
README.md
117
README.md
|
@ -1,8 +1,10 @@
|
|||
# Presidio-research
|
||||
|
||||
This package features data-science related tasks for developing new recognizers for [Presidio](https://github.com/microsoft/presidio).
|
||||
It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models
|
||||
|
||||
This package features data-science related tasks for developing new recognizers for
|
||||
[Presidio](https://github.com/microsoft/presidio).
|
||||
It is used for the evaluation of the entire system,
|
||||
as well as for evaluating specific PII recognizers or PII detection models
|
||||
In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.
|
||||
## Who should use it?
|
||||
|
||||
- Anyone interested in **developing or evaluating a PII detection model**, an existing Presidio instance or a Presidio PII recognizer.
|
||||
|
@ -42,72 +44,100 @@ Note that some dependencies (such as Flair and Stanza) are not installed to redu
|
|||
|
||||
See [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.
|
||||
|
||||
The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
|
||||
Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
|
||||
The data generation process receives a file with templates,0
|
||||
e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
|
||||
Then, it creates new synthetic sentences by sampling templates and PII values.
|
||||
Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
|
||||
|
||||
- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).
|
||||
- For information on data generation/augmentation,
|
||||
- see the data generator [README](presidio_evaluator/data_generator/README.md).
|
||||
|
||||
- For an example for running the generation process, see [this notebook](notebooks/data%20generation/Generate%20data.ipynb).
|
||||
- For an example for running the generation process,
|
||||
- see [this notebook](notebooks/1_Generate_data.ipynb).
|
||||
|
||||
- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/PII%20EDA.ipynb).
|
||||
Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
|
||||
- For an understanding of the underlying fake PII data used,
|
||||
see this [exploratory data analysis notebook](notebooks/2_PII_EDA.ipynb).
|
||||
|
||||
Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See [this notebook for more details](notebooks/Split%20by%20pattern%20%23.ipynb).
|
||||
Once data is generated, it could be split into train/test/validation sets
|
||||
while ensuring that each template only exists in one set.
|
||||
See [this notebook for more details](notebooks/3_Split_by_pattern_%23.ipynb).
|
||||
|
||||
## 2. Data representation
|
||||
|
||||
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see [data_objects.py](presidio_evaluator/data_objects.py).
|
||||
In order to standardize the process,
|
||||
we use specific data objects that hold all the information needed for generating,
|
||||
analyzing, modeling and evaluating data and models. Specifically,
|
||||
see [data_objects.py](presidio_evaluator/data_objects.py).
|
||||
|
||||
## 3. Recognizer evaluation
|
||||
The standardized structure, `List[InputSample]` could be translated into different formats:
|
||||
- CONLL
|
||||
```python
|
||||
from presidio_evaluator import InputSample
|
||||
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
|
||||
conll = InputSample.create_conll_dataset(dataset)
|
||||
conll.to_csv("dataset.csv", sep="\t")
|
||||
|
||||
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
|
||||
The main logic lies in the [Evaluator](presidio_evaluator/evaluation/evaluator.py) class. It provides a structured way of evaluating models and recognizers.
|
||||
```
|
||||
|
||||
### Ready model / engine wrappers
|
||||
- spaCy v3
|
||||
```python
|
||||
from presidio_evaluator import InputSample
|
||||
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
|
||||
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
|
||||
```
|
||||
|
||||
Some evaluators were developed for analysis and references. These include:
|
||||
- Flair
|
||||
```python
|
||||
from presidio_evaluator import InputSample
|
||||
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
|
||||
flair = InputSample.create_flair_dataset()(dataset)
|
||||
```
|
||||
|
||||
#### Presidio analyzer evaluation
|
||||
- json
|
||||
```python
|
||||
from presidio_evaluator import InputSample
|
||||
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
|
||||
InputSample.to_json(dataset, output_file="dataset_json")
|
||||
```
|
||||
|
||||
Allows you to evaluate an existing Presidio instance. [See this notebook for details](notebooks/Evaluate%20Presidio%20Analyzer.ipynb).
|
||||
## 3. PII models evaluation
|
||||
|
||||
#### One recognizer evaluation
|
||||
The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model,
|
||||
or a specific PII recognizer for precision and recall.
|
||||
|
||||
Evaluate one specific recognizer for precision and recall.
|
||||
Similar to the analyzer evaluation just focusing on one type of PII recognizer.
|
||||
See [presidio_recognizer_wrapper.py](presidio_evaluator/models/presidio_recognizer_wrapper.py)
|
||||
|
||||
#### Conditional Random Fields
|
||||
### Examples:
|
||||
- [Evaluate Presidio](notebooks/4_Evaluate_Presidio_Analyzer.ipynb)
|
||||
- [Evaluate spaCy models](notebooks/models/Evaluate%20spacy%20models.ipynb)
|
||||
- [Evaluate Stanza models](notebooks/models/Evaluate%20stanza%20models.ipynb)
|
||||
- [Evaluate CRF models](notebooks/models/Evaluate%20crf%20models.ipynb)
|
||||
- [Evaluate Flair models](notebooks/models/Evaluate%20flair%20models.ipynb)
|
||||
|
||||
To train a CRF on a new dataset, see [this notebook](notebooks/models/Train CRF.ipynb).
|
||||
To evaluate a CRF model, see the the [same notebook](notebooks/models/Train CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).
|
||||
|
||||
#### spaCy based models
|
||||
## 4. Training PII detection models
|
||||
|
||||
There are three ways of interacting with spaCy models:
|
||||
### CRF
|
||||
|
||||
1. Evaluate an existing trained model
|
||||
2. Train with pretrained embeddings
|
||||
3. Fine tune an existing spaCy model
|
||||
To train a vanilla CRF on a new dataset, see [this notebook](notebooks/models/Train%20CRF.ipynb). To evaluate, see [this notebook](notebooks/models/Evaluate%20CRF%20models.ipynb).
|
||||
|
||||
Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
|
||||
See [this notebook for creating spaCy datasets](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb).
|
||||
### spaCy
|
||||
|
||||
##### Evaluate an existing spaCy model
|
||||
To train a new spaCy model, first save the dataset in a spaCy format:
|
||||
```python
|
||||
# dataset is a List[InputSample]
|
||||
InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")
|
||||
```
|
||||
|
||||
To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
|
||||
To evaluate, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb)
|
||||
|
||||
#### Flair based models
|
||||
|
||||
To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object.
|
||||
For experimenting with other embedding types, change the `embeddings` object in the `train` method.
|
||||
To train a Flair model, run:
|
||||
### Flair
|
||||
|
||||
- To train Flair models, see this [helper class](presidio_evaluator/models/flair_train.py) or this snippet:
|
||||
```python
|
||||
from presidio_evaluator.models import FlairTrainer
|
||||
train_samples = "../data/generated_train.json"
|
||||
test_samples = "../data/generated_test.json"
|
||||
val_samples = "../data/generated_validation.json"
|
||||
train_samples = "data/generated_train.json"
|
||||
test_samples = "data/generated_test.json"
|
||||
val_samples = "data/generated_validation.json"
|
||||
|
||||
trainer = FlairTrainer()
|
||||
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
|
||||
|
@ -116,9 +146,8 @@ corpus = trainer.read_corpus("")
|
|||
trainer.train(corpus)
|
||||
```
|
||||
|
||||
To evaluate an existing model, see [this notebook](notebooks/models/Evaluate%20flair%20models.ipynb).
|
||||
|
||||
# For more information
|
||||
## For more information
|
||||
|
||||
- [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)
|
||||
- [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)
|
||||
|
|
2
VERSION
2
VERSION
|
@ -1,2 +1,2 @@
|
|||
0.0.2
|
||||
0.1.0
|
||||
|
||||
|
|
|
@ -27,9 +27,9 @@
|
|||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"pd.set_option('display.max_columns', None) \n",
|
||||
"pd.set_option('display.max_rows', None) \n",
|
||||
"pd.set_option('display.max_colwidth', None)\n",
|
||||
"pd.set_option(\"display.max_columns\", None)\n",
|
||||
"pd.set_option(\"display.max_rows\", None)\n",
|
||||
"pd.set_option(\"display.max_colwidth\", None)\n",
|
||||
"\n",
|
||||
"%reload_ext autoreload\n",
|
||||
"%autoreload 2"
|
||||
|
@ -82,12 +82,16 @@
|
|||
"print(dataset[1])\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max number of tokens in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max sentence length in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -153,8 +157,8 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sent = 'I am taiwanese but I live in Cambodia.'\n",
|
||||
"#sent = input(\"Enter sentence: \")\n",
|
||||
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
|
||||
"# sent = input(\"Enter sentence: \")\n",
|
||||
"model.predict(InputSample(full_text=sent))"
|
||||
]
|
||||
},
|
||||
|
@ -238,7 +242,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelError.get_fns_dataframe(errors,entity=['PHONE_NUMBER'])"
|
||||
"fns_df = ModelError.get_fns_dataframe(errors, entity=[\"PHONE_NUMBER\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -259,7 +263,7 @@
|
|||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"All errors:\\n\")\n",
|
||||
"[print(error,\"\\n\") for error in errors]"
|
||||
"[print(error, \"\\n\") for error in errors]"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
|
|
@ -31,9 +31,9 @@
|
|||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"pd.set_option('display.max_columns', None) \n",
|
||||
"pd.set_option('display.max_rows', None) \n",
|
||||
"pd.set_option('display.max_colwidth', None)\n",
|
||||
"pd.set_option(\"display.max_columns\", None)\n",
|
||||
"pd.set_option(\"display.max_rows\", None)\n",
|
||||
"pd.set_option(\"display.max_colwidth\", None)\n",
|
||||
"\n",
|
||||
"%reload_ext autoreload\n",
|
||||
"%autoreload 2"
|
||||
|
@ -98,12 +98,16 @@
|
|||
"print(dataset[1])\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max number of tokens in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max sentence length in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -160,7 +164,7 @@
|
|||
" results = evaluator.calculate_score(evaluation_results)\n",
|
||||
"\n",
|
||||
" # update params tracking\n",
|
||||
" params = {\"dataset_name\":dataset_name, \"model_name\": model_path}\n",
|
||||
" params = {\"dataset_name\": dataset_name, \"model_name\": model_path}\n",
|
||||
" params.update(model.to_log())\n",
|
||||
" experiment.log_parameters(params)\n",
|
||||
" experiment.log_dataset_hash(dataset)\n",
|
||||
|
@ -197,8 +201,8 @@
|
|||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sent = 'I am taiwanese but I live in Cambodia.'\n",
|
||||
"#sent = input(\"Enter sentence: \")\n",
|
||||
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
|
||||
"# sent = input(\"Enter sentence: \")\n",
|
||||
"model.predict(InputSample(full_text=sent))"
|
||||
]
|
||||
},
|
||||
|
@ -290,7 +294,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelError.get_fns_dataframe(errors, entity=['GPE'])"
|
||||
"fns_df = ModelError.get_fns_dataframe(errors, entity=[\"GPE\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -311,7 +315,7 @@
|
|||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"All errors:\\n\")\n",
|
||||
"[print(error,\"\\n\") for error in errors]"
|
||||
"[print(error, \"\\n\") for error in errors]"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -27,9 +27,9 @@
|
|||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"pd.set_option('display.max_columns', None) \n",
|
||||
"pd.set_option('display.max_rows', None) \n",
|
||||
"pd.set_option('display.max_colwidth', None)\n",
|
||||
"pd.set_option(\"display.max_columns\", None)\n",
|
||||
"pd.set_option(\"display.max_rows\", None)\n",
|
||||
"pd.set_option(\"display.max_colwidth\", None)\n",
|
||||
"\n",
|
||||
"%reload_ext autoreload\n",
|
||||
"%autoreload 2"
|
||||
|
@ -51,7 +51,9 @@
|
|||
"outputs": [],
|
||||
"source": [
|
||||
"dataset_name = \"synth_dataset_v2.json\"\n",
|
||||
"dataset = InputSample.read_dataset_json(Path(Path.cwd().parent.parent, \"data\", dataset_name))\n",
|
||||
"dataset = InputSample.read_dataset_json(\n",
|
||||
" Path(Path.cwd().parent.parent, \"data\", dataset_name)\n",
|
||||
")\n",
|
||||
"print(len(dataset))"
|
||||
]
|
||||
},
|
||||
|
@ -82,12 +84,16 @@
|
|||
"print(dataset[1])\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max number of tokens in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max sentence length in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -109,7 +115,13 @@
|
|||
"flair_ner_fast = \"ner-english-fast\"\n",
|
||||
"flair_ontonotes_fast = \"ner-english-ontonotes-fast\"\n",
|
||||
"flair_ontonotes_large = \"ner-english-ontonotes-large\"\n",
|
||||
"models = [flair_ner, flair_ner_fast, flair_ontonotes_fast ,flair_ner_fast, flair_ontonotes_large]"
|
||||
"models = [\n",
|
||||
" flair_ner,\n",
|
||||
" flair_ner_fast,\n",
|
||||
" flair_ontonotes_fast,\n",
|
||||
" flair_ner_fast,\n",
|
||||
" flair_ontonotes_large,\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -138,7 +150,7 @@
|
|||
" results = evaluator.calculate_score(evaluation_results)\n",
|
||||
"\n",
|
||||
" # update params tracking\n",
|
||||
" params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
|
||||
" params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
|
||||
" params.update(model.to_log())\n",
|
||||
" experiment.log_parameters(params)\n",
|
||||
" experiment.log_dataset_hash(dataset)\n",
|
||||
|
@ -171,8 +183,8 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sent = 'I am taiwanese but I live in Cambodia.'\n",
|
||||
"#sent = input(\"Enter sentence: \")\n",
|
||||
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
|
||||
"# sent = input(\"Enter sentence: \")\n",
|
||||
"model.predict(InputSample(full_text=sent))"
|
||||
]
|
||||
},
|
||||
|
@ -265,7 +277,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelError.get_fns_dataframe(errors, entity=['GPE'])"
|
||||
"fns_df = ModelError.get_fns_dataframe(errors, entity=[\"GPE\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -286,7 +298,7 @@
|
|||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"All errors:\\n\")\n",
|
||||
"[print(error,\"\\n\") for error in errors]"
|
||||
"[print(error, \"\\n\") for error in errors]"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -136,15 +136,13 @@
|
|||
" print(f\"Evaluating model {model_name}\")\n",
|
||||
"\n",
|
||||
" nlp = spacy.load(model_name)\n",
|
||||
" model = SpacyModel(\n",
|
||||
" model=nlp, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"]\n",
|
||||
" )\n",
|
||||
" model = SpacyModel(model=nlp, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"])\n",
|
||||
" evaluator = Evaluator(model=model)\n",
|
||||
" evaluation_results = evaluator.evaluate_all(deepcopy(dataset))\n",
|
||||
" results = evaluator.calculate_score(evaluation_results)\n",
|
||||
"\n",
|
||||
" # update params tracking\n",
|
||||
" params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
|
||||
" params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
|
||||
" params.update(model.to_log())\n",
|
||||
" experiment.log_parameters(params)\n",
|
||||
" experiment.log_dataset_hash(dataset)\n",
|
||||
|
@ -200,8 +198,8 @@
|
|||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sent = 'I am taiwanese but I live in Cambodia.'\n",
|
||||
"#sent = input(\"Enter sentence: \")\n",
|
||||
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
|
||||
"# sent = input(\"Enter sentence: \")\n",
|
||||
"model.predict(InputSample(full_text=sent))"
|
||||
]
|
||||
},
|
||||
|
@ -272,7 +270,6 @@
|
|||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"ModelError.most_common_fn_tokens(errors, n=50, entity=[\"PERSON\"])"
|
||||
]
|
||||
},
|
||||
|
|
|
@ -52,7 +52,9 @@
|
|||
"outputs": [],
|
||||
"source": [
|
||||
"dataset_name = \"synth_dataset_v2.json\"\n",
|
||||
"dataset = InputSample.read_dataset_json(Path(Path.cwd().parent.parent, \"data\", dataset_name))\n",
|
||||
"dataset = InputSample.read_dataset_json(\n",
|
||||
" Path(Path.cwd().parent.parent, \"data\", dataset_name)\n",
|
||||
")\n",
|
||||
"print(len(dataset))"
|
||||
]
|
||||
},
|
||||
|
@ -83,12 +85,16 @@
|
|||
"print(dataset[1])\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max number of tokens in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"\\nMin and max sentence length in dataset:\")\n",
|
||||
"print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
|
||||
"print(\n",
|
||||
" f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
|
||||
" f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -128,14 +134,16 @@
|
|||
" experiment = get_experiment_tracker()\n",
|
||||
" print(\"-----------------------------------\")\n",
|
||||
" print(f\"Evaluating model {model_name}\")\n",
|
||||
" \n",
|
||||
" model = StanzaModel(model_name=model_name, entities_to_keep=['PERSON', 'GPE', 'ORG', 'NORP'])\n",
|
||||
"\n",
|
||||
" model = StanzaModel(\n",
|
||||
" model_name=model_name, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"]\n",
|
||||
" )\n",
|
||||
" evaluator = Evaluator(model=model)\n",
|
||||
" evaluation_results = evaluator.evaluate_all(deepcopy(dataset))\n",
|
||||
" results = evaluator.calculate_score(evaluation_results)\n",
|
||||
"\n",
|
||||
" # update params tracking\n",
|
||||
" params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
|
||||
" params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
|
||||
" params.update(model.to_log())\n",
|
||||
" experiment.log_parameters(params)\n",
|
||||
" experiment.log_dataset_hash(dataset)\n",
|
||||
|
@ -168,8 +176,8 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sent = 'I am taiwanese but I live in Cambodia.'\n",
|
||||
"#sent = input(\"Enter sentence: \")\n",
|
||||
"sent = \"I am taiwanese but I live in Cambodia.\"\n",
|
||||
"# sent = input(\"Enter sentence: \")\n",
|
||||
"model.predict(InputSample(full_text=sent))"
|
||||
]
|
||||
},
|
||||
|
@ -243,7 +251,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelError.get_fns_dataframe(errors=results.model_errors,entity=['GPE'])"
|
||||
"fns_df = ModelError.get_fns_dataframe(errors=results.model_errors, entity=[\"GPE\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -264,7 +272,7 @@
|
|||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"All errors:\\n\")\n",
|
||||
"[print(error,\"\\n\") for error in results.model_errors]"
|
||||
"[print(error, \"\\n\") for error in results.model_errors]"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -9,22 +9,30 @@ for model training and evaluation.
|
|||
|
||||
There are two main scenarios for using the Presidio Data Generator:
|
||||
|
||||
1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates (see [this file](raw_data/templates.txt) for example)
|
||||
1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates
|
||||
(see [this file](raw_data/templates.txt) for example)
|
||||
2. Augment an existing labeled dataset with additional fake values.
|
||||
|
||||
In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates, and then scenario 1 is applied.
|
||||
In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates,
|
||||
and then scenario 1 is applied.
|
||||
|
||||
## Process
|
||||
|
||||
This generator heavily relies on the [Faker package](https://www.github.com/joke2k/faker) with a few differences:
|
||||
|
||||
1. `PresidioDataGenerator` returns not only fake text, but all the spans in which fake entities appear in the text
|
||||
1. `PresidioDataGenerator` returns not only fake text, but also the spans in which fake entities appear in the text.
|
||||
|
||||
2. Faker samples each value independently. In many cases we would want to keep the semantic dependency between two values. For example, for the template `My name is {{name}} and my email is {{email}}`, we would prefer a result which has the name within the email address, such as `My name is Mike and my email is mike1243@gmail.com`. For this functionality, a new `RecordGenerator` (based on Faker's `Generator` class) is implemented. It accepts a dictionary / pandas DataFrame, and favors returning objects from the same record (if possible).
|
||||
2. `Faker` samples each value independently.
|
||||
In many cases we would want to keep the semantic dependency between two values.
|
||||
For example, for the template `My name is {{name}} and my email is {{email}}`,
|
||||
we would prefer a result which has the name within the email address,
|
||||
such as `My name is Mike and my email is mike1243@gmail.com`.
|
||||
For this functionality, a new `RecordGenerator` (based on Faker's `Generator` class) is implemented.
|
||||
It accepts a dictionary / pandas DataFrame, and favors returning objects from the same record (if possible).
|
||||
|
||||
## Example
|
||||
|
||||
For a full example, see the [Generation Data Notebook](../../notebooks/1_Generate_data.ipynb).
|
||||
For a full example, see the [Generate Data Notebook](../../notebooks/1_Generate_data.ipynb).
|
||||
|
||||
Simple example:
|
||||
|
||||
|
@ -57,44 +65,18 @@ The process in high level is the following:
|
|||
|
||||
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
|
||||
templates: `My name is John` -> `My name is [PERSON]`
|
||||
2. (Optional) adapt the FakeDataGenerator to support new extensions
|
||||
which could generate fake PII entities
|
||||
3. Generate X samples using the templates list + a fake PII dataset +
|
||||
extensions that add additional PII entities
|
||||
2. (Optional) add new Faker providers to the `PresidioDataGenerator` to support types of PII not returned by Faker
|
||||
3. Generate samples using the templates list
|
||||
4. Split the generated dataset to train/test/validation while making sure
|
||||
that samples from the same template would only appear in one set
|
||||
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
|
||||
6. Train models
|
||||
7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
|
||||
7. Evaluate using one of the [evaluation notebooks](../../notebooks/models)
|
||||
|
||||
Notes:
|
||||
|
||||
- For steps 5, 6, 7 see the main [README](../../README.md).
|
||||
- For a simple data generation pipeline,
|
||||
[see this notebook](../../notebooks/data%20generation/Generate%20data.ipynb).
|
||||
- For information on transforming a NER dataset into a templates,
|
||||
see the notebooks in the [helper notebooks](../../notebooks/data%20generation) folder.
|
||||
|
||||
Example run:
|
||||
|
||||
```python
|
||||
from presidio_evaluator.data_generator import generate
|
||||
TEMPLATES_FILE = 'raw_data/templates.txt'
|
||||
OUTPUT = "generated_.txt"
|
||||
|
||||
## Should be downloaded from FakeNameGenerator
|
||||
fake_pii_csv = 'raw_data/FakeNameGenerator.csv'
|
||||
|
||||
examples = generate(fake_pii_csv=fake_pii_csv,
|
||||
utterances_file=TEMPLATES_FILE,
|
||||
dictionary_path=None,
|
||||
output_file=OUTPUT,
|
||||
lower_case_ratio=0.1,
|
||||
num_of_examples=100,
|
||||
ignore_types={"IP_ADDRESS", 'US_SSN', 'URL'},
|
||||
keep_only_tagged=False,
|
||||
span_to_tag=True)
|
||||
```
|
||||
|
||||
*Copyright notice:*
|
||||
|
||||
|
|
|
@ -9,7 +9,7 @@ from .providers import (
|
|||
IpAddressProvider,
|
||||
AddressProviderNew,
|
||||
PhoneNumberProviderNew,
|
||||
AgeProvider
|
||||
AgeProvider,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
|
@ -24,5 +24,5 @@ __all__ = [
|
|||
"AddressProviderNew",
|
||||
"PhoneNumberProviderNew",
|
||||
"AgeProvider",
|
||||
"RecordsFaker"
|
||||
"RecordsFaker",
|
||||
]
|
||||
|
|
|
@ -7,7 +7,6 @@ from presidio_evaluator.data_generator.faker_extensions import RecordGenerator
|
|||
|
||||
|
||||
class RecordsFaker(Faker):
|
||||
|
||||
def __init__(self, records: Union[pd.DataFrame, List[Dict]], **kwargs):
|
||||
if isinstance(records, pd.DataFrame):
|
||||
records = records.to_dict(orient="records")
|
||||
|
|
|
@ -281,7 +281,7 @@ class PresidioDataGenerator:
|
|||
"prefix_female",
|
||||
"prefix_male",
|
||||
"last_name_female",
|
||||
"last_name_male"
|
||||
"last_name_male",
|
||||
],
|
||||
),
|
||||
axis=1,
|
||||
|
|
|
@ -5,7 +5,7 @@ from typing import List, Optional, Union, Dict, Any, Tuple
|
|||
import pandas as pd
|
||||
import spacy
|
||||
from spacy import Language
|
||||
from spacy.tokens import Token, Doc, DocBin
|
||||
from spacy.tokens import Doc, DocBin
|
||||
from spacy.training import iob_to_biluo
|
||||
from tqdm import tqdm
|
||||
|
||||
|
@ -137,14 +137,14 @@ class InputSample(object):
|
|||
:param full_text: The raw text of this sample
|
||||
:param masked: Masked/Templated version of the raw text
|
||||
:param spans: List of spans for entities
|
||||
:param create_tags_from_span: True if tags (tokens+taks) should be added
|
||||
:param create_tags_from_span: True if tags (tokens+tags) should be added
|
||||
:param scheme: IO, BIO/IOB or BILOU. Only applicable if span_to_tag=True
|
||||
:param tokens: spaCy Doc object
|
||||
:param tags: list of strings representing the label for each token,
|
||||
given the scheme
|
||||
:param metadata: A dictionary of additional metadata on the sample,
|
||||
in the English (or other language) vocabulary
|
||||
:param template_id: Original template (utterance) of sample, in case it was generated
|
||||
:param template_id: Original template (utterance) of sample, in case it was generated # noqa
|
||||
"""
|
||||
if tags is None:
|
||||
tags = []
|
||||
|
@ -532,7 +532,7 @@ class InputSample(object):
|
|||
span.entity_value = "O"
|
||||
|
||||
@staticmethod
|
||||
def create_flair_dataset(dataset):
|
||||
def create_flair_dataset(dataset: List["InputSample"]) -> List[str]:
|
||||
flair_samples = []
|
||||
for sample in dataset:
|
||||
flair_samples.append(sample.to_flair())
|
||||
|
|
|
@ -143,7 +143,9 @@ class Evaluator:
|
|||
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
|
||||
evaluation_results = []
|
||||
if self.model.entity_mapping:
|
||||
print(f"Mapping entity values using this dictionary: {self.model.entity_mapping}")
|
||||
print(
|
||||
f"Mapping entity values using this dictionary: {self.model.entity_mapping}"
|
||||
)
|
||||
for sample in tqdm(dataset, desc=f"Evaluating {self.model.__class__}"):
|
||||
|
||||
# Align tag values to the ones expected by the model
|
||||
|
|
|
@ -40,7 +40,7 @@ class ExperimentTracker:
|
|||
labels=List[str],
|
||||
):
|
||||
self.confusion_matrix = matrix
|
||||
self.labels=labels
|
||||
self.labels = labels
|
||||
|
||||
def start(self):
|
||||
pass
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
"""Helper scripts for calling different NER models."""
|
||||
from .base_model import BaseModel
|
||||
from .crf_model import CRFModel
|
||||
from .presidio_analyzer_wrapper import PresidioAnalyzerWrapper
|
||||
|
|
|
@ -16,7 +16,7 @@ class CRFModel(BaseModel):
|
|||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
entity_mapping=entity_mapping
|
||||
entity_mapping=entity_mapping,
|
||||
)
|
||||
|
||||
if model_pickle_path is None:
|
||||
|
|
|
@ -16,6 +16,15 @@ from presidio_evaluator.models import BaseModel
|
|||
|
||||
|
||||
class FlairModel(BaseModel):
|
||||
"""
|
||||
Evaluator for Flair models
|
||||
:param model: model of type SequenceTagger
|
||||
:param model_path:
|
||||
:param entities_to_keep:
|
||||
:param verbose:
|
||||
and model expected entity types
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
model=None,
|
||||
|
@ -24,14 +33,7 @@ class FlairModel(BaseModel):
|
|||
verbose: bool = False,
|
||||
entity_mapping: Dict[str, str] = PRESIDIO_SPACY_ENTITIES,
|
||||
):
|
||||
"""
|
||||
Evaluator for Flair models
|
||||
:param model: model of type SequenceTagger
|
||||
:param model_path:
|
||||
:param entities_to_keep:
|
||||
:param verbose:
|
||||
and model expected entity types
|
||||
"""
|
||||
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
from typing import List
|
||||
|
||||
import pandas as pd
|
||||
|
||||
try:
|
||||
from flair.data import Corpus, Sentence
|
||||
from flair.datasets import ColumnCorpus
|
||||
|
@ -21,11 +23,20 @@ from os import path
|
|||
|
||||
|
||||
class FlairTrainer:
|
||||
"""
|
||||
Helper class for training Flair models
|
||||
"""
|
||||
|
||||
@staticmethod
|
||||
def to_flair_row(text, pos, label):
|
||||
def to_flair_row(text: str, pos: str, label: str) -> str:
|
||||
"""
|
||||
Turn text, part of speech and label into one row.
|
||||
:return: str
|
||||
"""
|
||||
return "{} {} {}".format(text, pos, label)
|
||||
|
||||
def to_flair(self, df, outfile="flair_train.txt"):
|
||||
def to_flair(self, df: pd.DataFrame, outfile: str = "flair_train.txt") -> None:
|
||||
"""Translate a pd.DataFrame to a flair dataset."""
|
||||
sentence = 0
|
||||
flair = []
|
||||
for row in df.itertuples():
|
||||
|
@ -43,10 +54,19 @@ class FlairTrainer:
|
|||
def create_flair_corpus(
|
||||
self, train_samples_path, test_samples_path, val_samples_path
|
||||
):
|
||||
"""
|
||||
Create a flair Corpus object and saive it to train, test, validation files.
|
||||
:param train_samples_path: Path to train samples
|
||||
:param test_samples_path: Path to test samples
|
||||
:param val_samples_path: Path to validation samples
|
||||
:return:
|
||||
"""
|
||||
if not path.exists("flair_train.txt"):
|
||||
train_samples = InputSample.read_dataset_json(train_samples_path)
|
||||
train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
|
||||
print(f"Kept {len(train_tagged)} train samples after removal of non-tagged samples")
|
||||
print(
|
||||
f"Kept {len(train_tagged)} train samples after removal of non-tagged samples"
|
||||
)
|
||||
train_data = InputSample.create_conll_dataset(train_tagged)
|
||||
self.to_flair(train_data, outfile="flair_train.txt")
|
||||
|
||||
|
@ -61,7 +81,12 @@ class FlairTrainer:
|
|||
self.to_flair(val_data, outfile="flair_val.txt")
|
||||
|
||||
@staticmethod
|
||||
def read_corpus(data_folder):
|
||||
def read_corpus(data_folder: str):
|
||||
"""
|
||||
Read Flair Corpus object.
|
||||
:param data_folder: Path with files
|
||||
:return: Corpus object
|
||||
"""
|
||||
columns = {0: "text", 1: "pos", 2: "ner"}
|
||||
corpus = ColumnCorpus(
|
||||
data_folder,
|
||||
|
@ -73,7 +98,12 @@ class FlairTrainer:
|
|||
return corpus
|
||||
|
||||
@staticmethod
|
||||
def train(corpus):
|
||||
def train(corpus: Corpus):
|
||||
"""
|
||||
Train a Flair model
|
||||
:param corpus: Corpus object
|
||||
:return:
|
||||
"""
|
||||
print(corpus)
|
||||
|
||||
# 2. what tag do we want to predict?
|
||||
|
|
|
@ -15,7 +15,7 @@ class PresidioAnalyzerWrapper(BaseModel):
|
|||
labeling_scheme: str = "BIO",
|
||||
score_threshold: float = 0.4,
|
||||
language: str = "en",
|
||||
entity_mapping:Optional[Dict[str,str]]=None
|
||||
entity_mapping: Optional[Dict[str, str]] = None,
|
||||
):
|
||||
"""
|
||||
Evaluation wrapper for the Presidio Analyzer
|
||||
|
@ -25,7 +25,7 @@ class PresidioAnalyzerWrapper(BaseModel):
|
|||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
entity_mapping=entity_mapping
|
||||
entity_mapping=entity_mapping,
|
||||
)
|
||||
self.score_threshold = score_threshold
|
||||
self.language = language
|
||||
|
|
|
@ -9,6 +9,17 @@ from presidio_evaluator.span_to_tag import span_to_tag
|
|||
|
||||
|
||||
class PresidioRecognizerWrapper(BaseModel):
|
||||
"""
|
||||
Class wrapper for one specific PII recognizer
|
||||
To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
|
||||
:param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
|
||||
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
|
||||
:param entities_to_keep: List of entity types to focus on while ignoring all the rest.
|
||||
Default=None would look at all entity types
|
||||
:param with_nlp_artifacts: Whether NLP artifacts should be obtained
|
||||
(faster if not, but some recognizers need it)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
recognizer: EntityRecognizer,
|
||||
|
@ -19,21 +30,12 @@ class PresidioRecognizerWrapper(BaseModel):
|
|||
entity_mapping: Optional[Dict[str, str]] = None,
|
||||
verbose: bool = False,
|
||||
):
|
||||
"""
|
||||
Evaluator for one specific PII recognizer
|
||||
To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
|
||||
:param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
|
||||
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
|
||||
:param entities_to_keep: List of entity types to focus on while ignoring all the rest.
|
||||
Default=None would look at all entity types
|
||||
:param with_nlp_artifacts: Whether NLP artifacts should be obtained
|
||||
(faster if not, but some recognizers need it)
|
||||
"""
|
||||
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
entity_mapping=entity_mapping
|
||||
entity_mapping=entity_mapping,
|
||||
)
|
||||
self.with_nlp_artifacts = with_nlp_artifacts
|
||||
self.recognizer = recognizer
|
||||
|
|
|
@ -21,7 +21,7 @@ class SpacyModel(BaseModel):
|
|||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
entity_mapping=entity_mapping
|
||||
entity_mapping=entity_mapping,
|
||||
)
|
||||
|
||||
if model is None:
|
||||
|
@ -32,14 +32,19 @@ class SpacyModel(BaseModel):
|
|||
self.model = model
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
"""
|
||||
Predict a list of tags for an inpuit sample.
|
||||
:param sample: InputSample
|
||||
:return: list of tags
|
||||
"""
|
||||
doc = self.model(sample.full_text)
|
||||
tags = self.get_tags_from_doc(doc)
|
||||
tags = self._get_tags_from_doc(doc)
|
||||
if len(doc) != len(sample.tokens):
|
||||
print("mismatch between input tokens and new tokens")
|
||||
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def get_tags_from_doc(doc):
|
||||
def _get_tags_from_doc(doc):
|
||||
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
|
||||
return tags
|
||||
|
|
|
@ -14,6 +14,17 @@ from presidio_evaluator.models import SpacyModel
|
|||
|
||||
|
||||
class StanzaModel(SpacyModel):
|
||||
"""
|
||||
Class wrapping Stanza models, using spacy_stanza.
|
||||
|
||||
:param model: spaCy Language object representing a stanza model
|
||||
:param model_name: Name of model, e.g. "en"
|
||||
:param entities_to_keep: List of entities to predict on
|
||||
:param verbose: Whether to print more
|
||||
:param labeling_scheme: Whether to return IO, BIO or BILUO tags
|
||||
:param entity_mapping: Mapping between input dataset entities and entities expected by the model
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
model: spacy.language.Language = None,
|
||||
|
@ -23,6 +34,7 @@ class StanzaModel(SpacyModel):
|
|||
labeling_scheme: str = "BIO",
|
||||
entity_mapping: Optional[Dict[str, str]] = PRESIDIO_SPACY_ENTITIES,
|
||||
):
|
||||
|
||||
if not model and not model_name:
|
||||
raise ValueError("Either model_name or model object must be supplied")
|
||||
if not model:
|
||||
|
@ -40,6 +52,12 @@ class StanzaModel(SpacyModel):
|
|||
)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
"""
|
||||
Predict the tags using a stanza model.
|
||||
|
||||
:param sample: InputSample with text
|
||||
:return: list of tags
|
||||
"""
|
||||
|
||||
doc = self.model(sample.full_text)
|
||||
if doc.ents:
|
||||
|
@ -48,7 +66,8 @@ class StanzaModel(SpacyModel):
|
|||
)
|
||||
|
||||
# Stanza tokens might not be consistent with spaCy's tokens.
|
||||
# Use spacy tokenization and not stanza to maintain consistency with other models:
|
||||
# Use spacy tokenization and not stanza
|
||||
# to maintain consistency with other models:
|
||||
if not sample.tokens:
|
||||
sample.tokens = tokenize(sample.full_text)
|
||||
|
||||
|
|
6
setup.py
6
setup.py
|
@ -20,12 +20,12 @@ setup(
|
|||
packages=find_packages(exclude=["tests"]),
|
||||
url="https://www.github.com/microsoft/presidio-research",
|
||||
license="MIT",
|
||||
description="PII dataset generator, model evaluator for Presidio and PII data in general",
|
||||
description="PII dataset generator, model evaluator for Presidio and PII data in general", # noqa
|
||||
data_files=[
|
||||
(
|
||||
"presidio_evaluator/data_generator/raw_data",
|
||||
[
|
||||
"presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv",
|
||||
"presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv", # noqa
|
||||
"presidio_evaluator/data_generator/raw_data/templates.txt",
|
||||
"presidio_evaluator/data_generator/raw_data/organizations.csv",
|
||||
"presidio_evaluator/data_generator/raw_data/nationalities.csv",
|
||||
|
@ -46,6 +46,6 @@ setup(
|
|||
"schwifty",
|
||||
"faker",
|
||||
"sklearn_crfsuite",
|
||||
"python-dotenv"
|
||||
"python-dotenv",
|
||||
],
|
||||
)
|
||||
|
|
|
@ -69,7 +69,7 @@ def test_to_spacy_file_and_back(small_dataset):
|
|||
output_path="dataset.spacy",
|
||||
translate_tags=False,
|
||||
spacy_pipeline=spacy_pipeline,
|
||||
alignment_mode = "strict"
|
||||
alignment_mode="strict",
|
||||
)
|
||||
|
||||
db = DocBin()
|
||||
|
|
|
@ -23,7 +23,6 @@ def test_flair_simple():
|
|||
os.path.join(dir_path, "data/generated_small.json")
|
||||
)
|
||||
|
||||
|
||||
flair_model = FlairModel(model_path="ner", entities_to_keep=["PERSON"])
|
||||
evaluator = Evaluator(model=flair_model)
|
||||
evaluation_results = evaluator.evaluate_all(input_samples)
|
||||
|
|
|
@ -55,10 +55,7 @@ cc_test_template_testdata = [
|
|||
|
||||
# credit card recognizer tests on template-generates data
|
||||
@pytest.mark.parametrize(
|
||||
"pii_csv, "
|
||||
"utterances, "
|
||||
"num_of_examples, "
|
||||
"acceptance_threshold",
|
||||
"pii_csv, " "utterances, " "num_of_examples, " "acceptance_threshold",
|
||||
[testcase.to_pytest_param() for testcase in cc_test_template_testdata],
|
||||
)
|
||||
def test_credit_card_recognizer_with_template(
|
||||
|
|
|
@ -77,7 +77,7 @@ rocket_test_template_testdata = [
|
|||
utterances="{}/data/rocket_example_sentences.txt",
|
||||
num_of_examples=100,
|
||||
acceptance_threshold=0,
|
||||
max_mistakes_number=100
|
||||
max_mistakes_number=100,
|
||||
),
|
||||
PatternRecognizerTestCase(
|
||||
test_name="rocket-some-errors",
|
||||
|
|
Загрузка…
Ссылка в новой задаче