updates to presidio 2 and spacy 3

This commit is contained in:
omri374 2021-04-26 12:40:05 +03:00
Родитель e2528bdca7
Коммит 83bb254b5d
55 изменённых файлов: 2354 добавлений и 2316 удалений

Просмотреть файл

@ -11,15 +11,15 @@ To install the package, clone the repo and install all dependencies, preferably
``` sh
# Create conda env (optional)
conda create --name presidio python=3.7
conda create --name presidio python=3.8
conda activate presidio
# Install package+dependencies
pip install -r requirements.txt
python setup.py install
# Optionally link in the local development copy of presidio-analyzer
pip install -e [path to presidio-analyzer]
# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg
# Verify installation
pytest
@ -58,7 +58,7 @@ In order to standardize the process, we use specific data objects that hold all
## 3. Recognizer evaluation
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
The main logic lies in the [ModelEvaluator](presidio_evaluator/model_evaluator.py) class. It provides a structured way of evaluating models and recognizers.
The main logic lies in the [ModelEvaluator](presidio_evaluator/models/base_model.py) class. It provides a structured way of evaluating models and recognizers.
### Ready evaluators
@ -72,14 +72,14 @@ Allows you to evaluate an existing Presidio deployment through the API. [See thi
Allows you to evaluate the local Presidio-Analyzer package. Faster than the API option but requires you to have Presidio-Analyzer installed locally. [See this class for more information](presidio_evaluator/presidio_analyzer.py)
#### 3. One recognizer evaluator
Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/presidio_recognizer_evaluator.py)
Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/models/presidio_recognizer_wrapper.py)
## 4. Modeling
### Conditional Random Fields
To train a CRF on a new dataset, see [this notebook](notebooks/models/CRF.ipynb).
To evaluate a CRF model, see the the [same notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/crf_evaluator.py).
To evaluate a CRF model, see the the [same notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).
### spaCy based models
There are three ways of interacting with spaCy models:
@ -93,39 +93,6 @@ See [this notebook for creating spaCy datasets](notebooks/models/Create%20datase
#### Evaluate an existing trained model
To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
#### Train with pretrain embeddings
In order to train a new spaCy model from scratch with pretrained embeddings (FastText wiki news subword in this case), follow these three steps:
##### 1. Download FastText pretrained (sub) word embeddings
``` sh
wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
unzip wiki-news-300d-1M-subword.vec.zip
```
##### 2. Init spaCy model with pre-trained embeddings
Using spaCy CLI:
``` sh
python -m spacy init-model en spacy_fasttext --vectors-loc wiki-news-300d-1M-subword.vec
```
##### 3. Train spaCy NER model
Using spaCy CLI:
``` sh
python -m spacy train en spacy_fasttext_100 train.json test.json --vectors spacy_fasttext --pipeline ner -n 100
```
#### Fine-tune an existing spaCy model
See [this code for retraining an existing spaCy model](models/spacy_retrain.py). Specifically, run a SpacyRetrainer:
First, you would have to create train and test pickle files for your train and test sets. See [this notebook](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb) for more information.
```python
from models import SpacyRetrainer
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
experiment_name='new_spacy_experiment',
n_iter=500, dropout=0.1, aml_config=None)
spacy_retrainer.run()
```
### Flair based models
To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object.
For experimenting with other embedding types, change the `embeddings` object in the `train` method.

Просмотреть файл

@ -1,2 +1,2 @@
0.0
0.0.2

Просмотреть файл

@ -15,11 +15,12 @@ pool:
vmImage: 'ubuntu-latest'
strategy:
matrix:
Python36:
python.version: '3.6'
Python37:
python.version: '3.7'
Python38:
python.version: '3.8'
Python39:
python.version: '3.9'
steps:
- task: UsePythonVersion@0
inputs:

Просмотреть файл

@ -5481,7 +5481,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "SvenZimmer@fleckens.hu",
"start_position": 39,
"end_position": 61
@ -9288,7 +9288,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "EmilySanderson@jourrapide.com",
"start_position": 59,
"end_position": 88
@ -20492,7 +20492,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "NatalinaLucchese@superrito.com",
"start_position": 59,
"end_position": 89
@ -25723,7 +25723,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "HannaUkkonen@dayrep.com",
"start_position": 39,
"end_position": 62
@ -32783,7 +32783,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "yahyaeriksson@gustr.com",
"start_position": 23,
"end_position": 46
@ -40833,7 +40833,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "VictorAndreyev@cuvox.de",
"start_position": 23,
"end_position": 46
@ -44468,7 +44468,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "HarrisonBarnes@fleckens.hu",
"start_position": 59,
"end_position": 85
@ -49165,7 +49165,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "MathiasEJespersen@armyspy.com",
"start_position": 23,
"end_position": 52
@ -62644,7 +62644,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "ElishaFedorov@fleckens.hu",
"start_position": 39,
"end_position": 64
@ -68659,7 +68659,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "HartmannAntonsson@jourrapide.com",
"start_position": 59,
"end_position": 91
@ -72669,7 +72669,7 @@
"masked": null,
"spans": [
{
"entity_type": "EMAIL",
"entity_type": "EMAIL_ADDRESS",
"entity_value": "MakarMaslow@teleworm.us",
"start_position": 39,
"end_position": 62

Просмотреть файл

@ -1,10 +1,13 @@
from typing import List
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
try:
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
except ImportError:
print("Flair is not installed")
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import read_synth_dataset

Просмотреть файл

@ -1,206 +0,0 @@
import logging
import pickle
import random
import sys
from pathlib import Path
import spacy
from azureml.core import Workspace, Experiment
from spacy.util import minibatch, compounding
from presidio_evaluator import SpacyEvaluator, InputSample
logging.basicConfig(level=logging.INFO)
root = logging.getLogger()
root.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)
root.addHandler(handler)
class SpacyRetrainer:
def __init__(self, original_model_name=None, experiment_name=None, n_iter=100, dropout=0.5,
aml_config='config.json', output_dir='../../model-outputs', train_pickle='../data/train.pickle',
test_pickle='../data/test.pickle'):
self.experiment_name = experiment_name
if aml_config:
self.ws = Workspace.from_config(aml_config)
self.experiment = Experiment(workspace=self.ws, name=experiment_name)
self.aml_run = self.experiment.start_logging()
self.has_aml = True
else:
self.has_aml = False
self.model = original_model_name
self.n_iter = n_iter
self.output_dir = output_dir
self.train_file = train_pickle
self.test_file = test_pickle
self.dropout = dropout
def run(self):
if self.has_aml:
self.aml_run.log("model", self.model)
self.aml_run.log("n_iter", self.n_iter)
self.aml_run.log("train_file", self.train_file)
self.aml_run.log("test_file", self.test_file)
self.aml_run.log("dropout rate", self.dropout)
model_path = self._train(self.model, self.output_dir, self.n_iter, self.train_file, self.experiment_name)
self._score_validate(model_path, self.test_file)
if self.has_aml:
self.aml_run.complete()
def print_scores(self, split, evaluation_result):
"""
Logs results into experiment run.
:param split: Name of this split. For ex 'train' or 'valid'
:param evaluation_result: EvaluationResult containing various metrics
:return: None. Writes to experiment runner and logs locally.
"""
logging.info('SPLIT: {0}. PII_precision: {1}, PII_recall: {2},'
'Person_precision: {3}, Person_recall: {4}'. \
format(split, evaluation_result.pii_precision, evaluation_result.pii_recall,
evaluation_result.entity_precision_dict['PERSON'],
evaluation_result.entity_recall_dict['PERSON']))
if self.has_aml:
self.aml_run.log('Precision', evaluation_result.pii_precision, split)
self.aml_run.log('Recall', evaluation_result.pii_recall, split)
@staticmethod
def _score(model, data):
"""
Score the model against the data
:param model: Trained model
:param data: Data split which is being scored.
:return: An EvaluationResult containing various metrics
"""
spacy_evaluator = SpacyEvaluator(model=model)
results = []
for text, ground_truth_annotations in data:
ground_truth_entities = ground_truth_annotations['entities']
input_sample = InputSample.from_spacy(text, ground_truth_entities)
results.append(spacy_evaluator.evaluate_sample(input_sample))
return spacy_evaluator.calculate_score(evaluation_results=results)
def _score_validate(self, model_path, test_data_file):
"""
Validation step for the model. Also prints the scores.
:param model_path: Path to trained model.
:param test_data_file: Data file which has the dataset for this split.
:return: None. Prints the scores.
"""
with open(test_data_file, 'rb') as f:
valid_data = pickle.load(f)
nlp = spacy.load(model_path)
self.print_scores('Valid', self._score(nlp, valid_data))
# @plac.annotations(
# model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
# output_dir=("Optional output directory", "option", "o", Path),
# n_iter=("Number of training iterations", "option", "n", int),
# train_file=("File containing pickled training Spacy NER formatted data", "option", "d", Path),
# test_file=("File containing pickled test Spacy NER formatted data", "option", "d", Path),
# exp_name=("Name of this experiment", "option", "e")
# )
def _train(self, model, output_dir, n_iter, train_file, exp_name):
"""Load the model, set up the pipeline and train the entity recognizer."""
nlp = self.load_or_create_empty_model(model)
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
else:
ner = nlp.get_pipe("ner")
with open(train_file, 'rb') as f:
train_data = pickle.load(f)
# DEBUG
train_data = train_data[:50]
# add labels
for _, annotations in train_data:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(train_data)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, drop=self.dropout, losses=losses, )
logging.debug("Losses", losses)
if self.has_aml:
self.aml_run.log('Losses', losses['ner'])
self.print_scores('Itn {}'.format(itn), self._score(nlp, train_data))
self.print_scores('Train', self._score(nlp, train_data))
saved_model_path = self.save_model(exp_name, nlp, output_dir)
return saved_model_path
@staticmethod
def save_model(exp_name, model, output_dir):
"""
Saves model to disk for later use.
:param exp_name: Name of the running experiment. This is used as folder name for storing the model.
:param model: Model being saved
:param output_dir: Directory where to save the model.
:return: Full path to saved model.
"""
saved_model_path = Path(output_dir, exp_name)
if not saved_model_path.exists():
saved_model_path.mkdir(parents=True)
model.to_disk(saved_model_path)
logging.info("Saved model to {}".format(output_dir))
return saved_model_path
@staticmethod
def load_model(exp_name, model_dir):
"""
Loads a spacy model from disk
:param exp_name: Name of experiment under which the model was saved
:param model_dir: path to saved model
:return: spacy model
"""
saved_model_path = Path(model_dir, exp_name)
return spacy.load(saved_model_path)
@staticmethod
def load_or_create_empty_model(model=None):
"""
Loads a given model or creates a blank english model.
:param model: Optional Model to load.
:return: Loaded or blank model.
"""
if model:
nlp = spacy.load(model)
logging.debug("Loaded model {}".format(model))
else:
nlp = spacy.blank("en")
logging.debug("Created blank 'en' model")
return nlp
if __name__ == "__main__":
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
experiment_name='spacy_new_ontonotes28',
n_iter=500, dropout=0.5, aml_config=None)
spacy_retrainer.run()

Просмотреть файл

@ -1,6 +1,21 @@
from .span_to_tag import span_to_tag, tokenize
from .data_objects import Span, InputSample, EvaluationResult, ModelError
from .model_evaluator import ModelEvaluator
from .spacy_evaluator import SpacyEvaluator
from .presidio_api_evaluator import PresidioAPIEvaluator
from .presidio_analyzer_evaluator import PresidioAnalyzerEvaluator
from .data_objects import Span, InputSample
from .validation import (
split_dataset,
split_by_template,
get_samples_by_pattern,
group_by_template,
save_to_json,
)
__all__ = [
"span_to_tag",
"tokenize",
"Span",
"InputSample",
"split_dataset",
"split_by_template",
"get_samples_by_pattern",
"group_by_template",
"save_to_json",
]

Просмотреть файл

@ -1,97 +0,0 @@
import pickle
from typing import List
from presidio_evaluator import ModelEvaluator, InputSample
class CRFEvaluator(ModelEvaluator):
def __init__(self,
model_pickle_path: str = "../models/crf.pickle",
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True):
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io)
if model_pickle_path is None:
raise ValueError("model_pickle_path must be supplied")
with open(model_pickle_path, 'rb') as f:
self.model = pickle.load(f)
def predict(self, sample: InputSample) -> List[str]:
tags = CRFEvaluator.crf_predict(sample,self.model)
if len(tags) != len(sample.tokens):
print("mismatch between previous tokens and new tokens")
# translated_tags = sample.rename_from_spacy_tags(tags)
return tags
@staticmethod
def crf_predict(sample, model):
sample.translate_input_sample_tags()
conll = sample.to_conll(translate_tags=True)
sentence = [(di['text'], di['pos'], di['label']) for di in conll]
features = CRFEvaluator.sent2features(sentence)
return model.predict([features])[0]
@staticmethod
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i - 1][0]
postag1 = sent[i - 1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
if i < len(sent) - 1:
word1 = sent[i + 1][0]
postag1 = sent[i + 1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
})
else:
features['EOS'] = True
return features
@staticmethod
def sent2features(sent):
return [CRFEvaluator.word2features(sent, i) for i in range(len(sent))]
@staticmethod
def sent2labels(sent):
return [label for token, postag, label in sent]
@staticmethod
def sent2tokens(sent):
return [token for token, postag, label in sent]

Просмотреть файл

@ -1,16 +1,27 @@
# PII dataset generator
This data generator takes a text file with templates (e.g. `my name is [PERSON]`) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer) and tags in various schemas (BIO/IOB, IO, BILOU)
In addition it provides some off-the-shelf features on each token, like `pos`, `dep` and `is_in_vocabulary`
This data generator takes a text file with templates (e.g. `my name is [PERSON]`)
and creates a list of InputSamples which contain fake PII entities
instead of placeholders.
It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer)
and tags in various schemas (BIO/IOB, IO, BILOU)
In addition it provides some off-the-shelf features on each token,
like `pos`, `dep` and `is_in_vocabulary`
The main class is `FakeDataGenerator` however the `main` module has two functions for creating and reading a fake dataset.
During the generation process, the tool either takes fake PII from a provided CSV with a known format, and/or from extension functions which can be found in the extensions.py file.
The main class is `FakeDataGenerator` however the `main` module has two functions
for creating and reading a fake dataset.
During the generation process, the tool either takes fake PII from a provided CSV with
a known format, and/or from extension functions which can be found
in the extensions.py file.
The process in high level is the following:
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of templates: `My name is John` -> `My name is [PERSON]`
2. (Optional) adapt the FakeDataGenerator to support new extensions which could generate fake PII entities
3. Generate X samples using the templates list + a fake PII dataset + extensions that add additional PII entities
4. Split the generated dataset to train/test/validation while making sure that samples from the same template would only appear in one set
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
templates: `My name is John` -> `My name is [PERSON]`
2. (Optional) adapt the FakeDataGenerator to support new extensions
which could generate fake PII entities
3. Generate X samples using the templates list + a fake PII dataset +
extensions that add additional PII entities
4. Split the generated dataset to train/test/validation while making sure
that samples from the same template would only appear in one set
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
6. Train models
7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
@ -19,12 +30,15 @@ The process in high level is the following:
Notes:
- For steps 5, 6, 7 see the main [README](../../README.md).
- For a simple data generation pipeline, [see this notebook](../../notebooks/Generate data.ipynb).
- For information on transforming a NER dataset into a templates, see the notebooks in the [helper notebooks](helper%20notebooks) folder.
- For a simple data generation pipeline,
[see this notebook](../../notebooks/Generate data.ipynb).
- For information on transforming a NER dataset into a templates,
see the notebooks in the [helper notebooks](helper%20notebooks) folder.
Example run:
```python
from presidio_evaluator.data_generator import generate
TEMPLATES_FILE = 'raw_data/templates.txt'
OUTPUT = "generated_.txt"
@ -45,4 +59,7 @@ examples = generate(fake_pii_csv=fake_pii_csv,
*Copyright notice:*
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
Fake Name Generator identities by the Fake Name Generator are licensed under a
Creative Commons Attribution-Share Alike 3.0 United States License.
Fake Name Generator and the Fake Name Generator logo
are trademarks of Corban Works, LLC.

Просмотреть файл

@ -1,8 +1,7 @@
import random
from typing import List, Optional
import re
from collections import Counter
from typing import List, Optional, Dict
import pandas as pd
from spacy.tokens import Token
@ -40,30 +39,30 @@ class FakeDataGenerator:
labeling_scheme="BILOU",
):
"""
Fake data generator.
Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
e.g. "My name is [FIRST_NAME]"
:param fake_pii_df:
A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
:param templates: A list of templates
with place holders for PII entities.
For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
Note that in case you have multiple entities of the same type
in a template, you should put a number on the second. For example:
"I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
More than two are currently not supported but extending this
is straightforward.
:param lower_case_ratio: Percentage of names that should start
with lower case
:param include_metadata: Whether to include additional
information in the output
(e.g. NameSet from which the name was taken, gender, country etc.)
:param dictionary_path: A path to a csv containing a vocabulary of
a language, to check if a token exists in the vocabulary or not.
:param ignore_types: set of types to ignore
:param span_to_tag: whether to tokenize the generated samples or not
:param labeling_scheme: labeling scheme (BILOU, BIO, IO)
"""
Fake data generator.
Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
e.g. "My name is [FIRST_NAME]"
:param fake_pii_df:
A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
:param templates: A list of templates
with place holders for PII entities.
For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
Note that in case you have multiple entities of the same type
in a template, you should put a number on the second. For example:
"I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
More than two are currently not supported but extending this
is straightforward.
:param lower_case_ratio: Percentage of names that should start
with lower case
:param include_metadata: Whether to include additional
information in the output
(e.g. NameSet from which the name was taken, gender, country etc.)
:param dictionary_path: A path to a csv containing a vocabulary of
a language, to check if a token exists in the vocabulary or not.
:param ignore_types: set of types to ignore
:param span_to_tag: whether to tokenize the generated samples or not
:param labeling_scheme: labeling scheme (BILOU, BIO, IO)
"""
if ignore_types is None:
ignore_types = {}
self.lower_case_ratio = lower_case_ratio
@ -110,7 +109,7 @@ class FakeDataGenerator:
"TelephoneNumber": "PHONE_NUMBER",
"CCNumber": "CREDIT_CARD",
"Birthday": "BIRTHDAY",
"EmailAddress": "EMAIL",
"EmailAddress": "EMAIL_ADDRESS",
"StreetAddress": "FULL_ADDRESS",
"Domain": "DOMAIN_NAME",
"NameSet": "NAMESET",
@ -143,9 +142,9 @@ class FakeDataGenerator:
) # replace previous country which has limited options
# Copied entities
if "DATE" not in self.ignore_types:
if "DATE_TIME" not in self.ignore_types:
if "BIRTHDAY" in df:
df["DATE"] = df["BIRTHDAY"]
df["DATE_TIME"] = df["BIRTHDAY"]
else:
print("DATE is taken from the BIRTHDAY column which is missing")
@ -165,7 +164,9 @@ class FakeDataGenerator:
if "TITLE" not in self.ignore_types:
print("Generating titles")
if "GENDER" not in df:
print("Cannot generate title without a GENDER column. Generating FEMALE_TITLE and MALE_TITLE")
print(
"Cannot generate title without a GENDER column. Generating FEMALE_TITLE and MALE_TITLE"
)
else:
df["TITLE"] = generate_titles(df["GENDER"])
df["FEMALE_TITLE"] = [generate_title("female") for _ in range(len(df))]
@ -275,7 +276,9 @@ class FakeDataGenerator:
return template, templates, entities_count
def sample_examples(self, count, genders:List[str]=None, namesets:List[str]=None):
def sample_examples(
self, count, genders: List[str] = None, namesets: List[str] = None
):
if self.fake_pii is None:
self.fake_pii = self.prep_fake_pii(self.original_pii_df)
@ -305,9 +308,7 @@ class FakeDataGenerator:
values[h] = str(fake_pii_sample_duplicated[h])
else:
print(
"Warning: entity {} is in the templates but not in the PII dataset. Ignoring.".format(
h
)
f"Warning: entity {h} is in the templates but not in the PII dataset. Ignoring."
)
values[h] = ""
@ -335,7 +336,7 @@ class FakeDataGenerator:
yield input_sample
@staticmethod
def _consolidate_names(input_sample):
def _consolidate_names(input_sample: InputSample):
locations = ("LOCATION", "CITY", "STATE", "COUNTRY", "ADDRESS", "STREET")
names = ("FIRST_NAME", "LAST_NAME", "PERSON")
@ -353,7 +354,9 @@ class FakeDataGenerator:
input_sample.masked = masked
def _create_input_sample(self, original_sentence, values):
def _create_input_sample(
self, original_sentence: str, values: Dict[str, str]
) -> InputSample:
"""
Creates an InputSample out of a template sentence
and a dict of entity names and values
@ -417,7 +420,10 @@ class FakeDataGenerator:
# Not creating tokens here since we're consolidating names afterwards
return InputSample(
sentence, original_sentence, spans, create_tags_from_span=False
full_text=sentence,
spans=spans,
masked=original_sentence,
create_tags_from_span=False,
)
def _add_duplicated_entities(self, fake_pii_sample, entity_counts):

Просмотреть файл

@ -1,5 +1,6 @@
import datetime
import json
import warnings
import pandas as pd
@ -12,14 +13,16 @@ def read_utterances(utterances_file):
return f.readlines()
def generate(fake_pii_csv,
utterances_file,
output_file=None,
num_of_examples=1000,
dictionary_path=None,
store_masked_text=False,
keep_only_tagged=False,
**kwargs):
def generate(
fake_pii_csv,
utterances_file,
output_file=None,
num_of_examples=1000,
dictionary_path=None,
store_masked_text=False,
keep_only_tagged=False,
**kwargs
):
"""
:param fake_pii_csv: csv containing fake PII
@ -34,18 +37,18 @@ def generate(fake_pii_csv,
"""
if not output_file:
raise ValueError("Please provide an output file path")
warnings.warn("Warning: no output_file value provided.")
templates = read_utterances(utterances_file)
if keep_only_tagged:
templates = [template for template in templates if "[" in template]
df = pd.read_csv(fake_pii_csv, encoding='utf-8')
df = pd.read_csv(fake_pii_csv, encoding="utf-8")
generator = FakeDataGenerator(fake_pii_df=df,
dictionary_path=dictionary_path,
templates=templates, **kwargs)
generator = FakeDataGenerator(
fake_pii_df=df, dictionary_path=dictionary_path, templates=templates, **kwargs
)
counter = 0
examples = []
@ -56,7 +59,7 @@ def generate(fake_pii_csv,
examples_json = [example.to_dict() for example in examples]
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
with open("{}".format(output_file), "w+", encoding="utf-8") as f:
json.dump(examples_json, f, ensure_ascii=False, indent=4)
print("generated {} examples".format(len(examples)))
@ -67,6 +70,7 @@ def generate(fake_pii_csv,
def read_synth_dataset(filepath=None, length=None):
import json
with open(filepath, "r", encoding="utf-8") as f:
dataset = json.load(f)
@ -84,28 +88,32 @@ if __name__ == "__main__":
EXAMPLES = 30
PII_FILE_SIZE = 3000
SPAN_TO_TAG = True
TEMPLATES_FILE = 'raw_data/templates.txt'
TEMPLATES_FILE = "raw_data/templates.txt"
KEEP_ONLY_TAGGED = False
LOWER_CASE_RATIO = 0.1
IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}
IGNORE_TYPES = {"IP_ADDRESS", "US_SSN", "URL"}
cur_time = datetime.date.today().strftime("%B %d %Y")
OUTPUT = "generated_size_{}_date_{}.txt".format(EXAMPLES, cur_time)
fake_pii_csv = '../../presidio_evaluator/data_generator/' \
'raw_data/FakeNameGenerator.com_{}.csv'.format(PII_FILE_SIZE)
fake_pii_csv = (
"../../presidio_evaluator/data_generator/"
"raw_data/FakeNameGenerator.com_{}.csv".format(PII_FILE_SIZE)
)
utterances_file = TEMPLATES_FILE
dictionary_path = None
examples = generate(fake_pii_csv=fake_pii_csv,
utterances_file=utterances_file,
dictionary_path=dictionary_path,
output_file=OUTPUT,
lower_case_ratio=LOWER_CASE_RATIO,
num_of_examples=EXAMPLES,
ignore_types=IGNORE_TYPES,
keep_only_tagged=KEEP_ONLY_TAGGED,
span_to_tag=SPAN_TO_TAG)
examples = generate(
fake_pii_csv=fake_pii_csv,
utterances_file=utterances_file,
dictionary_path=dictionary_path,
output_file=OUTPUT,
lower_case_ratio=LOWER_CASE_RATIO,
num_of_examples=EXAMPLES,
ignore_types=IGNORE_TYPES,
keep_only_tagged=KEEP_ONLY_TAGGED,
span_to_tag=SPAN_TO_TAG,
)
# sanity
input_samples = read_synth_dataset(OUTPUT)

Просмотреть файл

@ -15,24 +15,34 @@ class NationalityGenerator:
def get_country(self):
## [COUNTRY]
return NationalityGenerator.capitalizeWords(random.choice(self.df['country'].values))
return NationalityGenerator.capitalizeWords(
random.choice(self.df["country"].values)
)
def get_nationality(self):
## [NATIONALITY]
return NationalityGenerator.capitalizeWords(random.choice(self.df['nationality'].values))
return NationalityGenerator.capitalizeWords(
random.choice(self.df["nationality"].values)
)
def get_nation_woman(self):
## [NATION_WOMAN]
return NationalityGenerator.capitalizeWords(random.choice(self.df['woman'].values))
return NationalityGenerator.capitalizeWords(
random.choice(self.df["woman"].values)
)
def get_nation_man(self):
## [NATION_MAN]
return NationalityGenerator.capitalizeWords(random.choice(self.df['man'].values))
return NationalityGenerator.capitalizeWords(
random.choice(self.df["man"].values)
)
def get_nation_plural(self):
## [NATION_PLURAL]
return NationalityGenerator.capitalizeWords(random.choice(self.df['plural'].values))
return NationalityGenerator.capitalizeWords(
random.choice(self.df["plural"].values)
)
@staticmethod
def capitalizeWords(s):
return re.sub(r'\w+', lambda m: m.group(0).capitalize(), s)
return re.sub(r"\w+", lambda m: m.group(0).capitalize(), s)

Просмотреть файл

@ -1,6 +1,7 @@
from typing import List, Set, Dict
from presidio_analyzer import RecognizerResult
from presidio_anonymizer import AnonymizerEngine
from presidio_evaluator.data_generator import FakeDataGenerator
@ -13,7 +14,6 @@ class PresidioPerturb(FakeDataGenerator):
fake_pii_df: pd.DataFrame,
lower_case_ratio: float = 0.0,
ignore_types: Set[str] = None,
entity_dict: Dict[str, str] = None,
):
super().__init__(
fake_pii_df=fake_pii_df,
@ -29,12 +29,9 @@ class PresidioPerturb(FakeDataGenerator):
:param lower_case_ratio: Percentage of names that should start
with lower case
:param ignore_types: set of types to ignore
:param entity_dict: Dictionary with mapping of entity names between Presidio and the fake_pii_df.
For example, {"EMAIL_ADDRESS": "EMAIL"}
"""
self.fake_pii = self.prep_fake_pii(self.original_pii_df)
self.entity_dict = entity_dict
def perturb(
self,
@ -56,19 +53,14 @@ class PresidioPerturb(FakeDataGenerator):
presidio_response = sorted(presidio_response, key=lambda resp: resp.start)
delta = 0
text = original_text
for resp in presidio_response:
start = resp.start + delta
end = resp.end + delta
entity_text = original_text[start:end]
entity_type = resp.entity_type
if self.entity_dict:
if entity_type in self.entity_dict:
entity_type = self.entity_dict[entity_type]
anonymizer_engine = AnonymizerEngine()
anonymized_result = anonymizer_engine.anonymize(
text=original_text, analyzer_results=presidio_response
)
text = anonymized_result.text
text = text.replace(">", "}").replace("<", "{")
text = f"{text[:start]}{{{entity_type}}}{text[end:]}"
delta = len(entity_type) + 2 - len(entity_text)
self.templates = [text]
return [
sample.full_text

Просмотреть файл

@ -24,11 +24,11 @@ I will be travelling to [COUNTRY] next week, so I need my passport to be ready b
Who's coming to [COUNTRY] with me?
[COUNTRY] was super fun to visit!
Could you please email me the statement for laste month , my credit card number is [CREDIT_CARD]?
Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL]?
Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL_ADDRESS]?
How do I change my address to [ADDRESS] for post mail?
My name appears incorrectly on credit card statement could you please correct it to [TITLE] [PERSON]?
card number [CREDIT_CARD] is lost, can you please send a new one to [ADDRESS] i am in [CITY] for a business trip
Please transfer all funds from my account to this hackers' [EMAIL]
Please transfer all funds from my account to this hackers' [EMAIL_ADDRESS]
I can't browse to your site, keep getting address [IP_ADDRESS] blocked error
My religion does not allow speaking to bots, they are evil and hacked by the Devil
Excuse me, Sir bot, but I really don't like this tone
@ -47,7 +47,7 @@ I would like to remove my kid [FIRST_NAME] from the will. How do I do that?
The name in the account is not correct, please change it to [PERSON]
Hello I moved, please update my new address is [ADDRESS]
I need to add addresses, here they are: [ADDRESS], [ADDRESS]
Please send my portfolio to this email [EMAIL]
Please send my portfolio to this email [EMAIL_ADDRESS]
Hello, this is [TITLE] [PERSON]. Who are you?
I want to add [PERSON] as a beneficiary to my account
I want to cancel my card [CREDIT_CARD] because I lost it
@ -58,11 +58,11 @@ My nam is [FIRST_NAME]
I'm moving out of the country, so please cancel my subscription
My name is [PERSON] but everyone calls me [FIRST_NAME]
Please tell me your date of birth. It's [BIRTHDAY]
You said your email is [EMAIL]. Is that correct?
You said your email is [EMAIL_ADDRESS]. Is that correct?
I once lived in [ADDRESS]. I now live in [ADDRESS]
I'd like to order a taxi to [ADDRESS]
Please charge my credit card. Number is [CREDIT_CARD]
What's your email? [EMAIL]
What's your email? [EMAIL_ADDRESS]
What's your credit card? [CREDIT_CARD]
What's your name? [PERSON]
What's your last name? [LAST_NAME]

Просмотреть файл

@ -1,8 +1,9 @@
from typing import List, Counter, Dict
from typing import List, Optional
import spacy
import srsly
from spacy.tokens import Token
from spacy.training import docs_to_json
from tqdm import tqdm
from presidio_evaluator import span_to_tag, tokenize
@ -15,7 +16,7 @@ SPACY_PRESIDIO_ENTITIES = {
"FAC": "LOCATION",
"PERSON": "PERSON",
"LOCATION": "LOCATION",
"ORGANIZATION": "ORGANIZATION"
"ORGANIZATION": "ORGANIZATION",
}
PRESIDIO_SPACY_ENTITIES = {
"ORGANIZATION": "ORG",
@ -55,8 +56,10 @@ class Span:
"""
# if they do not overlap the intersection is 0
if self.end_position < other.start_position or other.end_position < \
self.start_position:
if (
self.end_position < other.start_position
or other.end_position < self.start_position
):
return 0
# if we are accounting for entity type a diff type means intersection 0
@ -65,25 +68,38 @@ class Span:
# otherwise the intersection is min(end) - max(start)
return min(self.end_position, other.end_position) - max(
self.start_position,
other.start_position)
self.start_position, other.start_position
)
def __repr__(self):
return "Type: {}, value: {}, start: {}, end: {}".format(
self.entity_type, self.entity_value, self.start_position,
self.end_position)
return (
f"Type: {self.entity_type}, "
f"value: {self.entity_value}, "
f"start: {self.start_position}, "
f"end: {self.end_position}"
)
def __eq__(self, other):
return self.entity_type == other.entity_type \
and self.entity_value == other.entity_value \
and self.start_position == other.start_position \
and self.end_position == other.end_position
return (
self.entity_type == other.entity_type
and self.entity_value == other.entity_value
and self.start_position == other.start_position
and self.end_position == other.end_position
)
def __hash__(self):
return hash(('entity_type', self.entity_type,
'entity_value', self.entity_value,
'start_position', self.start_position,
'end_position', self.end_position))
return hash(
(
"entity_type",
self.entity_type,
"entity_value",
self.entity_value,
"start_position",
self.start_position,
"end_position",
self.end_position,
)
)
@classmethod
def from_json(cls, data):
@ -108,12 +124,17 @@ class SimpleToken(object):
A class mimicking the Spacy Token class, for serialization purposes
"""
def __init__(self, text, idx, tag_=None,
pos_=None,
dep_=None,
lemma_=None,
spacy_extensions: SimpleSpacyExtensions = None,
**kwargs):
def __init__(
self,
text,
idx,
tag_=None,
pos_=None,
dep_=None,
lemma_=None,
spacy_extensions: SimpleSpacyExtensions = None,
**kwargs,
):
self.text = text
self.idx = idx
self.tag_ = tag_
@ -145,13 +166,15 @@ class SimpleToken(object):
else:
spacy_extensions = None
return cls(text=token.text,
idx=token.idx,
tag_=token.tag_,
pos_=token.pos_,
dep_=token.dep_,
lemma_=token.lemma_,
spacy_extensions=spacy_extensions)
return cls(
text=token.text,
idx=token.idx,
tag_=token.tag_,
pos_=token.pos_,
dep_=token.dep_,
lemma_=token.lemma_,
spacy_extensions=spacy_extensions,
)
def to_dict(self):
return {
@ -161,7 +184,7 @@ class SimpleToken(object):
"pos_": self.pos_,
"dep_": self.dep_,
"lemma_": self.lemma_,
"_": self._.to_dict()
"_": self._.to_dict(),
}
def __repr__(self):
@ -170,21 +193,27 @@ class SimpleToken(object):
@classmethod
def from_json(cls, data):
if '_' in data:
data['spacy_extensions'] = \
SimpleSpacyExtensions(**data['_'])
if "_" in data:
data["spacy_extensions"] = SimpleSpacyExtensions(**data["_"])
return cls(**data)
class InputSample(object):
def __init__(self, full_text: str, masked: str, spans: List[Span],
tokens=[], tags=[],
create_tags_from_span=True, scheme="IO", metadata=None, template_id=None):
def __init__(
self,
full_text: str,
spans: Optional[List[Span]] = None,
masked: Optional[str] = None,
tokens: Optional[List[SimpleToken]] = None,
tags: Optional[List[str]] = None,
create_tags_from_span=True,
scheme="IO",
metadata=None,
template_id=None,
):
"""
Holds all the information needed for evaluation in the
Hold all the information needed for evaluation in the
presidio-evaluator framework.
Can generate tags (BIO/BILOU/IO) based on spans
:param full_text: The raw text of this sample
:param masked: Masked version of the raw text (desired output)
@ -198,6 +227,10 @@ class InputSample(object):
in the English (or other language) vocabulary
:param template_id: Original template (utterance) of sample, in case it was generated
"""
if tags is None:
tags = []
if tokens is None:
tokens = []
self.full_text = full_text
self.masked = masked
self.spans = spans if spans else []
@ -218,11 +251,12 @@ class InputSample(object):
self.tags = tags
def __repr__(self):
return "Full text: {}\n" \
"Spans: {}\n" \
"Tokens: {}\n" \
"Tags: {}\n".format(self.full_text, self.spans, self.tokens,
self.tags)
return (
f"Full text: {self.full_text}\n"
f"Spans: {self.spans}\n"
f"Tokens: {self.tokens}\n"
f"Tags: {self.tags}\n"
)
def to_dict(self):
@ -230,20 +264,20 @@ class InputSample(object):
"full_text": self.full_text,
"masked": self.masked,
"spans": [span.__dict__ for span in self.spans],
"tokens": [SimpleToken.from_spacy_token(token).to_dict()
for token in self.tokens],
"tokens": [
SimpleToken.from_spacy_token(token).to_dict() for token in self.tokens
],
"tags": self.tags,
"template_id": self.template_id,
"metadata": self.metadata
"metadata": self.metadata,
}
@classmethod
def from_json(cls, data):
if 'spans' in data:
data['spans'] = [Span.from_json(span) for span in data['spans']]
if 'tokens' in data:
data['tokens'] = [SimpleToken.from_json(val) for val in
data['tokens']]
if "spans" in data:
data["spans"] = [Span.from_json(span) for span in data["spans"]]
if "tokens" in data:
data["tokens"] = [SimpleToken.from_json(val) for val in data["tokens"]]
return cls(**data, create_tags_from_span=False)
def get_tags(self, scheme="IOB"):
@ -252,33 +286,43 @@ class InputSample(object):
tags = [span.entity_type for span in self.spans]
tokens = tokenize(self.full_text)
labels = span_to_tag(scheme=scheme, text=self.full_text, tag=tags,
start=start_indices, end=end_indices,
tokens=tokens)
labels = span_to_tag(
scheme=scheme,
text=self.full_text,
tag=tags,
start=start_indices,
end=end_indices,
tokens=tokens,
)
return tokens, labels
def to_conll(self, translate_tags, scheme="BIO"):
def to_conll(self, translate_tags):
conll = []
for i, token in enumerate(self.tokens):
if translate_tags:
label = self.translate_tag(self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
label = self.translate_tag(
self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
)
else:
label = self.tags[i]
conll.append({"text": token.text,
"pos": token.pos_,
"tag": token.tag_,
"Template#": self.metadata['Template#'],
"gender": self.metadata['Gender'],
"country": self.metadata['Country'],
"label": label},
)
conll.append(
{
"text": token.text,
"pos": token.pos_,
"tag": token.tag_,
"Template#": self.metadata["Template#"],
"gender": self.metadata["Gender"],
"country": self.metadata["Country"],
"label": label,
},
)
return conll
def get_template_id(self):
return self.metadata['Template#']
return self.metadata["Template#"]
@staticmethod
def create_conll_dataset(dataset, translate_tags=True, to_bio=True):
@ -291,66 +335,76 @@ class InputSample(object):
sample.bilou_to_bio()
conll = sample.to_conll(translate_tags=translate_tags)
for token in conll:
token['sentence'] = i
token["sentence"] = i
conlls.append(token)
i += 1
return pd.DataFrame(conlls)
def to_spacy(self, entities=None, translate_tags=True):
entities = [(span.start_position, span.end_position, span.entity_type)
for span in self.spans if (entities is None) or (span.entity_type in entities)]
entities = [
(span.start_position, span.end_position, span.entity_type)
for span in self.spans
if (entities is None) or (span.entity_type in entities)
]
new_entities = []
if translate_tags:
for entity in entities:
new_tag = self.translate_tag(entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
new_tag = self.translate_tag(
entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
)
new_entities.append((entity[0], entity[1], new_tag))
else:
new_entities = entities
return (self.full_text,
{"entities": new_entities})
return self.full_text, {"entities": new_entities}
@classmethod
def from_spacy(cls, text, annotations, translate_from_spacy=True):
spans = []
for annotation in annotations:
tag = cls.rename_from_spacy_tags([annotation[2]])[0] if translate_from_spacy else annotation[2]
span = Span(tag, text[annotation[0]: annotation[1]], annotation[0], annotation[1])
tag = (
cls.rename_from_spacy_tags([annotation[2]])[0]
if translate_from_spacy
else annotation[2]
)
span = Span(
tag, text[annotation[0] : annotation[1]], annotation[0], annotation[1]
)
spans.append(span)
return cls(full_text=text, masked=None, spans=spans)
@staticmethod
def create_spacy_dataset(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
def create_spacy_dataset(
dataset, entities=None, sort_by_template_id=False, translate_tags=True
):
def template_sort(x):
return x.metadata['Template#']
return x.metadata["Template#"]
if sort_by_template_id:
dataset.sort(key=template_sort)
return [sample.to_spacy(entities=entities, translate_tags=translate_tags) for sample in dataset]
return [
sample.to_spacy(entities=entities, translate_tags=translate_tags)
for sample in dataset
]
def to_spacy_json(self, entities=None, translate_tags=True):
token_dicts = []
for i, token in enumerate(self.tokens):
if entities:
tag = self.tags[i] if self.tags[i][2:] in entities else 'O'
tag = self.tags[i] if self.tags[i][2:] in entities else "O"
else:
tag = self.tags[i]
if translate_tags:
tag = self.translate_tag(tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
token_dicts.append({
"orth": token.text,
"tag": token.tag_,
"ner": tag
})
tag = self.translate_tag(
tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
)
token_dicts.append({"orth": token.text, "tag": token.tag_, "ner": tag})
spacy_json_sentence = {
"raw": self.full_text,
"sentences": [{
"tokens": token_dicts
}
]
"sentences": [{"tokens": token_dicts}],
}
return spacy_json_sentence
@ -359,29 +413,37 @@ class InputSample(object):
doc = self.tokens
spacy_spans = []
for span in self.spans:
start_token = [token.i for token in self.tokens if token.idx == span.start_position][0]
end_token = [token.i for token in self.tokens if token.idx + len(token.text) == span.end_position][0] + 1
spacy_span = spacy.tokens.span.Span(doc, start=start_token, end=end_token,
label=span.entity_type)
start_token = [
token.i for token in self.tokens if token.idx == span.start_position
][0]
end_token = [
token.i
for token in self.tokens
if token.idx + len(token.text) == span.end_position
][0] + 1
spacy_span = spacy.tokens.span.Span(
doc, start=start_token, end=end_token, label=span.entity_type
)
spacy_spans.append(spacy_span)
doc.ents = spacy_spans
return doc
@staticmethod
def create_spacy_json(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
def create_spacy_json(
dataset, entities=None, sort_by_template_id=False, translate_tags=True
):
def template_sort(x):
return x.metadata['Template#']
return x.metadata["Template#"]
if sort_by_template_id:
dataset.sort(key=template_sort)
json_str = []
for i, sample in tqdm(enumerate(dataset)):
paragraph = sample.to_spacy_json(entities=entities, translate_tags=translate_tags)
json_str.append({
"id": i,
"paragraphs": [paragraph]
})
paragraph = sample.to_spacy_json(
entities=entities, translate_tags=translate_tags
)
json_str.append({"id": i, "paragraphs": [paragraph]})
return json_str
@ -402,10 +464,12 @@ class InputSample(object):
@staticmethod
def translate_tag(tag, dictionary, ignore_unknown):
has_prefix = len(tag) > 2 and tag[1] == '-'
has_prefix = len(tag) > 2 and tag[1] == "-"
no_prefix = tag[2:] if has_prefix else tag
if no_prefix in dictionary.keys():
return tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
return (
tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
)
else:
if ignore_unknown:
return "O"
@ -416,41 +480,48 @@ class InputSample(object):
new_tags = []
for tag in self.tags:
new_tag = tag
has_prefix = len(tag) > 2 and tag[1] == '-'
has_prefix = len(tag) > 2 and tag[1] == "-"
if has_prefix:
if tag[0] == 'U':
new_tag = 'B' + tag[1:]
elif tag[0] == 'L':
new_tag = 'I' + tag[1:]
if tag[0] == "U":
new_tag = "B" + tag[1:]
elif tag[0] == "L":
new_tag = "I" + tag[1:]
new_tags.append(new_tag)
self.tags = new_tags
@staticmethod
def rename_from_spacy_tags(spacy_tags, ignore_unknown=False):
return InputSample.translate_tags(spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown)
return InputSample.translate_tags(
spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown
)
@staticmethod
def rename_to_spacy_tags(tags, ignore_unknown=True):
return InputSample.translate_tags(tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown)
return InputSample.translate_tags(
tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown
)
@staticmethod
def write_spacy_json_from_docs(dataset, filename="spacy_output.json"):
docs = [sample.to_spacy_doc() for sample in dataset]
srsly.write_json(filename, [spacy.gold.docs_to_json(docs)])
srsly.write_json(filename, [spacy.training.docs_to_json(docs)])
def to_flair(self):
for token, i in enumerate(self.tokens):
return "{} {} {}".format(token, token.pos_, self.tags[i])
return f"{token} {token.pos_} {self.tags[i]}"
def translate_input_sample_tags(self, dictionary=PRESIDIO_SPACY_ENTITIES, ignore_unknown=True):
self.tags = InputSample.translate_tags(self.tags, dictionary, ignore_unknown=ignore_unknown)
def translate_input_sample_tags(self, dictionary=None, ignore_unknown=True):
if dictionary is None:
dictionary = PRESIDIO_SPACY_ENTITIES
self.tags = InputSample.translate_tags(
self.tags, dictionary, ignore_unknown=ignore_unknown
)
for span in self.spans:
if span.entity_value in PRESIDIO_SPACY_ENTITIES:
span.entity_value = PRESIDIO_SPACY_ENTITIES[span.entity_value]
elif ignore_unknown:
span.entity_value = 'O'
span.entity_value = "O"
@staticmethod
def create_flair_dataset(dataset):
@ -459,83 +530,3 @@ class InputSample(object):
flair_samples.append(sample.to_flair())
return flair_samples
class ModelError:
def __init__(self, error_type, annotation, prediction, token, full_text, metadata):
"""
Holds information about an error a model made for analysis purposes
:param error_type: str, e.g. FP, FN, Person->Address etc.
:param annotation: ground truth value
:param prediction: predicted value
:param token: token in question
:param full_text: full input text
:param metadata: metadata on text from InputSample
"""
self.error_type = error_type
self.annotation = annotation
self.prediction = prediction
self.token = token
self.full_text = full_text
self.metadata = metadata
def __str__(self):
return "type: {}, " \
"Annotation = {}, " \
"prediction = {}, " \
"Token = {}, " \
"Full text = {}, " \
"Metadata = {}".format(self.error_type,
self.annotation,
self.prediction,
self.token,
self.full_text,
self.metadata)
def __repr__(self):
return r"<ModelError {{0}}>".format(self.__str__())
class EvaluationResult(object):
def __init__(self, results: Counter, model_errors: List[ModelError], text: str = None):
"""
Holds the output of a comparison between ground truth and predicted
:param results: List of objects of type Counter
with structure {(actual, predicted) : count}
:param model_errors: List of ModelError
:param text: sample's full text (if used for one sample)
:type results: Counter
:type model_errors : List[ModelError]
:type text: object
"""
self.results = results
self.model_errors = model_errors
self.text = text
self.pii_recall = None
self.pii_precision = None
self.pii_f = None
self.entity_recall_dict = None
self.entity_precision_dict = None
def print(self):
recall_dict = self.entity_recall_dict
precision_dict = self.entity_precision_dict
recall_dict["PII"] = self.pii_recall
precision_dict["PII"] = self.pii_precision
entities = recall_dict.keys()
recall = recall_dict.values()
precision = precision_dict.values()
row_format = "{:>30}{:>30.2%}{:>30.2%}"
header_format = "{:>30}" * 3
print(header_format.format(*("Entity", "Precision", "Recall")))
for entity, precision, recall in zip(entities, precision, recall):
print(row_format.format(entity, precision, recall))
print("PII F measure: {}".format(self.pii_f))

Просмотреть файл

@ -0,0 +1,4 @@
from .dataset_formatter import DatasetFormatter
from .conll_formatter import CONLL2003Formatter
__all__ = ["DatasetFormatter", "CONLL2003Formatter"]

Просмотреть файл

@ -0,0 +1,62 @@
from pathlib import Path
from typing import List, Optional
import requests
from spacy.training import converters
from presidio_evaluator import InputSample
from presidio_evaluator.dataset_formatters import DatasetFormatter
class CONLL2003Formatter(DatasetFormatter):
def __init__(
self,
files_path=Path("../data/conll2003").resolve(),
glob_pattern: str = "*.iob",
):
self.files_path = files_path
self.glob_pattern = glob_pattern
@staticmethod
def download(
local_data_path=Path("../data/conll2003").resolve(),
conll_gh_path="https://raw.githubusercontent.com/glample/tagger/master/dataset/",
):
for fold in ("eng.train", "eng.testa", "eng.testb"):
fold_path = conll_gh_path + fold
if not local_data_path.exists():
local_data_path.mkdir(parents=True)
dataset_file = Path(local_data_path, fold)
if dataset_file.exists():
print("Dataset already exists, skipping download")
return
response = requests.get(fold_path)
dataset_raw = response.text
with open(dataset_file, "w") as f:
f.write(dataset_raw)
print(f"Finished writing fold {fold} to {local_data_path}")
print("Finished downloading CoNNL2003")
def to_input_samples(self, fold: Optional[str] = None) -> List[InputSample]:
files_found = False
for i, file_path in enumerate(self.files_path.glob(self.glob_pattern)):
if fold and fold not in file_path.name:
continue
files_found = True
with open(file_path, "r", encoding="utf-8") as file:
text = file.readlines()
text = "".join(text)
output_docs = converters.conll_ner2json(
input_data=text, n_sents=None, no_print=True
)
# TODO: Translate to InputSample
if not files_found:
raise FileNotFoundError(f"No files found for pattern {self.glob_pattern}")

Просмотреть файл

@ -0,0 +1,14 @@
from abc import ABC, abstractmethod
from typing import List
from presidio_evaluator import InputSample
class DatasetFormatter(ABC):
@abstractmethod
def to_input_samples(self) -> List[InputSample]:
"""
Translate a dataset structure into a list of documents, to be used by models and for evaluation
:return:
"""
pass

Просмотреть файл

@ -0,0 +1,5 @@
from .model_error import ModelError
from .evaluation_result import EvaluationResult
from .evaluator import Evaluator
__all__ = ["ModelError", "EvaluationResult", "Evaluator"]

Просмотреть файл

@ -0,0 +1,51 @@
from collections import Counter
from typing import List, Optional
from presidio_evaluator.evaluation import ModelError
class EvaluationResult(object):
def __init__(
self,
results: Counter,
model_errors: Optional[List[ModelError]] = None,
text: str = None,
):
"""
Holds the output of a comparison between ground truth and predicted
:param results: List of objects of type Counter
with structure {(actual, predicted) : count}
:param model_errors: List of specific model errors for further inspection
:param text: sample's full text (if used for one sample)
"""
self.results = results
self.model_errors = model_errors
self.text = text
self.pii_recall = None
self.pii_precision = None
self.pii_f = None
self.entity_recall_dict = None
self.entity_precision_dict = None
def print(self):
recall_dict = self.entity_recall_dict
precision_dict = self.entity_precision_dict
recall_dict["PII"] = self.pii_recall
precision_dict["PII"] = self.pii_precision
entities = recall_dict.keys()
recall = recall_dict.values()
precision = precision_dict.values()
row_format = "{:>30}{:>30.2%}{:>30.2%}"
header_format = "{:>30}" * 3
print(header_format.format(*("Entity", "Precision", "Recall")))
for entity, precision, recall in zip(entities, precision, recall):
print(row_format.format(entity, precision, recall))
print("PII F measure: {}".format(self.pii_f))
def __repr__(self):
return f"stats={self.results}"

Просмотреть файл

@ -0,0 +1,318 @@
from collections import Counter
from typing import List, Optional, Dict
import numpy as np
from tqdm import tqdm
from presidio_evaluator import InputSample
from presidio_evaluator.evaluation import EvaluationResult, ModelError
from presidio_evaluator.models import BaseModel, PresidioAnalyzerWrapper
class Evaluator:
def __init__(
self,
model: BaseModel,
verbose: bool = False,
compare_by_io=True,
entities_to_keep: Optional[List[str]] = None,
):
"""
Evaluate a PII detection model or a Presidio analyzer / recognizer
:param model: Instance of a fitted model (of base type BaseModel)
:param compare_by_io: True if comparison should be done on the entity
level and not the sub-entity level
:param entities_to_keep: List of entity names to focus the evaluator on (and ignore the rest).
Default is None = all entities. If the provided model has a list of entities to keep,
this list would be used for evaluation.
"""
self.model = model
self.verbose = verbose
self.compare_by_io = compare_by_io
self.entities_to_keep = entities_to_keep
if self.entities_to_keep is None and self.model.entities:
self.entities_to_keep = self.model.entities
def compare(self, input_sample: InputSample, prediction: List[str]):
"""
Compares ground truth tags (annotation) and predicted (prediction)
:param input_sample: input sample containing list of tags with scheme
:param prediction: predicted value for each token
self.labeling_scheme
"""
annotation = input_sample.tags
tokens = input_sample.tokens
if len(annotation) != len(prediction):
print(
"Annotation and prediction do not have the"
"same length. Sample={}".format(input_sample)
)
return Counter(), []
results = Counter()
mistakes = []
new_annotation = annotation.copy()
if self.compare_by_io:
new_annotation = self._to_io(new_annotation)
prediction = self._to_io(prediction)
# Ignore annotations that aren't in the list of
# requested entities.
if self.entities_to_keep:
prediction = self._adjust_per_entities(prediction)
new_annotation = self._adjust_per_entities(new_annotation)
for i in range(0, len(new_annotation)):
results[(new_annotation[i], prediction[i])] += 1
if self.verbose:
print("Annotation:", new_annotation[i])
print("Prediction:", prediction[i])
print(results)
# check if there was an error
is_error = new_annotation[i] != prediction[i]
if is_error:
if prediction[i] == "O":
mistakes.append(
ModelError(
"FN",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata,
)
)
elif new_annotation[i] == "O":
mistakes.append(
ModelError(
"FP",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata,
)
)
else:
mistakes.append(
ModelError(
"Wrong entity",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata,
)
)
return results, mistakes
def _adjust_per_entities(self, tags):
if self.entities_to_keep:
return [tag if tag in self.entities_to_keep else "O" for tag in tags]
@staticmethod
def _to_io(tags):
"""
Translates BILOU/BIO/IOB to IO - only In or Out of entity.
['B-PERSON','I-PERSON','L-PERSON'] is translated into
['PERSON','PERSON','PERSON']
:param tags: the input tags in BILOU/IOB/BIO format
:return: a new list of IO tags
"""
return [tag[2:] if "-" in tag else tag for tag in tags]
def evaluate_sample(
self, sample: InputSample, prediction: List[str]
) -> EvaluationResult:
if self.verbose:
print("Input sentence: {}".format(sample.full_text))
results, mistakes = self.compare(input_sample=sample, prediction=prediction)
return EvaluationResult(results, mistakes, sample.full_text)
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
evaluation_results = []
for sample in tqdm(dataset, desc="Evaluating {}".format(self.__class__)):
prediction = self.model.predict(sample)
evaluation_result = self.evaluate_sample(
sample=sample, prediction=prediction
)
evaluation_results.append(evaluation_result)
return evaluation_results
@staticmethod
def align_input_samples_to_presidio_analyzer(
input_samples: List[InputSample],
entities_mapping: Dict[
str, str
] = PresidioAnalyzerWrapper.presidio_entities_map,
) -> List[InputSample]:
"""
Change input samples to conform with Presidio's entities
:return: new list of InputSample
"""
new_input_samples = input_samples.copy()
# A list that will contain updated input samples,
new_list = []
# Iterate on all samples
for input_sample in new_input_samples:
contains_presidio_field = False
new_spans = []
# Update spans to match Presidio's entity name
for span in input_sample.spans:
in_presidio_field = False
if span.entity_type in entities_mapping.keys():
new_name = entities_mapping.get(span.entity_type)
span.entity_type = new_name
contains_presidio_field = True
# Add to new span list, if the span contains an entity relevant to Presidio
new_spans.append(span)
input_sample.spans = new_spans
# Update tags in case this sample has relevant entities for evaluation
if contains_presidio_field:
for i, tag in enumerate(input_sample.tags):
has_prefix = "-" in tag
if has_prefix:
prefix = tag[:2]
clean = tag[2:]
else:
prefix = ""
clean = tag
if clean in entities_mapping.keys():
new_name = entities_mapping.get(clean)
input_sample.tags[i] = "{}{}".format(prefix, new_name)
else:
input_sample.tags[i] = "O"
new_list.append(input_sample)
return new_list
def calculate_score(
self,
evaluation_results: List[EvaluationResult],
entities: Optional[List[str]] = None,
beta: float = 1,
) -> EvaluationResult:
"""
Returns the pii_precision, pii_recall and f_measure either for each entity
or for all entities (ignore_entity_type = True)
:param evaluation_results: List of EvaluationResult
:param entities: List of entities to calculate score to. Default is None: all entities
:param beta: F measure beta value
between different entity types, or to treat these as misclassifications
:return: EvaluationResult with precision, recall and f measures
"""
# aggregate results
all_results = sum([er.results for er in evaluation_results], Counter())
# compute pii_recall per entity
entity_recall = {}
entity_precision = {}
if not entities:
entities = list(set([x[0] for x in all_results.keys() if x[0] != "O"]))
for entity in entities:
# all annotation of given type
annotated = sum([all_results[x] for x in all_results if x[0] == entity])
predicted = sum([all_results[x] for x in all_results if x[1] == entity])
tp = all_results[(entity, entity)]
if annotated > 0:
entity_recall[entity] = tp / annotated
else:
entity_recall[entity] = np.NaN
if predicted > 0:
per_entity_tp = all_results[(entity, entity)]
entity_precision[entity] = per_entity_tp / predicted
else:
entity_precision[entity] = np.NaN
# compute pii_precision and pii_recall
annotated_all = sum([all_results[x] for x in all_results if x[0] != "O"])
predicted_all = sum([all_results[x] for x in all_results if x[1] != "O"])
if annotated_all > 0:
pii_recall = (
sum(
[
all_results[x]
for x in all_results
if (x[0] != "O" and x[1] != "O")
]
)
/ annotated_all
)
else:
pii_recall = np.NaN
if predicted_all > 0:
pii_precision = (
sum(
[
all_results[x]
for x in all_results
if (x[0] != "O" and x[1] != "O")
]
)
/ predicted_all
)
else:
pii_precision = np.NaN
# compute pii_f_beta-score
pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
# aggregate errors
errors = []
for res in evaluation_results:
if res.model_errors:
errors.extend(res.model_errors)
evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
evaluation_result.pii_precision = pii_precision
evaluation_result.pii_recall = pii_recall
evaluation_result.entity_recall_dict = entity_recall
evaluation_result.entity_precision_dict = entity_precision
evaluation_result.pii_f = pii_f_beta
return evaluation_result
@staticmethod
def precision(tp: int, fp: int) -> float:
return tp / (tp + fp + 1e-100)
@staticmethod
def recall(tp: int, fn: int) -> float:
return tp / (tp + fn + 1e-100)
@staticmethod
def f_beta(precision: float, recall: float, beta: float) -> float:
"""
Returns the F score for precision, recall and a beta parameter
:param precision: a float with the precision value
:param recall: a float with the recall value
:param beta: a float with the beta parameter of the F measure,
which gives more or less weight to precision
vs. recall
:return: a float value of the f(beta) measure.
"""
if np.isnan(precision) or np.isnan(recall) or (precision == 0 and recall == 0):
return np.nan
return ((1 + beta ** 2) * precision * recall) / (
((beta ** 2) * precision) + recall
)

Просмотреть файл

@ -0,0 +1,174 @@
from typing import Dict, List
from presidio_evaluator.data_objects import SimpleToken
import pandas as pd
class ModelError:
def __init__(
self,
error_type: str,
annotation: str,
prediction: str,
token: SimpleToken,
full_text: str,
metadata: Dict,
):
"""
Holds information about an error a model made for analysis purposes
:param error_type: str, e.g. FP, FN, Person->Address etc.
:param annotation: ground truth value
:param prediction: predicted value
:param token: token in question
:param full_text: full input text
:param metadata: metadata on text from InputSample
"""
self.error_type = error_type
self.annotation = annotation
self.prediction = prediction
self.token = token
self.full_text = full_text
self.metadata = metadata
def __str__(self):
return (
"type: {}, "
"Annotation = {}, "
"prediction = {}, "
"Token = {}, "
"Full text = {}, "
"Metadata = {}".format(
self.error_type,
self.annotation,
self.prediction,
self.token,
self.full_text,
self.metadata,
)
)
def __repr__(self):
return r"<ModelError {{0}}>".format(self.__str__())
@staticmethod
def most_common_fp_tokens(errors=List["ModelError"], n: int = 10, entity=None):
"""
Print the n most common false positive tokens (tokens thought to be an entity)
"""
fps = ModelError.get_false_positives(errors, entity)
tokens = [err.token.text for err in fps]
from collections import Counter
by_frequency = Counter(tokens)
most_common = by_frequency.most_common(n)
print("Most common false positive tokens:")
print(most_common)
print("Example sentence with each FP token:")
for tok, val in most_common:
with_tok = [err for err in fps if err.token.text == tok]
print(with_tok[0].full_text)
@staticmethod
def most_common_fn_tokens(errors=List["ModelError"], n: int = 10, entity=None):
"""
Print all tokens that were missed by the model, including an example of the full text in which they appear
"""
fns = ModelError.get_false_negatives(errors, entity)
fns_tokens = [err.token.text for err in fns]
from collections import Counter
by_frequency_fns = Counter(fns_tokens)
most_common_fns = by_frequency_fns.most_common(50)
print(most_common_fns)
for tok, val in most_common_fns:
with_tok = [err for err in fns if err.token.text == tok]
print(
"Token: {}, Annotation: {}, Full text: {}".format(
with_tok[0].token, with_tok[0].annotation, with_tok[0].full_text
)
)
@staticmethod
def get_errors_df(
errors=List["ModelError"], entity: List[str] = None, error_type: str = "FN"
):
"""
Get ModelErrors as pd.DataFrame
"""
if error_type == "FN":
filtered_errors = ModelError.get_false_negatives(errors, entity)
elif error_type == "FP":
filtered_errors = ModelError.get_false_positives(errors, entity)
else:
raise ValueError("error_type should be either FP or FN")
if len(filtered_errors) == 0:
print(
"No errors of type {} and entity {} were found".format(
error_type, entity
)
)
return None
errors_df = pd.DataFrame.from_records(
[error.__dict__ for error in filtered_errors]
)
metadata_df = pd.DataFrame(errors_df["metadata"].tolist())
errors_df.drop(["metadata"], axis=1, inplace=True)
new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
return new_errors_df
@staticmethod
def get_fps_dataframe(errors=List["ModelError"], entity: str = None):
"""
Get false positive ModelErrors as pd.DataFrame
"""
return ModelError.get_errors_df(errors, entity, error_type="FP")
@staticmethod
def get_fns_dataframe(errors=List["ModelError"], entity: str = None):
"""
Get false negative ModelErrors as pd.DataFrame
"""
return ModelError.get_errors_df(errors, entity, error_type="FN")
@staticmethod
def get_false_positives(errors=List["ModelError"], entity=None):
"""
Get a list of all false positive errors in the results
"""
if isinstance(entity, str):
entity = [entity]
if entity:
return [
model_error
for model_error in errors
if model_error.error_type == "FP" and model_error.prediction in entity
]
else:
return [
model_error for model_error in errors if model_error.error_type == "FP"
]
@staticmethod
def get_false_negatives(errors=List["ModelError"], entity=None):
"""
Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
"""
if isinstance(entity, str):
entity = [entity]
if entity:
return [
model_error
for model_error in errors
if model_error.error_type != "FP" and model_error.annotation in entity
]
else:
return [
model_error for model_error in errors if model_error.error_type != "FP"
]

Просмотреть файл

@ -0,0 +1,159 @@
"""E2E scoring pipelines for the different models"""
import math
from typing import List, Optional
from presidio_analyzer import EntityRecognizer
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.evaluation import EvaluationResult, Evaluator
from presidio_evaluator.models import (
PresidioRecognizerWrapper,
PresidioAnalyzerWrapper,
BaseModel,
)
def score_model(
model: BaseModel,
entities_to_keep: List[str],
input_samples: List[InputSample],
verbose: bool = False,
beta: float = 2.5,
) -> EvaluationResult:
"""
Run data through a model and gather results and stats
"""
print("Evaluating samples")
evaluator = Evaluator(model=model, entities_to_keep=entities_to_keep)
evaluated_samples = evaluator.evaluate_all(input_samples)
print("Estimating metrics")
evaluation_result = evaluator.calculate_score(
evaluation_results=evaluated_samples, beta=beta
)
precision = evaluation_result.pii_precision
recall = evaluation_result.pii_recall
entity_recall = evaluation_result.entity_recall_dict
entity_precision = evaluation_result.entity_precision_dict
f = evaluation_result.pii_f
errors = evaluation_result.model_errors
#
print(f"precision: {precision}")
print(f"Recall: {recall}")
print(f"F {beta}: {f}")
print(f"Precision per entity: {entity_precision}")
print(f"Recall per entity: {entity_recall}")
if verbose:
false_negatives = [
str(mistake) for mistake in errors if mistake.error_type == "FN"
]
false_positives = [
str(mistake) for mistake in errors if mistake.error_type == "FP"
]
other_mistakes = [
str(mistake) for mistake in errors if mistake.error_type not in ["FN", "FP"]
]
print("False negatives: ")
print("\n".join(false_negatives))
print("\n******************\n")
print("False positives: ")
print("\n".join(false_positives))
print("\n******************\n")
print("Other mistakes: ")
print("\n".join(other_mistakes))
return evaluation_result
def score_presidio_recognizer(
recognizer: EntityRecognizer,
entities_to_keep: List[str],
input_samples: Optional[List[InputSample]] = None,
labeling_scheme: str = "BILUO",
with_nlp_artifacts: bool = False,
verbose: bool = False,
) -> EvaluationResult:
"""
Run data through one EntityRecognizer and gather results and stats
"""
if not input_samples:
print("Reading dataset")
input_samples = read_synth_dataset("../../data/synth_dataset.txt")
else:
input_samples = list(input_samples)
print("Preparing dataset by aligning entity names to Presidio's entity names")
updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(input_samples)
model = PresidioRecognizerWrapper(
recognizer=recognizer,
entities_to_keep=entities_to_keep,
labeling_scheme=labeling_scheme,
nlp_engine=SpacyNlpEngine(),
with_nlp_artifacts=with_nlp_artifacts,
)
return score_model(
model=model,
entities_to_keep=entities_to_keep,
input_samples=updated_samples,
verbose=verbose,
)
def score_presidio_analyzer(
input_samples: Optional[List[InputSample]] = None,
entities_to_keep: Optional[List[str]] = None,
labeling_scheme: str = "BILUO",
verbose: bool = True,
) -> EvaluationResult:
""""""
if not input_samples:
print("Reading dataset")
input_samples = read_synth_dataset("../../data/synth_dataset.txt")
else:
input_samples = list(input_samples)
print("Preparing dataset by aligning entity names to Presidio's entity names")
updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(input_samples)
flatten = lambda l: [item for sublist in l for item in sublist]
from collections import Counter
count_per_entity = Counter(
[
span.entity_type
for span in flatten(
[input_sample.spans for input_sample in updated_samples]
)
]
)
if verbose:
print("Count per entity:")
print(count_per_entity)
analyzer = PresidioAnalyzerWrapper(
entities_to_keep=entities_to_keep, labeling_scheme=labeling_scheme
)
return score_model(
model=analyzer,
entities_to_keep=list(count_per_entity.keys()),
input_samples=updated_samples,
verbose=verbose,
)
if __name__ == "__main__":
score_presidio_analyzer()

Просмотреть файл

@ -1,398 +0,0 @@
from abc import ABC, abstractmethod
from typing import List, Tuple, Dict
from collections import Counter
import numpy as np
import pandas as pd
from presidio_evaluator import InputSample, EvaluationResult, ModelError
from tqdm import tqdm
class ModelEvaluator(ABC):
def __init__(self, entities_to_keep: List[str] = None,
verbose: bool = False,
use_spans: bool = False, labeling_scheme="BIO",
compare_by_io=True):
"""
Abstract class for evaluating NER models and others
:param entities_to_keep: Which entities should be evaluated? All other
entities are ignored. If None, none are filtered
:param verbose: Whether to print more debug info
:param labeling_scheme: Type of scheme used for labeling (BILOU,
BIO/LOB or IO)
:param compare_by_io: True if comparison should be done on the entity
level and not the sub-entity level
"""
self.entities = entities_to_keep
self.verbose = verbose
self.use_spans = use_spans
self.compare_by_io = compare_by_io
self.labeling_scheme = labeling_scheme
@abstractmethod
def predict(self, sample: InputSample) -> List[str]:
"""
Abstract. Returns the predicted tokens/spans from the evaluated model
:param sample: Sample to be evaluated
:return: if self.use spans: list of spans
if not self.use_spans: tags in self.labeling_scheme format
"""
pass
def compare(self, input_sample: InputSample, prediction: List[str]):
"""
Compares gound truth tags (annotation) and predicted (prediction)
:param input_sample: input sample containing list of tags with scheme
:param prediction: predicted value for each token
self.labeling_scheme
"""
annotation = input_sample.tags
tokens = input_sample.tokens
if len(annotation) != len(prediction):
print("Annotation and prediction do not have the"
"same length. Sample={}".format(input_sample))
return Counter(), []
results = Counter()
mistakes = []
new_annotation = annotation.copy()
if self.compare_by_io:
new_annotation = self._to_io(new_annotation)
prediction = self._to_io(prediction)
# Ignore annotations that aren't in the list of
# requested entities.
if self.entities:
prediction = self._adjust_per_entities(prediction)
new_annotation = self._adjust_per_entities(new_annotation)
for i in range(0, len(new_annotation)):
results[(new_annotation[i], prediction[i])] += 1
if self.verbose:
print('Annotation:', new_annotation[i])
print('Prediction:', prediction[i])
print(results)
# check if there was an error
is_error = (new_annotation[i] != prediction[i])
if is_error:
if prediction[i] == 'O':
mistakes.append(ModelError("FN",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata))
elif new_annotation[i] == 'O':
mistakes.append(ModelError("FP",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata))
else:
mistakes.append(ModelError("Wrong entity",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata))
return results, mistakes
def _adjust_per_entities(self, tags):
if self.entities:
return [tag if tag in self.entities else 'O' for tag in tags]
@staticmethod
def _to_io(tags):
"""
Translates BILOU/BIO/IOB to IO - only In or Out of entity.
['B-PERSON','I-PERSON','L-PERSON'] is translated into
['PERSON','PERSON','PERSON']
:param tags: the input tags in BILOU/IOB/BIO format
:return: a new list of IO tags
"""
return [tag[2:] if '-' in tag else tag for tag in tags]
def evaluate_sample(self, sample: InputSample) -> EvaluationResult:
if self.verbose:
print("Input sentence: {}".format(sample.full_text))
prediction = self.predict(sample)
results, mistakes = self.compare(
input_sample=sample,
prediction=prediction)
return EvaluationResult(results, mistakes, sample.full_text)
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
evaluation_results = []
for sample in tqdm(dataset, desc='Evaluating {}'.format(self.__class__)):
evaluation_result = self.evaluate_sample(sample)
evaluation_results.append(evaluation_result)
return evaluation_results
def calculate_score(self, evaluation_results: List[
EvaluationResult], beta: float = 1) \
-> EvaluationResult:
"""
Returns the pii_precision, pii_recall and f_measure either for each entity
or for all entities (ignore_entity_type = True)
:param evaluation_results: List of EvaluationResult
:param beta: F measure beta value
between different entity types, or to treat these as misclassifications
:return: EvaluationResult with precision, recall and f measures
"""
# aggregate results
all_results = sum([er.results for er in evaluation_results], Counter())
# compute pii_recall per entity
entity_recall = {}
entity_precision = {}
if self.entities:
entities = self.entities
else:
entities = list(
set([x[0] for x in all_results.keys() if x[0] != 'O']))
for entity in entities:
# all annotation of given type
annotated = sum(
[all_results[x] for x in all_results if x[0] == entity])
predicted = sum(
[all_results[x] for x in all_results if x[1] == entity])
tp = all_results[(entity, entity)]
if annotated > 0:
entity_recall[entity] = tp / annotated
else:
entity_recall[entity] = np.NaN
if predicted > 0:
per_entity_tp = all_results[(entity, entity)]
entity_precision[entity] = per_entity_tp / predicted
else:
entity_precision[entity] = np.NaN
# compute pii_precision and pii_recall
annotated_all = sum(
[all_results[x] for x in all_results if x[0] != 'O'])
predicted_all = sum(
[all_results[x] for x in all_results if x[1] != 'O'])
if annotated_all > 0:
pii_recall = sum([all_results[x] for x in all_results if
(x[0] != 'O' and x[1] != 'O')]) / annotated_all
else:
pii_recall = np.NaN
if predicted_all > 0:
pii_precision = sum([all_results[x] for x in all_results if
(x[0] != 'O' and x[1] != 'O')]) / predicted_all
else:
pii_precision = np.NaN
# compute pii_f_beta-score
pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
# aggregate errors
errors = []
for res in evaluation_results:
if res.model_errors:
errors.extend(res.model_errors)
evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
evaluation_result.pii_precision = pii_precision
evaluation_result.pii_recall = pii_recall
evaluation_result.entity_recall_dict = entity_recall
evaluation_result.entity_precision_dict = entity_precision
evaluation_result.pii_f = pii_f_beta
return evaluation_result
@staticmethod
def precision(tp: int, fp: int) -> float:
return tp / (tp + fp + 1e-100)
@staticmethod
def recall(tp: int, fn: int) -> float:
return tp / (tp + fn + 1e-100)
@staticmethod
def f_beta(precision: float, recall: float, beta: float) -> float:
"""
Returns the F score for precision, recall and a beta parameter
:param precision: a float with the precision value
:param recall: a float with the recall value
:param beta: a float with the beta parameter of the F measure,
which gives more or less weight to precision
vs. recall
:return: a float value of the f(beta) measure.
"""
if np.isnan(precision) or np.isnan(recall) or (
precision == 0 and recall == 0):
return np.nan
return ((1 + beta ** 2) * precision * recall) / (
((beta ** 2) * precision) + recall)
@staticmethod
def align_input_samples_to_presidio_analyzer(input_samples: List[InputSample],
entities_mapping: Dict[str, str],
presidio_fields: List[str]=None) \
-> List[InputSample]:
"""
Change input samples to conform with Presidio's entities
:return: new list of InputSample
"""
new_input_samples = input_samples.copy()
# Match entity names to Presidio's
if not presidio_fields:
presidio_fields = ['CREDIT_CARD', 'CRYPTO', 'DATE_TIME', 'DOMAIN_NAME', 'EMAIL_ADDRESS', 'IBAN_CODE',
'IP_ADDRESS', 'NRP', 'LOCATION', 'PERSON', 'PHONE_NUMBER', 'US_SSN']
# A list that will contain updated input samples,
new_list = []
# Iterate on all samples
for input_sample in new_input_samples:
contains_presidio_field = False
new_spans = []
# Update spans to match Presidio's entity name
for span in input_sample.spans:
in_presidio_field = False
if span.entity_type in entities_mapping.keys():
new_name = entities_mapping.get(span.entity_type)
span.entity_type = new_name
contains_presidio_field = True
# Add to new span list, if the span contains an entity relevant to Presidio
new_spans.append(span)
input_sample.spans = new_spans
# Update tags in case this sample has relevant entities for evaluation
if contains_presidio_field:
for i, tag in enumerate(input_sample.tags):
has_prefix = '-' in tag
if has_prefix:
prefix = tag[:2]
clean = tag[2:]
else:
prefix = ""
clean = tag
if clean in entities_mapping.keys():
new_name = entities_mapping.get(clean)
input_sample.tags[i] = "{}{}".format(prefix, new_name)
else:
input_sample.tags[i] = 'O'
new_list.append(input_sample)
return new_list
@staticmethod
def get_false_positives(errors=List[ModelError], entity=None):
"""
Get a list of all false positive errors in the results
"""
if isinstance(entity, str):
entity = [entity]
if entity:
return [model_error for model_error in errors if
model_error.error_type == 'FP' and model_error.prediction in entity]
else:
return [model_error for model_error in errors if model_error.error_type == 'FP']
@staticmethod
def get_false_negatives(errors=List[ModelError], entity=None):
"""
Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
"""
if isinstance(entity, str):
entity = [entity]
if entity:
return [model_error for model_error in errors if
model_error.error_type != 'FP' and model_error.annotation in entity]
else:
return [model_error for model_error in errors if model_error.error_type != 'FP']
@staticmethod
def most_common_fp_tokens(errors=List[ModelError], n: int = 10, entity=None):
"""
Print the n most common false positive tokens (tokens thought to be an entity)
"""
fps = ModelEvaluator.get_false_positives(errors, entity)
tokens = [err.token.text for err in fps]
from collections import Counter
by_frequency = Counter(tokens)
most_common = by_frequency.most_common(n)
print("Most common false positive tokens:")
print(most_common)
print("Example sentence with each FP token:")
for tok, val in most_common:
with_tok = [err for err in fps if err.token.text == tok]
print(with_tok[0].full_text)
@staticmethod
def most_common_fn_tokens(errors=List[ModelError], n: int = 10, entity=None):
"""
Print all tokens that were missed by the model, including an example of the full text in which they appear
"""
fns = ModelEvaluator.get_false_negatives(errors, entity)
fns_tokens = [err.token.text for err in fns]
from collections import Counter
by_frequency_fns = Counter(fns_tokens)
most_common_fns = by_frequency_fns.most_common(50)
print(most_common_fns)
for tok, val in most_common_fns:
with_tok = [err for err in fns if err.token.text == tok]
print("Token: {}, Annotation: {}, Full text: {}".format(with_tok[0].token, with_tok[0].annotation,
with_tok[0].full_text))
@staticmethod
def get_errors_df(errors=List[ModelError], entity: List[str] = None, error_type: str = 'FN'):
"""
Get ModelErrors as pd.DataFrame
"""
if error_type == 'FN':
filtered_errors = ModelEvaluator.get_false_negatives(errors, entity)
elif error_type == 'FP':
filtered_errors = ModelEvaluator.get_false_positives(errors, entity)
else:
raise ValueError("error_type should be either FP or FN")
if len(filtered_errors) == 0:
print("No errors of type {} and entity {} were found".format(error_type,entity))
return None
errors_df = pd.DataFrame.from_records([error.__dict__ for error in filtered_errors])
metadata_df = pd.DataFrame(errors_df['metadata'].tolist())
errors_df.drop(['metadata'], axis=1, inplace=True)
new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
return new_errors_df
@staticmethod
def get_fps_dataframe(errors=List[ModelError], entity: List[str] = None):
"""
Get false positive ModelErrors as pd.DataFrame
"""
return ModelEvaluator.get_errors_df(errors, entity, error_type='FP')
@staticmethod
def get_fns_dataframe(errors=List[ModelError], entity: List[str] = None):
"""
Get false negative ModelErrors as pd.DataFrame
"""
return ModelEvaluator.get_errors_df(errors, entity, error_type='FN')

Просмотреть файл

@ -0,0 +1,15 @@
from .base_model import BaseModel
from .crf_model import CRFModel
from .presidio_analyzer_wrapper import PresidioAnalyzerWrapper
from .presidio_recognizer_wrapper import PresidioRecognizerWrapper
from .spacy_model import SpacyModel
from .flair_model import FlairModel
__all__ = [
"BaseModel",
"CRFModel",
"PresidioRecognizerWrapper",
"PresidioAnalyzerWrapper",
"SpacyModel",
"FlairModel",
]

Просмотреть файл

@ -0,0 +1,37 @@
from abc import ABC, abstractmethod
from typing import List
from presidio_evaluator import InputSample
class BaseModel(ABC):
def __init__(
self,
labeling_scheme: str = "BILUO",
entities_to_keep: List[str] = None,
verbose: bool = False,
):
"""
Abstract class for evaluating NER models and others
:param entities_to_keep: Which entities should be evaluated? All other
entities are ignored. If None, none are filtered
:param labeling_scheme: Used to translate (if needed)
the prediction to a specific scheme (IO, BIO/IOB, BILUO)
:param verbose: Whether to print more debug info
"""
self.entities = entities_to_keep
self.labeling_scheme = labeling_scheme
self.verbose = verbose
@abstractmethod
def predict(self, sample: InputSample) -> List[str]:
"""
Abstract. Returns the predicted tokens/spans from the evaluated model
:param sample: Sample to be evaluated
:return: if self.use spans: list of spans
if not self.use_spans: tags in self.labeling_scheme format
"""
pass

Просмотреть файл

@ -0,0 +1,104 @@
import pickle
from typing import List
from presidio_evaluator import InputSample
from presidio_evaluator.models import BaseModel
class CRFModel(BaseModel):
def __init__(
self,
model_pickle_path: str = "../models/crf.pickle",
entities_to_keep: List[str] = None,
verbose: bool = False,
):
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
)
if model_pickle_path is None:
raise ValueError("model_pickle_path must be supplied")
with open(model_pickle_path, "rb") as f:
self.model = pickle.load(f)
def predict(self, sample: InputSample) -> List[str]:
tags = CRFModel.crf_predict(sample, self.model)
if self.entities:
tags = [tag for tag in tags if tag in self.entities]
if len(tags) != len(sample.tokens):
print("mismatch between previous tokens and new tokens")
# translated_tags = sample.rename_from_spacy_tags(tags)
return tags
@staticmethod
def crf_predict(sample, model):
sample.translate_input_sample_tags()
conll = sample.to_conll(translate_tags=True)
sentence = [(di["text"], di["pos"], di["label"]) for di in conll]
features = CRFModel.sent2features(sentence)
return model.predict([features])[0]
@staticmethod
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
"bias": 1.0,
"word.lower()": word.lower(),
"word[-3:]": word[-3:],
"word[-2:]": word[-2:],
"word.isupper()": word.isupper(),
"word.istitle()": word.istitle(),
"word.isdigit()": word.isdigit(),
"postag": postag,
"postag[:2]": postag[:2],
}
if i > 0:
word1 = sent[i - 1][0]
postag1 = sent[i - 1][1]
features.update(
{
"-1:word.lower()": word1.lower(),
"-1:word.istitle()": word1.istitle(),
"-1:word.isupper()": word1.isupper(),
"-1:postag": postag1,
"-1:postag[:2]": postag1[:2],
}
)
else:
features["BOS"] = True
if i < len(sent) - 1:
word1 = sent[i + 1][0]
postag1 = sent[i + 1][1]
features.update(
{
"+1:word.lower()": word1.lower(),
"+1:word.istitle()": word1.istitle(),
"+1:word.isupper()": word1.isupper(),
"+1:postag": postag1,
"+1:postag[:2]": postag1[:2],
}
)
else:
features["EOS"] = True
return features
@staticmethod
def sent2features(sent):
return [CRFModel.word2features(sent, i) for i in range(len(sent))]
@staticmethod
def sent2labels(sent):
return [label for token, postag, label in sent]
@staticmethod
def sent2tokens(sent):
return [token for token, postag, label in sent]

Просмотреть файл

@ -1,42 +1,40 @@
from typing import List
import spacy
try:
from flair.data import Sentence, build_spacy_tokenizer
from flair.models import SequenceTagger
from flair.tokenization import SpacyTokenizer
except ImportError:
print("Flair is not installed by default")
from presidio_evaluator import ModelEvaluator, InputSample
import spacy
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
from presidio_evaluator import InputSample
from presidio_evaluator.models import BaseModel
class FlairEvaluator(ModelEvaluator):
def __init__(self,
model=None,
model_path: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True,
translate_to_spacy_entities=True):
class FlairModel(BaseModel):
def __init__(
self,
model=None,
model_path: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
translate_to_spacy_entities=True,
):
"""
Evaluator for Flair models
:param model: model of type SequenceTagger
:param model_path:
:param entities_to_keep:
:param verbose:
:param labeling_scheme:
:param compare_by_io:
:param translate_to_spacy_entities:
"""
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io)
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
)
if model is None:
if model_path is None:
raise ValueError("Either model_path or model object must be supplied")
@ -44,11 +42,15 @@ class FlairEvaluator(ModelEvaluator):
else:
self.model = model
self.spacy_tokenizer = build_spacy_tokenizer(model=spacy.blank('en'))
self.spacy_tokenizer = SpacyTokenizer(model=spacy.load("en_core_web_lg"))
self.translate_to_spacy_entities = translate_to_spacy_entities
if self.translate_to_spacy_entities:
print("Translating entities using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
print(
"Translating entities using this dictionary: {}".format(
PRESIDIO_SPACY_ENTITIES
)
)
def predict(self, sample: InputSample) -> List[str]:
if self.translate_to_spacy_entities:
@ -59,13 +61,17 @@ class FlairEvaluator(ModelEvaluator):
tags = self.get_tags_from_sentence(sentence)
if len(tags) != len(sample.tokens):
print("mismatch between previous tokens and new tokens")
if self.entities:
tags = [tag for tag in tags if tag in self.entities]
return tags
@staticmethod
def get_tags_from_sentence(sentence):
tags = []
for token in sentence:
tags.append(token.get_tag('ner').value)
tags.append(token.get_tag("ner").value)
new_tags = []
for tag in tags:

Просмотреть файл

@ -0,0 +1,80 @@
from typing import List
from presidio_analyzer import AnalyzerEngine
from presidio_evaluator import InputSample, span_to_tag
from presidio_evaluator.models import BaseModel
class PresidioAnalyzerWrapper(BaseModel):
def __init__(
self,
analyzer_engine=AnalyzerEngine(),
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme="BIO",
score_threshold=0.4,
):
"""
Evaluation wrapper for the Presidio Analyzer
:param analyzer_engine: object of type AnalyzerEngine (from presidio-analyzer)
"""
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
)
self.analyzer_engine = analyzer_engine
self.score_threshold = score_threshold
def predict(self, sample: InputSample) -> List[str]:
results = self.analyzer_engine.analyze(
text=sample.full_text,
entities=self.entities,
language="en",
score_threshold=self.score_threshold,
)
starts = []
ends = []
scores = []
tags = []
#
for res in results:
starts.append(res.start)
ends.append(res.end)
tags.append(res.entity_type)
scores.append(res.score)
response_tags = span_to_tag(
scheme=self.labeling_scheme,
text=sample.full_text,
start=starts,
end=ends,
tokens=sample.tokens,
scores=scores,
tag=tags,
)
return response_tags
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
presidio_entities_map = {
"PERSON": "PERSON",
"EMAIL_ADDRESS": "EMAIL_ADDRESS",
"CREDIT_CARD": "CREDIT_CARD",
"FIRST_NAME": "PERSON",
"PHONE_NUMBER": "PHONE_NUMBER",
"BIRTHDAY": "DATE_TIME",
"DATE_TIME": "DATE_TIME",
"DOMAIN": "DOMAIN",
"CITY": "LOCATION",
"ADDRESS": "LOCATION",
"NATIONALITY": "LOCATION",
"IBAN": "IBAN_CODE",
"URL": "DOMAIN_NAME",
"US_SSN": "US_SSN",
"IP_ADDRESS": "IP_ADDRESS",
"ORGANIZATION": "ORG",
"O": "O",
}

Просмотреть файл

@ -0,0 +1,75 @@
from typing import List
from presidio_analyzer import EntityRecognizer
from presidio_analyzer.nlp_engine import NlpEngine
from presidio_evaluator import InputSample
from presidio_evaluator.models import BaseModel
from presidio_evaluator.span_to_tag import span_to_tag
class PresidioRecognizerWrapper(BaseModel):
def __init__(
self,
recognizer: EntityRecognizer,
nlp_engine: NlpEngine,
entities_to_keep: List[str] = None,
labeling_scheme: str = "BILUO",
with_nlp_artifacts: bool = False,
verbose: bool = False,
):
"""
Evaluator for one specific PII recognizer
To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
:param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
:param entities_to_keep: List of entity types to focus on while ignoring all the rest.
Default=None would look at all entity types
:param with_nlp_artifacts: Whether NLP artifacts should be obtained
(faster if not, but some recognizers need it)
"""
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
)
self.with_nlp_artifacts = with_nlp_artifacts
self.recognizer = recognizer
self.nlp_engine = nlp_engine
#
def __make_nlp_artifacts(self, text: str):
return self.nlp_engine.process_text(text, "en")
#
def predict(self, sample: InputSample) -> List[str]:
nlp_artifacts = None
if self.with_nlp_artifacts:
nlp_artifacts = self.__make_nlp_artifacts(sample.full_text)
results = self.recognizer.analyze(
sample.full_text, self.entities, nlp_artifacts
)
starts = []
ends = []
tags = []
scores = []
for res in results:
if not res.start:
res.start = 0
starts.append(res.start)
ends.append(res.end)
tags.append(res.entity_type)
scores.append(res.score)
response_tags = span_to_tag(
scheme=self.labeling_scheme,
text=sample.full_text,
start=starts,
end=ends,
tag=tags,
tokens=sample.tokens,
scores=scores,
)
if len(sample.tags) == 0:
sample.tags = ["0" for _ in response_tags]
return response_tags

Просмотреть файл

@ -0,0 +1,55 @@
from typing import List
import spacy
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
from presidio_evaluator import InputSample
from presidio_evaluator.models import BaseModel
class SpacyModel(BaseModel):
def __init__(
self,
model: spacy.language.Language = None,
model_name: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
translate_to_spacy_entities=True,
):
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
)
if model is None:
if model_name is None:
raise ValueError("Either model_name or model object must be supplied")
self.model = spacy.load(model_name)
else:
self.model = model
self.translate_to_spacy_entities = translate_to_spacy_entities
if self.translate_to_spacy_entities:
print(
"Translating entites using this dictionary: {}".format(
PRESIDIO_SPACY_ENTITIES
)
)
def predict(self, sample: InputSample) -> List[str]:
if self.translate_to_spacy_entities:
sample.translate_input_sample_tags()
doc = self.model(sample.full_text)
tags = self.get_tags_from_doc(doc)
if len(doc) != len(sample.tokens):
print("mismatch between input tokens and new tokens")
return tags
@staticmethod
def get_tags_from_doc(doc):
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
return tags

Просмотреть файл

@ -1,153 +0,0 @@
from typing import List
from presidio_analyzer import AnalyzerEngine
from presidio_evaluator import ModelEvaluator, InputSample, span_to_tag
from presidio_evaluator.data_generator import read_synth_dataset
class PresidioAnalyzerEvaluator(ModelEvaluator):
def __init__(
self,
analyzer=AnalyzerEngine(),
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme="BIO",
compare_by_io=True,
score_threshold=0.4,
):
"""
Evaluation wrapper for the Presidio Analyzer
:param analyzer: object of type AnalyzerEngine (from presidio-analyzer)
"""
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io,
)
self.analyzer = analyzer
self.score_threshold = score_threshold
def predict(self, sample: InputSample) -> List[str]:
if self.entities is None or len(self.entities) == 0:
all_fields = True
else:
all_fields = None
results = self.analyzer.analyze(
text=sample.full_text,
entities=self.entities,
language="en",
all_fields=all_fields,
)
starts = []
ends = []
scores = []
tags = []
#
for res in results:
#
if res.score >= self.score_threshold:
starts.append(res.start)
ends.append(res.end)
tags.append(res.entity_type)
scores.append(res.score)
#
response_tags = span_to_tag(
scheme=self.labeling_scheme,
text=sample.full_text,
start=starts,
end=ends,
tokens=sample.tokens,
scores=scores,
tag=tags,
)
return response_tags
if __name__ == "__main__":
print("Reading dataset")
input_samples = read_synth_dataset("../data/synth_dataset.txt")
print("Preparing dataset by aligning entity names to Presidio's entity names")
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
entities_mapping = {
"PERSON": "PERSON",
"EMAIL": "EMAIL_ADDRESS",
"CREDIT_CARD": "CREDIT_CARD",
"FIRST_NAME": "PERSON",
"PHONE_NUMBER": "PHONE_NUMBER",
"BIRTHDAY": "DATE_TIME",
"DATE": "DATE_TIME",
"DOMAIN": "DOMAIN",
"CITY": "LOCATION",
"ADDRESS": "LOCATION",
"IBAN": "IBAN_CODE",
"URL": "DOMAIN_NAME",
"US_SSN": "US_SSN",
"IP_ADDRESS": "IP_ADDRESS",
"ORGANIZATION": "ORG",
"O": "O",
}
updated_samples = ModelEvaluator.align_input_samples_to_presidio_analyzer(
input_samples, entities_mapping
)
flatten = lambda l: [item for sublist in l for item in sublist]
from collections import Counter
count_per_entity = Counter(
[
span.entity_type
for span in flatten(
[input_sample.spans for input_sample in updated_samples]
)
]
)
print("Evaluating samples")
analyzer = PresidioAnalyzerEvaluator(entities_to_keep=count_per_entity.keys())
evaluated_samples = analyzer.evaluate_all(updated_samples)
print("Estimating metrics")
score = analyzer.calculate_score(evaluation_results=evaluated_samples, beta=2.5)
precision = score.pii_precision
recall = score.pii_recall
entity_recall = score.entity_recall_dict
entity_precision = score.entity_precision_dict
f = score.pii_f
errors = score.model_errors
#
print("precision: {}".format(precision))
print("Recall: {}".format(recall))
print("F 2.5: {}".format(f))
print("Precision per entity: {}".format(entity_precision))
print("Recall per entity: {}".format(entity_recall))
#
FN_mistakes = [str(mistake) for mistake in errors if mistake.error_type == "FN"]
FP_mistakes = [str(mistake) for mistake in errors if mistake.error_type == "FP"]
other_mistakes = [
str(mistake) for mistake in errors if mistake.error_type not in ["FN", "FP"]
]
fn = open("../data/fn_30000.txt", "w+", encoding="utf-8")
fn1 = "\n".join(FN_mistakes)
fn.write(fn1)
fn.close()
fp = open("../data/fp_30000.txt", "w+", encoding="utf-8")
fp1 = "\n".join(FP_mistakes)
fp.write(fp1)
fp.close()
mistakes_file = open("../data/mistakes_30000.txt", "w+", encoding="utf-8")
mistakes1 = "\n".join(other_mistakes)
mistakes_file.write(mistakes1)
mistakes_file.close()
from pickle import dump
dump(evaluated_samples, open("../data/evaluated_samples_30000.pickle", "wb"))

Просмотреть файл

@ -1,133 +0,0 @@
import json
from typing import List
import requests
from presidio_evaluator import InputSample, ModelEvaluator
from presidio_evaluator.span_to_tag import span_to_tag, tokenize
ENDPOINT = "http://40.113.201.221:8080/api/v1/projects/test/analyze"
class PresidioAPIEvaluator(ModelEvaluator):
def __init__(self, endpoint=None, all_fields=False, entities_to_keep=None,
verbose=False, labeling_scheme="IO", **kwargs):
"""
evaluator model for the presidio API as a system
:param endpoint: url of presidio API
:param all_fields: boolean, true if no entities filtering should take
place
:param entities_to_keep: list of entities to return if found
:param labeling_scheme: BIO/IOB or BILOU
:param verbose:
:param kwargs:
"""
if not endpoint:
print(
"Endpoint is missing. using default presidio API at {}".format(
ENDPOINT))
self.endpoint = ENDPOINT
else:
self.endpoint = endpoint
if not entities_to_keep and not all_fields:
raise ValueError("Please provide either a list of entities or"
"all_fields=true")
if all_fields:
entities_to_keep = None
super().__init__(verbose=verbose, entities_to_keep=entities_to_keep,
labeling_scheme=labeling_scheme, **kwargs)
self.set_analyze_template(all_fields=all_fields,
entities=entities_to_keep)
def predict(self, sample: InputSample):
text = sample.full_text
request = {"text": text,
"analyzeTemplate": self.analyze_template
}
# Call presidio API
r = requests.post(self.endpoint, json=request)
starts = []
ends = []
tags = []
if r.status_code == 200:
analyzer_results = json.loads(r.text)
if self.verbose:
print(analyzer_results)
if analyzer_results:
for res in analyzer_results:
if not res['location'].get('start'):
res['location']['start'] = 0
starts.append(res['location']['start'])
ends.append(res['location']['end'])
tags.append(res['field']['name'])
response_tags = span_to_tag(scheme=self.labeling_scheme,
text=text,
start=starts,
end=ends,
tag=tags)
elif r.status_code == 400 or r.text == "":
if self.verbose:
print("Status 400 received")
response_tags = ['O' for token in sample.tokens]
else:
print("Error getting result from Presidio API")
print("Request = {}".format(request))
print("Response = {}".format(r.text))
raise Exception(r)
return response_tags
def set_analyze_template(self, all_fields: bool, entities: List[str]):
template = {
"fields": [{"name": "EMAIL_ADDRESS"}, {"name": "IP_ADDRESS"},
{"name": "US_DRIVER_LICENSE"},
{"name": "US_ITIN"}, {"name": "US_SSN"},
{"name": "DOMAIN_NAME"},
{"name": "IBAN_CODE"}, {"name": "PERSON"},
{"name": "PHONE_NUMBER"},
{"name": "US_BANK_NUMBER"}, {"name": "CRYPTO"},
{"name": "NRP"},
{"name": "UK_NHS"}, {"name": "CREDIT_CARD"},
{"name": "DATE_TIME"},
{"name": "LOCATION"}, {"name": "US_PASSPORT"}]}
if all_fields:
self.analyze_template = template
return
requested_fields = []
for entity in entities:
for field in template['fields']:
if entity == field['name']:
requested_fields.append(field)
new_template = {'fields': requested_fields}
self.analyze_template = new_template
if __name__ == "__main__":
# Example:
text = "My siblings are Dan and magen"
bilou_tags = ['O', 'O', 'O', 'U-PERSON', 'O', 'U-PERSON']
presidio = PresidioAPIEvaluator(verbose=True, all_fields=True, compare_by_io=True)
tokens = tokenize(text)
s = InputSample(text, masked=None, spans=None)
s.tokens = tokens
s.tags = bilou_tags
evaluated_sample = presidio.evaluate_sample(s)
p, r, entity_recall, f, mistakes = presidio.calculate_score([evaluated_sample])
print("Precision = {}\n"
"Recall = {}\n"
"F_3 = {}\n"
"Errors = {}".format(p, r, f, mistakes))

Просмотреть файл

@ -1,88 +0,0 @@
"""
Presidio Analyzer not yet on PyPI, therefore it cannot be referenced explicitly
"""
import math
from typing import List, Tuple, Dict
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_evaluator import ModelEvaluator, InputSample, EvaluationResult
from presidio_evaluator.span_to_tag import span_to_tag
class PresidioRecognizerEvaluator(ModelEvaluator):
def __init__(
self,
recognizer,
nlp_engine,
entities_to_keep=None,
with_nlp_artifacts=False,
verbose=False,
compare_by_io=True,
):
"""
Evaluator for one recognizer
:param recognizer: An object of type EntityRecognizer (in presidion-analyzer)
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
"""
super().__init__(
entities_to_keep=entities_to_keep,
verbose=verbose,
compare_by_io=compare_by_io,
)
self.withNlpArtifacts = with_nlp_artifacts
self.recognizer = recognizer
self.nlp_engine = nlp_engine
#
def __make_nlp_artifacts(self, text: str):
return self.nlp_engine.process_text(text, "en")
#
def predict(self, sample: InputSample) -> List[str]:
nlpArtifacts = None
if self.withNlpArtifacts:
nlpArtifacts = self.__make_nlp_artifacts(sample.full_text)
results = self.recognizer.analyze(sample.full_text, self.entities, nlpArtifacts)
starts = []
ends = []
tags = []
scores = []
for res in results:
if not res.start:
res.start = 0
starts.append(res.start)
ends.append(res.end)
tags.append(res.entity_type)
scores.append(res.score)
response_tags = span_to_tag(
scheme=self.labeling_scheme,
text=sample.full_text,
start=starts,
end=ends,
tag=tags,
tokens=sample.tokens,
scores=scores,
io_tags_only=self.compare_by_io,
)
if len(sample.tags) == 0:
sample.tags = ["0" for word in response_tags]
return response_tags
def score_presidio_recognizer(
recognizer, entities_to_keep, input_samples, withNlpArtifacts=False
) -> EvaluationResult:
model = PresidioRecognizerEvaluator(
recognizer=recognizer,
entities_to_keep=entities_to_keep,
nlp_engine=SpacyNlpEngine(),
with_nlp_artifacts=withNlpArtifacts,
)
evaluated_samples = model.evaluate_all(input_samples[:])
evaluation_result = model.calculate_score(evaluated_samples, beta=2.5)
evaluation_result.print()
if math.isnan(evaluation_result.pii_precision):
evaluation_result.pii_precision = 0
return evaluation_result

Просмотреть файл

@ -1,52 +0,0 @@
from typing import List
from presidio_evaluator import ModelEvaluator, InputSample
import spacy
from spacy.language import Language
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
class SpacyEvaluator(ModelEvaluator):
def __init__(self,
model: spacy.language.Language = None,
model_name: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True,
translate_to_spacy_ents = True):
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io)
if model is None:
if model_name is None:
raise ValueError("Either model_name or model object must be supplied")
self.model = spacy.load(model_name)
else:
self.model = model
self.translate_to_spacy_ents = translate_to_spacy_ents
if self.translate_to_spacy_ents:
print("Translating entites using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
def predict(self, sample: InputSample) -> List[str]:
if self.translate_to_spacy_ents:
sample.translate_input_sample_tags()
doc = self.model(sample.full_text)
tags = self.get_tags_from_doc(doc)
if len(doc) != len(sample.tokens):
print("mismatch between input tokens and new tokens")
return tags
@staticmethod
def get_tags_from_doc(doc):
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
return tags

Просмотреть файл

@ -1,14 +1,14 @@
from collections import namedtuple
from typing import List
import spacy
from spacy.tokens import Token
loaded_spacy = {}
def get_spacy(loaded_spacy=loaded_spacy, model_version="en_core_web_lg"):
if model_version not in loaded_spacy:
disable = ['vectors', 'textcat', 'ner']
disable = ["vectors", "textcat", "ner"]
print("loading model {}".format(model_version))
loaded_spacy[model_version] = spacy.load(model_version, disable=disable)
return loaded_spacy[model_version]
@ -26,7 +26,7 @@ def _get_detailed_tags(scheme, cur_tags):
:return:
"""
if all([tag == 'O' for tag in cur_tags]):
if all([tag == "O" for tag in cur_tags]):
return cur_tags
return_tags = []
@ -52,7 +52,12 @@ def _get_detailed_tags(scheme, cur_tags):
def _sort_spans(start, end, tag, score):
if len(start) > 0:
tpl = [(a, b, c, d) for a, b, c, d in sorted(zip(start, end, tag, score), key=lambda pair: pair[0])]
tpl = [
(a, b, c, d)
for a, b, c, d in sorted(
zip(start, end, tag, score), key=lambda pair: pair[0]
)
]
start, end, tag, score = [[x[i] for x in tpl] for i in range(len(tpl[0]))]
return start, end, tag, score
@ -65,8 +70,8 @@ def _handle_overlaps(start, end, tag, score):
index = min(start)
number_of_spans = len(start)
i = 0
while i < number_of_spans-1:
for j in range(i+1,number_of_spans):
while i < number_of_spans - 1:
for j in range(i + 1, number_of_spans):
# Span j intersects with span i
if start[i] <= start[j] <= end[i]:
# i's score is higher, remove intersecting part
@ -98,14 +103,15 @@ def _handle_overlaps(start, end, tag, score):
return start, end, tag, score
def span_to_tag(scheme: str,
text: str,
start: List[int],
end: List[int],
tag: List[str],
scores: List[float] = None,
tokens: List[spacy.tokens.Token] = None,
io_tags_only=False) -> List[str]:
def span_to_tag(
scheme: str,
text: str,
start: List[int],
end: List[int],
tag: List[str],
scores: List[float] = None,
tokens: List[spacy.tokens.Token] = None,
) -> List[str]:
"""
Turns a list of start and end values with corresponding labels, into a NER
tagging (BILOU,BIO/IOB)
@ -116,7 +122,6 @@ def span_to_tag(scheme: str,
:param end: list of indices where entities in the text end
:param tag: list of entity names
:param scores: score of tag (confidence)
:param io_tags_only: Whether to return only I and O tags
:return: list of strings, representing either BILOU or BIO for the input
"""
@ -141,7 +146,7 @@ def span_to_tag(scheme: str,
if not found:
io_tags.append("O")
if io_tags_only or scheme == "IO":
if scheme == "IO":
return io_tags
# Set tagging based on scheme (BIO/IOB or BILOU)
@ -158,7 +163,9 @@ def span_to_tag(scheme: str,
new_return_tags = []
for i in range(len(changes) - 1):
new_return_tags.extend(
_get_detailed_tags(scheme=scheme,
cur_tags=io_tags[changes[i]:changes[i + 1]]))
_get_detailed_tags(
scheme=scheme, cur_tags=io_tags[changes[i] : changes[i + 1]]
)
)
return new_return_tags

Просмотреть файл

@ -7,7 +7,7 @@ import json
from presidio_evaluator import InputSample
def split_dataset(dataset : List[InputSample], ratios):
def split_dataset(dataset: List[InputSample], ratios):
"""
Splits a provided dataset into n groups, by the Template# attribute in each sample's metadata
:param dataset: List of InputSamples to be splitted
@ -23,7 +23,9 @@ def split_dataset(dataset : List[InputSample], ratios):
for ratio in ratios:
if 1 >= ratio > 0:
first_templates, second_templates = split_by_template(remaining_dataset, ratio/remaining_ratio)
first_templates, second_templates = split_by_template(
remaining_dataset, ratio / remaining_ratio
)
first_split = get_samples_by_pattern(remaining_dataset, first_templates)
second_split = get_samples_by_pattern(remaining_dataset, second_templates)
splits.append(first_split)
@ -39,7 +41,7 @@ def group_by_template(dataset: List[InputSample]) -> Dict[str, List[InputSample]
"""
Creates a dict of key = template ID and value = List[InputSamples] for this template id
"""
samples_pattern_tup = [(sample.metadata["Template#"],sample) for sample in dataset]
samples_pattern_tup = [(sample.metadata["Template#"], sample) for sample in dataset]
group_by_template = defaultdict(list)
for sample in samples_pattern_tup:
@ -55,7 +57,9 @@ def split_by_template(input_samples: List[InputSample], train_pct: float = 0.7):
samples_grpd = group_by_template(input_samples)
templates = np.array(list(samples_grpd.keys()))
train_ind = set(random.sample(range(len(templates)), round(train_pct * len(templates))))
train_ind = set(
random.sample(range(len(templates)), round(train_pct * len(templates)))
)
test_ind = set(range(len(templates))) - train_ind
@ -75,5 +79,5 @@ def get_samples_by_pattern(input_samples, patterns_list):
def save_to_json(samples, output_file):
examples_dict = [example.to_dict() for example in samples]
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
with open("{}".format(output_file), "w+", encoding="utf-8") as f:
json.dump(examples_dict, f, ensure_ascii=False, indent=4)

Просмотреть файл

@ -1,18 +1,15 @@
spacy
requests==2.22.0
numpy
jupyter
pandas
tqdm
haikunator
spacy==3.0.5
numpy==1.20.2
jupyter>=1
pandas~=1.2.4
tqdm~=4.60.0
haikunator~=2.1.0
schwifty
faker
sklearn
https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz
regex
#azureml
#azureml-sdk
faker~=8.1.0
scikit_learn==0.24.1
#flair
sklearn_crfsuite
pytest
presidio_analyzer
sklearn_crfsuite==0.3.6
pytest~=6.2.3
presidio_analyzer
presidio_anonymizer
requests~=2.25.1

Просмотреть файл

@ -1,4 +1,4 @@
from setuptools import setup
from setuptools import setup, find_packages
import os.path
# read the contents of the README file
from os import path
@ -7,7 +7,6 @@ this_directory = path.abspath(path.dirname(__file__))
with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
# print(long_description)
__version__ = ""
with open(os.path.join(this_directory, 'VERSION')) as version_file:
__version__ = version_file.read().strip()
@ -17,16 +16,15 @@ setup(
long_description=long_description,
long_description_content_type='text/markdown',
version=__version__,
packages=['presidio_evaluator', 'presidio_evaluator.data_generator'
],
packages=find_packages(exclude=["tests"]),
url='https://www.github.com/microsoft/presidio',
license='MIT',
description='PII dataset generator, model evaluator for Presidio and PII data in general',
data_files=[('presidio_evaluator/data_generator/raw_data', ['presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv', 'presidio_evaluator/data_generator/raw_data/templates.txt', 'presidio_evaluator/data_generator/raw_data/organizations.csv', 'presidio_evaluator/data_generator/raw_data/nationalities.csv'])],
include_package_data=True,
install_requires=[
'spacy>=2.2.0',
'requests==2.22.0',
'spacy>=3.0.0',
'requests',
'numpy',
'pandas',
'tqdm>=4.32.1',

Просмотреть файл

@ -12,7 +12,7 @@ def pytest_addoption(parser):
"--runslow", action="store_true", default=False, help="run slow tests"
)
parser.addoption(
"--runinconclusive", action="store_true", default=False, help="run slow tests"
"--runinconclusive", action="store_true", default=False, help="run inconclusive tests"
)

Просмотреть файл

@ -1,254 +1,139 @@
[
{
"full_text": "My full address is Avda. Alameda Sundheim 46",
"full_text": "I either live on 2347 Lauzon Parkway, Windsor N9A7A2 or ",
"masked": null,
"spans": [
{
"entity_type": "FULL_ADDRESS",
"entity_value": "Avda. Alameda Sundheim 46",
"start_position": 19,
"end_position": 44
"entity_type": "LOCATION",
"entity_value": "2347 Lauzon Parkway, Windsor N9A7A2",
"start_position": 17,
"end_position": 52
},
{
"entity_type": "LOCATION",
"entity_value": "",
"start_position": 56,
"end_position": 56
}
],
"tokens": [
{
"text": "My",
"idx": 0,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "full",
"idx": 3,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "amod",
"lemma_": "full",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "address",
"idx": 8,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "nsubj",
"lemma_": "address",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "is",
"idx": 16,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "ROOT",
"lemma_": "be",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Avda",
"idx": 19,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Avda",
"_": {
"is_in_vocabulary": false
}
},
{
"text": ".",
"idx": 23,
"tag_": ".",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": ".",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Alameda",
"idx": 25,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "compound",
"lemma_": "Alameda",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Sundheim",
"idx": 33,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "ROOT",
"lemma_": "Sundheim",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "46",
"idx": 42,
"tag_": "CD",
"pos_": "NUM",
"dep_": "nummod",
"lemma_": "46",
"_": {
"is_in_vocabulary": false
}
}
],
"tags": [
"O",
"O",
"O",
"O",
"B-FULL_ADDRESS",
"I-FULL_ADDRESS",
"I-FULL_ADDRESS",
"I-FULL_ADDRESS",
"L-FULL_ADDRESS"
],
"template_id": null,
"metadata": {
"Gender": "male",
"NameSet": "Croatian",
"Country": "Uganda",
"Lowercase": false,
"Template#": 9
}
},
{
"full_text": "You want my credit card? No problem: 4532368231815457",
"masked": null,
"spans": [
{
"entity_type": "CREDIT_CARD",
"entity_value": "4532368231815457",
"start_position": 37,
"end_position": 53
}
],
"tokens": [
{
"text": "You",
"text": "I",
"idx": 0,
"tag_": "PRP",
"pos_": "PRON",
"dep_": "nsubj",
"lemma_": "-PRON-",
"lemma_": "I",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "want",
"idx": 4,
"text": "either",
"idx": 2,
"tag_": "RB",
"pos_": "ADV",
"dep_": "advmod",
"lemma_": "either",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "live",
"idx": 9,
"tag_": "VBP",
"pos_": "VERB",
"dep_": "ROOT",
"lemma_": "want",
"lemma_": "live",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "my",
"idx": 9,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"text": "on",
"idx": 14,
"tag_": "IN",
"pos_": "ADP",
"dep_": "prep",
"lemma_": "on",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "credit",
"idx": 12,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "compound",
"lemma_": "credit",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "card",
"idx": 19,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "dobj",
"lemma_": "card",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "?",
"idx": 23,
"tag_": ".",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": "?",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "No",
"idx": 25,
"tag_": "DT",
"pos_": "DET",
"dep_": "det",
"lemma_": "no",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "problem",
"idx": 28,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "ROOT",
"lemma_": "problem",
"_": {
"is_in_vocabulary": false
}
},
{
"text": ":",
"idx": 35,
"tag_": ":",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": ":",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "4532368231815457",
"idx": 37,
"text": "2347",
"idx": 17,
"tag_": "CD",
"pos_": "NUM",
"dep_": "nummod",
"lemma_": "2347",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Lauzon",
"idx": 22,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "compound",
"lemma_": "Lauzon",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Parkway",
"idx": 29,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "pobj",
"lemma_": "Parkway",
"_": {
"is_in_vocabulary": false
}
},
{
"text": ",",
"idx": 36,
"tag_": ",",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": ",",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Windsor",
"idx": 38,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "compound",
"lemma_": "Windsor",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "N9A7A2",
"idx": 46,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "appos",
"lemma_": "4532368231815457",
"lemma_": "N9A7A2",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "or",
"idx": 53,
"tag_": "CC",
"pos_": "CCONJ",
"dep_": "cc",
"lemma_": "or",
"_": {
"is_in_vocabulary": false
}
@ -259,37 +144,38 @@
"O",
"O",
"O",
"O",
"O",
"O",
"O",
"O",
"U-CREDIT_CARD"
"B-LOCATION",
"I-LOCATION",
"I-LOCATION",
"I-LOCATION",
"I-LOCATION",
"L-LOCATION",
"O"
],
"template_id": null,
"metadata": {
"Gender": "female",
"NameSet": "Czech",
"Country": "Austria",
"Gender": "male",
"NameSet": "Polish",
"Country": "Croatia",
"Lowercase": false,
"Template#": 7
"Template#": 11
}
},
{
"full_text": "My first name is Rogelio and my last is Patrick",
"full_text": "My accounts are and ",
"masked": null,
"spans": [
{
"entity_type": "PERSON",
"entity_value": "Rogelio",
"start_position": 17,
"end_position": 24
"entity_type": "ACCOUNT_NUMBER",
"entity_value": "",
"start_position": 16,
"end_position": 16
},
{
"entity_type": "PERSON",
"entity_value": "Patrick",
"start_position": 40,
"end_position": 47
"entity_type": "ACCOUNT_NUMBER",
"entity_value": "",
"start_position": 21,
"end_position": 21
}
],
"tokens": [
@ -297,39 +183,28 @@
"text": "My",
"idx": 0,
"tag_": "PRP$",
"pos_": "DET",
"pos_": "PRON",
"dep_": "poss",
"lemma_": "-PRON-",
"lemma_": "my",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "first",
"text": "accounts",
"idx": 3,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "amod",
"lemma_": "first",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "name",
"idx": 9,
"tag_": "NN",
"tag_": "NNS",
"pos_": "NOUN",
"dep_": "nsubj",
"lemma_": "name",
"lemma_": "account",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "is",
"idx": 14,
"tag_": "VBZ",
"text": "are",
"idx": 12,
"tag_": "VBP",
"pos_": "AUX",
"dep_": "ROOT",
"lemma_": "be",
@ -338,19 +213,19 @@
}
},
{
"text": "Rogelio",
"idx": 17,
"tag_": "NNP",
"pos_": "PROPN",
"text": " ",
"idx": 16,
"tag_": "_SP",
"pos_": "SPACE",
"dep_": "attr",
"lemma_": "Rogelio",
"lemma_": " ",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "and",
"idx": 25,
"idx": 17,
"tag_": "CC",
"pos_": "CCONJ",
"dep_": "cc",
@ -358,47 +233,76 @@
"_": {
"is_in_vocabulary": false
}
},
}
],
"tags": [
"O",
"O",
"O",
"O",
"O"
],
"template_id": null,
"metadata": {
"Gender": "male",
"NameSet": "Hispanic",
"Country": "Iraq",
"Lowercase": false,
"Template#": 14
}
},
{
"full_text": "I live in Uralaane",
"masked": null,
"spans": [
{
"text": "my",
"idx": 29,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
}
},
"entity_type": "LOCATION",
"entity_value": "Uralaane",
"start_position": 10,
"end_position": 18
}
],
"tokens": [
{
"text": "last",
"idx": 32,
"tag_": "JJ",
"pos_": "ADJ",
"text": "I",
"idx": 0,
"tag_": "PRP",
"pos_": "PRON",
"dep_": "nsubj",
"lemma_": "last",
"lemma_": "I",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "is",
"idx": 37,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "conj",
"lemma_": "be",
"text": "live",
"idx": 2,
"tag_": "VBP",
"pos_": "VERB",
"dep_": "ROOT",
"lemma_": "live",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Patrick",
"idx": 40,
"text": "in",
"idx": 7,
"tag_": "IN",
"pos_": "ADP",
"dep_": "prep",
"lemma_": "in",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Uralaane",
"idx": 10,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Patrick",
"dep_": "pobj",
"lemma_": "Uralaane",
"_": {
"is_in_vocabulary": false
}
@ -408,21 +312,15 @@
"O",
"O",
"O",
"O",
"U-PERSON",
"O",
"O",
"O",
"O",
"U-PERSON"
"U-LOCATION"
],
"template_id": null,
"metadata": {
"Gender": "male",
"NameSet": "American",
"Country": "California",
"Gender": "female",
"NameSet": "Chechen (Latin)",
"Country": "United States Of America",
"Lowercase": false,
"Template#": 2
"Template#": 5
}
}
]

Просмотреть файл

@ -1,3 +1,11 @@
from .model_mock import IdentityTokensMockModel, \
FiftyFiftyIdentityTokensMockModel, \
MockTokensModel
from .model_mock import (
IdentityTokensMockModel,
FiftyFiftyIdentityTokensMockModel,
MockTokensModel,
)
__all__ = [
"IdentityTokensMockModel",
"FiftyFiftyIdentityTokensMockModel",
"MockTokensModel",
]

Просмотреть файл

@ -1,14 +1,15 @@
from typing import List
from typing import List, Optional
from presidio_evaluator import InputSample, ModelEvaluator
from presidio_evaluator import InputSample
from presidio_evaluator.models import BaseModel
class MockTokensModel(ModelEvaluator):
class MockTokensModel(BaseModel):
"""
Simulates a real model, returns the prediction given in the constructor
"""
def __init__(self, prediction: List[str], entities_to_keep: List = None,
def __init__(self, prediction: Optional[List[str]], entities_to_keep: List = None,
verbose: bool = False, **kwargs):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose,
**kwargs)
@ -18,20 +19,19 @@ class MockTokensModel(ModelEvaluator):
return self.prediction
class IdentityTokensMockModel(ModelEvaluator):
class IdentityTokensMockModel(BaseModel):
"""
Simulates a real model, always return the label as prediction
"""
def __init__(self, entities_to_keep: List = None,
verbose: bool = False):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
def __init__(self, verbose: bool = False):
super().__init__(verbose=verbose)
def predict(self, sample: InputSample) -> List[str]:
return sample.tags
class FiftyFiftyIdentityTokensMockModel(ModelEvaluator):
class FiftyFiftyIdentityTokensMockModel(BaseModel):
"""
Simulates a real model, returns the label or no predictions (list of 'O')
alternately

Просмотреть файл

@ -1,6 +1,6 @@
import numpy as np
from presidio_evaluator.crf_evaluator import CRFEvaluator
from presidio_evaluator.models.crf_model import CRFModel
from presidio_evaluator.data_generator import read_synth_dataset
@ -12,7 +12,7 @@ def no_test_test_crf_simple():
model_path = os.path.abspath(os.path.join(dir_path, "..", "model-outputs/crf.pickle"))
crf_evaluator = CRFEvaluator(model_pickle_path=model_path,entities_to_keep=['PERSON'])
crf_evaluator = CRFModel(model_pickle_path=model_path, entities_to_keep=['PERSON'])
evaluation_results = crf_evaluator.evaluate_all(input_samples)
scores = crf_evaluator.calculate_score(evaluation_results)

298
tests/test_evaluator.py Normal file
Просмотреть файл

@ -0,0 +1,298 @@
import numpy as np
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.evaluation import EvaluationResult, Evaluator
from tests.mocks import (
IdentityTokensMockModel,
FiftyFiftyIdentityTokensMockModel,
MockTokensModel,
)
def test_evaluator_simple():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=["ANIMAL"])
evaluator = Evaluator(model=model)
sample = InputSample(
full_text="I am the walrus", masked="I am the [ANIMAL]", spans=None
)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = evaluator.evaluate_sample(sample, prediction)
final_evaluation = evaluator.calculate_score([evaluated])
assert final_evaluation.pii_precision == 1
assert final_evaluation.pii_recall == 1
def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=["SPACESHIP"])
sample = InputSample(
full_text="I am the walrus", masked="I am the [ANIMAL]", spans=None
)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = evaluator.evaluate_sample(sample, prediction)
assert evaluated.results[("O", "O")] == 4
def test_evaluate_same_entity_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
sample = InputSample(
full_text="I dog the walrus", masked="I [ANIMAL] the [ANIMAL]", spans=None
)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = evaluator.evaluate_sample(sample, prediction)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_entities_to_keep_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
entities_to_keep = ["ANIMAL", "PLANT", "SPACESHIP"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=entities_to_keep)
sample = InputSample(
full_text="I dog the walrus", masked="I [ANIMAL] the [ANIMAL]", spans=None
)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = evaluator.evaluate_sample(sample, prediction)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_tokens_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
sample = InputSample(
"I am the walrus amaericanus magnifico", masked=None, spans=None
)
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = evaluator.evaluate_sample(sample, prediction)
evaluation = evaluator.calculate_score([evaluated])
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 1
def test_evaluate_multiple_tokens_partial_match_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
sample = InputSample(
"I am the walrus amaericanus magnifico", masked=None, spans=None
)
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = evaluator.evaluate_sample(sample, prediction)
evaluation = evaluator.calculate_score([evaluated])
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 4 / 6
def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
sample = InputSample(
"I am the walrus amaericanus magnifico", masked=None, spans=None
)
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = evaluator.evaluate_sample(sample, prediction)
evaluation = evaluator.calculate_score([evaluated])
assert np.isnan(evaluation.pii_precision)
assert evaluation.pii_recall == 0
def test_evaluate_multiple_examples_correct_statistics():
prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=["PERSON"])
input_sample = InputSample("My name is Raphael or David", masked=None, spans=None)
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = evaluator.evaluate_all(
[input_sample, input_sample, input_sample, input_sample]
)
scores = evaluator.calculate_score(evaluated)
assert scores.pii_precision == 0.5
assert scores.pii_recall == 0.5
def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
model = MockTokensModel(prediction=prediction)
evaluator = Evaluator(model=model, entities_to_keep=["PERSON", "TENNIS_PLAYER"])
input_sample = InputSample("My name is Raphael or David", masked=None, spans=None)
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = evaluator.evaluate_all(
[input_sample, input_sample, input_sample, input_sample]
)
scores = evaluator.calculate_score(evaluated)
assert scores.pii_precision == 1
assert scores.pii_recall == 1
def test_confusion_matrix_correct_metrics():
from collections import Counter
evaluated = [
EvaluationResult(
results=Counter(
{
("O", "O"): 150,
("O", "PERSON"): 30,
("O", "COMPANY"): 30,
("PERSON", "PERSON"): 40,
("COMPANY", "COMPANY"): 40,
("PERSON", "COMPANY"): 10,
("COMPANY", "PERSON"): 10,
("PERSON", "O"): 30,
("COMPANY", "O"): 30,
}
),
model_errors=None,
text=None,
)
]
model = MockTokensModel(prediction=None)
evaluator = Evaluator(model=model, entities_to_keep=["PERSON", "COMPANY"])
scores = evaluator.calculate_score(evaluated, beta=2.5)
assert scores.pii_precision == 0.625
assert scores.pii_recall == 0.625
assert scores.entity_recall_dict["PERSON"] == 0.5
assert scores.entity_precision_dict["PERSON"] == 0.5
assert scores.entity_recall_dict["COMPANY"] == 0.5
assert scores.entity_precision_dict["COMPANY"] == 0.5
def test_confusion_matrix_2_correct_metrics():
from collections import Counter
evaluated = [
EvaluationResult(
results=Counter(
{
("O", "O"): 65467,
("O", "ORG"): 4189,
("GPE", "O"): 3370,
("PERSON", "PERSON"): 2024,
("GPE", "PERSON"): 1488,
("GPE", "GPE"): 1033,
("O", "GPE"): 964,
("ORG", "ORG"): 914,
("O", "PERSON"): 834,
("GPE", "ORG"): 401,
("PERSON", "ORG"): 35,
("PERSON", "O"): 33,
("ORG", "O"): 8,
("PERSON", "GPE"): 5,
("ORG", "PERSON"): 1,
}
),
model_errors=None,
text=None,
)
]
model = MockTokensModel(prediction=None)
evaluator = Evaluator(model=model)
scores = evaluator.calculate_score(evaluated, beta=2.5)
pii_tp = (
evaluated[0].results[("PERSON", "PERSON")]
+ evaluated[0].results[("ORG", "ORG")]
+ evaluated[0].results[("GPE", "GPE")]
+ evaluated[0].results[("ORG", "GPE")]
+ evaluated[0].results[("ORG", "PERSON")]
+ evaluated[0].results[("GPE", "ORG")]
+ evaluated[0].results[("GPE", "PERSON")]
+ evaluated[0].results[("PERSON", "GPE")]
+ evaluated[0].results[("PERSON", "ORG")]
)
pii_fp = (
evaluated[0].results[("O", "PERSON")]
+ evaluated[0].results[("O", "GPE")]
+ evaluated[0].results[("O", "ORG")]
)
pii_fn = (
evaluated[0].results[("PERSON", "O")]
+ evaluated[0].results[("GPE", "O")]
+ evaluated[0].results[("ORG", "O")]
)
assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
def test_dataset_to_metric_identity_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=10
)
model = IdentityTokensMockModel()
evaluator = Evaluator(model=model)
evaluation_results = evaluator.evaluate_all(input_samples)
metrics = evaluator.calculate_score(evaluation_results)
assert metrics.pii_precision == 1
assert metrics.pii_recall == 1
def test_dataset_to_metric_50_50_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=100
)
# Replace 50% of the predictions with a list of "O"
model = FiftyFiftyIdentityTokensMockModel()
evaluator = Evaluator(model=model, entities_to_keep=["PERSON"])
evaluation_results = evaluator.evaluate_all(input_samples)
metrics = evaluator.calculate_score(evaluation_results)
print(metrics.pii_precision)
print(metrics.pii_recall)
print(metrics.pii_f)
assert metrics.pii_precision == 1
assert metrics.pii_recall < 0.75
assert metrics.pii_recall > 0.25

Просмотреть файл

@ -1,15 +1,18 @@
import pytest
from presidio_evaluator.evaluation import Evaluator
try:
from flair.models import SequenceTagger
except:
ImportError("Flair is not installed by default")
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.flair_evaluator import FlairEvaluator
from presidio_evaluator.models.flair_model import FlairModel
import numpy as np
# no-unit because flair is not a dependency by default
@pytest.mark.skip(reason="Flair not installed by default")
def test_flair_simple():
@ -22,9 +25,10 @@ def test_flair_simple():
model = SequenceTagger.load("ner-ontonotes-fast") # .load('ner')
flair_evaluator = FlairEvaluator(model=model, entities_to_keep=["PERSON"])
evaluation_results = flair_evaluator.evaluate_all(input_samples)
scores = flair_evaluator.calculate_score(evaluation_results)
flair_model = FlairModel(model=model, entities_to_keep=["PERSON"])
evaluator = Evaluator(model=flair_model)
evaluation_results = evaluator.evaluate_all(input_samples)
scores = evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(
scores.pii_precision, scores.entity_precision_dict["PERSON"]

Просмотреть файл

@ -1,271 +0,0 @@
import numpy as np
import pytest
from presidio_evaluator import InputSample, EvaluationResult
from presidio_evaluator.data_generator import read_synth_dataset
from tests.mocks import IdentityTokensMockModel, \
FiftyFiftyIdentityTokensMockModel, MockTokensModel
def test_evaluator_simple():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample(full_text="I am the walrus",
masked="I am the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = model.evaluate_sample(sample)
final_evaluation = model.calculate_score(
[evaluated])
assert final_evaluation.pii_precision == 1
assert final_evaluation.pii_recall == 1
def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction,
entities_to_keep=['SPACESHIP'])
sample = InputSample(full_text="I am the walrus",
masked="I am the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = model.evaluate_sample(sample)
assert evaluated.results[("O", "O")] == 4
def test_evaluate_same_entity_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample(full_text="I dog the walrus",
masked="I [ANIMAL] the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = model.evaluate_sample(sample)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_entities_to_keep_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, labeling_scheme='BIO',
entities_to_keep=['ANIMAL', 'PLANT', 'SPACESHIP'])
sample = InputSample(full_text="I dog the walrus",
masked="I [ANIMAL] the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = model.evaluate_sample(sample)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_tokens_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
spans=None)
sample.tokens = ["I", "am", "the",
"walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O",
"B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
[evaluated])
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 1
def test_evaluate_multiple_tokens_partial_match_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
spans=None)
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
[evaluated])
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 4 / 6
def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
spans=None)
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
[evaluated])
assert np.isnan(evaluation.pii_precision)
assert evaluation.pii_recall == 0
def test_evaluate_multiple_examples_correct_statistics():
prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
model = MockTokensModel(prediction=prediction,
labeling_scheme='BILOU',
entities_to_keep=['PERSON'])
input_sample = InputSample("My name is Raphael or David", masked=None,
spans=None)
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = model.evaluate_all(
[input_sample, input_sample, input_sample, input_sample])
scores = model.calculate_score(
evaluated)
assert scores.pii_precision == 0.5
assert scores.pii_recall == 0.5
def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
model = MockTokensModel(prediction=prediction,
labeling_scheme='BILOU',
entities_to_keep=['PERSON', 'TENNIS_PLAYER'])
input_sample = InputSample("My name is Raphael or David", masked=None,
spans=None)
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = model.evaluate_all(
[input_sample, input_sample, input_sample, input_sample])
scores = model.calculate_score(evaluated)
assert scores.pii_precision == 1
assert scores.pii_recall == 1
def test_confusion_matrix_correct_metrics():
from collections import Counter
evaluated = [EvaluationResult(results=Counter({
('O', 'O'): 150,
('O', 'PERSON'): 30,
('O', 'COMPANY'): 30,
('PERSON', 'PERSON'): 40,
('COMPANY', 'COMPANY'): 40,
('PERSON', 'COMPANY'): 10,
('COMPANY', 'PERSON'): 10,
('PERSON', 'O'): 30,
('COMPANY', 'O'): 30}), model_errors=None, text=None)]
model = MockTokensModel(prediction=None,
entities_to_keep=['PERSON', 'COMPANY'])
scores = model.calculate_score(evaluated, beta=2.5)
assert scores.pii_precision == 0.625
assert scores.pii_recall == 0.625
assert scores.entity_recall_dict['PERSON'] == 0.5
assert scores.entity_precision_dict['PERSON'] == 0.5
assert scores.entity_recall_dict['COMPANY'] == 0.5
assert scores.entity_precision_dict['COMPANY'] == 0.5
def test_confusion_matrix_2_correct_metrics():
from collections import Counter
evaluated = [EvaluationResult(results=Counter(
{('O', 'O'): 65467,
('O', 'ORG'): 4189,
('GPE', 'O'): 3370,
('PERSON', 'PERSON'): 2024,
('GPE', 'PERSON'): 1488,
('GPE', 'GPE'): 1033,
('O', 'GPE'): 964,
('ORG', 'ORG'): 914,
('O', 'PERSON'): 834,
('GPE', 'ORG'): 401,
('PERSON', 'ORG'): 35,
('PERSON', 'O'): 33,
('ORG', 'O'): 8,
('PERSON', 'GPE'): 5,
('ORG', 'PERSON'): 1}), model_errors=None, text=None)]
model = MockTokensModel(prediction=None)
scores = model.calculate_score(evaluated, beta=2.5)
pii_tp = evaluated[0].results[('PERSON', 'PERSON')] + \
evaluated[0].results[('ORG', 'ORG')] + \
evaluated[0].results[('GPE', 'GPE')] + \
evaluated[0].results[('ORG', 'GPE')] + \
evaluated[0].results[('ORG', 'PERSON')] + \
evaluated[0].results[('GPE', 'ORG')] + \
evaluated[0].results[('GPE', 'PERSON')] + \
evaluated[0].results[('PERSON', 'GPE')] + \
evaluated[0].results[('PERSON', 'ORG')]
pii_fp = evaluated[0].results[('O', 'PERSON')] + \
evaluated[0].results[('O', 'GPE')] + \
evaluated[0].results[('O', 'ORG')]
pii_fn = evaluated[0].results[('PERSON', 'O')] + \
evaluated[0].results[('GPE', 'O')] + \
evaluated[0].results[('ORG', 'O')]
assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
def test_dataset_to_metric_identity_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=10)
model = IdentityTokensMockModel()
evaluation_results = model.evaluate_all(input_samples)
metrics = model.calculate_score(
evaluation_results)
assert metrics.pii_precision == 1
assert metrics.pii_recall == 1
def test_dataset_to_metric_50_50_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=100)
# Replace 50% of the predictions with a list of "O"
model = FiftyFiftyIdentityTokensMockModel(entities_to_keep='PERSON')
evaluation_results = model.evaluate_all(input_samples)
metrics = model.calculate_score(
evaluation_results)
print(metrics.pii_precision)
print(metrics.pii_recall)
print(metrics.pii_f)
assert metrics.pii_precision == 1
assert metrics.pii_recall < 0.75
assert metrics.pii_recall > 0.25

Просмотреть файл

@ -2,27 +2,8 @@ import pytest
from presidio_evaluator import InputSample, Span
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.presidio_analyzer_evaluator import PresidioAnalyzerEvaluator
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
entities_mapping = {
"PERSON": "PERSON",
"EMAIL": "EMAIL_ADDRESS",
"CREDIT_CARD": "CREDIT_CARD",
"FIRST_NAME": "PERSON",
"PHONE_NUMBER": "PHONE_NUMBER",
"BIRTHDAY": "DATE_TIME",
"DATE": "DATE_TIME",
"DOMAIN": "DOMAIN",
"CITY": "LOCATION",
"ADDRESS": "LOCATION",
"IBAN": "IBAN_CODE",
"URL": "DOMAIN_NAME",
"US_SSN": "US_SSN",
"IP_ADDRESS": "IP_ADDRESS",
"ORGANIZATION": "ORG",
"O": "O",
}
from presidio_evaluator.evaluation import Evaluator
from presidio_evaluator.models.presidio_analyzer_wrapper import PresidioAnalyzerWrapper
class GeneratedTextTestCase:
@ -54,8 +35,7 @@ analyzer_test_generate_text_testdata = [
def test_analyzer_simple_input():
model = PresidioAnalyzerEvaluator(entities_to_keep=["PERSON"])
model = PresidioAnalyzerWrapper(entities_to_keep=["PERSON"])
sample = InputSample(
full_text="My name is Mike",
masked="My name is [PERSON]",
@ -63,8 +43,11 @@ def test_analyzer_simple_input():
create_tags_from_span=True,
)
evaluated = model.evaluate_sample(sample)
metrics = model.calculate_score([evaluated])
prediction = model.predict(sample)
evaluator = Evaluator(model=model)
evaluated = evaluator.evaluate_sample(sample, prediction)
metrics = evaluator.calculate_score([evaluated])
assert metrics.pii_precision == 1
assert metrics.pii_recall == 1
@ -89,13 +72,14 @@ def test_analyzer_with_generated_text(test_input, acceptance_threshold):
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(test_input.format(dir_path))
updated_samples = PresidioAnalyzerEvaluator.align_input_samples_to_presidio_analyzer(
input_samples=input_samples, entities_mapping=entities_mapping
updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(
input_samples=input_samples, entities_mapping=PresidioAnalyzerWrapper.presidio_entities_map
)
analyzer = PresidioAnalyzerEvaluator()
evaluated_samples = analyzer.evaluate_all(updated_samples)
scores = analyzer.calculate_score(evaluation_results=evaluated_samples)
analyzer = PresidioAnalyzerWrapper()
evaluator = Evaluator(model=analyzer)
evaluated_samples = evaluator.evaluate_all(updated_samples)
scores = evaluator.calculate_score(evaluation_results=evaluated_samples)
assert acceptance_threshold <= scores.pii_precision
assert acceptance_threshold <= scores.pii_recall

Просмотреть файл

@ -8,19 +8,16 @@ import pandas as pd
@pytest.mark.parametrize(
# fmt: off
"text, entity1, entity2, start1, end1, start2, end2",
[
(
"Hi I live in South Africa and my name is Toma",
"LOCATION",
"PERSON",
13,
25,
41,
45,
"LOCATION", "PERSON", 13, 25, 41, 45,
),
("Africa is my continent, James", "LOCATION", "PERSON", 0, 6, 24, 29,),
],
# fmt: on
)
def test_presidio_perturb_two_entities(
text, entity1, entity2, start1, end1, start2, end2
@ -51,15 +48,13 @@ def test_entity_translation():
RecognizerResult(entity_type="EMAIL_ADDRESS", start=12, end=27, score=0.5)
]
presidio_perturb = PresidioPerturb(
fake_pii_df=get_mock_fake_df(), entity_dict={"EMAIL_ADDRESS": "EMAIL"}
)
presidio_perturb = PresidioPerturb(fake_pii_df=get_mock_fake_df())
fake_df = presidio_perturb.fake_pii
perturbations = presidio_perturb.perturb(
original_text=text, presidio_response=presidio_response, count=1
)
assert fake_df["EMAIL"].str.lower()[0] in perturbations[0]
assert fake_df["EMAIL_ADDRESS"].str.lower()[0] in perturbations[0]
def test_subset_perturbation():
@ -76,7 +71,7 @@ def test_subset_perturbation():
"NameSet": ["Hebrew", "English"],
}
)
ignore_types = ("DATE", "LOCATION", "ADDRESS", "GENDER")
ignore_types = {"DATE", "LOCATION", "ADDRESS", "GENDER"}
presidio_perturb = PresidioPerturb(fake_pii_df=fake_df, ignore_types=ignore_types)

Просмотреть файл

@ -1,8 +1,8 @@
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.presidio_recognizer_evaluator import score_presidio_recognizer
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
import pytest
from presidio_analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
from presidio_analyzer.predefined_recognizers import CreditCardRecognizer
# test case parameters for tests with dataset which was previously generated.
class GeneratedTextTestCase:
@ -13,8 +13,12 @@ class GeneratedTextTestCase:
self.marks = marks
def to_pytest_param(self):
return pytest.param(self.test_input, self.acceptance_threshold,
id=self.test_name, marks=self.marks)
return pytest.param(
self.test_input,
self.acceptance_threshold,
id=self.test_name,
marks=self.marks,
)
# generated-text test cases
@ -24,35 +28,39 @@ cc_test_generate_text_testdata = [
test_name="small-set",
test_input="{}/data/generated_small.txt",
acceptance_threshold=1,
marks=pytest.mark.none
marks=pytest.mark.none,
),
# large set fixture which expects all type results. marked as "slow"
GeneratedTextTestCase(
test_name="large_set",
test_input="{}/data/generated_large.txt",
acceptance_threshold=1,
marks=pytest.mark.slow
)
marks=pytest.mark.slow,
),
]
# credit card recognizer tests on generated data
@pytest.mark.parametrize("test_input,acceptance_threshold",
[testcase.to_pytest_param()
for testcase in cc_test_generate_text_testdata])
@pytest.mark.parametrize(
"test_input,acceptance_threshold",
[testcase.to_pytest_param() for testcase in cc_test_generate_text_testdata],
)
def test_credit_card_recognizer_with_generated_text(test_input, acceptance_threshold):
"""
Test credit card recognizer with a generated dataset text file
:param test_input: input text file location
:param acceptance_threshold: minimim precision/recall
allowed for tests to pass
Test credit card recognizer with a generated dataset text file
:param test_input: input text file location
:param acceptance_threshold: minimim precision/recall
allowed for tests to pass
"""
# read test input from generated file
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
test_input.format(dir_path))
input_samples = read_synth_dataset(test_input.format(dir_path))
scores = score_presidio_recognizer(
CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
recognizer=CreditCardRecognizer(),
entities_to_keep=["CREDIT_CARD"],
input_samples=input_samples,
)
assert acceptance_threshold <= scores.pii_f

Просмотреть файл

@ -1,15 +1,25 @@
from presidio_evaluator.data_generator import generate
from presidio_evaluator.presidio_recognizer_evaluator import \
score_presidio_recognizer
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
import pytest
import numpy as np
from presidio_analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
from presidio_analyzer.predefined_recognizers import CreditCardRecognizer
# test case parameters for tests with dataset generated from a template and csv values
class TemplateTextTestCase:
def __init__(self, test_name, pii_csv, utterances, dictionary_path,
num_of_examples, acceptance_threshold, marks):
"""
Test case parameters for tests with dataset generated from a template and csv values
"""
def __init__(
self,
test_name,
pii_csv,
utterances,
dictionary_path,
num_of_examples,
acceptance_threshold,
marks,
):
self.test_name = test_name
self.pii_csv = pii_csv
self.utterances = utterances
@ -19,9 +29,15 @@ class TemplateTextTestCase:
self.marks = marks
def to_pytest_param(self):
return pytest.param(self.pii_csv, self.utterances, self.dictionary_path,
self.num_of_examples, self.acceptance_threshold,
id=self.test_name, marks=self.marks)
return pytest.param(
self.pii_csv,
self.utterances,
self.dictionary_path,
self.num_of_examples,
self.acceptance_threshold,
id=self.test_name,
marks=self.marks,
)
# template-dataset test cases
@ -34,46 +50,52 @@ cc_test_template_testdata = [
dictionary_path="{}/data/Dictionary_test.csv",
num_of_examples=100,
acceptance_threshold=0.9,
marks=pytest.mark.slow
marks=pytest.mark.slow,
)
]
# credit card recognizer tests on template-generates data
@pytest.mark.parametrize("pii_csv, "
"utterances, "
"dictionary_path, "
"num_of_examples, "
"acceptance_threshold",
[testcase.to_pytest_param()
for testcase in cc_test_template_testdata])
def test_credit_card_recognizer_with_template(pii_csv, utterances,
dictionary_path,
num_of_examples,
acceptance_threshold):
@pytest.mark.parametrize(
"pii_csv, "
"utterances, "
"dictionary_path, "
"num_of_examples, "
"acceptance_threshold",
[testcase.to_pytest_param() for testcase in cc_test_template_testdata],
)
def test_credit_card_recognizer_with_template(
pii_csv, utterances, dictionary_path, num_of_examples, acceptance_threshold
):
"""
Test credit card recognizer with a dataset generated from
template and a CSV values file
:param pii_csv: input csv file location
:param utterances: template file location
:param dictionary_path: dictionary/vocabulary file location
:param num_of_examples: number of samples to be used from dataset
to test
:param acceptance_threshold: minimim precision/recall
allowed for tests to pass
Test credit card recognizer with a dataset generated from
template and a CSV values file
:param pii_csv: input csv file location
:param utterances: template file location
:param dictionary_path: dictionary/vocabulary file location
:param num_of_examples: number of samples to be used from dataset
to test
:param acceptance_threshold: minimum precision/recall
allowed for tests to pass
"""
# read template and CSV files
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = generate(fake_pii_csv=pii_csv.format(dir_path),
utterances_file=utterances.format(dir_path),
dictionary_path=dictionary_path.format(dir_path),
lower_case_ratio=0.5,
num_of_examples=num_of_examples)
input_samples = generate(
fake_pii_csv=pii_csv.format(dir_path),
utterances_file=utterances.format(dir_path),
dictionary_path=dictionary_path.format(dir_path),
lower_case_ratio=0.5,
num_of_examples=num_of_examples,
)
scores = score_presidio_recognizer(
CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
recognizer=CreditCardRecognizer(),
entities_to_keep=["CREDIT_CARD"],
input_samples=input_samples,
)
if not np.isnan(scores.pii_f):
assert acceptance_threshold <= scores.pii_f

Просмотреть файл

@ -1,18 +1,32 @@
from presidio_evaluator.data_generator import FakeDataGenerator
from presidio_evaluator.presidio_recognizer_evaluator import \
score_presidio_recognizer
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
import pandas as pd
import pytest
import numpy as np
from presidio_analyzer import Pattern, PatternRecognizer
# test case parameters for tests with dataset generated from a template and
# two csv value files, one containing the common-entities and another one with custom entities
class PatternRecognizerTestCase:
def __init__(self, test_name, entity_name, pattern, score, pii_csv, ext_csv,
utterances, dictionary_path, num_of_examples, acceptance_threshold,
max_mistakes_number, marks):
"""
Test case parameters for tests with dataset generated from a template and
two csv value files, one containing the common-entities and another one with custom entities.
"""
def __init__(
self,
test_name,
entity_name,
pattern,
score,
pii_csv,
ext_csv,
utterances,
dictionary_path,
num_of_examples,
acceptance_threshold,
max_mistakes_number,
marks,
):
self.test_name = test_name
self.entity_name = entity_name
self.pattern = pattern
@ -27,12 +41,20 @@ class PatternRecognizerTestCase:
self.marks = marks
def to_pytest_param(self):
return pytest.param(self.pii_csv, self.ext_csv, self.utterances,
self.dictionary_path,
self.entity_name, self.pattern, self.score,
self.num_of_examples, self.acceptance_threshold,
self.max_mistakes_number, id=self.test_name,
marks=self.marks)
return pytest.param(
self.pii_csv,
self.ext_csv,
self.utterances,
self.dictionary_path,
self.entity_name,
self.pattern,
self.score,
self.num_of_examples,
self.acceptance_threshold,
self.max_mistakes_number,
id=self.test_name,
marks=self.marks,
)
# template-dataset test cases
@ -42,7 +64,7 @@ rocket_test_template_testdata = [
PatternRecognizerTestCase(
test_name="rocket-no-errors",
entity_name="ROCKET",
pattern=r'\W*(rocket)\W*',
pattern=r"\W*(rocket)\W*",
score=0.8,
pii_csv="{}/data/FakeNameGenerator.com_100.csv",
ext_csv="{}/data/FakeRocketGenerator.csv",
@ -51,14 +73,14 @@ rocket_test_template_testdata = [
num_of_examples=100,
acceptance_threshold=1,
max_mistakes_number=0,
marks=pytest.mark.slow
marks=pytest.mark.slow,
),
# large dataset fixture. marked as slow
# all input is correct, test is conclusive
PatternRecognizerTestCase(
test_name="rocket-all-errors",
entity_name="ROCKET",
pattern=r'\W*(rocket)\W*',
pattern=r"\W*(rocket)\W*",
score=0.8,
pii_csv="{}/data/FakeNameGenerator.com_100.csv",
ext_csv="{}/data/FakeRocketErrorsGenerator.csv",
@ -67,14 +89,14 @@ rocket_test_template_testdata = [
num_of_examples=100,
acceptance_threshold=0,
max_mistakes_number=100,
marks=pytest.mark.slow
marks=pytest.mark.slow,
),
# large dataset fixture. marked as slow
# some input is correct some is not, test is inconclusive
PatternRecognizerTestCase(
test_name="rocket-some-errors",
entity_name="ROCKET",
pattern=r'\W*(rocket)\W*',
pattern=r"\W*(rocket)\W*",
score=0.8,
pii_csv="{}/data/FakeNameGenerator.com_100.csv",
ext_csv="{}/data/FakeRocket50PercentErrorsGenerator.csv",
@ -83,8 +105,8 @@ rocket_test_template_testdata = [
num_of_examples=100,
acceptance_threshold=0.3,
max_mistakes_number=70,
marks=[pytest.mark.slow, pytest.mark.inconclusive]
)
marks=[pytest.mark.slow, pytest.mark.inconclusive],
),
]
@ -92,30 +114,39 @@ rocket_test_template_testdata = [
"pii_csv, ext_csv, utterances, dictionary_path, "
"entity_name, pattern, score, num_of_examples, "
"acceptance_threshold, max_mistakes_number",
[testcase.to_pytest_param()
for testcase in rocket_test_template_testdata])
def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
entity_name, pattern,
score, num_of_examples, acceptance_threshold,
max_mistakes_number):
[testcase.to_pytest_param() for testcase in rocket_test_template_testdata],
)
def test_pattern_recognizer(
pii_csv,
ext_csv,
utterances,
dictionary_path,
entity_name,
pattern,
score,
num_of_examples,
acceptance_threshold,
max_mistakes_number,
):
"""
Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
and another CSV values file with a custom entity
:param pii_csv: input csv file location with the common entities
:param ext_csv: input csv file location with custom entities
:param utterances: template file location
:param dictionary_path: vocabulary/dictionary file location
:param entity_name: custom entity name
:param pattern: recognizer pattern
:param num_of_examples: number of samples to be used from dataset to test
:param acceptance_threshold: minimim precision/recall
allowed for tests to pass
Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
and another CSV values file with a custom entity
:param pii_csv: input csv file location with the common entities
:param ext_csv: input csv file location with custom entities
:param utterances: template file location
:param dictionary_path: vocabulary/dictionary file location
:param entity_name: custom entity name
:param pattern: recognizer pattern
:param num_of_examples: number of samples to be used from dataset to test
:param acceptance_threshold: minimum precision/recall
allowed for tests to pass
"""
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
dfpii = pd.read_csv(pii_csv.format(dir_path), encoding='utf-8')
dfext = pd.read_csv(ext_csv.format(dir_path), encoding='utf-8')
dfpii = pd.read_csv(pii_csv.format(dir_path), encoding="utf-8")
dfext = pd.read_csv(ext_csv.format(dir_path), encoding="utf-8")
dictionary_path = dictionary_path.format(dir_path)
ext_column_name = dfext.columns[0]
@ -127,18 +158,23 @@ def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
dfpii[ext_column_name] = [get_from_ext(i) for i in range(0, dfpii.shape[0])]
# generate examples
generator = FakeDataGenerator(fake_pii_csv_file=dfpii,
utterances_file=utterances.format(dir_path),
dictionary_path=dictionary_path)
generator = FakeDataGenerator(
fake_pii_df=dfpii,
templates=utterances.format(dir_path),
dictionary_path=dictionary_path,
)
examples = generator.sample_examples(num_of_examples)
pattern = Pattern("test pattern", pattern, score)
pattern_recognizer = PatternRecognizer(entity_name,
name="test recognizer",
patterns=[pattern])
pattern_recognizer = PatternRecognizer(
entity_name, name="test recognizer", patterns=[pattern]
)
scores = score_presidio_recognizer(
pattern_recognizer, [entity_name], examples)
recognizer=pattern_recognizer,
entities_to_keep=[entity_name],
input_samples=examples,
)
if not np.isnan(scores.pii_f):
assert acceptance_threshold <= scores.pii_f
assert max_mistakes_number >= len(scores.model_errors)

Просмотреть файл

@ -1,5 +1,6 @@
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.spacy_evaluator import SpacyEvaluator
from presidio_evaluator.evaluation import Evaluator
from presidio_evaluator.models.spacy_model import SpacyModel
import numpy as np
@ -8,9 +9,10 @@ def test_spacy_simple():
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_evaluator = SpacyEvaluator(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
evaluation_results = spacy_evaluator.evaluate_all(input_samples)
scores = spacy_evaluator.calculate_score(evaluation_results)
spacy_model = SpacyModel(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
evaluator = Evaluator(model=spacy_model)
evaluation_results = evaluator.evaluate_all(input_samples)
scores = evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])

Просмотреть файл

@ -1,12 +1,14 @@
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.presidio_recognizer_evaluator import \
score_presidio_recognizer
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
import pytest
from presidio_analyzer.predefined_recognizers.spacy_recognizer import SpacyRecognizer
# test case parameters for tests with dataset which was previously generated.
class GeneratedTextTestCase:
"""
Test case parameters for tests with dataset which was previously generated.
"""
def __init__(self, test_name, test_input, acceptance_threshold, marks):
self.test_name = test_name
self.test_input = test_input
@ -14,8 +16,12 @@ class GeneratedTextTestCase:
self.marks = marks
def to_pytest_param(self):
return pytest.param(self.test_input, self.acceptance_threshold,
id=self.test_name, marks=self.marks)
return pytest.param(
self.test_input,
self.acceptance_threshold,
id=self.test_name,
marks=self.marks,
)
# generated-text test cases
@ -25,35 +31,37 @@ cc_test_generate_text_testdata = [
test_name="small-set",
test_input="{}/data/generated_small.txt",
acceptance_threshold=0.5,
marks=pytest.mark.inconclusive
marks=pytest.mark.inconclusive,
),
# large dataset - test is slow and inconclusive
GeneratedTextTestCase(
test_name="large-set",
test_input="{}/data/generated_large.txt",
acceptance_threshold=0.5,
marks=pytest.mark.slow
)
marks=pytest.mark.slow,
),
]
# credit card recognizer tests on generated data
@pytest.mark.parametrize("test_input,acceptance_threshold",
[testcase.to_pytest_param() for testcase in
cc_test_generate_text_testdata])
@pytest.mark.parametrize(
"test_input,acceptance_threshold",
[testcase.to_pytest_param() for testcase in cc_test_generate_text_testdata],
)
def test_spacy_recognizer_with_generated_text(test_input, acceptance_threshold):
"""
Test spacy recognizer with a generated dataset text file
:param test_input: input text file location
:param acceptance_threshold: minimim precision/recall
allowed for tests to pass
Test spacy recognizer with a generated dataset text file
:param test_input: input text file location
:param acceptance_threshold: minimim precision/recall
allowed for tests to pass
"""
# read test input from generated file
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
test_input.format(dir_path))
input_samples = read_synth_dataset(test_input.format(dir_path))
scores = score_presidio_recognizer(
SpacyRecognizer(), ['PERSON'], input_samples, True)
SpacyRecognizer(), ["PERSON"], input_samples, with_nlp_artifacts=True
)
assert acceptance_threshold <= scores.pii_f

Просмотреть файл

@ -2,8 +2,9 @@ from presidio_evaluator import span_to_tag
BILOU_SCHEME = "BILOU"
BIO_SCHEME = "BIO"
IO_SCHEME = "IO"
# fmt: off
def test_span_to_bio_multiple_tokens():
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
start = 14
@ -166,8 +167,7 @@ def test_overlapping_entities_first_ends_in_mid_second():
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
'US_PHONE_NUMBER', 'US_PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
io = span_to_tag(IO_SCHEME, text, start, end, tag, scores)
assert io == expected
@ -180,8 +180,7 @@ def test_overlapping_entities_second_embedded_in_first_with_lower_score():
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'PHONE_NUMBER',
'PHONE_NUMBER', 'PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
assert io == expected
@ -194,8 +193,7 @@ def test_overlapping_entities_second_embedded_in_first_has_higher_score():
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
'PHONE_NUMBER', 'PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
assert io == expected
@ -207,6 +205,6 @@ def test_overlapping_entities_pyramid():
tag = ["A1", "B2","C3"]
expected = ['O', 'O', 'O', 'O', 'O', 'A1', 'B2', 'C3', 'B2',
'A1', 'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
assert io == expected
# fmt: on