updates to presidio 2 and spacy 3
This commit is contained in:
Родитель
e2528bdca7
Коммит
83bb254b5d
45
README.md
45
README.md
|
@ -11,15 +11,15 @@ To install the package, clone the repo and install all dependencies, preferably
|
|||
|
||||
``` sh
|
||||
# Create conda env (optional)
|
||||
conda create --name presidio python=3.7
|
||||
conda create --name presidio python=3.8
|
||||
conda activate presidio
|
||||
|
||||
# Install package+dependencies
|
||||
pip install -r requirements.txt
|
||||
python setup.py install
|
||||
|
||||
# Optionally link in the local development copy of presidio-analyzer
|
||||
pip install -e [path to presidio-analyzer]
|
||||
# Download a spaCy model used by presidio-analyzer
|
||||
python -m spacy download en_core_web_lg
|
||||
|
||||
# Verify installation
|
||||
pytest
|
||||
|
@ -58,7 +58,7 @@ In order to standardize the process, we use specific data objects that hold all
|
|||
|
||||
## 3. Recognizer evaluation
|
||||
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
|
||||
The main logic lies in the [ModelEvaluator](presidio_evaluator/model_evaluator.py) class. It provides a structured way of evaluating models and recognizers.
|
||||
The main logic lies in the [ModelEvaluator](presidio_evaluator/models/base_model.py) class. It provides a structured way of evaluating models and recognizers.
|
||||
|
||||
|
||||
### Ready evaluators
|
||||
|
@ -72,14 +72,14 @@ Allows you to evaluate an existing Presidio deployment through the API. [See thi
|
|||
Allows you to evaluate the local Presidio-Analyzer package. Faster than the API option but requires you to have Presidio-Analyzer installed locally. [See this class for more information](presidio_evaluator/presidio_analyzer.py)
|
||||
|
||||
#### 3. One recognizer evaluator
|
||||
Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/presidio_recognizer_evaluator.py)
|
||||
Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/models/presidio_recognizer_wrapper.py)
|
||||
|
||||
|
||||
## 4. Modeling
|
||||
|
||||
### Conditional Random Fields
|
||||
To train a CRF on a new dataset, see [this notebook](notebooks/models/CRF.ipynb).
|
||||
To evaluate a CRF model, see the the [same notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/crf_evaluator.py).
|
||||
To evaluate a CRF model, see the the [same notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).
|
||||
|
||||
### spaCy based models
|
||||
There are three ways of interacting with spaCy models:
|
||||
|
@ -93,39 +93,6 @@ See [this notebook for creating spaCy datasets](notebooks/models/Create%20datase
|
|||
#### Evaluate an existing trained model
|
||||
To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
|
||||
|
||||
#### Train with pretrain embeddings
|
||||
In order to train a new spaCy model from scratch with pretrained embeddings (FastText wiki news subword in this case), follow these three steps:
|
||||
|
||||
##### 1. Download FastText pretrained (sub) word embeddings
|
||||
``` sh
|
||||
wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
|
||||
unzip wiki-news-300d-1M-subword.vec.zip
|
||||
```
|
||||
|
||||
##### 2. Init spaCy model with pre-trained embeddings
|
||||
Using spaCy CLI:
|
||||
``` sh
|
||||
python -m spacy init-model en spacy_fasttext --vectors-loc wiki-news-300d-1M-subword.vec
|
||||
```
|
||||
|
||||
##### 3. Train spaCy NER model
|
||||
Using spaCy CLI:
|
||||
``` sh
|
||||
python -m spacy train en spacy_fasttext_100 train.json test.json --vectors spacy_fasttext --pipeline ner -n 100
|
||||
```
|
||||
|
||||
#### Fine-tune an existing spaCy model
|
||||
See [this code for retraining an existing spaCy model](models/spacy_retrain.py). Specifically, run a SpacyRetrainer:
|
||||
First, you would have to create train and test pickle files for your train and test sets. See [this notebook](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb) for more information.
|
||||
|
||||
```python
|
||||
from models import SpacyRetrainer
|
||||
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
|
||||
experiment_name='new_spacy_experiment',
|
||||
n_iter=500, dropout=0.1, aml_config=None)
|
||||
spacy_retrainer.run()
|
||||
```
|
||||
|
||||
### Flair based models
|
||||
To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object.
|
||||
For experimenting with other embedding types, change the `embeddings` object in the `train` method.
|
||||
|
|
2
VERSION
2
VERSION
|
@ -1,2 +1,2 @@
|
|||
0.0
|
||||
0.0.2
|
||||
|
||||
|
|
|
@ -15,11 +15,12 @@ pool:
|
|||
vmImage: 'ubuntu-latest'
|
||||
strategy:
|
||||
matrix:
|
||||
Python36:
|
||||
python.version: '3.6'
|
||||
Python37:
|
||||
python.version: '3.7'
|
||||
|
||||
Python38:
|
||||
python.version: '3.8'
|
||||
Python39:
|
||||
python.version: '3.9'
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
|
|
|
@ -5481,7 +5481,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "SvenZimmer@fleckens.hu",
|
||||
"start_position": 39,
|
||||
"end_position": 61
|
||||
|
@ -9288,7 +9288,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "EmilySanderson@jourrapide.com",
|
||||
"start_position": 59,
|
||||
"end_position": 88
|
||||
|
@ -20492,7 +20492,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "NatalinaLucchese@superrito.com",
|
||||
"start_position": 59,
|
||||
"end_position": 89
|
||||
|
@ -25723,7 +25723,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "HannaUkkonen@dayrep.com",
|
||||
"start_position": 39,
|
||||
"end_position": 62
|
||||
|
@ -32783,7 +32783,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "yahyaeriksson@gustr.com",
|
||||
"start_position": 23,
|
||||
"end_position": 46
|
||||
|
@ -40833,7 +40833,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "VictorAndreyev@cuvox.de",
|
||||
"start_position": 23,
|
||||
"end_position": 46
|
||||
|
@ -44468,7 +44468,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "HarrisonBarnes@fleckens.hu",
|
||||
"start_position": 59,
|
||||
"end_position": 85
|
||||
|
@ -49165,7 +49165,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "MathiasEJespersen@armyspy.com",
|
||||
"start_position": 23,
|
||||
"end_position": 52
|
||||
|
@ -62644,7 +62644,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "ElishaFedorov@fleckens.hu",
|
||||
"start_position": 39,
|
||||
"end_position": 64
|
||||
|
@ -68659,7 +68659,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "HartmannAntonsson@jourrapide.com",
|
||||
"start_position": 59,
|
||||
"end_position": 91
|
||||
|
@ -72669,7 +72669,7 @@
|
|||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "EMAIL",
|
||||
"entity_type": "EMAIL_ADDRESS",
|
||||
"entity_value": "MakarMaslow@teleworm.us",
|
||||
"start_position": 39,
|
||||
"end_position": 62
|
||||
|
|
|
@ -1,10 +1,13 @@
|
|||
from typing import List
|
||||
|
||||
from flair.data import Corpus, Sentence
|
||||
from flair.datasets import ColumnCorpus
|
||||
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
|
||||
from flair.models import SequenceTagger
|
||||
from flair.trainers import ModelTrainer
|
||||
try:
|
||||
from flair.data import Corpus, Sentence
|
||||
from flair.datasets import ColumnCorpus
|
||||
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
|
||||
from flair.models import SequenceTagger
|
||||
from flair.trainers import ModelTrainer
|
||||
except ImportError:
|
||||
print("Flair is not installed")
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
|
|
|
@ -1,206 +0,0 @@
|
|||
import logging
|
||||
import pickle
|
||||
import random
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import spacy
|
||||
from azureml.core import Workspace, Experiment
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
from presidio_evaluator import SpacyEvaluator, InputSample
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
root = logging.getLogger()
|
||||
root.setLevel(logging.INFO)
|
||||
|
||||
handler = logging.StreamHandler(sys.stdout)
|
||||
handler.setLevel(logging.INFO)
|
||||
root.addHandler(handler)
|
||||
|
||||
|
||||
class SpacyRetrainer:
|
||||
|
||||
def __init__(self, original_model_name=None, experiment_name=None, n_iter=100, dropout=0.5,
|
||||
aml_config='config.json', output_dir='../../model-outputs', train_pickle='../data/train.pickle',
|
||||
test_pickle='../data/test.pickle'):
|
||||
self.experiment_name = experiment_name
|
||||
if aml_config:
|
||||
self.ws = Workspace.from_config(aml_config)
|
||||
self.experiment = Experiment(workspace=self.ws, name=experiment_name)
|
||||
self.aml_run = self.experiment.start_logging()
|
||||
self.has_aml = True
|
||||
else:
|
||||
self.has_aml = False
|
||||
|
||||
self.model = original_model_name
|
||||
self.n_iter = n_iter
|
||||
self.output_dir = output_dir
|
||||
self.train_file = train_pickle
|
||||
self.test_file = test_pickle
|
||||
self.dropout = dropout
|
||||
|
||||
def run(self):
|
||||
if self.has_aml:
|
||||
self.aml_run.log("model", self.model)
|
||||
self.aml_run.log("n_iter", self.n_iter)
|
||||
self.aml_run.log("train_file", self.train_file)
|
||||
self.aml_run.log("test_file", self.test_file)
|
||||
self.aml_run.log("dropout rate", self.dropout)
|
||||
model_path = self._train(self.model, self.output_dir, self.n_iter, self.train_file, self.experiment_name)
|
||||
self._score_validate(model_path, self.test_file)
|
||||
if self.has_aml:
|
||||
self.aml_run.complete()
|
||||
|
||||
def print_scores(self, split, evaluation_result):
|
||||
"""
|
||||
Logs results into experiment run.
|
||||
:param split: Name of this split. For ex 'train' or 'valid'
|
||||
:param evaluation_result: EvaluationResult containing various metrics
|
||||
:return: None. Writes to experiment runner and logs locally.
|
||||
"""
|
||||
logging.info('SPLIT: {0}. PII_precision: {1}, PII_recall: {2},'
|
||||
'Person_precision: {3}, Person_recall: {4}'. \
|
||||
format(split, evaluation_result.pii_precision, evaluation_result.pii_recall,
|
||||
evaluation_result.entity_precision_dict['PERSON'],
|
||||
evaluation_result.entity_recall_dict['PERSON']))
|
||||
if self.has_aml:
|
||||
self.aml_run.log('Precision', evaluation_result.pii_precision, split)
|
||||
self.aml_run.log('Recall', evaluation_result.pii_recall, split)
|
||||
|
||||
@staticmethod
|
||||
def _score(model, data):
|
||||
"""
|
||||
Score the model against the data
|
||||
:param model: Trained model
|
||||
:param data: Data split which is being scored.
|
||||
:return: An EvaluationResult containing various metrics
|
||||
"""
|
||||
|
||||
spacy_evaluator = SpacyEvaluator(model=model)
|
||||
|
||||
results = []
|
||||
for text, ground_truth_annotations in data:
|
||||
ground_truth_entities = ground_truth_annotations['entities']
|
||||
input_sample = InputSample.from_spacy(text, ground_truth_entities)
|
||||
results.append(spacy_evaluator.evaluate_sample(input_sample))
|
||||
|
||||
return spacy_evaluator.calculate_score(evaluation_results=results)
|
||||
|
||||
def _score_validate(self, model_path, test_data_file):
|
||||
"""
|
||||
Validation step for the model. Also prints the scores.
|
||||
:param model_path: Path to trained model.
|
||||
:param test_data_file: Data file which has the dataset for this split.
|
||||
:return: None. Prints the scores.
|
||||
"""
|
||||
with open(test_data_file, 'rb') as f:
|
||||
valid_data = pickle.load(f)
|
||||
nlp = spacy.load(model_path)
|
||||
self.print_scores('Valid', self._score(nlp, valid_data))
|
||||
|
||||
# @plac.annotations(
|
||||
# model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
# output_dir=("Optional output directory", "option", "o", Path),
|
||||
# n_iter=("Number of training iterations", "option", "n", int),
|
||||
# train_file=("File containing pickled training Spacy NER formatted data", "option", "d", Path),
|
||||
# test_file=("File containing pickled test Spacy NER formatted data", "option", "d", Path),
|
||||
# exp_name=("Name of this experiment", "option", "e")
|
||||
# )
|
||||
|
||||
def _train(self, model, output_dir, n_iter, train_file, exp_name):
|
||||
"""Load the model, set up the pipeline and train the entity recognizer."""
|
||||
nlp = self.load_or_create_empty_model(model)
|
||||
|
||||
if "ner" not in nlp.pipe_names:
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner, last=True)
|
||||
else:
|
||||
ner = nlp.get_pipe("ner")
|
||||
|
||||
with open(train_file, 'rb') as f:
|
||||
train_data = pickle.load(f)
|
||||
|
||||
# DEBUG
|
||||
train_data = train_data[:50]
|
||||
|
||||
# add labels
|
||||
for _, annotations in train_data:
|
||||
for ent in annotations.get("entities"):
|
||||
ner.add_label(ent[2])
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
# reset and initialize the weights randomly – but only if we're
|
||||
# training a new model
|
||||
if model is None:
|
||||
nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(train_data)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, drop=self.dropout, losses=losses, )
|
||||
logging.debug("Losses", losses)
|
||||
if self.has_aml:
|
||||
self.aml_run.log('Losses', losses['ner'])
|
||||
self.print_scores('Itn {}'.format(itn), self._score(nlp, train_data))
|
||||
|
||||
self.print_scores('Train', self._score(nlp, train_data))
|
||||
|
||||
saved_model_path = self.save_model(exp_name, nlp, output_dir)
|
||||
return saved_model_path
|
||||
|
||||
@staticmethod
|
||||
def save_model(exp_name, model, output_dir):
|
||||
"""
|
||||
Saves model to disk for later use.
|
||||
:param exp_name: Name of the running experiment. This is used as folder name for storing the model.
|
||||
:param model: Model being saved
|
||||
:param output_dir: Directory where to save the model.
|
||||
:return: Full path to saved model.
|
||||
"""
|
||||
saved_model_path = Path(output_dir, exp_name)
|
||||
if not saved_model_path.exists():
|
||||
saved_model_path.mkdir(parents=True)
|
||||
model.to_disk(saved_model_path)
|
||||
logging.info("Saved model to {}".format(output_dir))
|
||||
return saved_model_path
|
||||
|
||||
@staticmethod
|
||||
def load_model(exp_name, model_dir):
|
||||
"""
|
||||
Loads a spacy model from disk
|
||||
|
||||
:param exp_name: Name of experiment under which the model was saved
|
||||
:param model_dir: path to saved model
|
||||
:return: spacy model
|
||||
"""
|
||||
saved_model_path = Path(model_dir, exp_name)
|
||||
return spacy.load(saved_model_path)
|
||||
|
||||
@staticmethod
|
||||
def load_or_create_empty_model(model=None):
|
||||
"""
|
||||
Loads a given model or creates a blank english model.
|
||||
:param model: Optional Model to load.
|
||||
:return: Loaded or blank model.
|
||||
"""
|
||||
if model:
|
||||
nlp = spacy.load(model)
|
||||
logging.debug("Loaded model {}".format(model))
|
||||
else:
|
||||
nlp = spacy.blank("en")
|
||||
logging.debug("Created blank 'en' model")
|
||||
return nlp
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
|
||||
experiment_name='spacy_new_ontonotes28',
|
||||
n_iter=500, dropout=0.5, aml_config=None)
|
||||
spacy_retrainer.run()
|
|
@ -1,6 +1,21 @@
|
|||
from .span_to_tag import span_to_tag, tokenize
|
||||
from .data_objects import Span, InputSample, EvaluationResult, ModelError
|
||||
from .model_evaluator import ModelEvaluator
|
||||
from .spacy_evaluator import SpacyEvaluator
|
||||
from .presidio_api_evaluator import PresidioAPIEvaluator
|
||||
from .presidio_analyzer_evaluator import PresidioAnalyzerEvaluator
|
||||
from .data_objects import Span, InputSample
|
||||
from .validation import (
|
||||
split_dataset,
|
||||
split_by_template,
|
||||
get_samples_by_pattern,
|
||||
group_by_template,
|
||||
save_to_json,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"span_to_tag",
|
||||
"tokenize",
|
||||
"Span",
|
||||
"InputSample",
|
||||
"split_dataset",
|
||||
"split_by_template",
|
||||
"get_samples_by_pattern",
|
||||
"group_by_template",
|
||||
"save_to_json",
|
||||
]
|
||||
|
|
|
@ -1,97 +0,0 @@
|
|||
import pickle
|
||||
from typing import List
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample
|
||||
|
||||
|
||||
class CRFEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self,
|
||||
model_pickle_path: str = "../models/crf.pickle",
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme: str = "BIO",
|
||||
compare_by_io: bool = True):
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io)
|
||||
|
||||
if model_pickle_path is None:
|
||||
raise ValueError("model_pickle_path must be supplied")
|
||||
|
||||
with open(model_pickle_path, 'rb') as f:
|
||||
self.model = pickle.load(f)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
tags = CRFEvaluator.crf_predict(sample,self.model)
|
||||
|
||||
if len(tags) != len(sample.tokens):
|
||||
print("mismatch between previous tokens and new tokens")
|
||||
# translated_tags = sample.rename_from_spacy_tags(tags)
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def crf_predict(sample, model):
|
||||
sample.translate_input_sample_tags()
|
||||
|
||||
conll = sample.to_conll(translate_tags=True)
|
||||
sentence = [(di['text'], di['pos'], di['label']) for di in conll]
|
||||
features = CRFEvaluator.sent2features(sentence)
|
||||
return model.predict([features])[0]
|
||||
|
||||
@staticmethod
|
||||
def word2features(sent, i):
|
||||
word = sent[i][0]
|
||||
postag = sent[i][1]
|
||||
|
||||
features = {
|
||||
'bias': 1.0,
|
||||
'word.lower()': word.lower(),
|
||||
'word[-3:]': word[-3:],
|
||||
'word[-2:]': word[-2:],
|
||||
'word.isupper()': word.isupper(),
|
||||
'word.istitle()': word.istitle(),
|
||||
'word.isdigit()': word.isdigit(),
|
||||
'postag': postag,
|
||||
'postag[:2]': postag[:2],
|
||||
}
|
||||
if i > 0:
|
||||
word1 = sent[i - 1][0]
|
||||
postag1 = sent[i - 1][1]
|
||||
features.update({
|
||||
'-1:word.lower()': word1.lower(),
|
||||
'-1:word.istitle()': word1.istitle(),
|
||||
'-1:word.isupper()': word1.isupper(),
|
||||
'-1:postag': postag1,
|
||||
'-1:postag[:2]': postag1[:2],
|
||||
})
|
||||
else:
|
||||
features['BOS'] = True
|
||||
|
||||
if i < len(sent) - 1:
|
||||
word1 = sent[i + 1][0]
|
||||
postag1 = sent[i + 1][1]
|
||||
features.update({
|
||||
'+1:word.lower()': word1.lower(),
|
||||
'+1:word.istitle()': word1.istitle(),
|
||||
'+1:word.isupper()': word1.isupper(),
|
||||
'+1:postag': postag1,
|
||||
'+1:postag[:2]': postag1[:2],
|
||||
})
|
||||
else:
|
||||
features['EOS'] = True
|
||||
|
||||
return features
|
||||
|
||||
@staticmethod
|
||||
def sent2features(sent):
|
||||
return [CRFEvaluator.word2features(sent, i) for i in range(len(sent))]
|
||||
|
||||
@staticmethod
|
||||
def sent2labels(sent):
|
||||
return [label for token, postag, label in sent]
|
||||
|
||||
@staticmethod
|
||||
def sent2tokens(sent):
|
||||
return [token for token, postag, label in sent]
|
|
@ -1,16 +1,27 @@
|
|||
# PII dataset generator
|
||||
This data generator takes a text file with templates (e.g. `my name is [PERSON]`) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
|
||||
It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer) and tags in various schemas (BIO/IOB, IO, BILOU)
|
||||
In addition it provides some off-the-shelf features on each token, like `pos`, `dep` and `is_in_vocabulary`
|
||||
This data generator takes a text file with templates (e.g. `my name is [PERSON]`)
|
||||
and creates a list of InputSamples which contain fake PII entities
|
||||
instead of placeholders.
|
||||
It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer)
|
||||
and tags in various schemas (BIO/IOB, IO, BILOU)
|
||||
In addition it provides some off-the-shelf features on each token,
|
||||
like `pos`, `dep` and `is_in_vocabulary`
|
||||
|
||||
The main class is `FakeDataGenerator` however the `main` module has two functions for creating and reading a fake dataset.
|
||||
During the generation process, the tool either takes fake PII from a provided CSV with a known format, and/or from extension functions which can be found in the extensions.py file.
|
||||
The main class is `FakeDataGenerator` however the `main` module has two functions
|
||||
for creating and reading a fake dataset.
|
||||
During the generation process, the tool either takes fake PII from a provided CSV with
|
||||
a known format, and/or from extension functions which can be found
|
||||
in the extensions.py file.
|
||||
|
||||
The process in high level is the following:
|
||||
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of templates: `My name is John` -> `My name is [PERSON]`
|
||||
2. (Optional) adapt the FakeDataGenerator to support new extensions which could generate fake PII entities
|
||||
3. Generate X samples using the templates list + a fake PII dataset + extensions that add additional PII entities
|
||||
4. Split the generated dataset to train/test/validation while making sure that samples from the same template would only appear in one set
|
||||
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
|
||||
templates: `My name is John` -> `My name is [PERSON]`
|
||||
2. (Optional) adapt the FakeDataGenerator to support new extensions
|
||||
which could generate fake PII entities
|
||||
3. Generate X samples using the templates list + a fake PII dataset +
|
||||
extensions that add additional PII entities
|
||||
4. Split the generated dataset to train/test/validation while making sure
|
||||
that samples from the same template would only appear in one set
|
||||
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
|
||||
6. Train models
|
||||
7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
|
||||
|
@ -19,12 +30,15 @@ The process in high level is the following:
|
|||
|
||||
Notes:
|
||||
- For steps 5, 6, 7 see the main [README](../../README.md).
|
||||
- For a simple data generation pipeline, [see this notebook](../../notebooks/Generate data.ipynb).
|
||||
- For information on transforming a NER dataset into a templates, see the notebooks in the [helper notebooks](helper%20notebooks) folder.
|
||||
- For a simple data generation pipeline,
|
||||
[see this notebook](../../notebooks/Generate data.ipynb).
|
||||
- For information on transforming a NER dataset into a templates,
|
||||
see the notebooks in the [helper notebooks](helper%20notebooks) folder.
|
||||
|
||||
Example run:
|
||||
|
||||
```python
|
||||
from presidio_evaluator.data_generator import generate
|
||||
TEMPLATES_FILE = 'raw_data/templates.txt'
|
||||
OUTPUT = "generated_.txt"
|
||||
|
||||
|
@ -45,4 +59,7 @@ examples = generate(fake_pii_csv=fake_pii_csv,
|
|||
|
||||
*Copyright notice:*
|
||||
|
||||
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
|
||||
Fake Name Generator identities by the Fake Name Generator are licensed under a
|
||||
Creative Commons Attribution-Share Alike 3.0 United States License.
|
||||
Fake Name Generator and the Fake Name Generator logo
|
||||
are trademarks of Corban Works, LLC.
|
|
@ -1,8 +1,7 @@
|
|||
import random
|
||||
from typing import List, Optional
|
||||
|
||||
import re
|
||||
from collections import Counter
|
||||
from typing import List, Optional, Dict
|
||||
|
||||
import pandas as pd
|
||||
from spacy.tokens import Token
|
||||
|
@ -40,30 +39,30 @@ class FakeDataGenerator:
|
|||
labeling_scheme="BILOU",
|
||||
):
|
||||
"""
|
||||
Fake data generator.
|
||||
Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
|
||||
e.g. "My name is [FIRST_NAME]"
|
||||
:param fake_pii_df:
|
||||
A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
|
||||
:param templates: A list of templates
|
||||
with place holders for PII entities.
|
||||
For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
|
||||
Note that in case you have multiple entities of the same type
|
||||
in a template, you should put a number on the second. For example:
|
||||
"I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
|
||||
More than two are currently not supported but extending this
|
||||
is straightforward.
|
||||
:param lower_case_ratio: Percentage of names that should start
|
||||
with lower case
|
||||
:param include_metadata: Whether to include additional
|
||||
information in the output
|
||||
(e.g. NameSet from which the name was taken, gender, country etc.)
|
||||
:param dictionary_path: A path to a csv containing a vocabulary of
|
||||
a language, to check if a token exists in the vocabulary or not.
|
||||
:param ignore_types: set of types to ignore
|
||||
:param span_to_tag: whether to tokenize the generated samples or not
|
||||
:param labeling_scheme: labeling scheme (BILOU, BIO, IO)
|
||||
"""
|
||||
Fake data generator.
|
||||
Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
|
||||
e.g. "My name is [FIRST_NAME]"
|
||||
:param fake_pii_df:
|
||||
A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
|
||||
:param templates: A list of templates
|
||||
with place holders for PII entities.
|
||||
For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
|
||||
Note that in case you have multiple entities of the same type
|
||||
in a template, you should put a number on the second. For example:
|
||||
"I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
|
||||
More than two are currently not supported but extending this
|
||||
is straightforward.
|
||||
:param lower_case_ratio: Percentage of names that should start
|
||||
with lower case
|
||||
:param include_metadata: Whether to include additional
|
||||
information in the output
|
||||
(e.g. NameSet from which the name was taken, gender, country etc.)
|
||||
:param dictionary_path: A path to a csv containing a vocabulary of
|
||||
a language, to check if a token exists in the vocabulary or not.
|
||||
:param ignore_types: set of types to ignore
|
||||
:param span_to_tag: whether to tokenize the generated samples or not
|
||||
:param labeling_scheme: labeling scheme (BILOU, BIO, IO)
|
||||
"""
|
||||
if ignore_types is None:
|
||||
ignore_types = {}
|
||||
self.lower_case_ratio = lower_case_ratio
|
||||
|
@ -110,7 +109,7 @@ class FakeDataGenerator:
|
|||
"TelephoneNumber": "PHONE_NUMBER",
|
||||
"CCNumber": "CREDIT_CARD",
|
||||
"Birthday": "BIRTHDAY",
|
||||
"EmailAddress": "EMAIL",
|
||||
"EmailAddress": "EMAIL_ADDRESS",
|
||||
"StreetAddress": "FULL_ADDRESS",
|
||||
"Domain": "DOMAIN_NAME",
|
||||
"NameSet": "NAMESET",
|
||||
|
@ -143,9 +142,9 @@ class FakeDataGenerator:
|
|||
) # replace previous country which has limited options
|
||||
|
||||
# Copied entities
|
||||
if "DATE" not in self.ignore_types:
|
||||
if "DATE_TIME" not in self.ignore_types:
|
||||
if "BIRTHDAY" in df:
|
||||
df["DATE"] = df["BIRTHDAY"]
|
||||
df["DATE_TIME"] = df["BIRTHDAY"]
|
||||
else:
|
||||
print("DATE is taken from the BIRTHDAY column which is missing")
|
||||
|
||||
|
@ -165,7 +164,9 @@ class FakeDataGenerator:
|
|||
if "TITLE" not in self.ignore_types:
|
||||
print("Generating titles")
|
||||
if "GENDER" not in df:
|
||||
print("Cannot generate title without a GENDER column. Generating FEMALE_TITLE and MALE_TITLE")
|
||||
print(
|
||||
"Cannot generate title without a GENDER column. Generating FEMALE_TITLE and MALE_TITLE"
|
||||
)
|
||||
else:
|
||||
df["TITLE"] = generate_titles(df["GENDER"])
|
||||
df["FEMALE_TITLE"] = [generate_title("female") for _ in range(len(df))]
|
||||
|
@ -275,7 +276,9 @@ class FakeDataGenerator:
|
|||
|
||||
return template, templates, entities_count
|
||||
|
||||
def sample_examples(self, count, genders:List[str]=None, namesets:List[str]=None):
|
||||
def sample_examples(
|
||||
self, count, genders: List[str] = None, namesets: List[str] = None
|
||||
):
|
||||
|
||||
if self.fake_pii is None:
|
||||
self.fake_pii = self.prep_fake_pii(self.original_pii_df)
|
||||
|
@ -305,9 +308,7 @@ class FakeDataGenerator:
|
|||
values[h] = str(fake_pii_sample_duplicated[h])
|
||||
else:
|
||||
print(
|
||||
"Warning: entity {} is in the templates but not in the PII dataset. Ignoring.".format(
|
||||
h
|
||||
)
|
||||
f"Warning: entity {h} is in the templates but not in the PII dataset. Ignoring."
|
||||
)
|
||||
values[h] = ""
|
||||
|
||||
|
@ -335,7 +336,7 @@ class FakeDataGenerator:
|
|||
yield input_sample
|
||||
|
||||
@staticmethod
|
||||
def _consolidate_names(input_sample):
|
||||
def _consolidate_names(input_sample: InputSample):
|
||||
locations = ("LOCATION", "CITY", "STATE", "COUNTRY", "ADDRESS", "STREET")
|
||||
names = ("FIRST_NAME", "LAST_NAME", "PERSON")
|
||||
|
||||
|
@ -353,7 +354,9 @@ class FakeDataGenerator:
|
|||
|
||||
input_sample.masked = masked
|
||||
|
||||
def _create_input_sample(self, original_sentence, values):
|
||||
def _create_input_sample(
|
||||
self, original_sentence: str, values: Dict[str, str]
|
||||
) -> InputSample:
|
||||
"""
|
||||
Creates an InputSample out of a template sentence
|
||||
and a dict of entity names and values
|
||||
|
@ -417,7 +420,10 @@ class FakeDataGenerator:
|
|||
|
||||
# Not creating tokens here since we're consolidating names afterwards
|
||||
return InputSample(
|
||||
sentence, original_sentence, spans, create_tags_from_span=False
|
||||
full_text=sentence,
|
||||
spans=spans,
|
||||
masked=original_sentence,
|
||||
create_tags_from_span=False,
|
||||
)
|
||||
|
||||
def _add_duplicated_entities(self, fake_pii_sample, entity_counts):
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
import datetime
|
||||
import json
|
||||
import warnings
|
||||
|
||||
import pandas as pd
|
||||
|
||||
|
@ -12,14 +13,16 @@ def read_utterances(utterances_file):
|
|||
return f.readlines()
|
||||
|
||||
|
||||
def generate(fake_pii_csv,
|
||||
utterances_file,
|
||||
output_file=None,
|
||||
num_of_examples=1000,
|
||||
dictionary_path=None,
|
||||
store_masked_text=False,
|
||||
keep_only_tagged=False,
|
||||
**kwargs):
|
||||
def generate(
|
||||
fake_pii_csv,
|
||||
utterances_file,
|
||||
output_file=None,
|
||||
num_of_examples=1000,
|
||||
dictionary_path=None,
|
||||
store_masked_text=False,
|
||||
keep_only_tagged=False,
|
||||
**kwargs
|
||||
):
|
||||
"""
|
||||
|
||||
:param fake_pii_csv: csv containing fake PII
|
||||
|
@ -34,18 +37,18 @@ def generate(fake_pii_csv,
|
|||
"""
|
||||
|
||||
if not output_file:
|
||||
raise ValueError("Please provide an output file path")
|
||||
warnings.warn("Warning: no output_file value provided.")
|
||||
|
||||
templates = read_utterances(utterances_file)
|
||||
|
||||
if keep_only_tagged:
|
||||
templates = [template for template in templates if "[" in template]
|
||||
|
||||
df = pd.read_csv(fake_pii_csv, encoding='utf-8')
|
||||
df = pd.read_csv(fake_pii_csv, encoding="utf-8")
|
||||
|
||||
generator = FakeDataGenerator(fake_pii_df=df,
|
||||
dictionary_path=dictionary_path,
|
||||
templates=templates, **kwargs)
|
||||
generator = FakeDataGenerator(
|
||||
fake_pii_df=df, dictionary_path=dictionary_path, templates=templates, **kwargs
|
||||
)
|
||||
counter = 0
|
||||
|
||||
examples = []
|
||||
|
@ -56,7 +59,7 @@ def generate(fake_pii_csv,
|
|||
|
||||
examples_json = [example.to_dict() for example in examples]
|
||||
|
||||
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
|
||||
with open("{}".format(output_file), "w+", encoding="utf-8") as f:
|
||||
json.dump(examples_json, f, ensure_ascii=False, indent=4)
|
||||
|
||||
print("generated {} examples".format(len(examples)))
|
||||
|
@ -67,6 +70,7 @@ def generate(fake_pii_csv,
|
|||
|
||||
def read_synth_dataset(filepath=None, length=None):
|
||||
import json
|
||||
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
dataset = json.load(f)
|
||||
|
||||
|
@ -84,28 +88,32 @@ if __name__ == "__main__":
|
|||
EXAMPLES = 30
|
||||
PII_FILE_SIZE = 3000
|
||||
SPAN_TO_TAG = True
|
||||
TEMPLATES_FILE = 'raw_data/templates.txt'
|
||||
TEMPLATES_FILE = "raw_data/templates.txt"
|
||||
KEEP_ONLY_TAGGED = False
|
||||
LOWER_CASE_RATIO = 0.1
|
||||
IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}
|
||||
IGNORE_TYPES = {"IP_ADDRESS", "US_SSN", "URL"}
|
||||
|
||||
cur_time = datetime.date.today().strftime("%B %d %Y")
|
||||
OUTPUT = "generated_size_{}_date_{}.txt".format(EXAMPLES, cur_time)
|
||||
|
||||
fake_pii_csv = '../../presidio_evaluator/data_generator/' \
|
||||
'raw_data/FakeNameGenerator.com_{}.csv'.format(PII_FILE_SIZE)
|
||||
fake_pii_csv = (
|
||||
"../../presidio_evaluator/data_generator/"
|
||||
"raw_data/FakeNameGenerator.com_{}.csv".format(PII_FILE_SIZE)
|
||||
)
|
||||
utterances_file = TEMPLATES_FILE
|
||||
dictionary_path = None
|
||||
|
||||
examples = generate(fake_pii_csv=fake_pii_csv,
|
||||
utterances_file=utterances_file,
|
||||
dictionary_path=dictionary_path,
|
||||
output_file=OUTPUT,
|
||||
lower_case_ratio=LOWER_CASE_RATIO,
|
||||
num_of_examples=EXAMPLES,
|
||||
ignore_types=IGNORE_TYPES,
|
||||
keep_only_tagged=KEEP_ONLY_TAGGED,
|
||||
span_to_tag=SPAN_TO_TAG)
|
||||
examples = generate(
|
||||
fake_pii_csv=fake_pii_csv,
|
||||
utterances_file=utterances_file,
|
||||
dictionary_path=dictionary_path,
|
||||
output_file=OUTPUT,
|
||||
lower_case_ratio=LOWER_CASE_RATIO,
|
||||
num_of_examples=EXAMPLES,
|
||||
ignore_types=IGNORE_TYPES,
|
||||
keep_only_tagged=KEEP_ONLY_TAGGED,
|
||||
span_to_tag=SPAN_TO_TAG,
|
||||
)
|
||||
|
||||
# sanity
|
||||
input_samples = read_synth_dataset(OUTPUT)
|
||||
|
|
|
@ -15,24 +15,34 @@ class NationalityGenerator:
|
|||
|
||||
def get_country(self):
|
||||
## [COUNTRY]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['country'].values))
|
||||
return NationalityGenerator.capitalizeWords(
|
||||
random.choice(self.df["country"].values)
|
||||
)
|
||||
|
||||
def get_nationality(self):
|
||||
## [NATIONALITY]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['nationality'].values))
|
||||
return NationalityGenerator.capitalizeWords(
|
||||
random.choice(self.df["nationality"].values)
|
||||
)
|
||||
|
||||
def get_nation_woman(self):
|
||||
## [NATION_WOMAN]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['woman'].values))
|
||||
return NationalityGenerator.capitalizeWords(
|
||||
random.choice(self.df["woman"].values)
|
||||
)
|
||||
|
||||
def get_nation_man(self):
|
||||
## [NATION_MAN]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['man'].values))
|
||||
return NationalityGenerator.capitalizeWords(
|
||||
random.choice(self.df["man"].values)
|
||||
)
|
||||
|
||||
def get_nation_plural(self):
|
||||
## [NATION_PLURAL]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['plural'].values))
|
||||
return NationalityGenerator.capitalizeWords(
|
||||
random.choice(self.df["plural"].values)
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def capitalizeWords(s):
|
||||
return re.sub(r'\w+', lambda m: m.group(0).capitalize(), s)
|
||||
return re.sub(r"\w+", lambda m: m.group(0).capitalize(), s)
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
from typing import List, Set, Dict
|
||||
|
||||
from presidio_analyzer import RecognizerResult
|
||||
from presidio_anonymizer import AnonymizerEngine
|
||||
|
||||
from presidio_evaluator.data_generator import FakeDataGenerator
|
||||
|
||||
|
@ -13,7 +14,6 @@ class PresidioPerturb(FakeDataGenerator):
|
|||
fake_pii_df: pd.DataFrame,
|
||||
lower_case_ratio: float = 0.0,
|
||||
ignore_types: Set[str] = None,
|
||||
entity_dict: Dict[str, str] = None,
|
||||
):
|
||||
super().__init__(
|
||||
fake_pii_df=fake_pii_df,
|
||||
|
@ -29,12 +29,9 @@ class PresidioPerturb(FakeDataGenerator):
|
|||
:param lower_case_ratio: Percentage of names that should start
|
||||
with lower case
|
||||
:param ignore_types: set of types to ignore
|
||||
:param entity_dict: Dictionary with mapping of entity names between Presidio and the fake_pii_df.
|
||||
For example, {"EMAIL_ADDRESS": "EMAIL"}
|
||||
"""
|
||||
|
||||
self.fake_pii = self.prep_fake_pii(self.original_pii_df)
|
||||
self.entity_dict = entity_dict
|
||||
|
||||
def perturb(
|
||||
self,
|
||||
|
@ -56,19 +53,14 @@ class PresidioPerturb(FakeDataGenerator):
|
|||
|
||||
presidio_response = sorted(presidio_response, key=lambda resp: resp.start)
|
||||
|
||||
delta = 0
|
||||
text = original_text
|
||||
for resp in presidio_response:
|
||||
start = resp.start + delta
|
||||
end = resp.end + delta
|
||||
entity_text = original_text[start:end]
|
||||
entity_type = resp.entity_type
|
||||
if self.entity_dict:
|
||||
if entity_type in self.entity_dict:
|
||||
entity_type = self.entity_dict[entity_type]
|
||||
anonymizer_engine = AnonymizerEngine()
|
||||
anonymized_result = anonymizer_engine.anonymize(
|
||||
text=original_text, analyzer_results=presidio_response
|
||||
)
|
||||
|
||||
text = anonymized_result.text
|
||||
text = text.replace(">", "}").replace("<", "{")
|
||||
|
||||
text = f"{text[:start]}{{{entity_type}}}{text[end:]}"
|
||||
delta = len(entity_type) + 2 - len(entity_text)
|
||||
self.templates = [text]
|
||||
return [
|
||||
sample.full_text
|
||||
|
|
|
@ -24,11 +24,11 @@ I will be travelling to [COUNTRY] next week, so I need my passport to be ready b
|
|||
Who's coming to [COUNTRY] with me?
|
||||
[COUNTRY] was super fun to visit!
|
||||
Could you please email me the statement for laste month , my credit card number is [CREDIT_CARD]?
|
||||
Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL]?
|
||||
Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL_ADDRESS]?
|
||||
How do I change my address to [ADDRESS] for post mail?
|
||||
My name appears incorrectly on credit card statement could you please correct it to [TITLE] [PERSON]?
|
||||
card number [CREDIT_CARD] is lost, can you please send a new one to [ADDRESS] i am in [CITY] for a business trip
|
||||
Please transfer all funds from my account to this hackers' [EMAIL]
|
||||
Please transfer all funds from my account to this hackers' [EMAIL_ADDRESS]
|
||||
I can't browse to your site, keep getting address [IP_ADDRESS] blocked error
|
||||
My religion does not allow speaking to bots, they are evil and hacked by the Devil
|
||||
Excuse me, Sir bot, but I really don't like this tone
|
||||
|
@ -47,7 +47,7 @@ I would like to remove my kid [FIRST_NAME] from the will. How do I do that?
|
|||
The name in the account is not correct, please change it to [PERSON]
|
||||
Hello I moved, please update my new address is [ADDRESS]
|
||||
I need to add addresses, here they are: [ADDRESS], [ADDRESS]
|
||||
Please send my portfolio to this email [EMAIL]
|
||||
Please send my portfolio to this email [EMAIL_ADDRESS]
|
||||
Hello, this is [TITLE] [PERSON]. Who are you?
|
||||
I want to add [PERSON] as a beneficiary to my account
|
||||
I want to cancel my card [CREDIT_CARD] because I lost it
|
||||
|
@ -58,11 +58,11 @@ My nam is [FIRST_NAME]
|
|||
I'm moving out of the country, so please cancel my subscription
|
||||
My name is [PERSON] but everyone calls me [FIRST_NAME]
|
||||
Please tell me your date of birth. It's [BIRTHDAY]
|
||||
You said your email is [EMAIL]. Is that correct?
|
||||
You said your email is [EMAIL_ADDRESS]. Is that correct?
|
||||
I once lived in [ADDRESS]. I now live in [ADDRESS]
|
||||
I'd like to order a taxi to [ADDRESS]
|
||||
Please charge my credit card. Number is [CREDIT_CARD]
|
||||
What's your email? [EMAIL]
|
||||
What's your email? [EMAIL_ADDRESS]
|
||||
What's your credit card? [CREDIT_CARD]
|
||||
What's your name? [PERSON]
|
||||
What's your last name? [LAST_NAME]
|
||||
|
|
|
@ -1,8 +1,9 @@
|
|||
from typing import List, Counter, Dict
|
||||
from typing import List, Optional
|
||||
|
||||
import spacy
|
||||
import srsly
|
||||
from spacy.tokens import Token
|
||||
from spacy.training import docs_to_json
|
||||
from tqdm import tqdm
|
||||
|
||||
from presidio_evaluator import span_to_tag, tokenize
|
||||
|
@ -15,7 +16,7 @@ SPACY_PRESIDIO_ENTITIES = {
|
|||
"FAC": "LOCATION",
|
||||
"PERSON": "PERSON",
|
||||
"LOCATION": "LOCATION",
|
||||
"ORGANIZATION": "ORGANIZATION"
|
||||
"ORGANIZATION": "ORGANIZATION",
|
||||
}
|
||||
PRESIDIO_SPACY_ENTITIES = {
|
||||
"ORGANIZATION": "ORG",
|
||||
|
@ -55,8 +56,10 @@ class Span:
|
|||
"""
|
||||
|
||||
# if they do not overlap the intersection is 0
|
||||
if self.end_position < other.start_position or other.end_position < \
|
||||
self.start_position:
|
||||
if (
|
||||
self.end_position < other.start_position
|
||||
or other.end_position < self.start_position
|
||||
):
|
||||
return 0
|
||||
|
||||
# if we are accounting for entity type a diff type means intersection 0
|
||||
|
@ -65,25 +68,38 @@ class Span:
|
|||
|
||||
# otherwise the intersection is min(end) - max(start)
|
||||
return min(self.end_position, other.end_position) - max(
|
||||
self.start_position,
|
||||
other.start_position)
|
||||
self.start_position, other.start_position
|
||||
)
|
||||
|
||||
def __repr__(self):
|
||||
return "Type: {}, value: {}, start: {}, end: {}".format(
|
||||
self.entity_type, self.entity_value, self.start_position,
|
||||
self.end_position)
|
||||
return (
|
||||
f"Type: {self.entity_type}, "
|
||||
f"value: {self.entity_value}, "
|
||||
f"start: {self.start_position}, "
|
||||
f"end: {self.end_position}"
|
||||
)
|
||||
|
||||
def __eq__(self, other):
|
||||
return self.entity_type == other.entity_type \
|
||||
and self.entity_value == other.entity_value \
|
||||
and self.start_position == other.start_position \
|
||||
and self.end_position == other.end_position
|
||||
return (
|
||||
self.entity_type == other.entity_type
|
||||
and self.entity_value == other.entity_value
|
||||
and self.start_position == other.start_position
|
||||
and self.end_position == other.end_position
|
||||
)
|
||||
|
||||
def __hash__(self):
|
||||
return hash(('entity_type', self.entity_type,
|
||||
'entity_value', self.entity_value,
|
||||
'start_position', self.start_position,
|
||||
'end_position', self.end_position))
|
||||
return hash(
|
||||
(
|
||||
"entity_type",
|
||||
self.entity_type,
|
||||
"entity_value",
|
||||
self.entity_value,
|
||||
"start_position",
|
||||
self.start_position,
|
||||
"end_position",
|
||||
self.end_position,
|
||||
)
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, data):
|
||||
|
@ -108,12 +124,17 @@ class SimpleToken(object):
|
|||
A class mimicking the Spacy Token class, for serialization purposes
|
||||
"""
|
||||
|
||||
def __init__(self, text, idx, tag_=None,
|
||||
pos_=None,
|
||||
dep_=None,
|
||||
lemma_=None,
|
||||
spacy_extensions: SimpleSpacyExtensions = None,
|
||||
**kwargs):
|
||||
def __init__(
|
||||
self,
|
||||
text,
|
||||
idx,
|
||||
tag_=None,
|
||||
pos_=None,
|
||||
dep_=None,
|
||||
lemma_=None,
|
||||
spacy_extensions: SimpleSpacyExtensions = None,
|
||||
**kwargs,
|
||||
):
|
||||
self.text = text
|
||||
self.idx = idx
|
||||
self.tag_ = tag_
|
||||
|
@ -145,13 +166,15 @@ class SimpleToken(object):
|
|||
else:
|
||||
spacy_extensions = None
|
||||
|
||||
return cls(text=token.text,
|
||||
idx=token.idx,
|
||||
tag_=token.tag_,
|
||||
pos_=token.pos_,
|
||||
dep_=token.dep_,
|
||||
lemma_=token.lemma_,
|
||||
spacy_extensions=spacy_extensions)
|
||||
return cls(
|
||||
text=token.text,
|
||||
idx=token.idx,
|
||||
tag_=token.tag_,
|
||||
pos_=token.pos_,
|
||||
dep_=token.dep_,
|
||||
lemma_=token.lemma_,
|
||||
spacy_extensions=spacy_extensions,
|
||||
)
|
||||
|
||||
def to_dict(self):
|
||||
return {
|
||||
|
@ -161,7 +184,7 @@ class SimpleToken(object):
|
|||
"pos_": self.pos_,
|
||||
"dep_": self.dep_,
|
||||
"lemma_": self.lemma_,
|
||||
"_": self._.to_dict()
|
||||
"_": self._.to_dict(),
|
||||
}
|
||||
|
||||
def __repr__(self):
|
||||
|
@ -170,21 +193,27 @@ class SimpleToken(object):
|
|||
@classmethod
|
||||
def from_json(cls, data):
|
||||
|
||||
if '_' in data:
|
||||
data['spacy_extensions'] = \
|
||||
SimpleSpacyExtensions(**data['_'])
|
||||
if "_" in data:
|
||||
data["spacy_extensions"] = SimpleSpacyExtensions(**data["_"])
|
||||
return cls(**data)
|
||||
|
||||
|
||||
class InputSample(object):
|
||||
|
||||
def __init__(self, full_text: str, masked: str, spans: List[Span],
|
||||
tokens=[], tags=[],
|
||||
create_tags_from_span=True, scheme="IO", metadata=None, template_id=None):
|
||||
def __init__(
|
||||
self,
|
||||
full_text: str,
|
||||
spans: Optional[List[Span]] = None,
|
||||
masked: Optional[str] = None,
|
||||
tokens: Optional[List[SimpleToken]] = None,
|
||||
tags: Optional[List[str]] = None,
|
||||
create_tags_from_span=True,
|
||||
scheme="IO",
|
||||
metadata=None,
|
||||
template_id=None,
|
||||
):
|
||||
"""
|
||||
Holds all the information needed for evaluation in the
|
||||
Hold all the information needed for evaluation in the
|
||||
presidio-evaluator framework.
|
||||
Can generate tags (BIO/BILOU/IO) based on spans
|
||||
|
||||
:param full_text: The raw text of this sample
|
||||
:param masked: Masked version of the raw text (desired output)
|
||||
|
@ -198,6 +227,10 @@ class InputSample(object):
|
|||
in the English (or other language) vocabulary
|
||||
:param template_id: Original template (utterance) of sample, in case it was generated
|
||||
"""
|
||||
if tags is None:
|
||||
tags = []
|
||||
if tokens is None:
|
||||
tokens = []
|
||||
self.full_text = full_text
|
||||
self.masked = masked
|
||||
self.spans = spans if spans else []
|
||||
|
@ -218,11 +251,12 @@ class InputSample(object):
|
|||
self.tags = tags
|
||||
|
||||
def __repr__(self):
|
||||
return "Full text: {}\n" \
|
||||
"Spans: {}\n" \
|
||||
"Tokens: {}\n" \
|
||||
"Tags: {}\n".format(self.full_text, self.spans, self.tokens,
|
||||
self.tags)
|
||||
return (
|
||||
f"Full text: {self.full_text}\n"
|
||||
f"Spans: {self.spans}\n"
|
||||
f"Tokens: {self.tokens}\n"
|
||||
f"Tags: {self.tags}\n"
|
||||
)
|
||||
|
||||
def to_dict(self):
|
||||
|
||||
|
@ -230,20 +264,20 @@ class InputSample(object):
|
|||
"full_text": self.full_text,
|
||||
"masked": self.masked,
|
||||
"spans": [span.__dict__ for span in self.spans],
|
||||
"tokens": [SimpleToken.from_spacy_token(token).to_dict()
|
||||
for token in self.tokens],
|
||||
"tokens": [
|
||||
SimpleToken.from_spacy_token(token).to_dict() for token in self.tokens
|
||||
],
|
||||
"tags": self.tags,
|
||||
"template_id": self.template_id,
|
||||
"metadata": self.metadata
|
||||
"metadata": self.metadata,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, data):
|
||||
if 'spans' in data:
|
||||
data['spans'] = [Span.from_json(span) for span in data['spans']]
|
||||
if 'tokens' in data:
|
||||
data['tokens'] = [SimpleToken.from_json(val) for val in
|
||||
data['tokens']]
|
||||
if "spans" in data:
|
||||
data["spans"] = [Span.from_json(span) for span in data["spans"]]
|
||||
if "tokens" in data:
|
||||
data["tokens"] = [SimpleToken.from_json(val) for val in data["tokens"]]
|
||||
return cls(**data, create_tags_from_span=False)
|
||||
|
||||
def get_tags(self, scheme="IOB"):
|
||||
|
@ -252,33 +286,43 @@ class InputSample(object):
|
|||
tags = [span.entity_type for span in self.spans]
|
||||
tokens = tokenize(self.full_text)
|
||||
|
||||
labels = span_to_tag(scheme=scheme, text=self.full_text, tag=tags,
|
||||
start=start_indices, end=end_indices,
|
||||
tokens=tokens)
|
||||
labels = span_to_tag(
|
||||
scheme=scheme,
|
||||
text=self.full_text,
|
||||
tag=tags,
|
||||
start=start_indices,
|
||||
end=end_indices,
|
||||
tokens=tokens,
|
||||
)
|
||||
|
||||
return tokens, labels
|
||||
|
||||
def to_conll(self, translate_tags, scheme="BIO"):
|
||||
def to_conll(self, translate_tags):
|
||||
|
||||
conll = []
|
||||
for i, token in enumerate(self.tokens):
|
||||
if translate_tags:
|
||||
label = self.translate_tag(self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
|
||||
label = self.translate_tag(
|
||||
self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
|
||||
)
|
||||
else:
|
||||
label = self.tags[i]
|
||||
conll.append({"text": token.text,
|
||||
"pos": token.pos_,
|
||||
"tag": token.tag_,
|
||||
"Template#": self.metadata['Template#'],
|
||||
"gender": self.metadata['Gender'],
|
||||
"country": self.metadata['Country'],
|
||||
"label": label},
|
||||
)
|
||||
conll.append(
|
||||
{
|
||||
"text": token.text,
|
||||
"pos": token.pos_,
|
||||
"tag": token.tag_,
|
||||
"Template#": self.metadata["Template#"],
|
||||
"gender": self.metadata["Gender"],
|
||||
"country": self.metadata["Country"],
|
||||
"label": label,
|
||||
},
|
||||
)
|
||||
|
||||
return conll
|
||||
|
||||
def get_template_id(self):
|
||||
return self.metadata['Template#']
|
||||
return self.metadata["Template#"]
|
||||
|
||||
@staticmethod
|
||||
def create_conll_dataset(dataset, translate_tags=True, to_bio=True):
|
||||
|
@ -291,66 +335,76 @@ class InputSample(object):
|
|||
sample.bilou_to_bio()
|
||||
conll = sample.to_conll(translate_tags=translate_tags)
|
||||
for token in conll:
|
||||
token['sentence'] = i
|
||||
token["sentence"] = i
|
||||
conlls.append(token)
|
||||
i += 1
|
||||
|
||||
return pd.DataFrame(conlls)
|
||||
|
||||
def to_spacy(self, entities=None, translate_tags=True):
|
||||
entities = [(span.start_position, span.end_position, span.entity_type)
|
||||
for span in self.spans if (entities is None) or (span.entity_type in entities)]
|
||||
entities = [
|
||||
(span.start_position, span.end_position, span.entity_type)
|
||||
for span in self.spans
|
||||
if (entities is None) or (span.entity_type in entities)
|
||||
]
|
||||
new_entities = []
|
||||
if translate_tags:
|
||||
for entity in entities:
|
||||
new_tag = self.translate_tag(entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
|
||||
new_tag = self.translate_tag(
|
||||
entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
|
||||
)
|
||||
new_entities.append((entity[0], entity[1], new_tag))
|
||||
else:
|
||||
new_entities = entities
|
||||
return (self.full_text,
|
||||
{"entities": new_entities})
|
||||
return self.full_text, {"entities": new_entities}
|
||||
|
||||
@classmethod
|
||||
def from_spacy(cls, text, annotations, translate_from_spacy=True):
|
||||
spans = []
|
||||
for annotation in annotations:
|
||||
tag = cls.rename_from_spacy_tags([annotation[2]])[0] if translate_from_spacy else annotation[2]
|
||||
span = Span(tag, text[annotation[0]: annotation[1]], annotation[0], annotation[1])
|
||||
tag = (
|
||||
cls.rename_from_spacy_tags([annotation[2]])[0]
|
||||
if translate_from_spacy
|
||||
else annotation[2]
|
||||
)
|
||||
span = Span(
|
||||
tag, text[annotation[0] : annotation[1]], annotation[0], annotation[1]
|
||||
)
|
||||
spans.append(span)
|
||||
return cls(full_text=text, masked=None, spans=spans)
|
||||
|
||||
@staticmethod
|
||||
def create_spacy_dataset(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
|
||||
def create_spacy_dataset(
|
||||
dataset, entities=None, sort_by_template_id=False, translate_tags=True
|
||||
):
|
||||
def template_sort(x):
|
||||
return x.metadata['Template#']
|
||||
return x.metadata["Template#"]
|
||||
|
||||
if sort_by_template_id:
|
||||
dataset.sort(key=template_sort)
|
||||
|
||||
return [sample.to_spacy(entities=entities, translate_tags=translate_tags) for sample in dataset]
|
||||
return [
|
||||
sample.to_spacy(entities=entities, translate_tags=translate_tags)
|
||||
for sample in dataset
|
||||
]
|
||||
|
||||
def to_spacy_json(self, entities=None, translate_tags=True):
|
||||
token_dicts = []
|
||||
for i, token in enumerate(self.tokens):
|
||||
if entities:
|
||||
tag = self.tags[i] if self.tags[i][2:] in entities else 'O'
|
||||
tag = self.tags[i] if self.tags[i][2:] in entities else "O"
|
||||
else:
|
||||
tag = self.tags[i]
|
||||
|
||||
if translate_tags:
|
||||
tag = self.translate_tag(tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
|
||||
token_dicts.append({
|
||||
"orth": token.text,
|
||||
"tag": token.tag_,
|
||||
"ner": tag
|
||||
})
|
||||
tag = self.translate_tag(
|
||||
tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
|
||||
)
|
||||
token_dicts.append({"orth": token.text, "tag": token.tag_, "ner": tag})
|
||||
|
||||
spacy_json_sentence = {
|
||||
"raw": self.full_text,
|
||||
"sentences": [{
|
||||
"tokens": token_dicts
|
||||
}
|
||||
]
|
||||
"sentences": [{"tokens": token_dicts}],
|
||||
}
|
||||
|
||||
return spacy_json_sentence
|
||||
|
@ -359,29 +413,37 @@ class InputSample(object):
|
|||
doc = self.tokens
|
||||
spacy_spans = []
|
||||
for span in self.spans:
|
||||
start_token = [token.i for token in self.tokens if token.idx == span.start_position][0]
|
||||
end_token = [token.i for token in self.tokens if token.idx + len(token.text) == span.end_position][0] + 1
|
||||
spacy_span = spacy.tokens.span.Span(doc, start=start_token, end=end_token,
|
||||
label=span.entity_type)
|
||||
start_token = [
|
||||
token.i for token in self.tokens if token.idx == span.start_position
|
||||
][0]
|
||||
end_token = [
|
||||
token.i
|
||||
for token in self.tokens
|
||||
if token.idx + len(token.text) == span.end_position
|
||||
][0] + 1
|
||||
spacy_span = spacy.tokens.span.Span(
|
||||
doc, start=start_token, end=end_token, label=span.entity_type
|
||||
)
|
||||
spacy_spans.append(spacy_span)
|
||||
doc.ents = spacy_spans
|
||||
return doc
|
||||
|
||||
@staticmethod
|
||||
def create_spacy_json(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
|
||||
def create_spacy_json(
|
||||
dataset, entities=None, sort_by_template_id=False, translate_tags=True
|
||||
):
|
||||
def template_sort(x):
|
||||
return x.metadata['Template#']
|
||||
return x.metadata["Template#"]
|
||||
|
||||
if sort_by_template_id:
|
||||
dataset.sort(key=template_sort)
|
||||
|
||||
json_str = []
|
||||
for i, sample in tqdm(enumerate(dataset)):
|
||||
paragraph = sample.to_spacy_json(entities=entities, translate_tags=translate_tags)
|
||||
json_str.append({
|
||||
"id": i,
|
||||
"paragraphs": [paragraph]
|
||||
})
|
||||
paragraph = sample.to_spacy_json(
|
||||
entities=entities, translate_tags=translate_tags
|
||||
)
|
||||
json_str.append({"id": i, "paragraphs": [paragraph]})
|
||||
|
||||
return json_str
|
||||
|
||||
|
@ -402,10 +464,12 @@ class InputSample(object):
|
|||
|
||||
@staticmethod
|
||||
def translate_tag(tag, dictionary, ignore_unknown):
|
||||
has_prefix = len(tag) > 2 and tag[1] == '-'
|
||||
has_prefix = len(tag) > 2 and tag[1] == "-"
|
||||
no_prefix = tag[2:] if has_prefix else tag
|
||||
if no_prefix in dictionary.keys():
|
||||
return tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
|
||||
return (
|
||||
tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
|
||||
)
|
||||
else:
|
||||
if ignore_unknown:
|
||||
return "O"
|
||||
|
@ -416,41 +480,48 @@ class InputSample(object):
|
|||
new_tags = []
|
||||
for tag in self.tags:
|
||||
new_tag = tag
|
||||
has_prefix = len(tag) > 2 and tag[1] == '-'
|
||||
has_prefix = len(tag) > 2 and tag[1] == "-"
|
||||
if has_prefix:
|
||||
if tag[0] == 'U':
|
||||
new_tag = 'B' + tag[1:]
|
||||
elif tag[0] == 'L':
|
||||
new_tag = 'I' + tag[1:]
|
||||
if tag[0] == "U":
|
||||
new_tag = "B" + tag[1:]
|
||||
elif tag[0] == "L":
|
||||
new_tag = "I" + tag[1:]
|
||||
new_tags.append(new_tag)
|
||||
|
||||
self.tags = new_tags
|
||||
|
||||
|
||||
@staticmethod
|
||||
def rename_from_spacy_tags(spacy_tags, ignore_unknown=False):
|
||||
return InputSample.translate_tags(spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown)
|
||||
return InputSample.translate_tags(
|
||||
spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def rename_to_spacy_tags(tags, ignore_unknown=True):
|
||||
return InputSample.translate_tags(tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown)
|
||||
return InputSample.translate_tags(
|
||||
tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def write_spacy_json_from_docs(dataset, filename="spacy_output.json"):
|
||||
docs = [sample.to_spacy_doc() for sample in dataset]
|
||||
srsly.write_json(filename, [spacy.gold.docs_to_json(docs)])
|
||||
srsly.write_json(filename, [spacy.training.docs_to_json(docs)])
|
||||
|
||||
def to_flair(self):
|
||||
for token, i in enumerate(self.tokens):
|
||||
return "{} {} {}".format(token, token.pos_, self.tags[i])
|
||||
return f"{token} {token.pos_} {self.tags[i]}"
|
||||
|
||||
def translate_input_sample_tags(self, dictionary=PRESIDIO_SPACY_ENTITIES, ignore_unknown=True):
|
||||
self.tags = InputSample.translate_tags(self.tags, dictionary, ignore_unknown=ignore_unknown)
|
||||
def translate_input_sample_tags(self, dictionary=None, ignore_unknown=True):
|
||||
if dictionary is None:
|
||||
dictionary = PRESIDIO_SPACY_ENTITIES
|
||||
self.tags = InputSample.translate_tags(
|
||||
self.tags, dictionary, ignore_unknown=ignore_unknown
|
||||
)
|
||||
for span in self.spans:
|
||||
if span.entity_value in PRESIDIO_SPACY_ENTITIES:
|
||||
span.entity_value = PRESIDIO_SPACY_ENTITIES[span.entity_value]
|
||||
elif ignore_unknown:
|
||||
span.entity_value = 'O'
|
||||
span.entity_value = "O"
|
||||
|
||||
@staticmethod
|
||||
def create_flair_dataset(dataset):
|
||||
|
@ -459,83 +530,3 @@ class InputSample(object):
|
|||
flair_samples.append(sample.to_flair())
|
||||
|
||||
return flair_samples
|
||||
|
||||
|
||||
class ModelError:
|
||||
|
||||
def __init__(self, error_type, annotation, prediction, token, full_text, metadata):
|
||||
"""
|
||||
Holds information about an error a model made for analysis purposes
|
||||
:param error_type: str, e.g. FP, FN, Person->Address etc.
|
||||
:param annotation: ground truth value
|
||||
:param prediction: predicted value
|
||||
:param token: token in question
|
||||
:param full_text: full input text
|
||||
:param metadata: metadata on text from InputSample
|
||||
"""
|
||||
|
||||
self.error_type = error_type
|
||||
self.annotation = annotation
|
||||
self.prediction = prediction
|
||||
self.token = token
|
||||
self.full_text = full_text
|
||||
self.metadata = metadata
|
||||
|
||||
def __str__(self):
|
||||
return "type: {}, " \
|
||||
"Annotation = {}, " \
|
||||
"prediction = {}, " \
|
||||
"Token = {}, " \
|
||||
"Full text = {}, " \
|
||||
"Metadata = {}".format(self.error_type,
|
||||
self.annotation,
|
||||
self.prediction,
|
||||
self.token,
|
||||
self.full_text,
|
||||
self.metadata)
|
||||
|
||||
def __repr__(self):
|
||||
return r"<ModelError {{0}}>".format(self.__str__())
|
||||
|
||||
|
||||
class EvaluationResult(object):
|
||||
def __init__(self, results: Counter, model_errors: List[ModelError], text: str = None):
|
||||
"""
|
||||
Holds the output of a comparison between ground truth and predicted
|
||||
:param results: List of objects of type Counter
|
||||
with structure {(actual, predicted) : count}
|
||||
:param model_errors: List of ModelError
|
||||
:param text: sample's full text (if used for one sample)
|
||||
:type results: Counter
|
||||
:type model_errors : List[ModelError]
|
||||
:type text: object
|
||||
"""
|
||||
self.results = results
|
||||
self.model_errors = model_errors
|
||||
self.text = text
|
||||
|
||||
self.pii_recall = None
|
||||
self.pii_precision = None
|
||||
self.pii_f = None
|
||||
self.entity_recall_dict = None
|
||||
self.entity_precision_dict = None
|
||||
|
||||
def print(self):
|
||||
recall_dict = self.entity_recall_dict
|
||||
precision_dict = self.entity_precision_dict
|
||||
|
||||
recall_dict["PII"] = self.pii_recall
|
||||
precision_dict["PII"] = self.pii_precision
|
||||
|
||||
entities = recall_dict.keys()
|
||||
recall = recall_dict.values()
|
||||
precision = precision_dict.values()
|
||||
|
||||
row_format = "{:>30}{:>30.2%}{:>30.2%}"
|
||||
header_format = "{:>30}" * 3
|
||||
print(header_format.format(*("Entity", "Precision", "Recall")))
|
||||
for entity, precision, recall in zip(entities, precision, recall):
|
||||
print(row_format.format(entity, precision, recall))
|
||||
|
||||
print("PII F measure: {}".format(self.pii_f))
|
||||
|
||||
|
|
|
@ -0,0 +1,4 @@
|
|||
from .dataset_formatter import DatasetFormatter
|
||||
from .conll_formatter import CONLL2003Formatter
|
||||
|
||||
__all__ = ["DatasetFormatter", "CONLL2003Formatter"]
|
|
@ -0,0 +1,62 @@
|
|||
from pathlib import Path
|
||||
from typing import List, Optional
|
||||
|
||||
import requests
|
||||
from spacy.training import converters
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.dataset_formatters import DatasetFormatter
|
||||
|
||||
|
||||
class CONLL2003Formatter(DatasetFormatter):
|
||||
def __init__(
|
||||
self,
|
||||
files_path=Path("../data/conll2003").resolve(),
|
||||
glob_pattern: str = "*.iob",
|
||||
):
|
||||
self.files_path = files_path
|
||||
self.glob_pattern = glob_pattern
|
||||
|
||||
@staticmethod
|
||||
def download(
|
||||
local_data_path=Path("../data/conll2003").resolve(),
|
||||
conll_gh_path="https://raw.githubusercontent.com/glample/tagger/master/dataset/",
|
||||
):
|
||||
|
||||
for fold in ("eng.train", "eng.testa", "eng.testb"):
|
||||
fold_path = conll_gh_path + fold
|
||||
if not local_data_path.exists():
|
||||
local_data_path.mkdir(parents=True)
|
||||
|
||||
dataset_file = Path(local_data_path, fold)
|
||||
if dataset_file.exists():
|
||||
print("Dataset already exists, skipping download")
|
||||
return
|
||||
|
||||
response = requests.get(fold_path)
|
||||
dataset_raw = response.text
|
||||
with open(dataset_file, "w") as f:
|
||||
f.write(dataset_raw)
|
||||
print(f"Finished writing fold {fold} to {local_data_path}")
|
||||
|
||||
print("Finished downloading CoNNL2003")
|
||||
|
||||
def to_input_samples(self, fold: Optional[str] = None) -> List[InputSample]:
|
||||
files_found = False
|
||||
for i, file_path in enumerate(self.files_path.glob(self.glob_pattern)):
|
||||
if fold and fold not in file_path.name:
|
||||
continue
|
||||
|
||||
files_found = True
|
||||
with open(file_path, "r", encoding="utf-8") as file:
|
||||
text = file.readlines()
|
||||
|
||||
text = "".join(text)
|
||||
|
||||
output_docs = converters.conll_ner2json(
|
||||
input_data=text, n_sents=None, no_print=True
|
||||
)
|
||||
|
||||
# TODO: Translate to InputSample
|
||||
if not files_found:
|
||||
raise FileNotFoundError(f"No files found for pattern {self.glob_pattern}")
|
|
@ -0,0 +1,14 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import List
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
|
||||
|
||||
class DatasetFormatter(ABC):
|
||||
@abstractmethod
|
||||
def to_input_samples(self) -> List[InputSample]:
|
||||
"""
|
||||
Translate a dataset structure into a list of documents, to be used by models and for evaluation
|
||||
:return:
|
||||
"""
|
||||
pass
|
|
@ -0,0 +1,5 @@
|
|||
from .model_error import ModelError
|
||||
from .evaluation_result import EvaluationResult
|
||||
from .evaluator import Evaluator
|
||||
|
||||
__all__ = ["ModelError", "EvaluationResult", "Evaluator"]
|
|
@ -0,0 +1,51 @@
|
|||
from collections import Counter
|
||||
from typing import List, Optional
|
||||
from presidio_evaluator.evaluation import ModelError
|
||||
|
||||
|
||||
class EvaluationResult(object):
|
||||
def __init__(
|
||||
self,
|
||||
results: Counter,
|
||||
model_errors: Optional[List[ModelError]] = None,
|
||||
text: str = None,
|
||||
):
|
||||
"""
|
||||
Holds the output of a comparison between ground truth and predicted
|
||||
:param results: List of objects of type Counter
|
||||
with structure {(actual, predicted) : count}
|
||||
:param model_errors: List of specific model errors for further inspection
|
||||
:param text: sample's full text (if used for one sample)
|
||||
"""
|
||||
|
||||
self.results = results
|
||||
self.model_errors = model_errors
|
||||
self.text = text
|
||||
|
||||
self.pii_recall = None
|
||||
self.pii_precision = None
|
||||
self.pii_f = None
|
||||
self.entity_recall_dict = None
|
||||
self.entity_precision_dict = None
|
||||
|
||||
def print(self):
|
||||
recall_dict = self.entity_recall_dict
|
||||
precision_dict = self.entity_precision_dict
|
||||
|
||||
recall_dict["PII"] = self.pii_recall
|
||||
precision_dict["PII"] = self.pii_precision
|
||||
|
||||
entities = recall_dict.keys()
|
||||
recall = recall_dict.values()
|
||||
precision = precision_dict.values()
|
||||
|
||||
row_format = "{:>30}{:>30.2%}{:>30.2%}"
|
||||
header_format = "{:>30}" * 3
|
||||
print(header_format.format(*("Entity", "Precision", "Recall")))
|
||||
for entity, precision, recall in zip(entities, precision, recall):
|
||||
print(row_format.format(entity, precision, recall))
|
||||
|
||||
print("PII F measure: {}".format(self.pii_f))
|
||||
|
||||
def __repr__(self):
|
||||
return f"stats={self.results}"
|
|
@ -0,0 +1,318 @@
|
|||
from collections import Counter
|
||||
from typing import List, Optional, Dict
|
||||
|
||||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.evaluation import EvaluationResult, ModelError
|
||||
from presidio_evaluator.models import BaseModel, PresidioAnalyzerWrapper
|
||||
|
||||
|
||||
class Evaluator:
|
||||
def __init__(
|
||||
self,
|
||||
model: BaseModel,
|
||||
verbose: bool = False,
|
||||
compare_by_io=True,
|
||||
entities_to_keep: Optional[List[str]] = None,
|
||||
):
|
||||
"""
|
||||
Evaluate a PII detection model or a Presidio analyzer / recognizer
|
||||
|
||||
:param model: Instance of a fitted model (of base type BaseModel)
|
||||
:param compare_by_io: True if comparison should be done on the entity
|
||||
level and not the sub-entity level
|
||||
:param entities_to_keep: List of entity names to focus the evaluator on (and ignore the rest).
|
||||
Default is None = all entities. If the provided model has a list of entities to keep,
|
||||
this list would be used for evaluation.
|
||||
"""
|
||||
self.model = model
|
||||
self.verbose = verbose
|
||||
self.compare_by_io = compare_by_io
|
||||
self.entities_to_keep = entities_to_keep
|
||||
if self.entities_to_keep is None and self.model.entities:
|
||||
self.entities_to_keep = self.model.entities
|
||||
|
||||
def compare(self, input_sample: InputSample, prediction: List[str]):
|
||||
|
||||
"""
|
||||
Compares ground truth tags (annotation) and predicted (prediction)
|
||||
:param input_sample: input sample containing list of tags with scheme
|
||||
:param prediction: predicted value for each token
|
||||
self.labeling_scheme
|
||||
|
||||
"""
|
||||
annotation = input_sample.tags
|
||||
tokens = input_sample.tokens
|
||||
|
||||
if len(annotation) != len(prediction):
|
||||
print(
|
||||
"Annotation and prediction do not have the"
|
||||
"same length. Sample={}".format(input_sample)
|
||||
)
|
||||
return Counter(), []
|
||||
|
||||
results = Counter()
|
||||
mistakes = []
|
||||
|
||||
new_annotation = annotation.copy()
|
||||
|
||||
if self.compare_by_io:
|
||||
new_annotation = self._to_io(new_annotation)
|
||||
prediction = self._to_io(prediction)
|
||||
|
||||
# Ignore annotations that aren't in the list of
|
||||
# requested entities.
|
||||
if self.entities_to_keep:
|
||||
prediction = self._adjust_per_entities(prediction)
|
||||
new_annotation = self._adjust_per_entities(new_annotation)
|
||||
for i in range(0, len(new_annotation)):
|
||||
results[(new_annotation[i], prediction[i])] += 1
|
||||
|
||||
if self.verbose:
|
||||
print("Annotation:", new_annotation[i])
|
||||
print("Prediction:", prediction[i])
|
||||
print(results)
|
||||
|
||||
# check if there was an error
|
||||
is_error = new_annotation[i] != prediction[i]
|
||||
if is_error:
|
||||
if prediction[i] == "O":
|
||||
mistakes.append(
|
||||
ModelError(
|
||||
"FN",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata,
|
||||
)
|
||||
)
|
||||
elif new_annotation[i] == "O":
|
||||
mistakes.append(
|
||||
ModelError(
|
||||
"FP",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata,
|
||||
)
|
||||
)
|
||||
else:
|
||||
mistakes.append(
|
||||
ModelError(
|
||||
"Wrong entity",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata,
|
||||
)
|
||||
)
|
||||
|
||||
return results, mistakes
|
||||
|
||||
def _adjust_per_entities(self, tags):
|
||||
if self.entities_to_keep:
|
||||
return [tag if tag in self.entities_to_keep else "O" for tag in tags]
|
||||
|
||||
@staticmethod
|
||||
def _to_io(tags):
|
||||
"""
|
||||
Translates BILOU/BIO/IOB to IO - only In or Out of entity.
|
||||
['B-PERSON','I-PERSON','L-PERSON'] is translated into
|
||||
['PERSON','PERSON','PERSON']
|
||||
:param tags: the input tags in BILOU/IOB/BIO format
|
||||
:return: a new list of IO tags
|
||||
"""
|
||||
return [tag[2:] if "-" in tag else tag for tag in tags]
|
||||
|
||||
def evaluate_sample(
|
||||
self, sample: InputSample, prediction: List[str]
|
||||
) -> EvaluationResult:
|
||||
if self.verbose:
|
||||
print("Input sentence: {}".format(sample.full_text))
|
||||
|
||||
results, mistakes = self.compare(input_sample=sample, prediction=prediction)
|
||||
return EvaluationResult(results, mistakes, sample.full_text)
|
||||
|
||||
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
|
||||
evaluation_results = []
|
||||
for sample in tqdm(dataset, desc="Evaluating {}".format(self.__class__)):
|
||||
prediction = self.model.predict(sample)
|
||||
evaluation_result = self.evaluate_sample(
|
||||
sample=sample, prediction=prediction
|
||||
)
|
||||
evaluation_results.append(evaluation_result)
|
||||
|
||||
return evaluation_results
|
||||
|
||||
@staticmethod
|
||||
def align_input_samples_to_presidio_analyzer(
|
||||
input_samples: List[InputSample],
|
||||
entities_mapping: Dict[
|
||||
str, str
|
||||
] = PresidioAnalyzerWrapper.presidio_entities_map,
|
||||
) -> List[InputSample]:
|
||||
"""
|
||||
Change input samples to conform with Presidio's entities
|
||||
:return: new list of InputSample
|
||||
"""
|
||||
|
||||
new_input_samples = input_samples.copy()
|
||||
|
||||
# A list that will contain updated input samples,
|
||||
new_list = []
|
||||
|
||||
# Iterate on all samples
|
||||
for input_sample in new_input_samples:
|
||||
contains_presidio_field = False
|
||||
new_spans = []
|
||||
# Update spans to match Presidio's entity name
|
||||
for span in input_sample.spans:
|
||||
in_presidio_field = False
|
||||
if span.entity_type in entities_mapping.keys():
|
||||
new_name = entities_mapping.get(span.entity_type)
|
||||
span.entity_type = new_name
|
||||
contains_presidio_field = True
|
||||
|
||||
# Add to new span list, if the span contains an entity relevant to Presidio
|
||||
new_spans.append(span)
|
||||
input_sample.spans = new_spans
|
||||
|
||||
# Update tags in case this sample has relevant entities for evaluation
|
||||
if contains_presidio_field:
|
||||
for i, tag in enumerate(input_sample.tags):
|
||||
has_prefix = "-" in tag
|
||||
if has_prefix:
|
||||
prefix = tag[:2]
|
||||
clean = tag[2:]
|
||||
else:
|
||||
prefix = ""
|
||||
clean = tag
|
||||
|
||||
if clean in entities_mapping.keys():
|
||||
new_name = entities_mapping.get(clean)
|
||||
input_sample.tags[i] = "{}{}".format(prefix, new_name)
|
||||
else:
|
||||
input_sample.tags[i] = "O"
|
||||
|
||||
new_list.append(input_sample)
|
||||
return new_list
|
||||
|
||||
def calculate_score(
|
||||
self,
|
||||
evaluation_results: List[EvaluationResult],
|
||||
entities: Optional[List[str]] = None,
|
||||
beta: float = 1,
|
||||
) -> EvaluationResult:
|
||||
"""
|
||||
Returns the pii_precision, pii_recall and f_measure either for each entity
|
||||
or for all entities (ignore_entity_type = True)
|
||||
:param evaluation_results: List of EvaluationResult
|
||||
:param entities: List of entities to calculate score to. Default is None: all entities
|
||||
:param beta: F measure beta value
|
||||
between different entity types, or to treat these as misclassifications
|
||||
:return: EvaluationResult with precision, recall and f measures
|
||||
"""
|
||||
|
||||
# aggregate results
|
||||
all_results = sum([er.results for er in evaluation_results], Counter())
|
||||
|
||||
# compute pii_recall per entity
|
||||
entity_recall = {}
|
||||
entity_precision = {}
|
||||
if not entities:
|
||||
entities = list(set([x[0] for x in all_results.keys() if x[0] != "O"]))
|
||||
|
||||
for entity in entities:
|
||||
# all annotation of given type
|
||||
annotated = sum([all_results[x] for x in all_results if x[0] == entity])
|
||||
predicted = sum([all_results[x] for x in all_results if x[1] == entity])
|
||||
tp = all_results[(entity, entity)]
|
||||
|
||||
if annotated > 0:
|
||||
entity_recall[entity] = tp / annotated
|
||||
else:
|
||||
entity_recall[entity] = np.NaN
|
||||
|
||||
if predicted > 0:
|
||||
per_entity_tp = all_results[(entity, entity)]
|
||||
entity_precision[entity] = per_entity_tp / predicted
|
||||
else:
|
||||
entity_precision[entity] = np.NaN
|
||||
|
||||
# compute pii_precision and pii_recall
|
||||
annotated_all = sum([all_results[x] for x in all_results if x[0] != "O"])
|
||||
predicted_all = sum([all_results[x] for x in all_results if x[1] != "O"])
|
||||
if annotated_all > 0:
|
||||
pii_recall = (
|
||||
sum(
|
||||
[
|
||||
all_results[x]
|
||||
for x in all_results
|
||||
if (x[0] != "O" and x[1] != "O")
|
||||
]
|
||||
)
|
||||
/ annotated_all
|
||||
)
|
||||
else:
|
||||
pii_recall = np.NaN
|
||||
if predicted_all > 0:
|
||||
pii_precision = (
|
||||
sum(
|
||||
[
|
||||
all_results[x]
|
||||
for x in all_results
|
||||
if (x[0] != "O" and x[1] != "O")
|
||||
]
|
||||
)
|
||||
/ predicted_all
|
||||
)
|
||||
else:
|
||||
pii_precision = np.NaN
|
||||
# compute pii_f_beta-score
|
||||
pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
|
||||
|
||||
# aggregate errors
|
||||
errors = []
|
||||
for res in evaluation_results:
|
||||
if res.model_errors:
|
||||
errors.extend(res.model_errors)
|
||||
|
||||
evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
|
||||
evaluation_result.pii_precision = pii_precision
|
||||
evaluation_result.pii_recall = pii_recall
|
||||
evaluation_result.entity_recall_dict = entity_recall
|
||||
evaluation_result.entity_precision_dict = entity_precision
|
||||
evaluation_result.pii_f = pii_f_beta
|
||||
|
||||
return evaluation_result
|
||||
|
||||
@staticmethod
|
||||
def precision(tp: int, fp: int) -> float:
|
||||
return tp / (tp + fp + 1e-100)
|
||||
|
||||
@staticmethod
|
||||
def recall(tp: int, fn: int) -> float:
|
||||
return tp / (tp + fn + 1e-100)
|
||||
|
||||
@staticmethod
|
||||
def f_beta(precision: float, recall: float, beta: float) -> float:
|
||||
"""
|
||||
Returns the F score for precision, recall and a beta parameter
|
||||
:param precision: a float with the precision value
|
||||
:param recall: a float with the recall value
|
||||
:param beta: a float with the beta parameter of the F measure,
|
||||
which gives more or less weight to precision
|
||||
vs. recall
|
||||
:return: a float value of the f(beta) measure.
|
||||
"""
|
||||
if np.isnan(precision) or np.isnan(recall) or (precision == 0 and recall == 0):
|
||||
return np.nan
|
||||
|
||||
return ((1 + beta ** 2) * precision * recall) / (
|
||||
((beta ** 2) * precision) + recall
|
||||
)
|
|
@ -0,0 +1,174 @@
|
|||
from typing import Dict, List
|
||||
|
||||
from presidio_evaluator.data_objects import SimpleToken
|
||||
|
||||
import pandas as pd
|
||||
|
||||
|
||||
class ModelError:
|
||||
def __init__(
|
||||
self,
|
||||
error_type: str,
|
||||
annotation: str,
|
||||
prediction: str,
|
||||
token: SimpleToken,
|
||||
full_text: str,
|
||||
metadata: Dict,
|
||||
):
|
||||
"""
|
||||
Holds information about an error a model made for analysis purposes
|
||||
:param error_type: str, e.g. FP, FN, Person->Address etc.
|
||||
:param annotation: ground truth value
|
||||
:param prediction: predicted value
|
||||
:param token: token in question
|
||||
:param full_text: full input text
|
||||
:param metadata: metadata on text from InputSample
|
||||
"""
|
||||
|
||||
self.error_type = error_type
|
||||
self.annotation = annotation
|
||||
self.prediction = prediction
|
||||
self.token = token
|
||||
self.full_text = full_text
|
||||
self.metadata = metadata
|
||||
|
||||
def __str__(self):
|
||||
return (
|
||||
"type: {}, "
|
||||
"Annotation = {}, "
|
||||
"prediction = {}, "
|
||||
"Token = {}, "
|
||||
"Full text = {}, "
|
||||
"Metadata = {}".format(
|
||||
self.error_type,
|
||||
self.annotation,
|
||||
self.prediction,
|
||||
self.token,
|
||||
self.full_text,
|
||||
self.metadata,
|
||||
)
|
||||
)
|
||||
|
||||
def __repr__(self):
|
||||
return r"<ModelError {{0}}>".format(self.__str__())
|
||||
|
||||
@staticmethod
|
||||
def most_common_fp_tokens(errors=List["ModelError"], n: int = 10, entity=None):
|
||||
"""
|
||||
Print the n most common false positive tokens (tokens thought to be an entity)
|
||||
"""
|
||||
fps = ModelError.get_false_positives(errors, entity)
|
||||
|
||||
tokens = [err.token.text for err in fps]
|
||||
from collections import Counter
|
||||
|
||||
by_frequency = Counter(tokens)
|
||||
most_common = by_frequency.most_common(n)
|
||||
print("Most common false positive tokens:")
|
||||
print(most_common)
|
||||
print("Example sentence with each FP token:")
|
||||
for tok, val in most_common:
|
||||
with_tok = [err for err in fps if err.token.text == tok]
|
||||
print(with_tok[0].full_text)
|
||||
|
||||
@staticmethod
|
||||
def most_common_fn_tokens(errors=List["ModelError"], n: int = 10, entity=None):
|
||||
"""
|
||||
Print all tokens that were missed by the model, including an example of the full text in which they appear
|
||||
"""
|
||||
fns = ModelError.get_false_negatives(errors, entity)
|
||||
|
||||
fns_tokens = [err.token.text for err in fns]
|
||||
from collections import Counter
|
||||
|
||||
by_frequency_fns = Counter(fns_tokens)
|
||||
most_common_fns = by_frequency_fns.most_common(50)
|
||||
print(most_common_fns)
|
||||
for tok, val in most_common_fns:
|
||||
with_tok = [err for err in fns if err.token.text == tok]
|
||||
print(
|
||||
"Token: {}, Annotation: {}, Full text: {}".format(
|
||||
with_tok[0].token, with_tok[0].annotation, with_tok[0].full_text
|
||||
)
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def get_errors_df(
|
||||
errors=List["ModelError"], entity: List[str] = None, error_type: str = "FN"
|
||||
):
|
||||
"""
|
||||
Get ModelErrors as pd.DataFrame
|
||||
"""
|
||||
if error_type == "FN":
|
||||
filtered_errors = ModelError.get_false_negatives(errors, entity)
|
||||
elif error_type == "FP":
|
||||
filtered_errors = ModelError.get_false_positives(errors, entity)
|
||||
else:
|
||||
raise ValueError("error_type should be either FP or FN")
|
||||
|
||||
if len(filtered_errors) == 0:
|
||||
print(
|
||||
"No errors of type {} and entity {} were found".format(
|
||||
error_type, entity
|
||||
)
|
||||
)
|
||||
return None
|
||||
|
||||
errors_df = pd.DataFrame.from_records(
|
||||
[error.__dict__ for error in filtered_errors]
|
||||
)
|
||||
metadata_df = pd.DataFrame(errors_df["metadata"].tolist())
|
||||
errors_df.drop(["metadata"], axis=1, inplace=True)
|
||||
new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
|
||||
return new_errors_df
|
||||
|
||||
@staticmethod
|
||||
def get_fps_dataframe(errors=List["ModelError"], entity: str = None):
|
||||
"""
|
||||
Get false positive ModelErrors as pd.DataFrame
|
||||
"""
|
||||
return ModelError.get_errors_df(errors, entity, error_type="FP")
|
||||
|
||||
@staticmethod
|
||||
def get_fns_dataframe(errors=List["ModelError"], entity: str = None):
|
||||
"""
|
||||
Get false negative ModelErrors as pd.DataFrame
|
||||
"""
|
||||
return ModelError.get_errors_df(errors, entity, error_type="FN")
|
||||
|
||||
@staticmethod
|
||||
def get_false_positives(errors=List["ModelError"], entity=None):
|
||||
"""
|
||||
Get a list of all false positive errors in the results
|
||||
"""
|
||||
if isinstance(entity, str):
|
||||
entity = [entity]
|
||||
|
||||
if entity:
|
||||
return [
|
||||
model_error
|
||||
for model_error in errors
|
||||
if model_error.error_type == "FP" and model_error.prediction in entity
|
||||
]
|
||||
else:
|
||||
return [
|
||||
model_error for model_error in errors if model_error.error_type == "FP"
|
||||
]
|
||||
|
||||
@staticmethod
|
||||
def get_false_negatives(errors=List["ModelError"], entity=None):
|
||||
"""
|
||||
Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
|
||||
"""
|
||||
if isinstance(entity, str):
|
||||
entity = [entity]
|
||||
if entity:
|
||||
return [
|
||||
model_error
|
||||
for model_error in errors
|
||||
if model_error.error_type != "FP" and model_error.annotation in entity
|
||||
]
|
||||
else:
|
||||
return [
|
||||
model_error for model_error in errors if model_error.error_type != "FP"
|
||||
]
|
|
@ -0,0 +1,159 @@
|
|||
"""E2E scoring pipelines for the different models"""
|
||||
|
||||
import math
|
||||
from typing import List, Optional
|
||||
|
||||
from presidio_analyzer import EntityRecognizer
|
||||
from presidio_analyzer.nlp_engine import SpacyNlpEngine
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.evaluation import EvaluationResult, Evaluator
|
||||
from presidio_evaluator.models import (
|
||||
PresidioRecognizerWrapper,
|
||||
PresidioAnalyzerWrapper,
|
||||
BaseModel,
|
||||
)
|
||||
|
||||
|
||||
def score_model(
|
||||
model: BaseModel,
|
||||
entities_to_keep: List[str],
|
||||
input_samples: List[InputSample],
|
||||
verbose: bool = False,
|
||||
beta: float = 2.5,
|
||||
) -> EvaluationResult:
|
||||
"""
|
||||
Run data through a model and gather results and stats
|
||||
"""
|
||||
|
||||
print("Evaluating samples")
|
||||
|
||||
evaluator = Evaluator(model=model, entities_to_keep=entities_to_keep)
|
||||
evaluated_samples = evaluator.evaluate_all(input_samples)
|
||||
|
||||
print("Estimating metrics")
|
||||
evaluation_result = evaluator.calculate_score(
|
||||
evaluation_results=evaluated_samples, beta=beta
|
||||
)
|
||||
precision = evaluation_result.pii_precision
|
||||
recall = evaluation_result.pii_recall
|
||||
entity_recall = evaluation_result.entity_recall_dict
|
||||
entity_precision = evaluation_result.entity_precision_dict
|
||||
f = evaluation_result.pii_f
|
||||
errors = evaluation_result.model_errors
|
||||
#
|
||||
print(f"precision: {precision}")
|
||||
print(f"Recall: {recall}")
|
||||
print(f"F {beta}: {f}")
|
||||
print(f"Precision per entity: {entity_precision}")
|
||||
print(f"Recall per entity: {entity_recall}")
|
||||
|
||||
if verbose:
|
||||
|
||||
false_negatives = [
|
||||
str(mistake) for mistake in errors if mistake.error_type == "FN"
|
||||
]
|
||||
false_positives = [
|
||||
str(mistake) for mistake in errors if mistake.error_type == "FP"
|
||||
]
|
||||
other_mistakes = [
|
||||
str(mistake) for mistake in errors if mistake.error_type not in ["FN", "FP"]
|
||||
]
|
||||
|
||||
print("False negatives: ")
|
||||
print("\n".join(false_negatives))
|
||||
print("\n******************\n")
|
||||
|
||||
print("False positives: ")
|
||||
print("\n".join(false_positives))
|
||||
print("\n******************\n")
|
||||
|
||||
print("Other mistakes: ")
|
||||
print("\n".join(other_mistakes))
|
||||
|
||||
return evaluation_result
|
||||
|
||||
|
||||
def score_presidio_recognizer(
|
||||
recognizer: EntityRecognizer,
|
||||
entities_to_keep: List[str],
|
||||
input_samples: Optional[List[InputSample]] = None,
|
||||
labeling_scheme: str = "BILUO",
|
||||
with_nlp_artifacts: bool = False,
|
||||
verbose: bool = False,
|
||||
) -> EvaluationResult:
|
||||
"""
|
||||
Run data through one EntityRecognizer and gather results and stats
|
||||
"""
|
||||
|
||||
if not input_samples:
|
||||
print("Reading dataset")
|
||||
input_samples = read_synth_dataset("../../data/synth_dataset.txt")
|
||||
else:
|
||||
input_samples = list(input_samples)
|
||||
|
||||
print("Preparing dataset by aligning entity names to Presidio's entity names")
|
||||
|
||||
updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(input_samples)
|
||||
|
||||
model = PresidioRecognizerWrapper(
|
||||
recognizer=recognizer,
|
||||
entities_to_keep=entities_to_keep,
|
||||
labeling_scheme=labeling_scheme,
|
||||
nlp_engine=SpacyNlpEngine(),
|
||||
with_nlp_artifacts=with_nlp_artifacts,
|
||||
)
|
||||
return score_model(
|
||||
model=model,
|
||||
entities_to_keep=entities_to_keep,
|
||||
input_samples=updated_samples,
|
||||
verbose=verbose,
|
||||
)
|
||||
|
||||
|
||||
def score_presidio_analyzer(
|
||||
input_samples: Optional[List[InputSample]] = None,
|
||||
entities_to_keep: Optional[List[str]] = None,
|
||||
labeling_scheme: str = "BILUO",
|
||||
verbose: bool = True,
|
||||
) -> EvaluationResult:
|
||||
""""""
|
||||
if not input_samples:
|
||||
print("Reading dataset")
|
||||
input_samples = read_synth_dataset("../../data/synth_dataset.txt")
|
||||
else:
|
||||
input_samples = list(input_samples)
|
||||
|
||||
print("Preparing dataset by aligning entity names to Presidio's entity names")
|
||||
|
||||
updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(input_samples)
|
||||
|
||||
flatten = lambda l: [item for sublist in l for item in sublist]
|
||||
from collections import Counter
|
||||
|
||||
count_per_entity = Counter(
|
||||
[
|
||||
span.entity_type
|
||||
for span in flatten(
|
||||
[input_sample.spans for input_sample in updated_samples]
|
||||
)
|
||||
]
|
||||
)
|
||||
if verbose:
|
||||
print("Count per entity:")
|
||||
print(count_per_entity)
|
||||
analyzer = PresidioAnalyzerWrapper(
|
||||
entities_to_keep=entities_to_keep, labeling_scheme=labeling_scheme
|
||||
)
|
||||
|
||||
return score_model(
|
||||
model=analyzer,
|
||||
entities_to_keep=list(count_per_entity.keys()),
|
||||
input_samples=updated_samples,
|
||||
verbose=verbose,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
score_presidio_analyzer()
|
|
@ -1,398 +0,0 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import List, Tuple, Dict
|
||||
from collections import Counter
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from presidio_evaluator import InputSample, EvaluationResult, ModelError
|
||||
from tqdm import tqdm
|
||||
|
||||
|
||||
class ModelEvaluator(ABC):
|
||||
|
||||
def __init__(self, entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
use_spans: bool = False, labeling_scheme="BIO",
|
||||
compare_by_io=True):
|
||||
|
||||
"""
|
||||
Abstract class for evaluating NER models and others
|
||||
:param entities_to_keep: Which entities should be evaluated? All other
|
||||
entities are ignored. If None, none are filtered
|
||||
:param verbose: Whether to print more debug info
|
||||
:param labeling_scheme: Type of scheme used for labeling (BILOU,
|
||||
BIO/LOB or IO)
|
||||
:param compare_by_io: True if comparison should be done on the entity
|
||||
level and not the sub-entity level
|
||||
|
||||
"""
|
||||
self.entities = entities_to_keep
|
||||
self.verbose = verbose
|
||||
self.use_spans = use_spans
|
||||
self.compare_by_io = compare_by_io
|
||||
self.labeling_scheme = labeling_scheme
|
||||
|
||||
@abstractmethod
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
"""
|
||||
Abstract. Returns the predicted tokens/spans from the evaluated model
|
||||
:param sample: Sample to be evaluated
|
||||
:return: if self.use spans: list of spans
|
||||
if not self.use_spans: tags in self.labeling_scheme format
|
||||
"""
|
||||
pass
|
||||
|
||||
def compare(self, input_sample: InputSample, prediction: List[str]):
|
||||
|
||||
"""
|
||||
Compares gound truth tags (annotation) and predicted (prediction)
|
||||
:param input_sample: input sample containing list of tags with scheme
|
||||
:param prediction: predicted value for each token
|
||||
self.labeling_scheme
|
||||
|
||||
"""
|
||||
annotation = input_sample.tags
|
||||
tokens = input_sample.tokens
|
||||
|
||||
if len(annotation) != len(prediction):
|
||||
print("Annotation and prediction do not have the"
|
||||
"same length. Sample={}".format(input_sample))
|
||||
return Counter(), []
|
||||
|
||||
results = Counter()
|
||||
mistakes = []
|
||||
|
||||
new_annotation = annotation.copy()
|
||||
|
||||
if self.compare_by_io:
|
||||
new_annotation = self._to_io(new_annotation)
|
||||
prediction = self._to_io(prediction)
|
||||
|
||||
# Ignore annotations that aren't in the list of
|
||||
# requested entities.
|
||||
if self.entities:
|
||||
prediction = self._adjust_per_entities(prediction)
|
||||
new_annotation = self._adjust_per_entities(new_annotation)
|
||||
for i in range(0, len(new_annotation)):
|
||||
results[(new_annotation[i], prediction[i])] += 1
|
||||
|
||||
if self.verbose:
|
||||
print('Annotation:', new_annotation[i])
|
||||
print('Prediction:', prediction[i])
|
||||
print(results)
|
||||
|
||||
# check if there was an error
|
||||
is_error = (new_annotation[i] != prediction[i])
|
||||
if is_error:
|
||||
if prediction[i] == 'O':
|
||||
mistakes.append(ModelError("FN",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata))
|
||||
elif new_annotation[i] == 'O':
|
||||
mistakes.append(ModelError("FP",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata))
|
||||
else:
|
||||
mistakes.append(ModelError("Wrong entity",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata))
|
||||
|
||||
return results, mistakes
|
||||
|
||||
def _adjust_per_entities(self, tags):
|
||||
if self.entities:
|
||||
return [tag if tag in self.entities else 'O' for tag in tags]
|
||||
|
||||
@staticmethod
|
||||
def _to_io(tags):
|
||||
"""
|
||||
Translates BILOU/BIO/IOB to IO - only In or Out of entity.
|
||||
['B-PERSON','I-PERSON','L-PERSON'] is translated into
|
||||
['PERSON','PERSON','PERSON']
|
||||
:param tags: the input tags in BILOU/IOB/BIO format
|
||||
:return: a new list of IO tags
|
||||
"""
|
||||
return [tag[2:] if '-' in tag else tag for tag in tags]
|
||||
|
||||
def evaluate_sample(self, sample: InputSample) -> EvaluationResult:
|
||||
if self.verbose:
|
||||
print("Input sentence: {}".format(sample.full_text))
|
||||
|
||||
prediction = self.predict(sample)
|
||||
results, mistakes = self.compare(
|
||||
input_sample=sample,
|
||||
prediction=prediction)
|
||||
return EvaluationResult(results, mistakes, sample.full_text)
|
||||
|
||||
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
|
||||
evaluation_results = []
|
||||
for sample in tqdm(dataset, desc='Evaluating {}'.format(self.__class__)):
|
||||
evaluation_result = self.evaluate_sample(sample)
|
||||
evaluation_results.append(evaluation_result)
|
||||
|
||||
return evaluation_results
|
||||
|
||||
def calculate_score(self, evaluation_results: List[
|
||||
EvaluationResult], beta: float = 1) \
|
||||
-> EvaluationResult:
|
||||
"""
|
||||
Returns the pii_precision, pii_recall and f_measure either for each entity
|
||||
or for all entities (ignore_entity_type = True)
|
||||
:param evaluation_results: List of EvaluationResult
|
||||
:param beta: F measure beta value
|
||||
between different entity types, or to treat these as misclassifications
|
||||
:return: EvaluationResult with precision, recall and f measures
|
||||
"""
|
||||
|
||||
# aggregate results
|
||||
all_results = sum([er.results for er in evaluation_results], Counter())
|
||||
|
||||
# compute pii_recall per entity
|
||||
entity_recall = {}
|
||||
entity_precision = {}
|
||||
if self.entities:
|
||||
entities = self.entities
|
||||
else:
|
||||
entities = list(
|
||||
set([x[0] for x in all_results.keys() if x[0] != 'O']))
|
||||
|
||||
for entity in entities:
|
||||
# all annotation of given type
|
||||
annotated = sum(
|
||||
[all_results[x] for x in all_results if x[0] == entity])
|
||||
predicted = sum(
|
||||
[all_results[x] for x in all_results if x[1] == entity])
|
||||
tp = all_results[(entity, entity)]
|
||||
|
||||
if annotated > 0:
|
||||
entity_recall[entity] = tp / annotated
|
||||
else:
|
||||
entity_recall[entity] = np.NaN
|
||||
|
||||
if predicted > 0:
|
||||
per_entity_tp = all_results[(entity, entity)]
|
||||
entity_precision[entity] = per_entity_tp / predicted
|
||||
else:
|
||||
entity_precision[entity] = np.NaN
|
||||
|
||||
# compute pii_precision and pii_recall
|
||||
annotated_all = sum(
|
||||
[all_results[x] for x in all_results if x[0] != 'O'])
|
||||
predicted_all = sum(
|
||||
[all_results[x] for x in all_results if x[1] != 'O'])
|
||||
if annotated_all > 0:
|
||||
pii_recall = sum([all_results[x] for x in all_results if
|
||||
(x[0] != 'O' and x[1] != 'O')]) / annotated_all
|
||||
else:
|
||||
pii_recall = np.NaN
|
||||
if predicted_all > 0:
|
||||
pii_precision = sum([all_results[x] for x in all_results if
|
||||
(x[0] != 'O' and x[1] != 'O')]) / predicted_all
|
||||
else:
|
||||
pii_precision = np.NaN
|
||||
# compute pii_f_beta-score
|
||||
pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
|
||||
|
||||
# aggregate errors
|
||||
errors = []
|
||||
for res in evaluation_results:
|
||||
if res.model_errors:
|
||||
errors.extend(res.model_errors)
|
||||
|
||||
evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
|
||||
evaluation_result.pii_precision = pii_precision
|
||||
evaluation_result.pii_recall = pii_recall
|
||||
evaluation_result.entity_recall_dict = entity_recall
|
||||
evaluation_result.entity_precision_dict = entity_precision
|
||||
evaluation_result.pii_f = pii_f_beta
|
||||
|
||||
return evaluation_result
|
||||
|
||||
@staticmethod
|
||||
def precision(tp: int, fp: int) -> float:
|
||||
return tp / (tp + fp + 1e-100)
|
||||
|
||||
@staticmethod
|
||||
def recall(tp: int, fn: int) -> float:
|
||||
return tp / (tp + fn + 1e-100)
|
||||
|
||||
@staticmethod
|
||||
def f_beta(precision: float, recall: float, beta: float) -> float:
|
||||
"""
|
||||
Returns the F score for precision, recall and a beta parameter
|
||||
:param precision: a float with the precision value
|
||||
:param recall: a float with the recall value
|
||||
:param beta: a float with the beta parameter of the F measure,
|
||||
which gives more or less weight to precision
|
||||
vs. recall
|
||||
:return: a float value of the f(beta) measure.
|
||||
"""
|
||||
if np.isnan(precision) or np.isnan(recall) or (
|
||||
precision == 0 and recall == 0):
|
||||
return np.nan
|
||||
|
||||
return ((1 + beta ** 2) * precision * recall) / (
|
||||
((beta ** 2) * precision) + recall)
|
||||
|
||||
@staticmethod
|
||||
def align_input_samples_to_presidio_analyzer(input_samples: List[InputSample],
|
||||
entities_mapping: Dict[str, str],
|
||||
presidio_fields: List[str]=None) \
|
||||
-> List[InputSample]:
|
||||
"""
|
||||
Change input samples to conform with Presidio's entities
|
||||
:return: new list of InputSample
|
||||
"""
|
||||
|
||||
new_input_samples = input_samples.copy()
|
||||
|
||||
# Match entity names to Presidio's
|
||||
if not presidio_fields:
|
||||
presidio_fields = ['CREDIT_CARD', 'CRYPTO', 'DATE_TIME', 'DOMAIN_NAME', 'EMAIL_ADDRESS', 'IBAN_CODE',
|
||||
'IP_ADDRESS', 'NRP', 'LOCATION', 'PERSON', 'PHONE_NUMBER', 'US_SSN']
|
||||
|
||||
# A list that will contain updated input samples,
|
||||
new_list = []
|
||||
|
||||
# Iterate on all samples
|
||||
for input_sample in new_input_samples:
|
||||
contains_presidio_field = False
|
||||
new_spans = []
|
||||
# Update spans to match Presidio's entity name
|
||||
for span in input_sample.spans:
|
||||
in_presidio_field = False
|
||||
if span.entity_type in entities_mapping.keys():
|
||||
new_name = entities_mapping.get(span.entity_type)
|
||||
span.entity_type = new_name
|
||||
contains_presidio_field = True
|
||||
|
||||
# Add to new span list, if the span contains an entity relevant to Presidio
|
||||
new_spans.append(span)
|
||||
input_sample.spans = new_spans
|
||||
|
||||
# Update tags in case this sample has relevant entities for evaluation
|
||||
if contains_presidio_field:
|
||||
for i, tag in enumerate(input_sample.tags):
|
||||
has_prefix = '-' in tag
|
||||
if has_prefix:
|
||||
prefix = tag[:2]
|
||||
clean = tag[2:]
|
||||
else:
|
||||
prefix = ""
|
||||
clean = tag
|
||||
|
||||
if clean in entities_mapping.keys():
|
||||
new_name = entities_mapping.get(clean)
|
||||
input_sample.tags[i] = "{}{}".format(prefix, new_name)
|
||||
else:
|
||||
input_sample.tags[i] = 'O'
|
||||
|
||||
new_list.append(input_sample)
|
||||
return new_list
|
||||
|
||||
@staticmethod
|
||||
def get_false_positives(errors=List[ModelError], entity=None):
|
||||
"""
|
||||
Get a list of all false positive errors in the results
|
||||
"""
|
||||
if isinstance(entity, str):
|
||||
entity = [entity]
|
||||
|
||||
if entity:
|
||||
return [model_error for model_error in errors if
|
||||
model_error.error_type == 'FP' and model_error.prediction in entity]
|
||||
else:
|
||||
return [model_error for model_error in errors if model_error.error_type == 'FP']
|
||||
|
||||
@staticmethod
|
||||
def get_false_negatives(errors=List[ModelError], entity=None):
|
||||
"""
|
||||
Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
|
||||
"""
|
||||
if isinstance(entity, str):
|
||||
entity = [entity]
|
||||
if entity:
|
||||
return [model_error for model_error in errors if
|
||||
model_error.error_type != 'FP' and model_error.annotation in entity]
|
||||
else:
|
||||
return [model_error for model_error in errors if model_error.error_type != 'FP']
|
||||
|
||||
@staticmethod
|
||||
def most_common_fp_tokens(errors=List[ModelError], n: int = 10, entity=None):
|
||||
"""
|
||||
Print the n most common false positive tokens (tokens thought to be an entity)
|
||||
"""
|
||||
fps = ModelEvaluator.get_false_positives(errors, entity)
|
||||
|
||||
tokens = [err.token.text for err in fps]
|
||||
from collections import Counter
|
||||
by_frequency = Counter(tokens)
|
||||
most_common = by_frequency.most_common(n)
|
||||
print("Most common false positive tokens:")
|
||||
print(most_common)
|
||||
print("Example sentence with each FP token:")
|
||||
for tok, val in most_common:
|
||||
with_tok = [err for err in fps if err.token.text == tok]
|
||||
print(with_tok[0].full_text)
|
||||
|
||||
@staticmethod
|
||||
def most_common_fn_tokens(errors=List[ModelError], n: int = 10, entity=None):
|
||||
"""
|
||||
Print all tokens that were missed by the model, including an example of the full text in which they appear
|
||||
"""
|
||||
fns = ModelEvaluator.get_false_negatives(errors, entity)
|
||||
|
||||
fns_tokens = [err.token.text for err in fns]
|
||||
from collections import Counter
|
||||
by_frequency_fns = Counter(fns_tokens)
|
||||
most_common_fns = by_frequency_fns.most_common(50)
|
||||
print(most_common_fns)
|
||||
for tok, val in most_common_fns:
|
||||
with_tok = [err for err in fns if err.token.text == tok]
|
||||
print("Token: {}, Annotation: {}, Full text: {}".format(with_tok[0].token, with_tok[0].annotation,
|
||||
with_tok[0].full_text))
|
||||
|
||||
@staticmethod
|
||||
def get_errors_df(errors=List[ModelError], entity: List[str] = None, error_type: str = 'FN'):
|
||||
"""
|
||||
Get ModelErrors as pd.DataFrame
|
||||
"""
|
||||
if error_type == 'FN':
|
||||
filtered_errors = ModelEvaluator.get_false_negatives(errors, entity)
|
||||
elif error_type == 'FP':
|
||||
filtered_errors = ModelEvaluator.get_false_positives(errors, entity)
|
||||
else:
|
||||
raise ValueError("error_type should be either FP or FN")
|
||||
|
||||
if len(filtered_errors) == 0:
|
||||
print("No errors of type {} and entity {} were found".format(error_type,entity))
|
||||
return None
|
||||
|
||||
errors_df = pd.DataFrame.from_records([error.__dict__ for error in filtered_errors])
|
||||
metadata_df = pd.DataFrame(errors_df['metadata'].tolist())
|
||||
errors_df.drop(['metadata'], axis=1, inplace=True)
|
||||
new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
|
||||
return new_errors_df
|
||||
|
||||
@staticmethod
|
||||
def get_fps_dataframe(errors=List[ModelError], entity: List[str] = None):
|
||||
"""
|
||||
Get false positive ModelErrors as pd.DataFrame
|
||||
"""
|
||||
return ModelEvaluator.get_errors_df(errors, entity, error_type='FP')
|
||||
|
||||
@staticmethod
|
||||
def get_fns_dataframe(errors=List[ModelError], entity: List[str] = None):
|
||||
"""
|
||||
Get false negative ModelErrors as pd.DataFrame
|
||||
"""
|
||||
return ModelEvaluator.get_errors_df(errors, entity, error_type='FN')
|
|
@ -0,0 +1,15 @@
|
|||
from .base_model import BaseModel
|
||||
from .crf_model import CRFModel
|
||||
from .presidio_analyzer_wrapper import PresidioAnalyzerWrapper
|
||||
from .presidio_recognizer_wrapper import PresidioRecognizerWrapper
|
||||
from .spacy_model import SpacyModel
|
||||
from .flair_model import FlairModel
|
||||
|
||||
__all__ = [
|
||||
"BaseModel",
|
||||
"CRFModel",
|
||||
"PresidioRecognizerWrapper",
|
||||
"PresidioAnalyzerWrapper",
|
||||
"SpacyModel",
|
||||
"FlairModel",
|
||||
]
|
|
@ -0,0 +1,37 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import List
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
|
||||
|
||||
class BaseModel(ABC):
|
||||
def __init__(
|
||||
self,
|
||||
labeling_scheme: str = "BILUO",
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
):
|
||||
|
||||
"""
|
||||
Abstract class for evaluating NER models and others
|
||||
:param entities_to_keep: Which entities should be evaluated? All other
|
||||
entities are ignored. If None, none are filtered
|
||||
:param labeling_scheme: Used to translate (if needed)
|
||||
the prediction to a specific scheme (IO, BIO/IOB, BILUO)
|
||||
:param verbose: Whether to print more debug info
|
||||
|
||||
|
||||
"""
|
||||
self.entities = entities_to_keep
|
||||
self.labeling_scheme = labeling_scheme
|
||||
self.verbose = verbose
|
||||
|
||||
@abstractmethod
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
"""
|
||||
Abstract. Returns the predicted tokens/spans from the evaluated model
|
||||
:param sample: Sample to be evaluated
|
||||
:return: if self.use spans: list of spans
|
||||
if not self.use_spans: tags in self.labeling_scheme format
|
||||
"""
|
||||
pass
|
|
@ -0,0 +1,104 @@
|
|||
import pickle
|
||||
from typing import List
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.models import BaseModel
|
||||
|
||||
|
||||
class CRFModel(BaseModel):
|
||||
def __init__(
|
||||
self,
|
||||
model_pickle_path: str = "../models/crf.pickle",
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
):
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
)
|
||||
|
||||
if model_pickle_path is None:
|
||||
raise ValueError("model_pickle_path must be supplied")
|
||||
|
||||
with open(model_pickle_path, "rb") as f:
|
||||
self.model = pickle.load(f)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
tags = CRFModel.crf_predict(sample, self.model)
|
||||
|
||||
if self.entities:
|
||||
tags = [tag for tag in tags if tag in self.entities]
|
||||
|
||||
if len(tags) != len(sample.tokens):
|
||||
print("mismatch between previous tokens and new tokens")
|
||||
# translated_tags = sample.rename_from_spacy_tags(tags)
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def crf_predict(sample, model):
|
||||
sample.translate_input_sample_tags()
|
||||
|
||||
conll = sample.to_conll(translate_tags=True)
|
||||
sentence = [(di["text"], di["pos"], di["label"]) for di in conll]
|
||||
features = CRFModel.sent2features(sentence)
|
||||
return model.predict([features])[0]
|
||||
|
||||
@staticmethod
|
||||
def word2features(sent, i):
|
||||
word = sent[i][0]
|
||||
postag = sent[i][1]
|
||||
|
||||
features = {
|
||||
"bias": 1.0,
|
||||
"word.lower()": word.lower(),
|
||||
"word[-3:]": word[-3:],
|
||||
"word[-2:]": word[-2:],
|
||||
"word.isupper()": word.isupper(),
|
||||
"word.istitle()": word.istitle(),
|
||||
"word.isdigit()": word.isdigit(),
|
||||
"postag": postag,
|
||||
"postag[:2]": postag[:2],
|
||||
}
|
||||
if i > 0:
|
||||
word1 = sent[i - 1][0]
|
||||
postag1 = sent[i - 1][1]
|
||||
features.update(
|
||||
{
|
||||
"-1:word.lower()": word1.lower(),
|
||||
"-1:word.istitle()": word1.istitle(),
|
||||
"-1:word.isupper()": word1.isupper(),
|
||||
"-1:postag": postag1,
|
||||
"-1:postag[:2]": postag1[:2],
|
||||
}
|
||||
)
|
||||
else:
|
||||
features["BOS"] = True
|
||||
|
||||
if i < len(sent) - 1:
|
||||
word1 = sent[i + 1][0]
|
||||
postag1 = sent[i + 1][1]
|
||||
features.update(
|
||||
{
|
||||
"+1:word.lower()": word1.lower(),
|
||||
"+1:word.istitle()": word1.istitle(),
|
||||
"+1:word.isupper()": word1.isupper(),
|
||||
"+1:postag": postag1,
|
||||
"+1:postag[:2]": postag1[:2],
|
||||
}
|
||||
)
|
||||
else:
|
||||
features["EOS"] = True
|
||||
|
||||
return features
|
||||
|
||||
@staticmethod
|
||||
def sent2features(sent):
|
||||
return [CRFModel.word2features(sent, i) for i in range(len(sent))]
|
||||
|
||||
@staticmethod
|
||||
def sent2labels(sent):
|
||||
return [label for token, postag, label in sent]
|
||||
|
||||
@staticmethod
|
||||
def sent2tokens(sent):
|
||||
return [token for token, postag, label in sent]
|
|
@ -1,42 +1,40 @@
|
|||
from typing import List
|
||||
|
||||
import spacy
|
||||
|
||||
try:
|
||||
from flair.data import Sentence, build_spacy_tokenizer
|
||||
from flair.models import SequenceTagger
|
||||
from flair.tokenization import SpacyTokenizer
|
||||
except ImportError:
|
||||
print("Flair is not installed by default")
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample
|
||||
import spacy
|
||||
|
||||
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.models import BaseModel
|
||||
|
||||
|
||||
class FlairEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self,
|
||||
model=None,
|
||||
model_path: str = None,
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme: str = "BIO",
|
||||
compare_by_io: bool = True,
|
||||
translate_to_spacy_entities=True):
|
||||
class FlairModel(BaseModel):
|
||||
def __init__(
|
||||
self,
|
||||
model=None,
|
||||
model_path: str = None,
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
translate_to_spacy_entities=True,
|
||||
):
|
||||
"""
|
||||
Evaluator for Flair models
|
||||
:param model: model of type SequenceTagger
|
||||
:param model_path:
|
||||
:param entities_to_keep:
|
||||
:param verbose:
|
||||
:param labeling_scheme:
|
||||
:param compare_by_io:
|
||||
:param translate_to_spacy_entities:
|
||||
"""
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io)
|
||||
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
)
|
||||
if model is None:
|
||||
if model_path is None:
|
||||
raise ValueError("Either model_path or model object must be supplied")
|
||||
|
@ -44,11 +42,15 @@ class FlairEvaluator(ModelEvaluator):
|
|||
else:
|
||||
self.model = model
|
||||
|
||||
self.spacy_tokenizer = build_spacy_tokenizer(model=spacy.blank('en'))
|
||||
self.spacy_tokenizer = SpacyTokenizer(model=spacy.load("en_core_web_lg"))
|
||||
self.translate_to_spacy_entities = translate_to_spacy_entities
|
||||
|
||||
if self.translate_to_spacy_entities:
|
||||
print("Translating entities using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
|
||||
print(
|
||||
"Translating entities using this dictionary: {}".format(
|
||||
PRESIDIO_SPACY_ENTITIES
|
||||
)
|
||||
)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
if self.translate_to_spacy_entities:
|
||||
|
@ -59,13 +61,17 @@ class FlairEvaluator(ModelEvaluator):
|
|||
tags = self.get_tags_from_sentence(sentence)
|
||||
if len(tags) != len(sample.tokens):
|
||||
print("mismatch between previous tokens and new tokens")
|
||||
|
||||
if self.entities:
|
||||
tags = [tag for tag in tags if tag in self.entities]
|
||||
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def get_tags_from_sentence(sentence):
|
||||
tags = []
|
||||
for token in sentence:
|
||||
tags.append(token.get_tag('ner').value)
|
||||
tags.append(token.get_tag("ner").value)
|
||||
|
||||
new_tags = []
|
||||
for tag in tags:
|
|
@ -0,0 +1,80 @@
|
|||
from typing import List
|
||||
|
||||
from presidio_analyzer import AnalyzerEngine
|
||||
|
||||
from presidio_evaluator import InputSample, span_to_tag
|
||||
from presidio_evaluator.models import BaseModel
|
||||
|
||||
|
||||
class PresidioAnalyzerWrapper(BaseModel):
|
||||
def __init__(
|
||||
self,
|
||||
analyzer_engine=AnalyzerEngine(),
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme="BIO",
|
||||
score_threshold=0.4,
|
||||
):
|
||||
"""
|
||||
Evaluation wrapper for the Presidio Analyzer
|
||||
:param analyzer_engine: object of type AnalyzerEngine (from presidio-analyzer)
|
||||
"""
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
)
|
||||
self.analyzer_engine = analyzer_engine
|
||||
|
||||
self.score_threshold = score_threshold
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
|
||||
results = self.analyzer_engine.analyze(
|
||||
text=sample.full_text,
|
||||
entities=self.entities,
|
||||
language="en",
|
||||
score_threshold=self.score_threshold,
|
||||
)
|
||||
starts = []
|
||||
ends = []
|
||||
scores = []
|
||||
tags = []
|
||||
#
|
||||
for res in results:
|
||||
starts.append(res.start)
|
||||
ends.append(res.end)
|
||||
tags.append(res.entity_type)
|
||||
scores.append(res.score)
|
||||
|
||||
response_tags = span_to_tag(
|
||||
scheme=self.labeling_scheme,
|
||||
text=sample.full_text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tokens=sample.tokens,
|
||||
scores=scores,
|
||||
tag=tags,
|
||||
)
|
||||
return response_tags
|
||||
|
||||
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
|
||||
presidio_entities_map = {
|
||||
"PERSON": "PERSON",
|
||||
"EMAIL_ADDRESS": "EMAIL_ADDRESS",
|
||||
"CREDIT_CARD": "CREDIT_CARD",
|
||||
"FIRST_NAME": "PERSON",
|
||||
"PHONE_NUMBER": "PHONE_NUMBER",
|
||||
"BIRTHDAY": "DATE_TIME",
|
||||
"DATE_TIME": "DATE_TIME",
|
||||
"DOMAIN": "DOMAIN",
|
||||
"CITY": "LOCATION",
|
||||
"ADDRESS": "LOCATION",
|
||||
"NATIONALITY": "LOCATION",
|
||||
"IBAN": "IBAN_CODE",
|
||||
"URL": "DOMAIN_NAME",
|
||||
"US_SSN": "US_SSN",
|
||||
"IP_ADDRESS": "IP_ADDRESS",
|
||||
"ORGANIZATION": "ORG",
|
||||
"O": "O",
|
||||
}
|
|
@ -0,0 +1,75 @@
|
|||
from typing import List
|
||||
|
||||
from presidio_analyzer import EntityRecognizer
|
||||
from presidio_analyzer.nlp_engine import NlpEngine
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.models import BaseModel
|
||||
from presidio_evaluator.span_to_tag import span_to_tag
|
||||
|
||||
|
||||
class PresidioRecognizerWrapper(BaseModel):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
recognizer: EntityRecognizer,
|
||||
nlp_engine: NlpEngine,
|
||||
entities_to_keep: List[str] = None,
|
||||
labeling_scheme: str = "BILUO",
|
||||
with_nlp_artifacts: bool = False,
|
||||
verbose: bool = False,
|
||||
):
|
||||
"""
|
||||
Evaluator for one specific PII recognizer
|
||||
To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
|
||||
:param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
|
||||
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
|
||||
:param entities_to_keep: List of entity types to focus on while ignoring all the rest.
|
||||
Default=None would look at all entity types
|
||||
:param with_nlp_artifacts: Whether NLP artifacts should be obtained
|
||||
(faster if not, but some recognizers need it)
|
||||
"""
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
)
|
||||
self.with_nlp_artifacts = with_nlp_artifacts
|
||||
self.recognizer = recognizer
|
||||
self.nlp_engine = nlp_engine
|
||||
#
|
||||
def __make_nlp_artifacts(self, text: str):
|
||||
return self.nlp_engine.process_text(text, "en")
|
||||
|
||||
#
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
nlp_artifacts = None
|
||||
if self.with_nlp_artifacts:
|
||||
nlp_artifacts = self.__make_nlp_artifacts(sample.full_text)
|
||||
|
||||
results = self.recognizer.analyze(
|
||||
sample.full_text, self.entities, nlp_artifacts
|
||||
)
|
||||
starts = []
|
||||
ends = []
|
||||
tags = []
|
||||
scores = []
|
||||
for res in results:
|
||||
if not res.start:
|
||||
res.start = 0
|
||||
starts.append(res.start)
|
||||
ends.append(res.end)
|
||||
tags.append(res.entity_type)
|
||||
scores.append(res.score)
|
||||
response_tags = span_to_tag(
|
||||
scheme=self.labeling_scheme,
|
||||
text=sample.full_text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tag=tags,
|
||||
tokens=sample.tokens,
|
||||
scores=scores,
|
||||
)
|
||||
if len(sample.tags) == 0:
|
||||
sample.tags = ["0" for _ in response_tags]
|
||||
return response_tags
|
|
@ -0,0 +1,55 @@
|
|||
from typing import List
|
||||
|
||||
import spacy
|
||||
|
||||
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.models import BaseModel
|
||||
|
||||
|
||||
class SpacyModel(BaseModel):
|
||||
def __init__(
|
||||
self,
|
||||
model: spacy.language.Language = None,
|
||||
model_name: str = None,
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme: str = "BIO",
|
||||
translate_to_spacy_entities=True,
|
||||
):
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
)
|
||||
|
||||
if model is None:
|
||||
if model_name is None:
|
||||
raise ValueError("Either model_name or model object must be supplied")
|
||||
self.model = spacy.load(model_name)
|
||||
else:
|
||||
self.model = model
|
||||
|
||||
self.translate_to_spacy_entities = translate_to_spacy_entities
|
||||
if self.translate_to_spacy_entities:
|
||||
print(
|
||||
"Translating entites using this dictionary: {}".format(
|
||||
PRESIDIO_SPACY_ENTITIES
|
||||
)
|
||||
)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
if self.translate_to_spacy_entities:
|
||||
sample.translate_input_sample_tags()
|
||||
|
||||
doc = self.model(sample.full_text)
|
||||
tags = self.get_tags_from_doc(doc)
|
||||
if len(doc) != len(sample.tokens):
|
||||
print("mismatch between input tokens and new tokens")
|
||||
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def get_tags_from_doc(doc):
|
||||
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
|
||||
return tags
|
|
@ -1,153 +0,0 @@
|
|||
from typing import List
|
||||
|
||||
from presidio_analyzer import AnalyzerEngine
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample, span_to_tag
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
|
||||
|
||||
class PresidioAnalyzerEvaluator(ModelEvaluator):
|
||||
def __init__(
|
||||
self,
|
||||
analyzer=AnalyzerEngine(),
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme="BIO",
|
||||
compare_by_io=True,
|
||||
score_threshold=0.4,
|
||||
):
|
||||
"""
|
||||
Evaluation wrapper for the Presidio Analyzer
|
||||
:param analyzer: object of type AnalyzerEngine (from presidio-analyzer)
|
||||
"""
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io,
|
||||
)
|
||||
self.analyzer = analyzer
|
||||
|
||||
self.score_threshold = score_threshold
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
if self.entities is None or len(self.entities) == 0:
|
||||
all_fields = True
|
||||
else:
|
||||
all_fields = None
|
||||
results = self.analyzer.analyze(
|
||||
text=sample.full_text,
|
||||
entities=self.entities,
|
||||
language="en",
|
||||
all_fields=all_fields,
|
||||
)
|
||||
starts = []
|
||||
ends = []
|
||||
scores = []
|
||||
tags = []
|
||||
#
|
||||
for res in results:
|
||||
#
|
||||
if res.score >= self.score_threshold:
|
||||
starts.append(res.start)
|
||||
ends.append(res.end)
|
||||
tags.append(res.entity_type)
|
||||
scores.append(res.score)
|
||||
#
|
||||
response_tags = span_to_tag(
|
||||
scheme=self.labeling_scheme,
|
||||
text=sample.full_text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tokens=sample.tokens,
|
||||
scores=scores,
|
||||
tag=tags,
|
||||
)
|
||||
return response_tags
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Reading dataset")
|
||||
input_samples = read_synth_dataset("../data/synth_dataset.txt")
|
||||
|
||||
print("Preparing dataset by aligning entity names to Presidio's entity names")
|
||||
|
||||
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
|
||||
entities_mapping = {
|
||||
"PERSON": "PERSON",
|
||||
"EMAIL": "EMAIL_ADDRESS",
|
||||
"CREDIT_CARD": "CREDIT_CARD",
|
||||
"FIRST_NAME": "PERSON",
|
||||
"PHONE_NUMBER": "PHONE_NUMBER",
|
||||
"BIRTHDAY": "DATE_TIME",
|
||||
"DATE": "DATE_TIME",
|
||||
"DOMAIN": "DOMAIN",
|
||||
"CITY": "LOCATION",
|
||||
"ADDRESS": "LOCATION",
|
||||
"IBAN": "IBAN_CODE",
|
||||
"URL": "DOMAIN_NAME",
|
||||
"US_SSN": "US_SSN",
|
||||
"IP_ADDRESS": "IP_ADDRESS",
|
||||
"ORGANIZATION": "ORG",
|
||||
"O": "O",
|
||||
}
|
||||
|
||||
updated_samples = ModelEvaluator.align_input_samples_to_presidio_analyzer(
|
||||
input_samples, entities_mapping
|
||||
)
|
||||
|
||||
flatten = lambda l: [item for sublist in l for item in sublist]
|
||||
from collections import Counter
|
||||
|
||||
count_per_entity = Counter(
|
||||
[
|
||||
span.entity_type
|
||||
for span in flatten(
|
||||
[input_sample.spans for input_sample in updated_samples]
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
print("Evaluating samples")
|
||||
analyzer = PresidioAnalyzerEvaluator(entities_to_keep=count_per_entity.keys())
|
||||
evaluated_samples = analyzer.evaluate_all(updated_samples)
|
||||
|
||||
print("Estimating metrics")
|
||||
score = analyzer.calculate_score(evaluation_results=evaluated_samples, beta=2.5)
|
||||
precision = score.pii_precision
|
||||
recall = score.pii_recall
|
||||
entity_recall = score.entity_recall_dict
|
||||
entity_precision = score.entity_precision_dict
|
||||
f = score.pii_f
|
||||
errors = score.model_errors
|
||||
#
|
||||
print("precision: {}".format(precision))
|
||||
print("Recall: {}".format(recall))
|
||||
print("F 2.5: {}".format(f))
|
||||
print("Precision per entity: {}".format(entity_precision))
|
||||
print("Recall per entity: {}".format(entity_recall))
|
||||
#
|
||||
FN_mistakes = [str(mistake) for mistake in errors if mistake.error_type == "FN"]
|
||||
FP_mistakes = [str(mistake) for mistake in errors if mistake.error_type == "FP"]
|
||||
other_mistakes = [
|
||||
str(mistake) for mistake in errors if mistake.error_type not in ["FN", "FP"]
|
||||
]
|
||||
|
||||
fn = open("../data/fn_30000.txt", "w+", encoding="utf-8")
|
||||
fn1 = "\n".join(FN_mistakes)
|
||||
fn.write(fn1)
|
||||
fn.close()
|
||||
|
||||
fp = open("../data/fp_30000.txt", "w+", encoding="utf-8")
|
||||
fp1 = "\n".join(FP_mistakes)
|
||||
fp.write(fp1)
|
||||
fp.close()
|
||||
|
||||
mistakes_file = open("../data/mistakes_30000.txt", "w+", encoding="utf-8")
|
||||
mistakes1 = "\n".join(other_mistakes)
|
||||
mistakes_file.write(mistakes1)
|
||||
mistakes_file.close()
|
||||
|
||||
from pickle import dump
|
||||
|
||||
dump(evaluated_samples, open("../data/evaluated_samples_30000.pickle", "wb"))
|
|
@ -1,133 +0,0 @@
|
|||
import json
|
||||
from typing import List
|
||||
|
||||
import requests
|
||||
|
||||
from presidio_evaluator import InputSample, ModelEvaluator
|
||||
from presidio_evaluator.span_to_tag import span_to_tag, tokenize
|
||||
|
||||
ENDPOINT = "http://40.113.201.221:8080/api/v1/projects/test/analyze"
|
||||
|
||||
|
||||
class PresidioAPIEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self, endpoint=None, all_fields=False, entities_to_keep=None,
|
||||
verbose=False, labeling_scheme="IO", **kwargs):
|
||||
"""
|
||||
evaluator model for the presidio API as a system
|
||||
:param endpoint: url of presidio API
|
||||
:param all_fields: boolean, true if no entities filtering should take
|
||||
place
|
||||
:param entities_to_keep: list of entities to return if found
|
||||
:param labeling_scheme: BIO/IOB or BILOU
|
||||
:param verbose:
|
||||
:param kwargs:
|
||||
"""
|
||||
|
||||
if not endpoint:
|
||||
print(
|
||||
"Endpoint is missing. using default presidio API at {}".format(
|
||||
ENDPOINT))
|
||||
self.endpoint = ENDPOINT
|
||||
else:
|
||||
self.endpoint = endpoint
|
||||
|
||||
if not entities_to_keep and not all_fields:
|
||||
raise ValueError("Please provide either a list of entities or"
|
||||
"all_fields=true")
|
||||
|
||||
if all_fields:
|
||||
entities_to_keep = None
|
||||
super().__init__(verbose=verbose, entities_to_keep=entities_to_keep,
|
||||
labeling_scheme=labeling_scheme, **kwargs)
|
||||
|
||||
self.set_analyze_template(all_fields=all_fields,
|
||||
entities=entities_to_keep)
|
||||
|
||||
def predict(self, sample: InputSample):
|
||||
text = sample.full_text
|
||||
request = {"text": text,
|
||||
"analyzeTemplate": self.analyze_template
|
||||
}
|
||||
# Call presidio API
|
||||
r = requests.post(self.endpoint, json=request)
|
||||
starts = []
|
||||
ends = []
|
||||
tags = []
|
||||
|
||||
if r.status_code == 200:
|
||||
analyzer_results = json.loads(r.text)
|
||||
if self.verbose:
|
||||
print(analyzer_results)
|
||||
|
||||
if analyzer_results:
|
||||
for res in analyzer_results:
|
||||
if not res['location'].get('start'):
|
||||
res['location']['start'] = 0
|
||||
starts.append(res['location']['start'])
|
||||
ends.append(res['location']['end'])
|
||||
tags.append(res['field']['name'])
|
||||
|
||||
response_tags = span_to_tag(scheme=self.labeling_scheme,
|
||||
text=text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tag=tags)
|
||||
|
||||
elif r.status_code == 400 or r.text == "":
|
||||
if self.verbose:
|
||||
print("Status 400 received")
|
||||
response_tags = ['O' for token in sample.tokens]
|
||||
else:
|
||||
print("Error getting result from Presidio API")
|
||||
print("Request = {}".format(request))
|
||||
print("Response = {}".format(r.text))
|
||||
raise Exception(r)
|
||||
|
||||
return response_tags
|
||||
|
||||
def set_analyze_template(self, all_fields: bool, entities: List[str]):
|
||||
template = {
|
||||
"fields": [{"name": "EMAIL_ADDRESS"}, {"name": "IP_ADDRESS"},
|
||||
{"name": "US_DRIVER_LICENSE"},
|
||||
{"name": "US_ITIN"}, {"name": "US_SSN"},
|
||||
{"name": "DOMAIN_NAME"},
|
||||
{"name": "IBAN_CODE"}, {"name": "PERSON"},
|
||||
{"name": "PHONE_NUMBER"},
|
||||
{"name": "US_BANK_NUMBER"}, {"name": "CRYPTO"},
|
||||
{"name": "NRP"},
|
||||
{"name": "UK_NHS"}, {"name": "CREDIT_CARD"},
|
||||
{"name": "DATE_TIME"},
|
||||
{"name": "LOCATION"}, {"name": "US_PASSPORT"}]}
|
||||
|
||||
if all_fields:
|
||||
self.analyze_template = template
|
||||
return
|
||||
|
||||
requested_fields = []
|
||||
for entity in entities:
|
||||
for field in template['fields']:
|
||||
if entity == field['name']:
|
||||
requested_fields.append(field)
|
||||
|
||||
new_template = {'fields': requested_fields}
|
||||
|
||||
self.analyze_template = new_template
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example:
|
||||
text = "My siblings are Dan and magen"
|
||||
bilou_tags = ['O', 'O', 'O', 'U-PERSON', 'O', 'U-PERSON']
|
||||
presidio = PresidioAPIEvaluator(verbose=True, all_fields=True, compare_by_io=True)
|
||||
tokens = tokenize(text)
|
||||
s = InputSample(text, masked=None, spans=None)
|
||||
s.tokens = tokens
|
||||
s.tags = bilou_tags
|
||||
|
||||
evaluated_sample = presidio.evaluate_sample(s)
|
||||
p, r, entity_recall, f, mistakes = presidio.calculate_score([evaluated_sample])
|
||||
print("Precision = {}\n"
|
||||
"Recall = {}\n"
|
||||
"F_3 = {}\n"
|
||||
"Errors = {}".format(p, r, f, mistakes))
|
|
@ -1,88 +0,0 @@
|
|||
"""
|
||||
Presidio Analyzer not yet on PyPI, therefore it cannot be referenced explicitly
|
||||
"""
|
||||
|
||||
import math
|
||||
from typing import List, Tuple, Dict
|
||||
|
||||
from presidio_analyzer.nlp_engine import SpacyNlpEngine
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample, EvaluationResult
|
||||
from presidio_evaluator.span_to_tag import span_to_tag
|
||||
|
||||
|
||||
class PresidioRecognizerEvaluator(ModelEvaluator):
|
||||
def __init__(
|
||||
self,
|
||||
recognizer,
|
||||
nlp_engine,
|
||||
entities_to_keep=None,
|
||||
with_nlp_artifacts=False,
|
||||
verbose=False,
|
||||
compare_by_io=True,
|
||||
):
|
||||
"""
|
||||
Evaluator for one recognizer
|
||||
:param recognizer: An object of type EntityRecognizer (in presidion-analyzer)
|
||||
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
|
||||
"""
|
||||
super().__init__(
|
||||
entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
compare_by_io=compare_by_io,
|
||||
)
|
||||
self.withNlpArtifacts = with_nlp_artifacts
|
||||
self.recognizer = recognizer
|
||||
self.nlp_engine = nlp_engine
|
||||
|
||||
#
|
||||
def __make_nlp_artifacts(self, text: str):
|
||||
return self.nlp_engine.process_text(text, "en")
|
||||
|
||||
#
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
nlpArtifacts = None
|
||||
if self.withNlpArtifacts:
|
||||
nlpArtifacts = self.__make_nlp_artifacts(sample.full_text)
|
||||
results = self.recognizer.analyze(sample.full_text, self.entities, nlpArtifacts)
|
||||
starts = []
|
||||
ends = []
|
||||
tags = []
|
||||
scores = []
|
||||
for res in results:
|
||||
if not res.start:
|
||||
res.start = 0
|
||||
starts.append(res.start)
|
||||
ends.append(res.end)
|
||||
tags.append(res.entity_type)
|
||||
scores.append(res.score)
|
||||
response_tags = span_to_tag(
|
||||
scheme=self.labeling_scheme,
|
||||
text=sample.full_text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tag=tags,
|
||||
tokens=sample.tokens,
|
||||
scores=scores,
|
||||
io_tags_only=self.compare_by_io,
|
||||
)
|
||||
if len(sample.tags) == 0:
|
||||
sample.tags = ["0" for word in response_tags]
|
||||
return response_tags
|
||||
|
||||
|
||||
def score_presidio_recognizer(
|
||||
recognizer, entities_to_keep, input_samples, withNlpArtifacts=False
|
||||
) -> EvaluationResult:
|
||||
model = PresidioRecognizerEvaluator(
|
||||
recognizer=recognizer,
|
||||
entities_to_keep=entities_to_keep,
|
||||
nlp_engine=SpacyNlpEngine(),
|
||||
with_nlp_artifacts=withNlpArtifacts,
|
||||
)
|
||||
evaluated_samples = model.evaluate_all(input_samples[:])
|
||||
evaluation_result = model.calculate_score(evaluated_samples, beta=2.5)
|
||||
evaluation_result.print()
|
||||
if math.isnan(evaluation_result.pii_precision):
|
||||
evaluation_result.pii_precision = 0
|
||||
return evaluation_result
|
|
@ -1,52 +0,0 @@
|
|||
from typing import List
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample
|
||||
import spacy
|
||||
|
||||
from spacy.language import Language
|
||||
|
||||
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
|
||||
|
||||
|
||||
class SpacyEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self,
|
||||
model: spacy.language.Language = None,
|
||||
model_name: str = None,
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme: str = "BIO",
|
||||
compare_by_io: bool = True,
|
||||
translate_to_spacy_ents = True):
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io)
|
||||
|
||||
if model is None:
|
||||
if model_name is None:
|
||||
raise ValueError("Either model_name or model object must be supplied")
|
||||
self.model = spacy.load(model_name)
|
||||
else:
|
||||
self.model = model
|
||||
|
||||
self.translate_to_spacy_ents = translate_to_spacy_ents
|
||||
if self.translate_to_spacy_ents:
|
||||
print("Translating entites using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
if self.translate_to_spacy_ents:
|
||||
sample.translate_input_sample_tags()
|
||||
|
||||
doc = self.model(sample.full_text)
|
||||
tags = self.get_tags_from_doc(doc)
|
||||
if len(doc) != len(sample.tokens):
|
||||
print("mismatch between input tokens and new tokens")
|
||||
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def get_tags_from_doc(doc):
|
||||
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
|
||||
return tags
|
||||
|
|
@ -1,14 +1,14 @@
|
|||
from collections import namedtuple
|
||||
from typing import List
|
||||
|
||||
import spacy
|
||||
from spacy.tokens import Token
|
||||
|
||||
loaded_spacy = {}
|
||||
|
||||
|
||||
def get_spacy(loaded_spacy=loaded_spacy, model_version="en_core_web_lg"):
|
||||
if model_version not in loaded_spacy:
|
||||
disable = ['vectors', 'textcat', 'ner']
|
||||
disable = ["vectors", "textcat", "ner"]
|
||||
print("loading model {}".format(model_version))
|
||||
loaded_spacy[model_version] = spacy.load(model_version, disable=disable)
|
||||
return loaded_spacy[model_version]
|
||||
|
@ -26,7 +26,7 @@ def _get_detailed_tags(scheme, cur_tags):
|
|||
:return:
|
||||
"""
|
||||
|
||||
if all([tag == 'O' for tag in cur_tags]):
|
||||
if all([tag == "O" for tag in cur_tags]):
|
||||
return cur_tags
|
||||
|
||||
return_tags = []
|
||||
|
@ -52,7 +52,12 @@ def _get_detailed_tags(scheme, cur_tags):
|
|||
|
||||
def _sort_spans(start, end, tag, score):
|
||||
if len(start) > 0:
|
||||
tpl = [(a, b, c, d) for a, b, c, d in sorted(zip(start, end, tag, score), key=lambda pair: pair[0])]
|
||||
tpl = [
|
||||
(a, b, c, d)
|
||||
for a, b, c, d in sorted(
|
||||
zip(start, end, tag, score), key=lambda pair: pair[0]
|
||||
)
|
||||
]
|
||||
start, end, tag, score = [[x[i] for x in tpl] for i in range(len(tpl[0]))]
|
||||
return start, end, tag, score
|
||||
|
||||
|
@ -65,8 +70,8 @@ def _handle_overlaps(start, end, tag, score):
|
|||
index = min(start)
|
||||
number_of_spans = len(start)
|
||||
i = 0
|
||||
while i < number_of_spans-1:
|
||||
for j in range(i+1,number_of_spans):
|
||||
while i < number_of_spans - 1:
|
||||
for j in range(i + 1, number_of_spans):
|
||||
# Span j intersects with span i
|
||||
if start[i] <= start[j] <= end[i]:
|
||||
# i's score is higher, remove intersecting part
|
||||
|
@ -98,14 +103,15 @@ def _handle_overlaps(start, end, tag, score):
|
|||
return start, end, tag, score
|
||||
|
||||
|
||||
def span_to_tag(scheme: str,
|
||||
text: str,
|
||||
start: List[int],
|
||||
end: List[int],
|
||||
tag: List[str],
|
||||
scores: List[float] = None,
|
||||
tokens: List[spacy.tokens.Token] = None,
|
||||
io_tags_only=False) -> List[str]:
|
||||
def span_to_tag(
|
||||
scheme: str,
|
||||
text: str,
|
||||
start: List[int],
|
||||
end: List[int],
|
||||
tag: List[str],
|
||||
scores: List[float] = None,
|
||||
tokens: List[spacy.tokens.Token] = None,
|
||||
) -> List[str]:
|
||||
"""
|
||||
Turns a list of start and end values with corresponding labels, into a NER
|
||||
tagging (BILOU,BIO/IOB)
|
||||
|
@ -116,7 +122,6 @@ def span_to_tag(scheme: str,
|
|||
:param end: list of indices where entities in the text end
|
||||
:param tag: list of entity names
|
||||
:param scores: score of tag (confidence)
|
||||
:param io_tags_only: Whether to return only I and O tags
|
||||
:return: list of strings, representing either BILOU or BIO for the input
|
||||
"""
|
||||
|
||||
|
@ -141,7 +146,7 @@ def span_to_tag(scheme: str,
|
|||
if not found:
|
||||
io_tags.append("O")
|
||||
|
||||
if io_tags_only or scheme == "IO":
|
||||
if scheme == "IO":
|
||||
return io_tags
|
||||
|
||||
# Set tagging based on scheme (BIO/IOB or BILOU)
|
||||
|
@ -158,7 +163,9 @@ def span_to_tag(scheme: str,
|
|||
new_return_tags = []
|
||||
for i in range(len(changes) - 1):
|
||||
new_return_tags.extend(
|
||||
_get_detailed_tags(scheme=scheme,
|
||||
cur_tags=io_tags[changes[i]:changes[i + 1]]))
|
||||
_get_detailed_tags(
|
||||
scheme=scheme, cur_tags=io_tags[changes[i] : changes[i + 1]]
|
||||
)
|
||||
)
|
||||
|
||||
return new_return_tags
|
||||
|
|
|
@ -7,7 +7,7 @@ import json
|
|||
from presidio_evaluator import InputSample
|
||||
|
||||
|
||||
def split_dataset(dataset : List[InputSample], ratios):
|
||||
def split_dataset(dataset: List[InputSample], ratios):
|
||||
"""
|
||||
Splits a provided dataset into n groups, by the Template# attribute in each sample's metadata
|
||||
:param dataset: List of InputSamples to be splitted
|
||||
|
@ -23,7 +23,9 @@ def split_dataset(dataset : List[InputSample], ratios):
|
|||
|
||||
for ratio in ratios:
|
||||
if 1 >= ratio > 0:
|
||||
first_templates, second_templates = split_by_template(remaining_dataset, ratio/remaining_ratio)
|
||||
first_templates, second_templates = split_by_template(
|
||||
remaining_dataset, ratio / remaining_ratio
|
||||
)
|
||||
first_split = get_samples_by_pattern(remaining_dataset, first_templates)
|
||||
second_split = get_samples_by_pattern(remaining_dataset, second_templates)
|
||||
splits.append(first_split)
|
||||
|
@ -39,7 +41,7 @@ def group_by_template(dataset: List[InputSample]) -> Dict[str, List[InputSample]
|
|||
"""
|
||||
Creates a dict of key = template ID and value = List[InputSamples] for this template id
|
||||
"""
|
||||
samples_pattern_tup = [(sample.metadata["Template#"],sample) for sample in dataset]
|
||||
samples_pattern_tup = [(sample.metadata["Template#"], sample) for sample in dataset]
|
||||
|
||||
group_by_template = defaultdict(list)
|
||||
for sample in samples_pattern_tup:
|
||||
|
@ -55,7 +57,9 @@ def split_by_template(input_samples: List[InputSample], train_pct: float = 0.7):
|
|||
samples_grpd = group_by_template(input_samples)
|
||||
|
||||
templates = np.array(list(samples_grpd.keys()))
|
||||
train_ind = set(random.sample(range(len(templates)), round(train_pct * len(templates))))
|
||||
train_ind = set(
|
||||
random.sample(range(len(templates)), round(train_pct * len(templates)))
|
||||
)
|
||||
|
||||
test_ind = set(range(len(templates))) - train_ind
|
||||
|
||||
|
@ -75,5 +79,5 @@ def get_samples_by_pattern(input_samples, patterns_list):
|
|||
def save_to_json(samples, output_file):
|
||||
examples_dict = [example.to_dict() for example in samples]
|
||||
|
||||
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
|
||||
with open("{}".format(output_file), "w+", encoding="utf-8") as f:
|
||||
json.dump(examples_dict, f, ensure_ascii=False, indent=4)
|
||||
|
|
|
@ -1,18 +1,15 @@
|
|||
spacy
|
||||
requests==2.22.0
|
||||
numpy
|
||||
jupyter
|
||||
pandas
|
||||
tqdm
|
||||
haikunator
|
||||
spacy==3.0.5
|
||||
numpy==1.20.2
|
||||
jupyter>=1
|
||||
pandas~=1.2.4
|
||||
tqdm~=4.60.0
|
||||
haikunator~=2.1.0
|
||||
schwifty
|
||||
faker
|
||||
sklearn
|
||||
https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz
|
||||
regex
|
||||
#azureml
|
||||
#azureml-sdk
|
||||
faker~=8.1.0
|
||||
scikit_learn==0.24.1
|
||||
#flair
|
||||
sklearn_crfsuite
|
||||
pytest
|
||||
presidio_analyzer
|
||||
sklearn_crfsuite==0.3.6
|
||||
pytest~=6.2.3
|
||||
presidio_analyzer
|
||||
presidio_anonymizer
|
||||
requests~=2.25.1
|
10
setup.py
10
setup.py
|
@ -1,4 +1,4 @@
|
|||
from setuptools import setup
|
||||
from setuptools import setup, find_packages
|
||||
import os.path
|
||||
# read the contents of the README file
|
||||
from os import path
|
||||
|
@ -7,7 +7,6 @@ this_directory = path.abspath(path.dirname(__file__))
|
|||
with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
|
||||
long_description = f.read()
|
||||
# print(long_description)
|
||||
__version__ = ""
|
||||
|
||||
with open(os.path.join(this_directory, 'VERSION')) as version_file:
|
||||
__version__ = version_file.read().strip()
|
||||
|
@ -17,16 +16,15 @@ setup(
|
|||
long_description=long_description,
|
||||
long_description_content_type='text/markdown',
|
||||
version=__version__,
|
||||
packages=['presidio_evaluator', 'presidio_evaluator.data_generator'
|
||||
],
|
||||
packages=find_packages(exclude=["tests"]),
|
||||
url='https://www.github.com/microsoft/presidio',
|
||||
license='MIT',
|
||||
description='PII dataset generator, model evaluator for Presidio and PII data in general',
|
||||
data_files=[('presidio_evaluator/data_generator/raw_data', ['presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv', 'presidio_evaluator/data_generator/raw_data/templates.txt', 'presidio_evaluator/data_generator/raw_data/organizations.csv', 'presidio_evaluator/data_generator/raw_data/nationalities.csv'])],
|
||||
include_package_data=True,
|
||||
install_requires=[
|
||||
'spacy>=2.2.0',
|
||||
'requests==2.22.0',
|
||||
'spacy>=3.0.0',
|
||||
'requests',
|
||||
'numpy',
|
||||
'pandas',
|
||||
'tqdm>=4.32.1',
|
||||
|
|
|
@ -12,7 +12,7 @@ def pytest_addoption(parser):
|
|||
"--runslow", action="store_true", default=False, help="run slow tests"
|
||||
)
|
||||
parser.addoption(
|
||||
"--runinconclusive", action="store_true", default=False, help="run slow tests"
|
||||
"--runinconclusive", action="store_true", default=False, help="run inconclusive tests"
|
||||
)
|
||||
|
||||
|
||||
|
|
|
@ -1,254 +1,139 @@
|
|||
[
|
||||
{
|
||||
"full_text": "My full address is Avda. Alameda Sundheim 46",
|
||||
"full_text": "I either live on 2347 Lauzon Parkway, Windsor N9A7A2 or ",
|
||||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "FULL_ADDRESS",
|
||||
"entity_value": "Avda. Alameda Sundheim 46",
|
||||
"start_position": 19,
|
||||
"end_position": 44
|
||||
"entity_type": "LOCATION",
|
||||
"entity_value": "2347 Lauzon Parkway, Windsor N9A7A2",
|
||||
"start_position": 17,
|
||||
"end_position": 52
|
||||
},
|
||||
{
|
||||
"entity_type": "LOCATION",
|
||||
"entity_value": "",
|
||||
"start_position": 56,
|
||||
"end_position": 56
|
||||
}
|
||||
],
|
||||
"tokens": [
|
||||
{
|
||||
"text": "My",
|
||||
"idx": 0,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "full",
|
||||
"idx": 3,
|
||||
"tag_": "JJ",
|
||||
"pos_": "ADJ",
|
||||
"dep_": "amod",
|
||||
"lemma_": "full",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "address",
|
||||
"idx": 8,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "address",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "is",
|
||||
"idx": 16,
|
||||
"tag_": "VBZ",
|
||||
"pos_": "AUX",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "be",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Avda",
|
||||
"idx": 19,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "attr",
|
||||
"lemma_": "Avda",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": ".",
|
||||
"idx": 23,
|
||||
"tag_": ".",
|
||||
"pos_": "PUNCT",
|
||||
"dep_": "punct",
|
||||
"lemma_": ".",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Alameda",
|
||||
"idx": 25,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "compound",
|
||||
"lemma_": "Alameda",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Sundheim",
|
||||
"idx": 33,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "Sundheim",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "46",
|
||||
"idx": 42,
|
||||
"tag_": "CD",
|
||||
"pos_": "NUM",
|
||||
"dep_": "nummod",
|
||||
"lemma_": "46",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"tags": [
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"B-FULL_ADDRESS",
|
||||
"I-FULL_ADDRESS",
|
||||
"I-FULL_ADDRESS",
|
||||
"I-FULL_ADDRESS",
|
||||
"L-FULL_ADDRESS"
|
||||
],
|
||||
"template_id": null,
|
||||
"metadata": {
|
||||
"Gender": "male",
|
||||
"NameSet": "Croatian",
|
||||
"Country": "Uganda",
|
||||
"Lowercase": false,
|
||||
"Template#": 9
|
||||
}
|
||||
},
|
||||
{
|
||||
"full_text": "You want my credit card? No problem: 4532368231815457",
|
||||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "CREDIT_CARD",
|
||||
"entity_value": "4532368231815457",
|
||||
"start_position": 37,
|
||||
"end_position": 53
|
||||
}
|
||||
],
|
||||
"tokens": [
|
||||
{
|
||||
"text": "You",
|
||||
"text": "I",
|
||||
"idx": 0,
|
||||
"tag_": "PRP",
|
||||
"pos_": "PRON",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "-PRON-",
|
||||
"lemma_": "I",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "want",
|
||||
"idx": 4,
|
||||
"text": "either",
|
||||
"idx": 2,
|
||||
"tag_": "RB",
|
||||
"pos_": "ADV",
|
||||
"dep_": "advmod",
|
||||
"lemma_": "either",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "live",
|
||||
"idx": 9,
|
||||
"tag_": "VBP",
|
||||
"pos_": "VERB",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "want",
|
||||
"lemma_": "live",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "my",
|
||||
"idx": 9,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"text": "on",
|
||||
"idx": 14,
|
||||
"tag_": "IN",
|
||||
"pos_": "ADP",
|
||||
"dep_": "prep",
|
||||
"lemma_": "on",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "credit",
|
||||
"idx": 12,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "compound",
|
||||
"lemma_": "credit",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "card",
|
||||
"idx": 19,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "dobj",
|
||||
"lemma_": "card",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "?",
|
||||
"idx": 23,
|
||||
"tag_": ".",
|
||||
"pos_": "PUNCT",
|
||||
"dep_": "punct",
|
||||
"lemma_": "?",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "No",
|
||||
"idx": 25,
|
||||
"tag_": "DT",
|
||||
"pos_": "DET",
|
||||
"dep_": "det",
|
||||
"lemma_": "no",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "problem",
|
||||
"idx": 28,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "problem",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": ":",
|
||||
"idx": 35,
|
||||
"tag_": ":",
|
||||
"pos_": "PUNCT",
|
||||
"dep_": "punct",
|
||||
"lemma_": ":",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "4532368231815457",
|
||||
"idx": 37,
|
||||
"text": "2347",
|
||||
"idx": 17,
|
||||
"tag_": "CD",
|
||||
"pos_": "NUM",
|
||||
"dep_": "nummod",
|
||||
"lemma_": "2347",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Lauzon",
|
||||
"idx": 22,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "compound",
|
||||
"lemma_": "Lauzon",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Parkway",
|
||||
"idx": 29,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "pobj",
|
||||
"lemma_": "Parkway",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": ",",
|
||||
"idx": 36,
|
||||
"tag_": ",",
|
||||
"pos_": "PUNCT",
|
||||
"dep_": "punct",
|
||||
"lemma_": ",",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Windsor",
|
||||
"idx": 38,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "compound",
|
||||
"lemma_": "Windsor",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "N9A7A2",
|
||||
"idx": 46,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "appos",
|
||||
"lemma_": "4532368231815457",
|
||||
"lemma_": "N9A7A2",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "or",
|
||||
"idx": 53,
|
||||
"tag_": "CC",
|
||||
"pos_": "CCONJ",
|
||||
"dep_": "cc",
|
||||
"lemma_": "or",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
|
@ -259,37 +144,38 @@
|
|||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"U-CREDIT_CARD"
|
||||
"B-LOCATION",
|
||||
"I-LOCATION",
|
||||
"I-LOCATION",
|
||||
"I-LOCATION",
|
||||
"I-LOCATION",
|
||||
"L-LOCATION",
|
||||
"O"
|
||||
],
|
||||
"template_id": null,
|
||||
"metadata": {
|
||||
"Gender": "female",
|
||||
"NameSet": "Czech",
|
||||
"Country": "Austria",
|
||||
"Gender": "male",
|
||||
"NameSet": "Polish",
|
||||
"Country": "Croatia",
|
||||
"Lowercase": false,
|
||||
"Template#": 7
|
||||
"Template#": 11
|
||||
}
|
||||
},
|
||||
{
|
||||
"full_text": "My first name is Rogelio and my last is Patrick",
|
||||
"full_text": "My accounts are and ",
|
||||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "PERSON",
|
||||
"entity_value": "Rogelio",
|
||||
"start_position": 17,
|
||||
"end_position": 24
|
||||
"entity_type": "ACCOUNT_NUMBER",
|
||||
"entity_value": "",
|
||||
"start_position": 16,
|
||||
"end_position": 16
|
||||
},
|
||||
{
|
||||
"entity_type": "PERSON",
|
||||
"entity_value": "Patrick",
|
||||
"start_position": 40,
|
||||
"end_position": 47
|
||||
"entity_type": "ACCOUNT_NUMBER",
|
||||
"entity_value": "",
|
||||
"start_position": 21,
|
||||
"end_position": 21
|
||||
}
|
||||
],
|
||||
"tokens": [
|
||||
|
@ -297,39 +183,28 @@
|
|||
"text": "My",
|
||||
"idx": 0,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"pos_": "PRON",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"lemma_": "my",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "first",
|
||||
"text": "accounts",
|
||||
"idx": 3,
|
||||
"tag_": "JJ",
|
||||
"pos_": "ADJ",
|
||||
"dep_": "amod",
|
||||
"lemma_": "first",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "name",
|
||||
"idx": 9,
|
||||
"tag_": "NN",
|
||||
"tag_": "NNS",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "name",
|
||||
"lemma_": "account",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "is",
|
||||
"idx": 14,
|
||||
"tag_": "VBZ",
|
||||
"text": "are",
|
||||
"idx": 12,
|
||||
"tag_": "VBP",
|
||||
"pos_": "AUX",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "be",
|
||||
|
@ -338,19 +213,19 @@
|
|||
}
|
||||
},
|
||||
{
|
||||
"text": "Rogelio",
|
||||
"idx": 17,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"text": " ",
|
||||
"idx": 16,
|
||||
"tag_": "_SP",
|
||||
"pos_": "SPACE",
|
||||
"dep_": "attr",
|
||||
"lemma_": "Rogelio",
|
||||
"lemma_": " ",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "and",
|
||||
"idx": 25,
|
||||
"idx": 17,
|
||||
"tag_": "CC",
|
||||
"pos_": "CCONJ",
|
||||
"dep_": "cc",
|
||||
|
@ -358,47 +233,76 @@
|
|||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
}
|
||||
],
|
||||
"tags": [
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O"
|
||||
],
|
||||
"template_id": null,
|
||||
"metadata": {
|
||||
"Gender": "male",
|
||||
"NameSet": "Hispanic",
|
||||
"Country": "Iraq",
|
||||
"Lowercase": false,
|
||||
"Template#": 14
|
||||
}
|
||||
},
|
||||
{
|
||||
"full_text": "I live in Uralaane",
|
||||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"text": "my",
|
||||
"idx": 29,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
"entity_type": "LOCATION",
|
||||
"entity_value": "Uralaane",
|
||||
"start_position": 10,
|
||||
"end_position": 18
|
||||
}
|
||||
],
|
||||
"tokens": [
|
||||
{
|
||||
"text": "last",
|
||||
"idx": 32,
|
||||
"tag_": "JJ",
|
||||
"pos_": "ADJ",
|
||||
"text": "I",
|
||||
"idx": 0,
|
||||
"tag_": "PRP",
|
||||
"pos_": "PRON",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "last",
|
||||
"lemma_": "I",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "is",
|
||||
"idx": 37,
|
||||
"tag_": "VBZ",
|
||||
"pos_": "AUX",
|
||||
"dep_": "conj",
|
||||
"lemma_": "be",
|
||||
"text": "live",
|
||||
"idx": 2,
|
||||
"tag_": "VBP",
|
||||
"pos_": "VERB",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "live",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Patrick",
|
||||
"idx": 40,
|
||||
"text": "in",
|
||||
"idx": 7,
|
||||
"tag_": "IN",
|
||||
"pos_": "ADP",
|
||||
"dep_": "prep",
|
||||
"lemma_": "in",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Uralaane",
|
||||
"idx": 10,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "attr",
|
||||
"lemma_": "Patrick",
|
||||
"dep_": "pobj",
|
||||
"lemma_": "Uralaane",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
|
@ -408,21 +312,15 @@
|
|||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"U-PERSON",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"U-PERSON"
|
||||
"U-LOCATION"
|
||||
],
|
||||
"template_id": null,
|
||||
"metadata": {
|
||||
"Gender": "male",
|
||||
"NameSet": "American",
|
||||
"Country": "California",
|
||||
"Gender": "female",
|
||||
"NameSet": "Chechen (Latin)",
|
||||
"Country": "United States Of America",
|
||||
"Lowercase": false,
|
||||
"Template#": 2
|
||||
"Template#": 5
|
||||
}
|
||||
}
|
||||
]
|
|
@ -1,3 +1,11 @@
|
|||
from .model_mock import IdentityTokensMockModel, \
|
||||
FiftyFiftyIdentityTokensMockModel, \
|
||||
MockTokensModel
|
||||
from .model_mock import (
|
||||
IdentityTokensMockModel,
|
||||
FiftyFiftyIdentityTokensMockModel,
|
||||
MockTokensModel,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"IdentityTokensMockModel",
|
||||
"FiftyFiftyIdentityTokensMockModel",
|
||||
"MockTokensModel",
|
||||
]
|
||||
|
|
|
@ -1,14 +1,15 @@
|
|||
from typing import List
|
||||
from typing import List, Optional
|
||||
|
||||
from presidio_evaluator import InputSample, ModelEvaluator
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.models import BaseModel
|
||||
|
||||
|
||||
class MockTokensModel(ModelEvaluator):
|
||||
class MockTokensModel(BaseModel):
|
||||
"""
|
||||
Simulates a real model, returns the prediction given in the constructor
|
||||
"""
|
||||
|
||||
def __init__(self, prediction: List[str], entities_to_keep: List = None,
|
||||
def __init__(self, prediction: Optional[List[str]], entities_to_keep: List = None,
|
||||
verbose: bool = False, **kwargs):
|
||||
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose,
|
||||
**kwargs)
|
||||
|
@ -18,20 +19,19 @@ class MockTokensModel(ModelEvaluator):
|
|||
return self.prediction
|
||||
|
||||
|
||||
class IdentityTokensMockModel(ModelEvaluator):
|
||||
class IdentityTokensMockModel(BaseModel):
|
||||
"""
|
||||
Simulates a real model, always return the label as prediction
|
||||
"""
|
||||
|
||||
def __init__(self, entities_to_keep: List = None,
|
||||
verbose: bool = False):
|
||||
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
|
||||
def __init__(self, verbose: bool = False):
|
||||
super().__init__(verbose=verbose)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
return sample.tags
|
||||
|
||||
|
||||
class FiftyFiftyIdentityTokensMockModel(ModelEvaluator):
|
||||
class FiftyFiftyIdentityTokensMockModel(BaseModel):
|
||||
"""
|
||||
Simulates a real model, returns the label or no predictions (list of 'O')
|
||||
alternately
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
import numpy as np
|
||||
|
||||
from presidio_evaluator.crf_evaluator import CRFEvaluator
|
||||
from presidio_evaluator.models.crf_model import CRFModel
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
|
||||
|
||||
|
@ -12,7 +12,7 @@ def no_test_test_crf_simple():
|
|||
|
||||
model_path = os.path.abspath(os.path.join(dir_path, "..", "model-outputs/crf.pickle"))
|
||||
|
||||
crf_evaluator = CRFEvaluator(model_pickle_path=model_path,entities_to_keep=['PERSON'])
|
||||
crf_evaluator = CRFModel(model_pickle_path=model_path, entities_to_keep=['PERSON'])
|
||||
evaluation_results = crf_evaluator.evaluate_all(input_samples)
|
||||
scores = crf_evaluator.calculate_score(evaluation_results)
|
||||
|
|
@ -0,0 +1,298 @@
|
|||
import numpy as np
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.evaluation import EvaluationResult, Evaluator
|
||||
from tests.mocks import (
|
||||
IdentityTokensMockModel,
|
||||
FiftyFiftyIdentityTokensMockModel,
|
||||
MockTokensModel,
|
||||
)
|
||||
|
||||
|
||||
def test_evaluator_simple():
|
||||
prediction = ["O", "O", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=["ANIMAL"])
|
||||
|
||||
evaluator = Evaluator(model=model)
|
||||
sample = InputSample(
|
||||
full_text="I am the walrus", masked="I am the [ANIMAL]", spans=None
|
||||
)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluated = evaluator.evaluate_sample(sample, prediction)
|
||||
final_evaluation = evaluator.calculate_score([evaluated])
|
||||
|
||||
assert final_evaluation.pii_precision == 1
|
||||
assert final_evaluation.pii_recall == 1
|
||||
|
||||
|
||||
def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
|
||||
prediction = ["O", "O", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["SPACESHIP"])
|
||||
|
||||
sample = InputSample(
|
||||
full_text="I am the walrus", masked="I am the [ANIMAL]", spans=None
|
||||
)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluated = evaluator.evaluate_sample(sample, prediction)
|
||||
assert evaluated.results[("O", "O")] == 4
|
||||
|
||||
|
||||
def test_evaluate_same_entity_correct_statistics():
|
||||
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
|
||||
sample = InputSample(
|
||||
full_text="I dog the walrus", masked="I [ANIMAL] the [ANIMAL]", spans=None
|
||||
)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluation_result = evaluator.evaluate_sample(sample, prediction)
|
||||
assert evaluation_result.results[("O", "O")] == 2
|
||||
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
|
||||
assert evaluation_result.results[("O", "ANIMAL")] == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_entities_to_keep_correct_statistics():
|
||||
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
|
||||
entities_to_keep = ["ANIMAL", "PLANT", "SPACESHIP"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
evaluator = Evaluator(model=model, entities_to_keep=entities_to_keep)
|
||||
|
||||
sample = InputSample(
|
||||
full_text="I dog the walrus", masked="I [ANIMAL] the [ANIMAL]", spans=None
|
||||
)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluation_result = evaluator.evaluate_sample(sample, prediction)
|
||||
assert evaluation_result.results[("O", "O")] == 2
|
||||
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
|
||||
assert evaluation_result.results[("O", "ANIMAL")] == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
|
||||
sample = InputSample(
|
||||
"I am the walrus amaericanus magnifico", masked=None, spans=None
|
||||
)
|
||||
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = evaluator.evaluate_sample(sample, prediction)
|
||||
evaluation = evaluator.calculate_score([evaluated])
|
||||
|
||||
assert evaluation.pii_precision == 1
|
||||
assert evaluation.pii_recall == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_partial_match_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
|
||||
sample = InputSample(
|
||||
"I am the walrus amaericanus magnifico", masked=None, spans=None
|
||||
)
|
||||
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = evaluator.evaluate_sample(sample, prediction)
|
||||
evaluation = evaluator.calculate_score([evaluated])
|
||||
|
||||
assert evaluation.pii_precision == 1
|
||||
assert evaluation.pii_recall == 4 / 6
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
|
||||
sample = InputSample(
|
||||
"I am the walrus amaericanus magnifico", masked=None, spans=None
|
||||
)
|
||||
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = evaluator.evaluate_sample(sample, prediction)
|
||||
evaluation = evaluator.calculate_score([evaluated])
|
||||
|
||||
assert np.isnan(evaluation.pii_precision)
|
||||
assert evaluation.pii_recall == 0
|
||||
|
||||
|
||||
def test_evaluate_multiple_examples_correct_statistics():
|
||||
prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["PERSON"])
|
||||
input_sample = InputSample("My name is Raphael or David", masked=None, spans=None)
|
||||
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
|
||||
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
|
||||
|
||||
evaluated = evaluator.evaluate_all(
|
||||
[input_sample, input_sample, input_sample, input_sample]
|
||||
)
|
||||
scores = evaluator.calculate_score(evaluated)
|
||||
assert scores.pii_precision == 0.5
|
||||
assert scores.pii_recall == 0.5
|
||||
|
||||
|
||||
def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
|
||||
prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
|
||||
model = MockTokensModel(prediction=prediction)
|
||||
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["PERSON", "TENNIS_PLAYER"])
|
||||
input_sample = InputSample("My name is Raphael or David", masked=None, spans=None)
|
||||
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
|
||||
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
|
||||
|
||||
evaluated = evaluator.evaluate_all(
|
||||
[input_sample, input_sample, input_sample, input_sample]
|
||||
)
|
||||
scores = evaluator.calculate_score(evaluated)
|
||||
assert scores.pii_precision == 1
|
||||
assert scores.pii_recall == 1
|
||||
|
||||
|
||||
def test_confusion_matrix_correct_metrics():
|
||||
from collections import Counter
|
||||
|
||||
evaluated = [
|
||||
EvaluationResult(
|
||||
results=Counter(
|
||||
{
|
||||
("O", "O"): 150,
|
||||
("O", "PERSON"): 30,
|
||||
("O", "COMPANY"): 30,
|
||||
("PERSON", "PERSON"): 40,
|
||||
("COMPANY", "COMPANY"): 40,
|
||||
("PERSON", "COMPANY"): 10,
|
||||
("COMPANY", "PERSON"): 10,
|
||||
("PERSON", "O"): 30,
|
||||
("COMPANY", "O"): 30,
|
||||
}
|
||||
),
|
||||
model_errors=None,
|
||||
text=None,
|
||||
)
|
||||
]
|
||||
|
||||
model = MockTokensModel(prediction=None)
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["PERSON", "COMPANY"])
|
||||
scores = evaluator.calculate_score(evaluated, beta=2.5)
|
||||
|
||||
assert scores.pii_precision == 0.625
|
||||
assert scores.pii_recall == 0.625
|
||||
assert scores.entity_recall_dict["PERSON"] == 0.5
|
||||
assert scores.entity_precision_dict["PERSON"] == 0.5
|
||||
assert scores.entity_recall_dict["COMPANY"] == 0.5
|
||||
assert scores.entity_precision_dict["COMPANY"] == 0.5
|
||||
|
||||
|
||||
def test_confusion_matrix_2_correct_metrics():
|
||||
from collections import Counter
|
||||
|
||||
evaluated = [
|
||||
EvaluationResult(
|
||||
results=Counter(
|
||||
{
|
||||
("O", "O"): 65467,
|
||||
("O", "ORG"): 4189,
|
||||
("GPE", "O"): 3370,
|
||||
("PERSON", "PERSON"): 2024,
|
||||
("GPE", "PERSON"): 1488,
|
||||
("GPE", "GPE"): 1033,
|
||||
("O", "GPE"): 964,
|
||||
("ORG", "ORG"): 914,
|
||||
("O", "PERSON"): 834,
|
||||
("GPE", "ORG"): 401,
|
||||
("PERSON", "ORG"): 35,
|
||||
("PERSON", "O"): 33,
|
||||
("ORG", "O"): 8,
|
||||
("PERSON", "GPE"): 5,
|
||||
("ORG", "PERSON"): 1,
|
||||
}
|
||||
),
|
||||
model_errors=None,
|
||||
text=None,
|
||||
)
|
||||
]
|
||||
|
||||
model = MockTokensModel(prediction=None)
|
||||
evaluator = Evaluator(model=model)
|
||||
scores = evaluator.calculate_score(evaluated, beta=2.5)
|
||||
|
||||
pii_tp = (
|
||||
evaluated[0].results[("PERSON", "PERSON")]
|
||||
+ evaluated[0].results[("ORG", "ORG")]
|
||||
+ evaluated[0].results[("GPE", "GPE")]
|
||||
+ evaluated[0].results[("ORG", "GPE")]
|
||||
+ evaluated[0].results[("ORG", "PERSON")]
|
||||
+ evaluated[0].results[("GPE", "ORG")]
|
||||
+ evaluated[0].results[("GPE", "PERSON")]
|
||||
+ evaluated[0].results[("PERSON", "GPE")]
|
||||
+ evaluated[0].results[("PERSON", "ORG")]
|
||||
)
|
||||
|
||||
pii_fp = (
|
||||
evaluated[0].results[("O", "PERSON")]
|
||||
+ evaluated[0].results[("O", "GPE")]
|
||||
+ evaluated[0].results[("O", "ORG")]
|
||||
)
|
||||
|
||||
pii_fn = (
|
||||
evaluated[0].results[("PERSON", "O")]
|
||||
+ evaluated[0].results[("GPE", "O")]
|
||||
+ evaluated[0].results[("ORG", "O")]
|
||||
)
|
||||
|
||||
assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
|
||||
assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
|
||||
|
||||
|
||||
def test_dataset_to_metric_identity_model():
|
||||
import os
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
"{}/data/generated_small.txt".format(dir_path), length=10
|
||||
)
|
||||
|
||||
model = IdentityTokensMockModel()
|
||||
evaluator = Evaluator(model=model)
|
||||
evaluation_results = evaluator.evaluate_all(input_samples)
|
||||
metrics = evaluator.calculate_score(evaluation_results)
|
||||
|
||||
assert metrics.pii_precision == 1
|
||||
assert metrics.pii_recall == 1
|
||||
|
||||
|
||||
def test_dataset_to_metric_50_50_model():
|
||||
import os
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
"{}/data/generated_small.txt".format(dir_path), length=100
|
||||
)
|
||||
|
||||
# Replace 50% of the predictions with a list of "O"
|
||||
model = FiftyFiftyIdentityTokensMockModel()
|
||||
evaluator = Evaluator(model=model, entities_to_keep=["PERSON"])
|
||||
evaluation_results = evaluator.evaluate_all(input_samples)
|
||||
metrics = evaluator.calculate_score(evaluation_results)
|
||||
|
||||
print(metrics.pii_precision)
|
||||
print(metrics.pii_recall)
|
||||
print(metrics.pii_f)
|
||||
|
||||
assert metrics.pii_precision == 1
|
||||
assert metrics.pii_recall < 0.75
|
||||
assert metrics.pii_recall > 0.25
|
|
@ -1,15 +1,18 @@
|
|||
import pytest
|
||||
|
||||
from presidio_evaluator.evaluation import Evaluator
|
||||
|
||||
try:
|
||||
from flair.models import SequenceTagger
|
||||
except:
|
||||
ImportError("Flair is not installed by default")
|
||||
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.flair_evaluator import FlairEvaluator
|
||||
from presidio_evaluator.models.flair_model import FlairModel
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
# no-unit because flair is not a dependency by default
|
||||
@pytest.mark.skip(reason="Flair not installed by default")
|
||||
def test_flair_simple():
|
||||
|
@ -22,9 +25,10 @@ def test_flair_simple():
|
|||
|
||||
model = SequenceTagger.load("ner-ontonotes-fast") # .load('ner')
|
||||
|
||||
flair_evaluator = FlairEvaluator(model=model, entities_to_keep=["PERSON"])
|
||||
evaluation_results = flair_evaluator.evaluate_all(input_samples)
|
||||
scores = flair_evaluator.calculate_score(evaluation_results)
|
||||
flair_model = FlairModel(model=model, entities_to_keep=["PERSON"])
|
||||
evaluator = Evaluator(model=flair_model)
|
||||
evaluation_results = evaluator.evaluate_all(input_samples)
|
||||
scores = evaluator.calculate_score(evaluation_results)
|
||||
|
||||
np.testing.assert_almost_equal(
|
||||
scores.pii_precision, scores.entity_precision_dict["PERSON"]
|
|
@ -1,271 +0,0 @@
|
|||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from presidio_evaluator import InputSample, EvaluationResult
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from tests.mocks import IdentityTokensMockModel, \
|
||||
FiftyFiftyIdentityTokensMockModel, MockTokensModel
|
||||
|
||||
|
||||
def test_evaluator_simple():
|
||||
prediction = ["O", "O", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample(full_text="I am the walrus",
|
||||
masked="I am the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
final_evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert final_evaluation.pii_precision == 1
|
||||
assert final_evaluation.pii_recall == 1
|
||||
|
||||
|
||||
def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
|
||||
prediction = ["O", "O", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction,
|
||||
entities_to_keep=['SPACESHIP'])
|
||||
|
||||
sample = InputSample(full_text="I am the walrus",
|
||||
masked="I am the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
assert evaluated.results[("O", "O")] == 4
|
||||
|
||||
|
||||
def test_evaluate_same_entity_correct_statistics():
|
||||
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample(full_text="I dog the walrus",
|
||||
masked="I [ANIMAL] the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluation_result = model.evaluate_sample(sample)
|
||||
assert evaluation_result.results[("O", "O")] == 2
|
||||
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
|
||||
assert evaluation_result.results[("O", "ANIMAL")] == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_entities_to_keep_correct_statistics():
|
||||
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, labeling_scheme='BIO',
|
||||
entities_to_keep=['ANIMAL', 'PLANT', 'SPACESHIP'])
|
||||
sample = InputSample(full_text="I dog the walrus",
|
||||
masked="I [ANIMAL] the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluation_result = model.evaluate_sample(sample)
|
||||
assert evaluation_result.results[("O", "O")] == 2
|
||||
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
|
||||
assert evaluation_result.results[("O", "ANIMAL")] == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the",
|
||||
"walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O",
|
||||
"B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert evaluation.pii_precision == 1
|
||||
assert evaluation.pii_recall == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_partial_match_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert evaluation.pii_precision == 1
|
||||
assert evaluation.pii_recall == 4 / 6
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert np.isnan(evaluation.pii_precision)
|
||||
assert evaluation.pii_recall == 0
|
||||
|
||||
|
||||
def test_evaluate_multiple_examples_correct_statistics():
|
||||
prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
|
||||
model = MockTokensModel(prediction=prediction,
|
||||
labeling_scheme='BILOU',
|
||||
entities_to_keep=['PERSON'])
|
||||
input_sample = InputSample("My name is Raphael or David", masked=None,
|
||||
spans=None)
|
||||
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
|
||||
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
|
||||
|
||||
evaluated = model.evaluate_all(
|
||||
[input_sample, input_sample, input_sample, input_sample])
|
||||
scores = model.calculate_score(
|
||||
evaluated)
|
||||
assert scores.pii_precision == 0.5
|
||||
assert scores.pii_recall == 0.5
|
||||
|
||||
|
||||
def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
|
||||
prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
|
||||
model = MockTokensModel(prediction=prediction,
|
||||
labeling_scheme='BILOU',
|
||||
entities_to_keep=['PERSON', 'TENNIS_PLAYER'])
|
||||
input_sample = InputSample("My name is Raphael or David", masked=None,
|
||||
spans=None)
|
||||
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
|
||||
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
|
||||
|
||||
evaluated = model.evaluate_all(
|
||||
[input_sample, input_sample, input_sample, input_sample])
|
||||
scores = model.calculate_score(evaluated)
|
||||
assert scores.pii_precision == 1
|
||||
assert scores.pii_recall == 1
|
||||
|
||||
|
||||
def test_confusion_matrix_correct_metrics():
|
||||
from collections import Counter
|
||||
|
||||
evaluated = [EvaluationResult(results=Counter({
|
||||
('O', 'O'): 150,
|
||||
('O', 'PERSON'): 30,
|
||||
('O', 'COMPANY'): 30,
|
||||
('PERSON', 'PERSON'): 40,
|
||||
('COMPANY', 'COMPANY'): 40,
|
||||
('PERSON', 'COMPANY'): 10,
|
||||
('COMPANY', 'PERSON'): 10,
|
||||
('PERSON', 'O'): 30,
|
||||
('COMPANY', 'O'): 30}), model_errors=None, text=None)]
|
||||
|
||||
model = MockTokensModel(prediction=None,
|
||||
entities_to_keep=['PERSON', 'COMPANY'])
|
||||
|
||||
scores = model.calculate_score(evaluated, beta=2.5)
|
||||
|
||||
assert scores.pii_precision == 0.625
|
||||
assert scores.pii_recall == 0.625
|
||||
assert scores.entity_recall_dict['PERSON'] == 0.5
|
||||
assert scores.entity_precision_dict['PERSON'] == 0.5
|
||||
assert scores.entity_recall_dict['COMPANY'] == 0.5
|
||||
assert scores.entity_precision_dict['COMPANY'] == 0.5
|
||||
|
||||
|
||||
def test_confusion_matrix_2_correct_metrics():
|
||||
from collections import Counter
|
||||
|
||||
evaluated = [EvaluationResult(results=Counter(
|
||||
{('O', 'O'): 65467,
|
||||
('O', 'ORG'): 4189,
|
||||
('GPE', 'O'): 3370,
|
||||
('PERSON', 'PERSON'): 2024,
|
||||
('GPE', 'PERSON'): 1488,
|
||||
('GPE', 'GPE'): 1033,
|
||||
('O', 'GPE'): 964,
|
||||
('ORG', 'ORG'): 914,
|
||||
('O', 'PERSON'): 834,
|
||||
('GPE', 'ORG'): 401,
|
||||
('PERSON', 'ORG'): 35,
|
||||
('PERSON', 'O'): 33,
|
||||
('ORG', 'O'): 8,
|
||||
('PERSON', 'GPE'): 5,
|
||||
('ORG', 'PERSON'): 1}), model_errors=None, text=None)]
|
||||
|
||||
model = MockTokensModel(prediction=None)
|
||||
|
||||
scores = model.calculate_score(evaluated, beta=2.5)
|
||||
|
||||
pii_tp = evaluated[0].results[('PERSON', 'PERSON')] + \
|
||||
evaluated[0].results[('ORG', 'ORG')] + \
|
||||
evaluated[0].results[('GPE', 'GPE')] + \
|
||||
evaluated[0].results[('ORG', 'GPE')] + \
|
||||
evaluated[0].results[('ORG', 'PERSON')] + \
|
||||
evaluated[0].results[('GPE', 'ORG')] + \
|
||||
evaluated[0].results[('GPE', 'PERSON')] + \
|
||||
evaluated[0].results[('PERSON', 'GPE')] + \
|
||||
evaluated[0].results[('PERSON', 'ORG')]
|
||||
|
||||
pii_fp = evaluated[0].results[('O', 'PERSON')] + \
|
||||
evaluated[0].results[('O', 'GPE')] + \
|
||||
evaluated[0].results[('O', 'ORG')]
|
||||
|
||||
pii_fn = evaluated[0].results[('PERSON', 'O')] + \
|
||||
evaluated[0].results[('GPE', 'O')] + \
|
||||
evaluated[0].results[('ORG', 'O')]
|
||||
|
||||
assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
|
||||
assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
|
||||
|
||||
|
||||
def test_dataset_to_metric_identity_model():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
"{}/data/generated_small.txt".format(dir_path), length=10)
|
||||
|
||||
model = IdentityTokensMockModel()
|
||||
|
||||
evaluation_results = model.evaluate_all(input_samples)
|
||||
metrics = model.calculate_score(
|
||||
evaluation_results)
|
||||
|
||||
assert metrics.pii_precision == 1
|
||||
assert metrics.pii_recall == 1
|
||||
|
||||
|
||||
def test_dataset_to_metric_50_50_model():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
"{}/data/generated_small.txt".format(dir_path), length=100)
|
||||
|
||||
# Replace 50% of the predictions with a list of "O"
|
||||
model = FiftyFiftyIdentityTokensMockModel(entities_to_keep='PERSON')
|
||||
|
||||
evaluation_results = model.evaluate_all(input_samples)
|
||||
metrics = model.calculate_score(
|
||||
evaluation_results)
|
||||
|
||||
print(metrics.pii_precision)
|
||||
print(metrics.pii_recall)
|
||||
print(metrics.pii_f)
|
||||
|
||||
assert metrics.pii_precision == 1
|
||||
assert metrics.pii_recall < 0.75
|
||||
assert metrics.pii_recall > 0.25
|
|
@ -2,27 +2,8 @@ import pytest
|
|||
|
||||
from presidio_evaluator import InputSample, Span
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.presidio_analyzer_evaluator import PresidioAnalyzerEvaluator
|
||||
|
||||
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
|
||||
entities_mapping = {
|
||||
"PERSON": "PERSON",
|
||||
"EMAIL": "EMAIL_ADDRESS",
|
||||
"CREDIT_CARD": "CREDIT_CARD",
|
||||
"FIRST_NAME": "PERSON",
|
||||
"PHONE_NUMBER": "PHONE_NUMBER",
|
||||
"BIRTHDAY": "DATE_TIME",
|
||||
"DATE": "DATE_TIME",
|
||||
"DOMAIN": "DOMAIN",
|
||||
"CITY": "LOCATION",
|
||||
"ADDRESS": "LOCATION",
|
||||
"IBAN": "IBAN_CODE",
|
||||
"URL": "DOMAIN_NAME",
|
||||
"US_SSN": "US_SSN",
|
||||
"IP_ADDRESS": "IP_ADDRESS",
|
||||
"ORGANIZATION": "ORG",
|
||||
"O": "O",
|
||||
}
|
||||
from presidio_evaluator.evaluation import Evaluator
|
||||
from presidio_evaluator.models.presidio_analyzer_wrapper import PresidioAnalyzerWrapper
|
||||
|
||||
|
||||
class GeneratedTextTestCase:
|
||||
|
@ -54,8 +35,7 @@ analyzer_test_generate_text_testdata = [
|
|||
|
||||
|
||||
def test_analyzer_simple_input():
|
||||
model = PresidioAnalyzerEvaluator(entities_to_keep=["PERSON"])
|
||||
|
||||
model = PresidioAnalyzerWrapper(entities_to_keep=["PERSON"])
|
||||
sample = InputSample(
|
||||
full_text="My name is Mike",
|
||||
masked="My name is [PERSON]",
|
||||
|
@ -63,8 +43,11 @@ def test_analyzer_simple_input():
|
|||
create_tags_from_span=True,
|
||||
)
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
metrics = model.calculate_score([evaluated])
|
||||
prediction = model.predict(sample)
|
||||
evaluator = Evaluator(model=model)
|
||||
|
||||
evaluated = evaluator.evaluate_sample(sample, prediction)
|
||||
metrics = evaluator.calculate_score([evaluated])
|
||||
|
||||
assert metrics.pii_precision == 1
|
||||
assert metrics.pii_recall == 1
|
||||
|
@ -89,13 +72,14 @@ def test_analyzer_with_generated_text(test_input, acceptance_threshold):
|
|||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(test_input.format(dir_path))
|
||||
|
||||
updated_samples = PresidioAnalyzerEvaluator.align_input_samples_to_presidio_analyzer(
|
||||
input_samples=input_samples, entities_mapping=entities_mapping
|
||||
updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(
|
||||
input_samples=input_samples, entities_mapping=PresidioAnalyzerWrapper.presidio_entities_map
|
||||
)
|
||||
|
||||
analyzer = PresidioAnalyzerEvaluator()
|
||||
evaluated_samples = analyzer.evaluate_all(updated_samples)
|
||||
scores = analyzer.calculate_score(evaluation_results=evaluated_samples)
|
||||
analyzer = PresidioAnalyzerWrapper()
|
||||
evaluator = Evaluator(model=analyzer)
|
||||
evaluated_samples = evaluator.evaluate_all(updated_samples)
|
||||
scores = evaluator.calculate_score(evaluation_results=evaluated_samples)
|
||||
|
||||
assert acceptance_threshold <= scores.pii_precision
|
||||
assert acceptance_threshold <= scores.pii_recall
|
|
@ -8,19 +8,16 @@ import pandas as pd
|
|||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
# fmt: off
|
||||
"text, entity1, entity2, start1, end1, start2, end2",
|
||||
[
|
||||
(
|
||||
"Hi I live in South Africa and my name is Toma",
|
||||
"LOCATION",
|
||||
"PERSON",
|
||||
13,
|
||||
25,
|
||||
41,
|
||||
45,
|
||||
"LOCATION", "PERSON", 13, 25, 41, 45,
|
||||
),
|
||||
("Africa is my continent, James", "LOCATION", "PERSON", 0, 6, 24, 29,),
|
||||
],
|
||||
# fmt: on
|
||||
)
|
||||
def test_presidio_perturb_two_entities(
|
||||
text, entity1, entity2, start1, end1, start2, end2
|
||||
|
@ -51,15 +48,13 @@ def test_entity_translation():
|
|||
RecognizerResult(entity_type="EMAIL_ADDRESS", start=12, end=27, score=0.5)
|
||||
]
|
||||
|
||||
presidio_perturb = PresidioPerturb(
|
||||
fake_pii_df=get_mock_fake_df(), entity_dict={"EMAIL_ADDRESS": "EMAIL"}
|
||||
)
|
||||
presidio_perturb = PresidioPerturb(fake_pii_df=get_mock_fake_df())
|
||||
fake_df = presidio_perturb.fake_pii
|
||||
perturbations = presidio_perturb.perturb(
|
||||
original_text=text, presidio_response=presidio_response, count=1
|
||||
)
|
||||
|
||||
assert fake_df["EMAIL"].str.lower()[0] in perturbations[0]
|
||||
assert fake_df["EMAIL_ADDRESS"].str.lower()[0] in perturbations[0]
|
||||
|
||||
|
||||
def test_subset_perturbation():
|
||||
|
@ -76,7 +71,7 @@ def test_subset_perturbation():
|
|||
"NameSet": ["Hebrew", "English"],
|
||||
}
|
||||
)
|
||||
ignore_types = ("DATE", "LOCATION", "ADDRESS", "GENDER")
|
||||
ignore_types = {"DATE", "LOCATION", "ADDRESS", "GENDER"}
|
||||
|
||||
presidio_perturb = PresidioPerturb(fake_pii_df=fake_df, ignore_types=ignore_types)
|
||||
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.presidio_recognizer_evaluator import score_presidio_recognizer
|
||||
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
|
||||
import pytest
|
||||
|
||||
from presidio_analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
|
||||
from presidio_analyzer.predefined_recognizers import CreditCardRecognizer
|
||||
|
||||
# test case parameters for tests with dataset which was previously generated.
|
||||
class GeneratedTextTestCase:
|
||||
|
@ -13,8 +13,12 @@ class GeneratedTextTestCase:
|
|||
self.marks = marks
|
||||
|
||||
def to_pytest_param(self):
|
||||
return pytest.param(self.test_input, self.acceptance_threshold,
|
||||
id=self.test_name, marks=self.marks)
|
||||
return pytest.param(
|
||||
self.test_input,
|
||||
self.acceptance_threshold,
|
||||
id=self.test_name,
|
||||
marks=self.marks,
|
||||
)
|
||||
|
||||
|
||||
# generated-text test cases
|
||||
|
@ -24,35 +28,39 @@ cc_test_generate_text_testdata = [
|
|||
test_name="small-set",
|
||||
test_input="{}/data/generated_small.txt",
|
||||
acceptance_threshold=1,
|
||||
marks=pytest.mark.none
|
||||
marks=pytest.mark.none,
|
||||
),
|
||||
# large set fixture which expects all type results. marked as "slow"
|
||||
GeneratedTextTestCase(
|
||||
test_name="large_set",
|
||||
test_input="{}/data/generated_large.txt",
|
||||
acceptance_threshold=1,
|
||||
marks=pytest.mark.slow
|
||||
)
|
||||
marks=pytest.mark.slow,
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
# credit card recognizer tests on generated data
|
||||
@pytest.mark.parametrize("test_input,acceptance_threshold",
|
||||
[testcase.to_pytest_param()
|
||||
for testcase in cc_test_generate_text_testdata])
|
||||
@pytest.mark.parametrize(
|
||||
"test_input,acceptance_threshold",
|
||||
[testcase.to_pytest_param() for testcase in cc_test_generate_text_testdata],
|
||||
)
|
||||
def test_credit_card_recognizer_with_generated_text(test_input, acceptance_threshold):
|
||||
"""
|
||||
Test credit card recognizer with a generated dataset text file
|
||||
:param test_input: input text file location
|
||||
:param acceptance_threshold: minimim precision/recall
|
||||
allowed for tests to pass
|
||||
Test credit card recognizer with a generated dataset text file
|
||||
:param test_input: input text file location
|
||||
:param acceptance_threshold: minimim precision/recall
|
||||
allowed for tests to pass
|
||||
"""
|
||||
|
||||
# read test input from generated file
|
||||
import os
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
test_input.format(dir_path))
|
||||
input_samples = read_synth_dataset(test_input.format(dir_path))
|
||||
scores = score_presidio_recognizer(
|
||||
CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
|
||||
recognizer=CreditCardRecognizer(),
|
||||
entities_to_keep=["CREDIT_CARD"],
|
||||
input_samples=input_samples,
|
||||
)
|
||||
assert acceptance_threshold <= scores.pii_f
|
||||
|
|
|
@ -1,15 +1,25 @@
|
|||
from presidio_evaluator.data_generator import generate
|
||||
from presidio_evaluator.presidio_recognizer_evaluator import \
|
||||
score_presidio_recognizer
|
||||
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
|
||||
import pytest
|
||||
import numpy as np
|
||||
|
||||
from presidio_analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
|
||||
from presidio_analyzer.predefined_recognizers import CreditCardRecognizer
|
||||
|
||||
|
||||
# test case parameters for tests with dataset generated from a template and csv values
|
||||
class TemplateTextTestCase:
|
||||
def __init__(self, test_name, pii_csv, utterances, dictionary_path,
|
||||
num_of_examples, acceptance_threshold, marks):
|
||||
"""
|
||||
Test case parameters for tests with dataset generated from a template and csv values
|
||||
"""
|
||||
def __init__(
|
||||
self,
|
||||
test_name,
|
||||
pii_csv,
|
||||
utterances,
|
||||
dictionary_path,
|
||||
num_of_examples,
|
||||
acceptance_threshold,
|
||||
marks,
|
||||
):
|
||||
self.test_name = test_name
|
||||
self.pii_csv = pii_csv
|
||||
self.utterances = utterances
|
||||
|
@ -19,9 +29,15 @@ class TemplateTextTestCase:
|
|||
self.marks = marks
|
||||
|
||||
def to_pytest_param(self):
|
||||
return pytest.param(self.pii_csv, self.utterances, self.dictionary_path,
|
||||
self.num_of_examples, self.acceptance_threshold,
|
||||
id=self.test_name, marks=self.marks)
|
||||
return pytest.param(
|
||||
self.pii_csv,
|
||||
self.utterances,
|
||||
self.dictionary_path,
|
||||
self.num_of_examples,
|
||||
self.acceptance_threshold,
|
||||
id=self.test_name,
|
||||
marks=self.marks,
|
||||
)
|
||||
|
||||
|
||||
# template-dataset test cases
|
||||
|
@ -34,46 +50,52 @@ cc_test_template_testdata = [
|
|||
dictionary_path="{}/data/Dictionary_test.csv",
|
||||
num_of_examples=100,
|
||||
acceptance_threshold=0.9,
|
||||
marks=pytest.mark.slow
|
||||
marks=pytest.mark.slow,
|
||||
)
|
||||
]
|
||||
|
||||
|
||||
# credit card recognizer tests on template-generates data
|
||||
@pytest.mark.parametrize("pii_csv, "
|
||||
"utterances, "
|
||||
"dictionary_path, "
|
||||
"num_of_examples, "
|
||||
"acceptance_threshold",
|
||||
[testcase.to_pytest_param()
|
||||
for testcase in cc_test_template_testdata])
|
||||
def test_credit_card_recognizer_with_template(pii_csv, utterances,
|
||||
dictionary_path,
|
||||
num_of_examples,
|
||||
acceptance_threshold):
|
||||
@pytest.mark.parametrize(
|
||||
"pii_csv, "
|
||||
"utterances, "
|
||||
"dictionary_path, "
|
||||
"num_of_examples, "
|
||||
"acceptance_threshold",
|
||||
[testcase.to_pytest_param() for testcase in cc_test_template_testdata],
|
||||
)
|
||||
def test_credit_card_recognizer_with_template(
|
||||
pii_csv, utterances, dictionary_path, num_of_examples, acceptance_threshold
|
||||
):
|
||||
"""
|
||||
Test credit card recognizer with a dataset generated from
|
||||
template and a CSV values file
|
||||
:param pii_csv: input csv file location
|
||||
:param utterances: template file location
|
||||
:param dictionary_path: dictionary/vocabulary file location
|
||||
:param num_of_examples: number of samples to be used from dataset
|
||||
to test
|
||||
:param acceptance_threshold: minimim precision/recall
|
||||
allowed for tests to pass
|
||||
Test credit card recognizer with a dataset generated from
|
||||
template and a CSV values file
|
||||
:param pii_csv: input csv file location
|
||||
:param utterances: template file location
|
||||
:param dictionary_path: dictionary/vocabulary file location
|
||||
:param num_of_examples: number of samples to be used from dataset
|
||||
to test
|
||||
:param acceptance_threshold: minimum precision/recall
|
||||
allowed for tests to pass
|
||||
"""
|
||||
|
||||
# read template and CSV files
|
||||
import os
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
|
||||
input_samples = generate(fake_pii_csv=pii_csv.format(dir_path),
|
||||
utterances_file=utterances.format(dir_path),
|
||||
dictionary_path=dictionary_path.format(dir_path),
|
||||
lower_case_ratio=0.5,
|
||||
num_of_examples=num_of_examples)
|
||||
input_samples = generate(
|
||||
fake_pii_csv=pii_csv.format(dir_path),
|
||||
utterances_file=utterances.format(dir_path),
|
||||
dictionary_path=dictionary_path.format(dir_path),
|
||||
lower_case_ratio=0.5,
|
||||
num_of_examples=num_of_examples,
|
||||
)
|
||||
|
||||
scores = score_presidio_recognizer(
|
||||
CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
|
||||
recognizer=CreditCardRecognizer(),
|
||||
entities_to_keep=["CREDIT_CARD"],
|
||||
input_samples=input_samples,
|
||||
)
|
||||
if not np.isnan(scores.pii_f):
|
||||
assert acceptance_threshold <= scores.pii_f
|
||||
|
|
|
@ -1,18 +1,32 @@
|
|||
from presidio_evaluator.data_generator import FakeDataGenerator
|
||||
from presidio_evaluator.presidio_recognizer_evaluator import \
|
||||
score_presidio_recognizer
|
||||
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
|
||||
import pandas as pd
|
||||
import pytest
|
||||
import numpy as np
|
||||
|
||||
from presidio_analyzer import Pattern, PatternRecognizer
|
||||
|
||||
# test case parameters for tests with dataset generated from a template and
|
||||
# two csv value files, one containing the common-entities and another one with custom entities
|
||||
|
||||
class PatternRecognizerTestCase:
|
||||
def __init__(self, test_name, entity_name, pattern, score, pii_csv, ext_csv,
|
||||
utterances, dictionary_path, num_of_examples, acceptance_threshold,
|
||||
max_mistakes_number, marks):
|
||||
"""
|
||||
Test case parameters for tests with dataset generated from a template and
|
||||
two csv value files, one containing the common-entities and another one with custom entities.
|
||||
"""
|
||||
def __init__(
|
||||
self,
|
||||
test_name,
|
||||
entity_name,
|
||||
pattern,
|
||||
score,
|
||||
pii_csv,
|
||||
ext_csv,
|
||||
utterances,
|
||||
dictionary_path,
|
||||
num_of_examples,
|
||||
acceptance_threshold,
|
||||
max_mistakes_number,
|
||||
marks,
|
||||
):
|
||||
self.test_name = test_name
|
||||
self.entity_name = entity_name
|
||||
self.pattern = pattern
|
||||
|
@ -27,12 +41,20 @@ class PatternRecognizerTestCase:
|
|||
self.marks = marks
|
||||
|
||||
def to_pytest_param(self):
|
||||
return pytest.param(self.pii_csv, self.ext_csv, self.utterances,
|
||||
self.dictionary_path,
|
||||
self.entity_name, self.pattern, self.score,
|
||||
self.num_of_examples, self.acceptance_threshold,
|
||||
self.max_mistakes_number, id=self.test_name,
|
||||
marks=self.marks)
|
||||
return pytest.param(
|
||||
self.pii_csv,
|
||||
self.ext_csv,
|
||||
self.utterances,
|
||||
self.dictionary_path,
|
||||
self.entity_name,
|
||||
self.pattern,
|
||||
self.score,
|
||||
self.num_of_examples,
|
||||
self.acceptance_threshold,
|
||||
self.max_mistakes_number,
|
||||
id=self.test_name,
|
||||
marks=self.marks,
|
||||
)
|
||||
|
||||
|
||||
# template-dataset test cases
|
||||
|
@ -42,7 +64,7 @@ rocket_test_template_testdata = [
|
|||
PatternRecognizerTestCase(
|
||||
test_name="rocket-no-errors",
|
||||
entity_name="ROCKET",
|
||||
pattern=r'\W*(rocket)\W*',
|
||||
pattern=r"\W*(rocket)\W*",
|
||||
score=0.8,
|
||||
pii_csv="{}/data/FakeNameGenerator.com_100.csv",
|
||||
ext_csv="{}/data/FakeRocketGenerator.csv",
|
||||
|
@ -51,14 +73,14 @@ rocket_test_template_testdata = [
|
|||
num_of_examples=100,
|
||||
acceptance_threshold=1,
|
||||
max_mistakes_number=0,
|
||||
marks=pytest.mark.slow
|
||||
marks=pytest.mark.slow,
|
||||
),
|
||||
# large dataset fixture. marked as slow
|
||||
# all input is correct, test is conclusive
|
||||
PatternRecognizerTestCase(
|
||||
test_name="rocket-all-errors",
|
||||
entity_name="ROCKET",
|
||||
pattern=r'\W*(rocket)\W*',
|
||||
pattern=r"\W*(rocket)\W*",
|
||||
score=0.8,
|
||||
pii_csv="{}/data/FakeNameGenerator.com_100.csv",
|
||||
ext_csv="{}/data/FakeRocketErrorsGenerator.csv",
|
||||
|
@ -67,14 +89,14 @@ rocket_test_template_testdata = [
|
|||
num_of_examples=100,
|
||||
acceptance_threshold=0,
|
||||
max_mistakes_number=100,
|
||||
marks=pytest.mark.slow
|
||||
marks=pytest.mark.slow,
|
||||
),
|
||||
# large dataset fixture. marked as slow
|
||||
# some input is correct some is not, test is inconclusive
|
||||
PatternRecognizerTestCase(
|
||||
test_name="rocket-some-errors",
|
||||
entity_name="ROCKET",
|
||||
pattern=r'\W*(rocket)\W*',
|
||||
pattern=r"\W*(rocket)\W*",
|
||||
score=0.8,
|
||||
pii_csv="{}/data/FakeNameGenerator.com_100.csv",
|
||||
ext_csv="{}/data/FakeRocket50PercentErrorsGenerator.csv",
|
||||
|
@ -83,8 +105,8 @@ rocket_test_template_testdata = [
|
|||
num_of_examples=100,
|
||||
acceptance_threshold=0.3,
|
||||
max_mistakes_number=70,
|
||||
marks=[pytest.mark.slow, pytest.mark.inconclusive]
|
||||
)
|
||||
marks=[pytest.mark.slow, pytest.mark.inconclusive],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
|
@ -92,30 +114,39 @@ rocket_test_template_testdata = [
|
|||
"pii_csv, ext_csv, utterances, dictionary_path, "
|
||||
"entity_name, pattern, score, num_of_examples, "
|
||||
"acceptance_threshold, max_mistakes_number",
|
||||
[testcase.to_pytest_param()
|
||||
for testcase in rocket_test_template_testdata])
|
||||
def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
|
||||
entity_name, pattern,
|
||||
score, num_of_examples, acceptance_threshold,
|
||||
max_mistakes_number):
|
||||
[testcase.to_pytest_param() for testcase in rocket_test_template_testdata],
|
||||
)
|
||||
def test_pattern_recognizer(
|
||||
pii_csv,
|
||||
ext_csv,
|
||||
utterances,
|
||||
dictionary_path,
|
||||
entity_name,
|
||||
pattern,
|
||||
score,
|
||||
num_of_examples,
|
||||
acceptance_threshold,
|
||||
max_mistakes_number,
|
||||
):
|
||||
"""
|
||||
Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
|
||||
and another CSV values file with a custom entity
|
||||
:param pii_csv: input csv file location with the common entities
|
||||
:param ext_csv: input csv file location with custom entities
|
||||
:param utterances: template file location
|
||||
:param dictionary_path: vocabulary/dictionary file location
|
||||
:param entity_name: custom entity name
|
||||
:param pattern: recognizer pattern
|
||||
:param num_of_examples: number of samples to be used from dataset to test
|
||||
:param acceptance_threshold: minimim precision/recall
|
||||
allowed for tests to pass
|
||||
Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
|
||||
and another CSV values file with a custom entity
|
||||
:param pii_csv: input csv file location with the common entities
|
||||
:param ext_csv: input csv file location with custom entities
|
||||
:param utterances: template file location
|
||||
:param dictionary_path: vocabulary/dictionary file location
|
||||
:param entity_name: custom entity name
|
||||
:param pattern: recognizer pattern
|
||||
:param num_of_examples: number of samples to be used from dataset to test
|
||||
:param acceptance_threshold: minimum precision/recall
|
||||
allowed for tests to pass
|
||||
"""
|
||||
|
||||
import os
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
dfpii = pd.read_csv(pii_csv.format(dir_path), encoding='utf-8')
|
||||
dfext = pd.read_csv(ext_csv.format(dir_path), encoding='utf-8')
|
||||
dfpii = pd.read_csv(pii_csv.format(dir_path), encoding="utf-8")
|
||||
dfext = pd.read_csv(ext_csv.format(dir_path), encoding="utf-8")
|
||||
dictionary_path = dictionary_path.format(dir_path)
|
||||
ext_column_name = dfext.columns[0]
|
||||
|
||||
|
@ -127,18 +158,23 @@ def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
|
|||
dfpii[ext_column_name] = [get_from_ext(i) for i in range(0, dfpii.shape[0])]
|
||||
|
||||
# generate examples
|
||||
generator = FakeDataGenerator(fake_pii_csv_file=dfpii,
|
||||
utterances_file=utterances.format(dir_path),
|
||||
dictionary_path=dictionary_path)
|
||||
generator = FakeDataGenerator(
|
||||
fake_pii_df=dfpii,
|
||||
templates=utterances.format(dir_path),
|
||||
dictionary_path=dictionary_path,
|
||||
)
|
||||
examples = generator.sample_examples(num_of_examples)
|
||||
|
||||
pattern = Pattern("test pattern", pattern, score)
|
||||
pattern_recognizer = PatternRecognizer(entity_name,
|
||||
name="test recognizer",
|
||||
patterns=[pattern])
|
||||
pattern_recognizer = PatternRecognizer(
|
||||
entity_name, name="test recognizer", patterns=[pattern]
|
||||
)
|
||||
|
||||
scores = score_presidio_recognizer(
|
||||
pattern_recognizer, [entity_name], examples)
|
||||
recognizer=pattern_recognizer,
|
||||
entities_to_keep=[entity_name],
|
||||
input_samples=examples,
|
||||
)
|
||||
if not np.isnan(scores.pii_f):
|
||||
assert acceptance_threshold <= scores.pii_f
|
||||
assert max_mistakes_number >= len(scores.model_errors)
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.spacy_evaluator import SpacyEvaluator
|
||||
from presidio_evaluator.evaluation import Evaluator
|
||||
from presidio_evaluator.models.spacy_model import SpacyModel
|
||||
import numpy as np
|
||||
|
||||
|
||||
|
@ -8,9 +9,10 @@ def test_spacy_simple():
|
|||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
spacy_evaluator = SpacyEvaluator(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
|
||||
evaluation_results = spacy_evaluator.evaluate_all(input_samples)
|
||||
scores = spacy_evaluator.calculate_score(evaluation_results)
|
||||
spacy_model = SpacyModel(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
|
||||
evaluator = Evaluator(model=spacy_model)
|
||||
evaluation_results = evaluator.evaluate_all(input_samples)
|
||||
scores = evaluator.calculate_score(evaluation_results)
|
||||
|
||||
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
|
||||
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
|
|
@ -1,12 +1,14 @@
|
|||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.presidio_recognizer_evaluator import \
|
||||
score_presidio_recognizer
|
||||
from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
|
||||
|
||||
import pytest
|
||||
from presidio_analyzer.predefined_recognizers.spacy_recognizer import SpacyRecognizer
|
||||
|
||||
# test case parameters for tests with dataset which was previously generated.
|
||||
|
||||
class GeneratedTextTestCase:
|
||||
"""
|
||||
Test case parameters for tests with dataset which was previously generated.
|
||||
"""
|
||||
def __init__(self, test_name, test_input, acceptance_threshold, marks):
|
||||
self.test_name = test_name
|
||||
self.test_input = test_input
|
||||
|
@ -14,8 +16,12 @@ class GeneratedTextTestCase:
|
|||
self.marks = marks
|
||||
|
||||
def to_pytest_param(self):
|
||||
return pytest.param(self.test_input, self.acceptance_threshold,
|
||||
id=self.test_name, marks=self.marks)
|
||||
return pytest.param(
|
||||
self.test_input,
|
||||
self.acceptance_threshold,
|
||||
id=self.test_name,
|
||||
marks=self.marks,
|
||||
)
|
||||
|
||||
|
||||
# generated-text test cases
|
||||
|
@ -25,35 +31,37 @@ cc_test_generate_text_testdata = [
|
|||
test_name="small-set",
|
||||
test_input="{}/data/generated_small.txt",
|
||||
acceptance_threshold=0.5,
|
||||
marks=pytest.mark.inconclusive
|
||||
marks=pytest.mark.inconclusive,
|
||||
),
|
||||
# large dataset - test is slow and inconclusive
|
||||
GeneratedTextTestCase(
|
||||
test_name="large-set",
|
||||
test_input="{}/data/generated_large.txt",
|
||||
acceptance_threshold=0.5,
|
||||
marks=pytest.mark.slow
|
||||
)
|
||||
marks=pytest.mark.slow,
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
# credit card recognizer tests on generated data
|
||||
@pytest.mark.parametrize("test_input,acceptance_threshold",
|
||||
[testcase.to_pytest_param() for testcase in
|
||||
cc_test_generate_text_testdata])
|
||||
@pytest.mark.parametrize(
|
||||
"test_input,acceptance_threshold",
|
||||
[testcase.to_pytest_param() for testcase in cc_test_generate_text_testdata],
|
||||
)
|
||||
def test_spacy_recognizer_with_generated_text(test_input, acceptance_threshold):
|
||||
"""
|
||||
Test spacy recognizer with a generated dataset text file
|
||||
:param test_input: input text file location
|
||||
:param acceptance_threshold: minimim precision/recall
|
||||
allowed for tests to pass
|
||||
Test spacy recognizer with a generated dataset text file
|
||||
:param test_input: input text file location
|
||||
:param acceptance_threshold: minimim precision/recall
|
||||
allowed for tests to pass
|
||||
"""
|
||||
|
||||
# read test input from generated file
|
||||
import os
|
||||
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
test_input.format(dir_path))
|
||||
input_samples = read_synth_dataset(test_input.format(dir_path))
|
||||
scores = score_presidio_recognizer(
|
||||
SpacyRecognizer(), ['PERSON'], input_samples, True)
|
||||
SpacyRecognizer(), ["PERSON"], input_samples, with_nlp_artifacts=True
|
||||
)
|
||||
assert acceptance_threshold <= scores.pii_f
|
||||
|
|
|
@ -2,8 +2,9 @@ from presidio_evaluator import span_to_tag
|
|||
|
||||
BILOU_SCHEME = "BILOU"
|
||||
BIO_SCHEME = "BIO"
|
||||
IO_SCHEME = "IO"
|
||||
|
||||
|
||||
# fmt: off
|
||||
def test_span_to_bio_multiple_tokens():
|
||||
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
|
||||
start = 14
|
||||
|
@ -166,8 +167,7 @@ def test_overlapping_entities_first_ends_in_mid_second():
|
|||
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
|
||||
'US_PHONE_NUMBER', 'US_PHONE_NUMBER',
|
||||
'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
io = span_to_tag(IO_SCHEME, text, start, end, tag, scores)
|
||||
assert io == expected
|
||||
|
||||
|
||||
|
@ -180,8 +180,7 @@ def test_overlapping_entities_second_embedded_in_first_with_lower_score():
|
|||
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'PHONE_NUMBER',
|
||||
'PHONE_NUMBER', 'PHONE_NUMBER',
|
||||
'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
|
||||
assert io == expected
|
||||
|
||||
|
||||
|
@ -194,8 +193,7 @@ def test_overlapping_entities_second_embedded_in_first_has_higher_score():
|
|||
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
|
||||
'PHONE_NUMBER', 'PHONE_NUMBER',
|
||||
'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
|
||||
assert io == expected
|
||||
|
||||
|
||||
|
@ -207,6 +205,6 @@ def test_overlapping_entities_pyramid():
|
|||
tag = ["A1", "B2","C3"]
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'A1', 'B2', 'C3', 'B2',
|
||||
'A1', 'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
|
||||
assert io == expected
|
||||
# fmt: on
|
||||
|
|
Загрузка…
Ссылка в новой задаче