updates to presidio 2 and spacy 3

2021-04-26 12:40:05 +03:00 · 2021-04-26 12:40:05 +03:00 · 83bb254b5d
--- a/README.md
+++ b/README.md
@ -11,15 +11,15 @@ To install the package, clone the repo and install all dependencies, preferably

 ``` sh
 # Create conda env (optional)
-conda create --name presidio python=3.7
+conda create --name presidio python=3.8
 conda activate presidio

 # Install package+dependencies
 pip install -r requirements.txt
 python setup.py install

-# Optionally link in the local development copy of presidio-analyzer
-pip install -e [path to presidio-analyzer]
+# Download a spaCy model used by presidio-analyzer
+python -m spacy download en_core_web_lg

 # Verify installation
 pytest
@ -58,7 +58,7 @@ In order to standardize the process, we use specific data objects that hold all

 ## 3. Recognizer evaluation
 The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
-The main logic lies in the [ModelEvaluator](presidio_evaluator/model_evaluator.py) class. It provides a structured way of evaluating models and recognizers.
+The main logic lies in the [ModelEvaluator](presidio_evaluator/models/base_model.py) class. It provides a structured way of evaluating models and recognizers.


 ### Ready evaluators
@ -72,14 +72,14 @@ Allows you to evaluate an existing Presidio deployment through the API. [See thi
 Allows you to evaluate the local Presidio-Analyzer package. Faster than the API option but requires you to have Presidio-Analyzer installed locally. [See this class for more information](presidio_evaluator/presidio_analyzer.py)

 #### 3. One recognizer evaluator
-Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/presidio_recognizer_evaluator.py)
+Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/models/presidio_recognizer_wrapper.py)


 ## 4. Modeling

 ### Conditional Random Fields
 To train a CRF on a new dataset, see [this notebook](notebooks/models/CRF.ipynb).
-To evaluate a CRF model, see the the [same notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/crf_evaluator.py).
+To evaluate a CRF model, see the the [same notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).

 ### spaCy based models
 There are three ways of interacting with spaCy models: 
@ -93,39 +93,6 @@ See [this notebook for creating spaCy datasets](notebooks/models/Create%20datase
 #### Evaluate an existing trained model
 To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).

-#### Train with pretrain embeddings
-In order to train a new spaCy model from scratch with pretrained embeddings (FastText wiki news subword in this case), follow these three steps:
-
-##### 1. Download FastText pretrained (sub) word embeddings
-``` sh
-wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
-unzip wiki-news-300d-1M-subword.vec.zip
-```
-
-##### 2. Init spaCy model with pre-trained embeddings
-Using spaCy CLI:
-``` sh
-python -m spacy init-model en spacy_fasttext --vectors-loc wiki-news-300d-1M-subword.vec
-```
-
-##### 3. Train spaCy NER model
-Using spaCy CLI:
-``` sh
-python -m spacy train en spacy_fasttext_100 train.json test.json --vectors spacy_fasttext --pipeline ner -n 100
-```
-
-#### Fine-tune an existing spaCy model
-See [this code for retraining an existing spaCy model](models/spacy_retrain.py). Specifically, run a SpacyRetrainer:
-First, you would have to create train and test pickle files for your train and test sets. See [this notebook](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb) for more information.
-
-```python
-from models import SpacyRetrainer
-spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
-                                 experiment_name='new_spacy_experiment',
-                                 n_iter=500, dropout=0.1, aml_config=None)
-spacy_retrainer.run()
-```
-
 ### Flair based models
 To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object. 
 For experimenting with other embedding types, change the `embeddings` object in the `train` method.
--- a/2
+++ b/2
@ -1,2 +1,2 @@
-0.0
+0.0.2

--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -15,11 +15,12 @@ pool:
  vmImage: 'ubuntu-latest'
 strategy:
  matrix:
-    Python36:
-      python.version: '3.6'
    Python37:
      python.version: '3.7'
-
+    Python38:
+      python.version: '3.8'
+    Python39:
+      python.version: '3.9'
 steps:
 - task: UsePythonVersion@0
  inputs:
--- a/data/synth_dataset.txt
+++ b/data/synth_dataset.txt
@ -5481,7 +5481,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "SvenZimmer@fleckens.hu",
                "start_position": 39,
                "end_position": 61
@ -9288,7 +9288,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "EmilySanderson@jourrapide.com",
                "start_position": 59,
                "end_position": 88
@ -20492,7 +20492,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "NatalinaLucchese@superrito.com",
                "start_position": 59,
                "end_position": 89
@ -25723,7 +25723,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "HannaUkkonen@dayrep.com",
                "start_position": 39,
                "end_position": 62
@ -32783,7 +32783,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "yahyaeriksson@gustr.com",
                "start_position": 23,
                "end_position": 46
@ -40833,7 +40833,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "VictorAndreyev@cuvox.de",
                "start_position": 23,
                "end_position": 46
@ -44468,7 +44468,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "HarrisonBarnes@fleckens.hu",
                "start_position": 59,
                "end_position": 85
@ -49165,7 +49165,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "MathiasEJespersen@armyspy.com",
                "start_position": 23,
                "end_position": 52
@ -62644,7 +62644,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "ElishaFedorov@fleckens.hu",
                "start_position": 39,
                "end_position": 64
@ -68659,7 +68659,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "HartmannAntonsson@jourrapide.com",
                "start_position": 59,
                "end_position": 91
@ -72669,7 +72669,7 @@
        "masked": null,
        "spans": [
            {
-                "entity_type": "EMAIL",
+                "entity_type": "EMAIL_ADDRESS",
                "entity_value": "MakarMaslow@teleworm.us",
                "start_position": 39,
                "end_position": 62
--- a/models/flair_train.py
+++ b/models/flair_train.py
@ -1,10 +1,13 @@
 from typing import List

-from flair.data import Corpus, Sentence
-from flair.datasets import ColumnCorpus
-from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
-from flair.models import SequenceTagger
-from flair.trainers import ModelTrainer
+try:
+    from flair.data import Corpus, Sentence
+    from flair.datasets import ColumnCorpus
+    from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
+    from flair.models import SequenceTagger
+    from flair.trainers import ModelTrainer
+except ImportError:
+    print("Flair is not installed")

 from presidio_evaluator import InputSample
 from presidio_evaluator.data_generator import read_synth_dataset
--- a/models/spacy_retrain.py
+++ b/models/spacy_retrain.py
@ -1,206 +0,0 @@
-import logging
-import pickle
-import random
-import sys
-from pathlib import Path
-
-import spacy
-from azureml.core import Workspace, Experiment
-from spacy.util import minibatch, compounding
-
-from presidio_evaluator import SpacyEvaluator, InputSample
-
-logging.basicConfig(level=logging.INFO)
-
-root = logging.getLogger()
-root.setLevel(logging.INFO)
-
-handler = logging.StreamHandler(sys.stdout)
-handler.setLevel(logging.INFO)
-root.addHandler(handler)
-
-
-class SpacyRetrainer:
-
-    def __init__(self, original_model_name=None, experiment_name=None, n_iter=100, dropout=0.5,
-                 aml_config='config.json', output_dir='../../model-outputs', train_pickle='../data/train.pickle',
-                 test_pickle='../data/test.pickle'):
-        self.experiment_name = experiment_name
-        if aml_config:
-            self.ws = Workspace.from_config(aml_config)
-            self.experiment = Experiment(workspace=self.ws, name=experiment_name)
-            self.aml_run = self.experiment.start_logging()
-            self.has_aml = True
-        else:
-            self.has_aml = False
-
-        self.model = original_model_name
-        self.n_iter = n_iter
-        self.output_dir = output_dir
-        self.train_file = train_pickle
-        self.test_file = test_pickle
-        self.dropout = dropout
-
-    def run(self):
-        if self.has_aml:
-            self.aml_run.log("model", self.model)
-            self.aml_run.log("n_iter", self.n_iter)
-            self.aml_run.log("train_file", self.train_file)
-            self.aml_run.log("test_file", self.test_file)
-            self.aml_run.log("dropout rate", self.dropout)
-        model_path = self._train(self.model, self.output_dir, self.n_iter, self.train_file, self.experiment_name)
-        self._score_validate(model_path, self.test_file)
-        if self.has_aml:
-            self.aml_run.complete()
-
-    def print_scores(self, split, evaluation_result):
-        """
-        Logs results into experiment run.
-        :param split: Name of this split. For ex 'train' or 'valid'
-        :param evaluation_result: EvaluationResult containing various metrics
-        :return: None. Writes to experiment runner and logs locally.
-        """
-        logging.info('SPLIT: {0}. PII_precision: {1}, PII_recall: {2},'
-                     'Person_precision: {3}, Person_recall: {4}'. \
-                     format(split, evaluation_result.pii_precision, evaluation_result.pii_recall,
-                            evaluation_result.entity_precision_dict['PERSON'],
-                            evaluation_result.entity_recall_dict['PERSON']))
-        if self.has_aml:
-            self.aml_run.log('Precision', evaluation_result.pii_precision, split)
-            self.aml_run.log('Recall', evaluation_result.pii_recall, split)
-
-    @staticmethod
-    def _score(model, data):
-        """
-        Score the model against the data
-        :param model: Trained model
-        :param data: Data split which is being scored.
-        :return: An EvaluationResult containing various metrics
-        """
-
-        spacy_evaluator = SpacyEvaluator(model=model)
-
-        results = []
-        for text, ground_truth_annotations in data:
-            ground_truth_entities = ground_truth_annotations['entities']
-            input_sample = InputSample.from_spacy(text, ground_truth_entities)
-            results.append(spacy_evaluator.evaluate_sample(input_sample))
-
-        return spacy_evaluator.calculate_score(evaluation_results=results)
-
-    def _score_validate(self, model_path, test_data_file):
-        """
-        Validation step for the model. Also prints the scores.
-        :param model_path: Path to trained model.
-        :param test_data_file: Data file which has the dataset for this split.
-        :return: None. Prints the scores.
-        """
-        with open(test_data_file, 'rb') as f:
-            valid_data = pickle.load(f)
-        nlp = spacy.load(model_path)
-        self.print_scores('Valid', self._score(nlp, valid_data))
-
-    # @plac.annotations(
-    #     model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
-    #     output_dir=("Optional output directory", "option", "o", Path),
-    #     n_iter=("Number of training iterations", "option", "n", int),
-    #     train_file=("File containing pickled training Spacy NER formatted data", "option", "d", Path),
-    #     test_file=("File containing pickled test Spacy NER formatted data", "option", "d", Path),
-    #     exp_name=("Name of this experiment", "option", "e")
-    # )
-
-    def _train(self, model, output_dir, n_iter, train_file, exp_name):
-        """Load the model, set up the pipeline and train the entity recognizer."""
-        nlp = self.load_or_create_empty_model(model)
-
-        if "ner" not in nlp.pipe_names:
-            ner = nlp.create_pipe("ner")
-            nlp.add_pipe(ner, last=True)
-        else:
-            ner = nlp.get_pipe("ner")
-
-        with open(train_file, 'rb') as f:
-            train_data = pickle.load(f)
-
-        # DEBUG
-        train_data = train_data[:50]
-
-        # add labels
-        for _, annotations in train_data:
-            for ent in annotations.get("entities"):
-                ner.add_label(ent[2])
-
-        # get names of other pipes to disable them during training
-        other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
-        with nlp.disable_pipes(*other_pipes):  # only train NER
-            # reset and initialize the weights randomly – but only if we're
-            # training a new model
-            if model is None:
-                nlp.begin_training()
-            for itn in range(n_iter):
-                random.shuffle(train_data)
-                losses = {}
-                # batch up the examples using spaCy's minibatch
-                batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
-                for batch in batches:
-                    texts, annotations = zip(*batch)
-                    nlp.update(texts, annotations, drop=self.dropout, losses=losses, )
-                logging.debug("Losses", losses)
-                if self.has_aml:
-                    self.aml_run.log('Losses', losses['ner'])
-                self.print_scores('Itn {}'.format(itn), self._score(nlp, train_data))
-
-        self.print_scores('Train', self._score(nlp, train_data))
-
-        saved_model_path = self.save_model(exp_name, nlp, output_dir)
-        return saved_model_path
-
-    @staticmethod
-    def save_model(exp_name, model, output_dir):
-        """
-        Saves model to disk for later use.
-        :param exp_name: Name of the running experiment. This is used as folder name for storing the model.
-        :param model: Model being saved
-        :param output_dir: Directory where to save the model.
-        :return: Full path to saved model.
-        """
-        saved_model_path = Path(output_dir, exp_name)
-        if not saved_model_path.exists():
-            saved_model_path.mkdir(parents=True)
-        model.to_disk(saved_model_path)
-        logging.info("Saved model to {}".format(output_dir))
-        return saved_model_path
-
-    @staticmethod
-    def load_model(exp_name, model_dir):
-        """
-        Loads a spacy model from disk
-
-        :param exp_name: Name of experiment under which the model was saved
-        :param model_dir: path to saved model
-        :return: spacy model
-        """
-        saved_model_path = Path(model_dir, exp_name)
-        return spacy.load(saved_model_path)
-
-    @staticmethod
-    def load_or_create_empty_model(model=None):
-        """
-        Loads a given model or creates a blank english model.
-        :param model: Optional Model to load.
-        :return: Loaded or blank model.
-        """
-        if model:
-            nlp = spacy.load(model)
-            logging.debug("Loaded model {}".format(model))
-        else:
-            nlp = spacy.blank("en")
-            logging.debug("Created blank 'en' model")
-        return nlp
-
-
-if __name__ == "__main__":
-    spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
-                                     experiment_name='spacy_new_ontonotes28',
-                                     n_iter=500, dropout=0.5, aml_config=None)
-    spacy_retrainer.run()
--- a/presidio_evaluator/init.py
+++ b/presidio_evaluator/init.py
@ -1,6 +1,21 @@
 from .span_to_tag import span_to_tag, tokenize
-from .data_objects import Span, InputSample, EvaluationResult, ModelError
-from .model_evaluator import ModelEvaluator
-from .spacy_evaluator import SpacyEvaluator
-from .presidio_api_evaluator import PresidioAPIEvaluator
-from .presidio_analyzer_evaluator import PresidioAnalyzerEvaluator
+from .data_objects import Span, InputSample
+from .validation import (
+    split_dataset,
+    split_by_template,
+    get_samples_by_pattern,
+    group_by_template,
+    save_to_json,
+)
+
+__all__ = [
+    "span_to_tag",
+    "tokenize",
+    "Span",
+    "InputSample",
+    "split_dataset",
+    "split_by_template",
+    "get_samples_by_pattern",
+    "group_by_template",
+    "save_to_json",
+]
--- a/presidio_evaluator/crf_evaluator.py
+++ b/presidio_evaluator/crf_evaluator.py
@ -1,97 +0,0 @@
-import pickle
-from typing import List
-
-from presidio_evaluator import ModelEvaluator, InputSample
-
-
-class CRFEvaluator(ModelEvaluator):
-
-    def __init__(self,
-                 model_pickle_path: str = "../models/crf.pickle",
-                 entities_to_keep: List[str] = None,
-                 verbose: bool = False,
-                 labeling_scheme: str = "BIO",
-                 compare_by_io: bool = True):
-        super().__init__(entities_to_keep=entities_to_keep,
-                         verbose=verbose,
-                         labeling_scheme=labeling_scheme,
-                         compare_by_io=compare_by_io)
-
-        if model_pickle_path is None:
-            raise ValueError("model_pickle_path must be supplied")
-
-        with open(model_pickle_path, 'rb') as f:
-            self.model = pickle.load(f)
-
-    def predict(self, sample: InputSample) -> List[str]:
-        tags = CRFEvaluator.crf_predict(sample,self.model)
-
-        if len(tags) != len(sample.tokens):
-            print("mismatch between previous tokens and new tokens")
-        # translated_tags = sample.rename_from_spacy_tags(tags)
-        return tags
-
-    @staticmethod
-    def crf_predict(sample, model):
-        sample.translate_input_sample_tags()
-
-        conll = sample.to_conll(translate_tags=True)
-        sentence = [(di['text'], di['pos'], di['label']) for di in conll]
-        features = CRFEvaluator.sent2features(sentence)
-        return model.predict([features])[0]
-
-    @staticmethod
-    def word2features(sent, i):
-        word = sent[i][0]
-        postag = sent[i][1]
-
-        features = {
-            'bias': 1.0,
-            'word.lower()': word.lower(),
-            'word[-3:]': word[-3:],
-            'word[-2:]': word[-2:],
-            'word.isupper()': word.isupper(),
-            'word.istitle()': word.istitle(),
-            'word.isdigit()': word.isdigit(),
-            'postag': postag,
-            'postag[:2]': postag[:2],
-        }
-        if i > 0:
-            word1 = sent[i - 1][0]
-            postag1 = sent[i - 1][1]
-            features.update({
-                '-1:word.lower()': word1.lower(),
-                '-1:word.istitle()': word1.istitle(),
-                '-1:word.isupper()': word1.isupper(),
-                '-1:postag': postag1,
-                '-1:postag[:2]': postag1[:2],
-            })
-        else:
-            features['BOS'] = True
-
-        if i < len(sent) - 1:
-            word1 = sent[i + 1][0]
-            postag1 = sent[i + 1][1]
-            features.update({
-                '+1:word.lower()': word1.lower(),
-                '+1:word.istitle()': word1.istitle(),
-                '+1:word.isupper()': word1.isupper(),
-                '+1:postag': postag1,
-                '+1:postag[:2]': postag1[:2],
-            })
-        else:
-            features['EOS'] = True
-
-        return features
-
-    @staticmethod
-    def sent2features(sent):
-        return [CRFEvaluator.word2features(sent, i) for i in range(len(sent))]
-
-    @staticmethod
-    def sent2labels(sent):
-        return [label for token, postag, label in sent]
-
-    @staticmethod
-    def sent2tokens(sent):
-        return [token for token, postag, label in sent]
--- a/presidio_evaluator/data_generator/README.md
+++ b/presidio_evaluator/data_generator/README.md
@ -1,16 +1,27 @@
 # PII dataset generator
-This data generator takes a text file with templates (e.g. `my name is [PERSON]`) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
-It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer) and tags in various schemas (BIO/IOB, IO, BILOU)
-In addition it provides some off-the-shelf features on each token, like `pos`, `dep` and `is_in_vocabulary`
+This data generator takes a text file with templates (e.g. `my name is [PERSON]`) 
+and creates a list of InputSamples which contain fake PII entities 
+instead of placeholders.
+It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer) 
+and tags in various schemas (BIO/IOB, IO, BILOU)
+In addition it provides some off-the-shelf features on each token, 
+like `pos`, `dep` and `is_in_vocabulary`

-The main class is `FakeDataGenerator` however the `main` module has two functions for creating and reading a fake dataset.
-During the generation process, the tool either takes fake PII from a provided CSV with a known format, and/or from extension functions which can be found in the extensions.py file.
+The main class is `FakeDataGenerator` however the `main` module has two functions 
+for creating and reading a fake dataset.
+During the generation process, the tool either takes fake PII from a provided CSV with 
+a known format, and/or from extension functions which can be found 
+in the extensions.py file.

 The process in high level is the following:
-1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of templates: `My name is John` -> `My name is [PERSON]`
-2. (Optional) adapt the FakeDataGenerator to support new extensions which could generate fake PII entities
-3. Generate X samples using the templates list + a fake PII dataset + extensions that add additional PII entities
-4. Split the generated dataset to train/test/validation while making sure that samples from the same template would only appear in one set
+1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of 
+templates: `My name is John` -> `My name is [PERSON]`
+2. (Optional) adapt the FakeDataGenerator to support new extensions 
+which could generate fake PII entities
+3. Generate X samples using the templates list + a fake PII dataset + 
+extensions that add additional PII entities
+4. Split the generated dataset to train/test/validation while making sure 
+that samples from the same template would only appear in one set
 5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
 6. Train models
 7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
@ -19,12 +30,15 @@ The process in high level is the following:

 Notes:
 - For steps 5, 6, 7 see the main [README](../../README.md).
- For a simple data generation pipeline, [see this notebook](../../notebooks/Generate data.ipynb).
- For information on transforming a NER dataset into a templates, see the notebooks in the [helper notebooks](helper%20notebooks) folder.
+- For a simple data generation pipeline, 
+[see this notebook](../../notebooks/Generate data.ipynb).
+- For information on transforming a NER dataset into a templates, 
+see the notebooks in the [helper notebooks](helper%20notebooks) folder.

 Example run:

 ```python
+from presidio_evaluator.data_generator import generate
 TEMPLATES_FILE = 'raw_data/templates.txt'
 OUTPUT = "generated_.txt"

@ -45,4 +59,7 @@ examples = generate(fake_pii_csv=fake_pii_csv,

 *Copyright notice:*

-Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
+Fake Name Generator identities by the Fake Name Generator are licensed under a 
+Creative Commons Attribution-Share Alike 3.0 United States License. 
+Fake Name Generator and the Fake Name Generator logo 
+are trademarks of Corban Works, LLC.
--- a/presidio_evaluator/data_generator/generator.py
+++ b/presidio_evaluator/data_generator/generator.py
@ -1,8 +1,7 @@
 import random
-from typing import List, Optional
-
 import re
 from collections import Counter
+from typing import List, Optional, Dict

 import pandas as pd
 from spacy.tokens import Token
@ -40,30 +39,30 @@ class FakeDataGenerator:
        labeling_scheme="BILOU",
    ):
        """
-         Fake data generator.
-         Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
-         e.g. "My name is [FIRST_NAME]"
-         :param fake_pii_df:
-         A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
-         :param templates: A list of templates
-         with place holders for PII entities.
-         For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
-         Note that in case you have multiple entities of the same type
-         in a template, you should put a number on the second. For example:
-         "I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
-         More than two are currently not supported but extending this
-         is straightforward.
-         :param lower_case_ratio: Percentage of names that should start
-         with lower case
-         :param include_metadata: Whether to include additional
-         information in the output
-         (e.g. NameSet from which the name was taken, gender, country etc.)
-         :param dictionary_path: A path to a csv containing a vocabulary of
-         a language, to check if a token exists in the vocabulary or not.
-         :param ignore_types: set of types to ignore
-         :param span_to_tag: whether to tokenize the generated samples or not
-         :param labeling_scheme: labeling scheme (BILOU, BIO, IO)
-         """
+        Fake data generator.
+        Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
+        e.g. "My name is [FIRST_NAME]"
+        :param fake_pii_df:
+        A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
+        :param templates: A list of templates
+        with place holders for PII entities.
+        For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
+        Note that in case you have multiple entities of the same type
+        in a template, you should put a number on the second. For example:
+        "I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
+        More than two are currently not supported but extending this
+        is straightforward.
+        :param lower_case_ratio: Percentage of names that should start
+        with lower case
+        :param include_metadata: Whether to include additional
+        information in the output
+        (e.g. NameSet from which the name was taken, gender, country etc.)
+        :param dictionary_path: A path to a csv containing a vocabulary of
+        a language, to check if a token exists in the vocabulary or not.
+        :param ignore_types: set of types to ignore
+        :param span_to_tag: whether to tokenize the generated samples or not
+        :param labeling_scheme: labeling scheme (BILOU, BIO, IO)
+        """
        if ignore_types is None:
            ignore_types = {}
        self.lower_case_ratio = lower_case_ratio
@ -110,7 +109,7 @@ class FakeDataGenerator:
            "TelephoneNumber": "PHONE_NUMBER",
            "CCNumber": "CREDIT_CARD",
            "Birthday": "BIRTHDAY",
-            "EmailAddress": "EMAIL",
+            "EmailAddress": "EMAIL_ADDRESS",
            "StreetAddress": "FULL_ADDRESS",
            "Domain": "DOMAIN_NAME",
            "NameSet": "NAMESET",
@ -143,9 +142,9 @@ class FakeDataGenerator:
            )  # replace previous country which has limited options

        # Copied entities
-        if "DATE" not in self.ignore_types:
+        if "DATE_TIME" not in self.ignore_types:
            if "BIRTHDAY" in df:
-                df["DATE"] = df["BIRTHDAY"]
+                df["DATE_TIME"] = df["BIRTHDAY"]
            else:
                print("DATE is taken from the BIRTHDAY column which is missing")

@ -165,7 +164,9 @@ class FakeDataGenerator:
        if "TITLE" not in self.ignore_types:
            print("Generating titles")
            if "GENDER" not in df:
-                print("Cannot generate title without a GENDER column. Generating FEMALE_TITLE and MALE_TITLE")
+                print(
+                    "Cannot generate title without a GENDER column. Generating FEMALE_TITLE and MALE_TITLE"
+                )
            else:
                df["TITLE"] = generate_titles(df["GENDER"])
            df["FEMALE_TITLE"] = [generate_title("female") for _ in range(len(df))]
@ -275,7 +276,9 @@ class FakeDataGenerator:

        return template, templates, entities_count

-    def sample_examples(self, count, genders:List[str]=None, namesets:List[str]=None):
+    def sample_examples(
+        self, count, genders: List[str] = None, namesets: List[str] = None
+    ):

        if self.fake_pii is None:
            self.fake_pii = self.prep_fake_pii(self.original_pii_df)
@ -305,9 +308,7 @@ class FakeDataGenerator:
                    values[h] = str(fake_pii_sample_duplicated[h])
                else:
                    print(
-                        "Warning: entity {} is in the templates but not in the PII dataset. Ignoring.".format(
-                            h
-                        )
+                        f"Warning: entity {h} is in the templates but not in the PII dataset. Ignoring."
                    )
                    values[h] = ""

@ -335,7 +336,7 @@ class FakeDataGenerator:
            yield input_sample

    @staticmethod
-    def _consolidate_names(input_sample):
+    def _consolidate_names(input_sample: InputSample):
        locations = ("LOCATION", "CITY", "STATE", "COUNTRY", "ADDRESS", "STREET")
        names = ("FIRST_NAME", "LAST_NAME", "PERSON")

@ -353,7 +354,9 @@ class FakeDataGenerator:

        input_sample.masked = masked

-    def _create_input_sample(self, original_sentence, values):
+    def _create_input_sample(
+        self, original_sentence: str, values: Dict[str, str]
+    ) -> InputSample:
        """
        Creates an InputSample out of a template sentence
        and a dict of entity names and values
@ -417,7 +420,10 @@ class FakeDataGenerator:

        # Not creating tokens here since we're consolidating names afterwards
        return InputSample(
-            sentence, original_sentence, spans, create_tags_from_span=False
+            full_text=sentence,
+            spans=spans,
+            masked=original_sentence,
+            create_tags_from_span=False,
        )

    def _add_duplicated_entities(self, fake_pii_sample, entity_counts):
--- a/presidio_evaluator/data_generator/main.py
+++ b/presidio_evaluator/data_generator/main.py
@ -1,5 +1,6 @@
 import datetime
 import json
+import warnings

 import pandas as pd

@ -12,14 +13,16 @@ def read_utterances(utterances_file):
        return f.readlines()


-def generate(fake_pii_csv,
-             utterances_file,
-             output_file=None,
-             num_of_examples=1000,
-             dictionary_path=None,
-             store_masked_text=False,
-             keep_only_tagged=False,
-             **kwargs):
+def generate(
+    fake_pii_csv,
+    utterances_file,
+    output_file=None,
+    num_of_examples=1000,
+    dictionary_path=None,
+    store_masked_text=False,
+    keep_only_tagged=False,
+    **kwargs
+):
    """

    :param fake_pii_csv: csv containing fake PII
@ -34,18 +37,18 @@ def generate(fake_pii_csv,
    """

    if not output_file:
-        raise ValueError("Please provide an output file path")
+        warnings.warn("Warning: no output_file value provided.")

    templates = read_utterances(utterances_file)

    if keep_only_tagged:
        templates = [template for template in templates if "[" in template]

-    df = pd.read_csv(fake_pii_csv, encoding='utf-8')
+    df = pd.read_csv(fake_pii_csv, encoding="utf-8")

-    generator = FakeDataGenerator(fake_pii_df=df,
-                                  dictionary_path=dictionary_path,
-                                  templates=templates, **kwargs)
+    generator = FakeDataGenerator(
+        fake_pii_df=df, dictionary_path=dictionary_path, templates=templates, **kwargs
+    )
    counter = 0

    examples = []
@ -56,7 +59,7 @@ def generate(fake_pii_csv,

    examples_json = [example.to_dict() for example in examples]

-    with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
+    with open("{}".format(output_file), "w+", encoding="utf-8") as f:
        json.dump(examples_json, f, ensure_ascii=False, indent=4)

    print("generated {} examples".format(len(examples)))
@ -67,6 +70,7 @@ def generate(fake_pii_csv,

 def read_synth_dataset(filepath=None, length=None):
    import json
+
    with open(filepath, "r", encoding="utf-8") as f:
        dataset = json.load(f)

@ -84,28 +88,32 @@ if __name__ == "__main__":
    EXAMPLES = 30
    PII_FILE_SIZE = 3000
    SPAN_TO_TAG = True
-    TEMPLATES_FILE = 'raw_data/templates.txt'
+    TEMPLATES_FILE = "raw_data/templates.txt"
    KEEP_ONLY_TAGGED = False
    LOWER_CASE_RATIO = 0.1
-    IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}
+    IGNORE_TYPES = {"IP_ADDRESS", "US_SSN", "URL"}

    cur_time = datetime.date.today().strftime("%B %d %Y")
    OUTPUT = "generated_size_{}_date_{}.txt".format(EXAMPLES, cur_time)

-    fake_pii_csv = '../../presidio_evaluator/data_generator/' \
-                   'raw_data/FakeNameGenerator.com_{}.csv'.format(PII_FILE_SIZE)
+    fake_pii_csv = (
+        "../../presidio_evaluator/data_generator/"
+        "raw_data/FakeNameGenerator.com_{}.csv".format(PII_FILE_SIZE)
+    )
    utterances_file = TEMPLATES_FILE
    dictionary_path = None

-    examples = generate(fake_pii_csv=fake_pii_csv,
-                        utterances_file=utterances_file,
-                        dictionary_path=dictionary_path,
-                        output_file=OUTPUT,
-                        lower_case_ratio=LOWER_CASE_RATIO,
-                        num_of_examples=EXAMPLES,
-                        ignore_types=IGNORE_TYPES,
-                        keep_only_tagged=KEEP_ONLY_TAGGED,
-                        span_to_tag=SPAN_TO_TAG)
+    examples = generate(
+        fake_pii_csv=fake_pii_csv,
+        utterances_file=utterances_file,
+        dictionary_path=dictionary_path,
+        output_file=OUTPUT,
+        lower_case_ratio=LOWER_CASE_RATIO,
+        num_of_examples=EXAMPLES,
+        ignore_types=IGNORE_TYPES,
+        keep_only_tagged=KEEP_ONLY_TAGGED,
+        span_to_tag=SPAN_TO_TAG,
+    )

    # sanity
    input_samples = read_synth_dataset(OUTPUT)
--- a/presidio_evaluator/data_generator/nationality_generator.py
+++ b/presidio_evaluator/data_generator/nationality_generator.py
@ -15,24 +15,34 @@ class NationalityGenerator:

    def get_country(self):
        ## [COUNTRY]
-        return NationalityGenerator.capitalizeWords(random.choice(self.df['country'].values))
+        return NationalityGenerator.capitalizeWords(
+            random.choice(self.df["country"].values)
+        )

    def get_nationality(self):
        ## [NATIONALITY]
-        return NationalityGenerator.capitalizeWords(random.choice(self.df['nationality'].values))
+        return NationalityGenerator.capitalizeWords(
+            random.choice(self.df["nationality"].values)
+        )

    def get_nation_woman(self):
        ## [NATION_WOMAN]
-        return NationalityGenerator.capitalizeWords(random.choice(self.df['woman'].values))
+        return NationalityGenerator.capitalizeWords(
+            random.choice(self.df["woman"].values)
+        )

    def get_nation_man(self):
        ## [NATION_MAN]
-        return NationalityGenerator.capitalizeWords(random.choice(self.df['man'].values))
+        return NationalityGenerator.capitalizeWords(
+            random.choice(self.df["man"].values)
+        )

    def get_nation_plural(self):
        ## [NATION_PLURAL]
-        return NationalityGenerator.capitalizeWords(random.choice(self.df['plural'].values))
+        return NationalityGenerator.capitalizeWords(
+            random.choice(self.df["plural"].values)
+        )

    @staticmethod
    def capitalizeWords(s):
-        return re.sub(r'\w+', lambda m: m.group(0).capitalize(), s)
+        return re.sub(r"\w+", lambda m: m.group(0).capitalize(), s)
--- a/presidio_evaluator/data_generator/presidio_perturb.py
+++ b/presidio_evaluator/data_generator/presidio_perturb.py
@ -1,6 +1,7 @@
 from typing import List, Set, Dict

 from presidio_analyzer import RecognizerResult
+from presidio_anonymizer import AnonymizerEngine

 from presidio_evaluator.data_generator import FakeDataGenerator

@ -13,7 +14,6 @@ class PresidioPerturb(FakeDataGenerator):
        fake_pii_df: pd.DataFrame,
        lower_case_ratio: float = 0.0,
        ignore_types: Set[str] = None,
-        entity_dict: Dict[str, str] = None,
    ):
        super().__init__(
            fake_pii_df=fake_pii_df,
@ -29,12 +29,9 @@ class PresidioPerturb(FakeDataGenerator):
        :param lower_case_ratio: Percentage of names that should start
         with lower case
         :param ignore_types: set of types to ignore
-         :param entity_dict: Dictionary with mapping of entity names between Presidio and the fake_pii_df.
-         For example, {"EMAIL_ADDRESS": "EMAIL"}
        """

        self.fake_pii = self.prep_fake_pii(self.original_pii_df)
-        self.entity_dict = entity_dict

    def perturb(
        self,
@ -56,19 +53,14 @@ class PresidioPerturb(FakeDataGenerator):

        presidio_response = sorted(presidio_response, key=lambda resp: resp.start)

-        delta = 0
-        text = original_text
-        for resp in presidio_response:
-            start = resp.start + delta
-            end = resp.end + delta
-            entity_text = original_text[start:end]
-            entity_type = resp.entity_type
-            if self.entity_dict:
-                if entity_type in self.entity_dict:
-                    entity_type = self.entity_dict[entity_type]
+        anonymizer_engine = AnonymizerEngine()
+        anonymized_result = anonymizer_engine.anonymize(
+            text=original_text, analyzer_results=presidio_response
+        )
+
+        text = anonymized_result.text
+        text = text.replace(">", "}").replace("<", "{")

-            text = f"{text[:start]}{{{entity_type}}}{text[end:]}"
-            delta = len(entity_type) + 2 - len(entity_text)
        self.templates = [text]
        return [
            sample.full_text
--- a/presidio_evaluator/data_generator/raw_data/templates.txt
+++ b/presidio_evaluator/data_generator/raw_data/templates.txt
@ -24,11 +24,11 @@ I will be travelling to [COUNTRY] next week, so I need my passport to be ready b
 Who's coming to [COUNTRY] with me?
 [COUNTRY] was super fun to visit!
 Could you please email me the statement for laste month , my credit card number is [CREDIT_CARD]?
-Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL]?
+Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL_ADDRESS]?
 How do I change my address to [ADDRESS] for post mail?
 My name appears incorrectly on credit card statement could you please correct it to [TITLE] [PERSON]?
 card number [CREDIT_CARD] is lost, can you please send a new one to [ADDRESS] i am in [CITY] for a business trip
-Please transfer all funds from my account to this hackers' [EMAIL]
+Please transfer all funds from my account to this hackers' [EMAIL_ADDRESS]
 I can't browse to your site, keep getting address [IP_ADDRESS] blocked error
 My religion does not allow speaking to bots, they are evil and hacked by the Devil
 Excuse me, Sir bot, but I really don't like this tone
@ -47,7 +47,7 @@ I would like to remove my kid [FIRST_NAME] from the will. How do I do that?
 The name in the account is not correct, please change it to [PERSON]
 Hello I moved, please update my new address is [ADDRESS]
 I need to add  addresses, here they are: [ADDRESS], [ADDRESS]
-Please send my portfolio to this email [EMAIL]
+Please send my portfolio to this email [EMAIL_ADDRESS]
 Hello, this is [TITLE] [PERSON]. Who are you?
 I want to add [PERSON] as a beneficiary to my account
 I want to cancel my card [CREDIT_CARD] because I lost it
@ -58,11 +58,11 @@ My nam is [FIRST_NAME]
 I'm moving out of the country, so please cancel my subscription
 My name is [PERSON] but everyone calls me [FIRST_NAME]
 Please tell me your date of birth. It's [BIRTHDAY]
-You said your email is [EMAIL]. Is that correct?
+You said your email is [EMAIL_ADDRESS]. Is that correct?
 I once lived in [ADDRESS]. I now live in [ADDRESS]
 I'd like to order a taxi to [ADDRESS]
 Please charge my credit card. Number is [CREDIT_CARD]
-What's your email? [EMAIL]
+What's your email? [EMAIL_ADDRESS]
 What's your credit card? [CREDIT_CARD]
 What's your name? [PERSON]
 What's your last name? [LAST_NAME]
--- a/presidio_evaluator/data_objects.py
+++ b/presidio_evaluator/data_objects.py
@ -1,8 +1,9 @@
-from typing import List, Counter, Dict
+from typing import List, Optional

 import spacy
 import srsly
 from spacy.tokens import Token
+from spacy.training import docs_to_json
 from tqdm import tqdm

 from presidio_evaluator import span_to_tag, tokenize
@ -15,7 +16,7 @@ SPACY_PRESIDIO_ENTITIES = {
    "FAC": "LOCATION",
    "PERSON": "PERSON",
    "LOCATION": "LOCATION",
-    "ORGANIZATION": "ORGANIZATION"
+    "ORGANIZATION": "ORGANIZATION",
 }
 PRESIDIO_SPACY_ENTITIES = {
    "ORGANIZATION": "ORG",
@ -55,8 +56,10 @@ class Span:
        """

        # if they do not overlap the intersection is 0
-        if self.end_position < other.start_position or other.end_position < \
-                self.start_position:
+        if (
+            self.end_position < other.start_position
+            or other.end_position < self.start_position
+        ):
            return 0

        # if we are accounting for entity type a diff type means intersection 0
@ -65,25 +68,38 @@ class Span:

        # otherwise the intersection is min(end) - max(start)
        return min(self.end_position, other.end_position) - max(
-            self.start_position,
-            other.start_position)
+            self.start_position, other.start_position
+        )

    def __repr__(self):
-        return "Type: {}, value: {}, start: {}, end: {}".format(
-            self.entity_type, self.entity_value, self.start_position,
-            self.end_position)
+        return (
+            f"Type: {self.entity_type}, "
+            f"value: {self.entity_value}, "
+            f"start: {self.start_position}, "
+            f"end: {self.end_position}"
+        )

    def __eq__(self, other):
-        return self.entity_type == other.entity_type \
-               and self.entity_value == other.entity_value \
-               and self.start_position == other.start_position \
-               and self.end_position == other.end_position
+        return (
+            self.entity_type == other.entity_type
+            and self.entity_value == other.entity_value
+            and self.start_position == other.start_position
+            and self.end_position == other.end_position
+        )

    def __hash__(self):
-        return hash(('entity_type', self.entity_type,
-                     'entity_value', self.entity_value,
-                     'start_position', self.start_position,
-                     'end_position', self.end_position))
+        return hash(
+            (
+                "entity_type",
+                self.entity_type,
+                "entity_value",
+                self.entity_value,
+                "start_position",
+                self.start_position,
+                "end_position",
+                self.end_position,
+            )
+        )

    @classmethod
    def from_json(cls, data):
@ -108,12 +124,17 @@ class SimpleToken(object):
    A class mimicking the Spacy Token class, for serialization purposes
    """

-    def __init__(self, text, idx, tag_=None,
-                 pos_=None,
-                 dep_=None,
-                 lemma_=None,
-                 spacy_extensions: SimpleSpacyExtensions = None,
-                 **kwargs):
+    def __init__(
+        self,
+        text,
+        idx,
+        tag_=None,
+        pos_=None,
+        dep_=None,
+        lemma_=None,
+        spacy_extensions: SimpleSpacyExtensions = None,
+        **kwargs,
+    ):
        self.text = text
        self.idx = idx
        self.tag_ = tag_
@ -145,13 +166,15 @@ class SimpleToken(object):
            else:
                spacy_extensions = None

-            return cls(text=token.text,
-                       idx=token.idx,
-                       tag_=token.tag_,
-                       pos_=token.pos_,
-                       dep_=token.dep_,
-                       lemma_=token.lemma_,
-                       spacy_extensions=spacy_extensions)
+            return cls(
+                text=token.text,
+                idx=token.idx,
+                tag_=token.tag_,
+                pos_=token.pos_,
+                dep_=token.dep_,
+                lemma_=token.lemma_,
+                spacy_extensions=spacy_extensions,
+            )

    def to_dict(self):
        return {
@ -161,7 +184,7 @@ class SimpleToken(object):
            "pos_": self.pos_,
            "dep_": self.dep_,
            "lemma_": self.lemma_,
-            "_": self._.to_dict()
+            "_": self._.to_dict(),
        }

    def __repr__(self):
@ -170,21 +193,27 @@ class SimpleToken(object):
    @classmethod
    def from_json(cls, data):

-        if '_' in data:
-            data['spacy_extensions'] = \
-                SimpleSpacyExtensions(**data['_'])
+        if "_" in data:
+            data["spacy_extensions"] = SimpleSpacyExtensions(**data["_"])
        return cls(**data)


 class InputSample(object):
-
-    def __init__(self, full_text: str, masked: str, spans: List[Span],
-                 tokens=[], tags=[],
-                 create_tags_from_span=True, scheme="IO", metadata=None, template_id=None):
+    def __init__(
+        self,
+        full_text: str,
+        spans: Optional[List[Span]] = None,
+        masked: Optional[str] = None,
+        tokens: Optional[List[SimpleToken]] = None,
+        tags: Optional[List[str]] = None,
+        create_tags_from_span=True,
+        scheme="IO",
+        metadata=None,
+        template_id=None,
+    ):
        """
-        Holds all the information needed for evaluation in the
+        Hold all the information needed for evaluation in the
        presidio-evaluator framework.
-        Can generate tags (BIO/BILOU/IO) based on spans

        :param full_text: The raw text of this sample
        :param masked: Masked version of the raw text (desired output)
@ -198,6 +227,10 @@ class InputSample(object):
        in the English (or other language) vocabulary
        :param template_id: Original template (utterance) of sample, in case it was generated
        """
+        if tags is None:
+            tags = []
+        if tokens is None:
+            tokens = []
        self.full_text = full_text
        self.masked = masked
        self.spans = spans if spans else []
@ -218,11 +251,12 @@ class InputSample(object):
            self.tags = tags

    def __repr__(self):
-        return "Full text: {}\n" \
-               "Spans: {}\n" \
-               "Tokens: {}\n" \
-               "Tags: {}\n".format(self.full_text, self.spans, self.tokens,
-                                   self.tags)
+        return (
+            f"Full text: {self.full_text}\n"
+            f"Spans: {self.spans}\n"
+            f"Tokens: {self.tokens}\n"
+            f"Tags: {self.tags}\n"
+        )

    def to_dict(self):

@ -230,20 +264,20 @@ class InputSample(object):
            "full_text": self.full_text,
            "masked": self.masked,
            "spans": [span.__dict__ for span in self.spans],
-            "tokens": [SimpleToken.from_spacy_token(token).to_dict()
-                       for token in self.tokens],
+            "tokens": [
+                SimpleToken.from_spacy_token(token).to_dict() for token in self.tokens
+            ],
            "tags": self.tags,
            "template_id": self.template_id,
-            "metadata": self.metadata
+            "metadata": self.metadata,
        }

    @classmethod
    def from_json(cls, data):
-        if 'spans' in data:
-            data['spans'] = [Span.from_json(span) for span in data['spans']]
-        if 'tokens' in data:
-            data['tokens'] = [SimpleToken.from_json(val) for val in
-                              data['tokens']]
+        if "spans" in data:
+            data["spans"] = [Span.from_json(span) for span in data["spans"]]
+        if "tokens" in data:
+            data["tokens"] = [SimpleToken.from_json(val) for val in data["tokens"]]
        return cls(**data, create_tags_from_span=False)

    def get_tags(self, scheme="IOB"):
@ -252,33 +286,43 @@ class InputSample(object):
        tags = [span.entity_type for span in self.spans]
        tokens = tokenize(self.full_text)

-        labels = span_to_tag(scheme=scheme, text=self.full_text, tag=tags,
-                             start=start_indices, end=end_indices,
-                             tokens=tokens)
+        labels = span_to_tag(
+            scheme=scheme,
+            text=self.full_text,
+            tag=tags,
+            start=start_indices,
+            end=end_indices,
+            tokens=tokens,
+        )

        return tokens, labels

-    def to_conll(self, translate_tags, scheme="BIO"):
+    def to_conll(self, translate_tags):

        conll = []
        for i, token in enumerate(self.tokens):
            if translate_tags:
-                label = self.translate_tag(self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
+                label = self.translate_tag(
+                    self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
+                )
            else:
                label = self.tags[i]
-            conll.append({"text": token.text,
-                          "pos": token.pos_,
-                          "tag": token.tag_,
-                          "Template#": self.metadata['Template#'],
-                          "gender": self.metadata['Gender'],
-                          "country": self.metadata['Country'],
-                          "label": label},
-                         )
+            conll.append(
+                {
+                    "text": token.text,
+                    "pos": token.pos_,
+                    "tag": token.tag_,
+                    "Template#": self.metadata["Template#"],
+                    "gender": self.metadata["Gender"],
+                    "country": self.metadata["Country"],
+                    "label": label,
+                },
+            )

        return conll

    def get_template_id(self):
-        return self.metadata['Template#']
+        return self.metadata["Template#"]

    @staticmethod
    def create_conll_dataset(dataset, translate_tags=True, to_bio=True):
@ -291,66 +335,76 @@ class InputSample(object):
                sample.bilou_to_bio()
            conll = sample.to_conll(translate_tags=translate_tags)
            for token in conll:
-                token['sentence'] = i
+                token["sentence"] = i
                conlls.append(token)
            i += 1

        return pd.DataFrame(conlls)

    def to_spacy(self, entities=None, translate_tags=True):
-        entities = [(span.start_position, span.end_position, span.entity_type)
-                    for span in self.spans if (entities is None) or (span.entity_type in entities)]
+        entities = [
+            (span.start_position, span.end_position, span.entity_type)
+            for span in self.spans
+            if (entities is None) or (span.entity_type in entities)
+        ]
        new_entities = []
        if translate_tags:
            for entity in entities:
-                new_tag = self.translate_tag(entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
+                new_tag = self.translate_tag(
+                    entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
+                )
                new_entities.append((entity[0], entity[1], new_tag))
        else:
            new_entities = entities
-        return (self.full_text,
-                {"entities": new_entities})
+        return self.full_text, {"entities": new_entities}

    @classmethod
    def from_spacy(cls, text, annotations, translate_from_spacy=True):
        spans = []
        for annotation in annotations:
-            tag = cls.rename_from_spacy_tags([annotation[2]])[0] if translate_from_spacy else annotation[2]
-            span = Span(tag, text[annotation[0]: annotation[1]], annotation[0], annotation[1])
+            tag = (
+                cls.rename_from_spacy_tags([annotation[2]])[0]
+                if translate_from_spacy
+                else annotation[2]
+            )
+            span = Span(
+                tag, text[annotation[0] : annotation[1]], annotation[0], annotation[1]
+            )
            spans.append(span)
        return cls(full_text=text, masked=None, spans=spans)

    @staticmethod
-    def create_spacy_dataset(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
+    def create_spacy_dataset(
+        dataset, entities=None, sort_by_template_id=False, translate_tags=True
+    ):
        def template_sort(x):
-            return x.metadata['Template#']
+            return x.metadata["Template#"]

        if sort_by_template_id:
            dataset.sort(key=template_sort)

-        return [sample.to_spacy(entities=entities, translate_tags=translate_tags) for sample in dataset]
+        return [
+            sample.to_spacy(entities=entities, translate_tags=translate_tags)
+            for sample in dataset
+        ]

    def to_spacy_json(self, entities=None, translate_tags=True):
        token_dicts = []
        for i, token in enumerate(self.tokens):
            if entities:
-                tag = self.tags[i] if self.tags[i][2:] in entities else 'O'
+                tag = self.tags[i] if self.tags[i][2:] in entities else "O"
            else:
                tag = self.tags[i]

            if translate_tags:
-                tag = self.translate_tag(tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
-            token_dicts.append({
-                "orth": token.text,
-                "tag": token.tag_,
-                "ner": tag
-            })
+                tag = self.translate_tag(
+                    tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True
+                )
+            token_dicts.append({"orth": token.text, "tag": token.tag_, "ner": tag})

        spacy_json_sentence = {
            "raw": self.full_text,
-            "sentences": [{
-                "tokens": token_dicts
-            }
-            ]
+            "sentences": [{"tokens": token_dicts}],
        }

        return spacy_json_sentence
@ -359,29 +413,37 @@ class InputSample(object):
        doc = self.tokens
        spacy_spans = []
        for span in self.spans:
-            start_token = [token.i for token in self.tokens if token.idx == span.start_position][0]
-            end_token = [token.i for token in self.tokens if token.idx + len(token.text) == span.end_position][0] + 1
-            spacy_span = spacy.tokens.span.Span(doc, start=start_token, end=end_token,
-                                                label=span.entity_type)
+            start_token = [
+                token.i for token in self.tokens if token.idx == span.start_position
+            ][0]
+            end_token = [
+                token.i
+                for token in self.tokens
+                if token.idx + len(token.text) == span.end_position
+            ][0] + 1
+            spacy_span = spacy.tokens.span.Span(
+                doc, start=start_token, end=end_token, label=span.entity_type
+            )
            spacy_spans.append(spacy_span)
        doc.ents = spacy_spans
        return doc

    @staticmethod
-    def create_spacy_json(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
+    def create_spacy_json(
+        dataset, entities=None, sort_by_template_id=False, translate_tags=True
+    ):
        def template_sort(x):
-            return x.metadata['Template#']
+            return x.metadata["Template#"]

        if sort_by_template_id:
            dataset.sort(key=template_sort)

        json_str = []
        for i, sample in tqdm(enumerate(dataset)):
-            paragraph = sample.to_spacy_json(entities=entities, translate_tags=translate_tags)
-            json_str.append({
-                "id": i,
-                "paragraphs": [paragraph]
-            })
+            paragraph = sample.to_spacy_json(
+                entities=entities, translate_tags=translate_tags
+            )
+            json_str.append({"id": i, "paragraphs": [paragraph]})

        return json_str

@ -402,10 +464,12 @@ class InputSample(object):

    @staticmethod
    def translate_tag(tag, dictionary, ignore_unknown):
-        has_prefix = len(tag) > 2 and tag[1] == '-'
+        has_prefix = len(tag) > 2 and tag[1] == "-"
        no_prefix = tag[2:] if has_prefix else tag
        if no_prefix in dictionary.keys():
-            return tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
+            return (
+                tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
+            )
        else:
            if ignore_unknown:
                return "O"
@ -416,41 +480,48 @@ class InputSample(object):
        new_tags = []
        for tag in self.tags:
            new_tag = tag
-            has_prefix = len(tag) > 2 and tag[1] == '-'
+            has_prefix = len(tag) > 2 and tag[1] == "-"
            if has_prefix:
-                if tag[0] == 'U':
-                    new_tag = 'B' + tag[1:]
-                elif tag[0] == 'L':
-                    new_tag = 'I' + tag[1:]
+                if tag[0] == "U":
+                    new_tag = "B" + tag[1:]
+                elif tag[0] == "L":
+                    new_tag = "I" + tag[1:]
            new_tags.append(new_tag)

        self.tags = new_tags

-
    @staticmethod
    def rename_from_spacy_tags(spacy_tags, ignore_unknown=False):
-        return InputSample.translate_tags(spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown)
+        return InputSample.translate_tags(
+            spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown
+        )

    @staticmethod
    def rename_to_spacy_tags(tags, ignore_unknown=True):
-        return InputSample.translate_tags(tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown)
+        return InputSample.translate_tags(
+            tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown
+        )

    @staticmethod
    def write_spacy_json_from_docs(dataset, filename="spacy_output.json"):
        docs = [sample.to_spacy_doc() for sample in dataset]
-        srsly.write_json(filename, [spacy.gold.docs_to_json(docs)])
+        srsly.write_json(filename, [spacy.training.docs_to_json(docs)])

    def to_flair(self):
        for token, i in enumerate(self.tokens):
-            return "{} {} {}".format(token, token.pos_, self.tags[i])
+            return f"{token} {token.pos_} {self.tags[i]}"

-    def translate_input_sample_tags(self, dictionary=PRESIDIO_SPACY_ENTITIES, ignore_unknown=True):
-        self.tags = InputSample.translate_tags(self.tags, dictionary, ignore_unknown=ignore_unknown)
+    def translate_input_sample_tags(self, dictionary=None, ignore_unknown=True):
+        if dictionary is None:
+            dictionary = PRESIDIO_SPACY_ENTITIES
+        self.tags = InputSample.translate_tags(
+            self.tags, dictionary, ignore_unknown=ignore_unknown
+        )
        for span in self.spans:
            if span.entity_value in PRESIDIO_SPACY_ENTITIES:
                span.entity_value = PRESIDIO_SPACY_ENTITIES[span.entity_value]
            elif ignore_unknown:
-                span.entity_value = 'O'
+                span.entity_value = "O"

    @staticmethod
    def create_flair_dataset(dataset):
@ -459,83 +530,3 @@ class InputSample(object):
            flair_samples.append(sample.to_flair())

        return flair_samples
-
-
-class ModelError:
-
-    def __init__(self, error_type, annotation, prediction, token, full_text, metadata):
-        """
-        Holds information about an error a model made for analysis purposes
-        :param error_type: str, e.g. FP, FN, Person->Address etc.
-        :param annotation: ground truth value
-        :param prediction: predicted value
-        :param token: token in question
-        :param full_text: full input text
-        :param metadata: metadata on text from InputSample
-        """
-
-        self.error_type = error_type
-        self.annotation = annotation
-        self.prediction = prediction
-        self.token = token
-        self.full_text = full_text
-        self.metadata = metadata
-
-    def __str__(self):
-        return "type: {}, " \
-               "Annotation = {}, " \
-               "prediction = {}, " \
-               "Token = {}, " \
-               "Full text = {}, " \
-               "Metadata = {}".format(self.error_type,
-                                      self.annotation,
-                                      self.prediction,
-                                      self.token,
-                                      self.full_text,
-                                      self.metadata)
-
-    def __repr__(self):
-        return r"<ModelError {{0}}>".format(self.__str__())
-
-
-class EvaluationResult(object):
-    def __init__(self, results: Counter, model_errors: List[ModelError], text: str = None):
-        """
-        Holds the output of a comparison between ground truth and predicted
-        :param results: List of objects of type Counter
-        with structure {(actual, predicted) : count}
-        :param model_errors: List of ModelError
-        :param text: sample's full text (if used for one sample)
-        :type results: Counter
-        :type model_errors : List[ModelError]
-        :type text: object
-        """
-        self.results = results
-        self.model_errors = model_errors
-        self.text = text
-
-        self.pii_recall = None
-        self.pii_precision = None
-        self.pii_f = None
-        self.entity_recall_dict = None
-        self.entity_precision_dict = None
-
-    def print(self):
-        recall_dict = self.entity_recall_dict
-        precision_dict = self.entity_precision_dict
-
-        recall_dict["PII"] = self.pii_recall
-        precision_dict["PII"] = self.pii_precision
-
-        entities = recall_dict.keys()
-        recall = recall_dict.values()
-        precision = precision_dict.values()
-
-        row_format = "{:>30}{:>30.2%}{:>30.2%}"
-        header_format = "{:>30}" * 3
-        print(header_format.format(*("Entity", "Precision", "Recall")))
-        for entity, precision, recall in zip(entities, precision, recall):
-            print(row_format.format(entity, precision, recall))
-
-        print("PII F measure: {}".format(self.pii_f))
-
--- a/presidio_evaluator/dataset_formatters/init.py
+++ b/presidio_evaluator/dataset_formatters/init.py
@ -0,0 +1,4 @@
+from .dataset_formatter import DatasetFormatter
+from .conll_formatter import CONLL2003Formatter
+
+__all__ = ["DatasetFormatter", "CONLL2003Formatter"]
--- a/presidio_evaluator/dataset_formatters/conll_formatter.py
+++ b/presidio_evaluator/dataset_formatters/conll_formatter.py
@ -0,0 +1,62 @@
+from pathlib import Path
+from typing import List, Optional
+
+import requests
+from spacy.training import converters
+
+from presidio_evaluator import InputSample
+from presidio_evaluator.dataset_formatters import DatasetFormatter
+
+
+class CONLL2003Formatter(DatasetFormatter):
+    def __init__(
+        self,
+        files_path=Path("../data/conll2003").resolve(),
+        glob_pattern: str = "*.iob",
+    ):
+        self.files_path = files_path
+        self.glob_pattern = glob_pattern
+
+    @staticmethod
+    def download(
+        local_data_path=Path("../data/conll2003").resolve(),
+        conll_gh_path="https://raw.githubusercontent.com/glample/tagger/master/dataset/",
+    ):
+
+        for fold in ("eng.train", "eng.testa", "eng.testb"):
+            fold_path = conll_gh_path + fold
+            if not local_data_path.exists():
+                local_data_path.mkdir(parents=True)
+
+            dataset_file = Path(local_data_path, fold)
+            if dataset_file.exists():
+                print("Dataset already exists, skipping download")
+                return
+
+            response = requests.get(fold_path)
+            dataset_raw = response.text
+            with open(dataset_file, "w") as f:
+                f.write(dataset_raw)
+            print(f"Finished writing fold {fold} to {local_data_path}")
+
+        print("Finished downloading CoNNL2003")
+
+    def to_input_samples(self, fold: Optional[str] = None) -> List[InputSample]:
+        files_found = False
+        for i, file_path in enumerate(self.files_path.glob(self.glob_pattern)):
+            if fold and fold not in file_path.name:
+                continue
+
+            files_found = True
+            with open(file_path, "r", encoding="utf-8") as file:
+                text = file.readlines()
+
+            text = "".join(text)
+
+            output_docs = converters.conll_ner2json(
+                input_data=text, n_sents=None, no_print=True
+            )
+
+        # TODO: Translate to InputSample
+        if not files_found:
+            raise FileNotFoundError(f"No files found for pattern {self.glob_pattern}")
--- a/presidio_evaluator/dataset_formatters/dataset_formatter.py
+++ b/presidio_evaluator/dataset_formatters/dataset_formatter.py
@ -0,0 +1,14 @@
+from abc import ABC, abstractmethod
+from typing import List
+
+from presidio_evaluator import InputSample
+
+
+class DatasetFormatter(ABC):
+    @abstractmethod
+    def to_input_samples(self) -> List[InputSample]:
+        """
+        Translate a dataset structure into a list of documents, to be used by models and for evaluation
+        :return:
+        """
+        pass
--- a/presidio_evaluator/evaluation/init.py
+++ b/presidio_evaluator/evaluation/init.py
@ -0,0 +1,5 @@
+from .model_error import ModelError
+from .evaluation_result import EvaluationResult
+from .evaluator import Evaluator
+
+__all__ = ["ModelError", "EvaluationResult", "Evaluator"]
--- a/presidio_evaluator/evaluation/evaluation_result.py
+++ b/presidio_evaluator/evaluation/evaluation_result.py
@ -0,0 +1,51 @@
+from collections import Counter
+from typing import List, Optional
+from presidio_evaluator.evaluation import ModelError
+
+
+class EvaluationResult(object):
+    def __init__(
+        self,
+        results: Counter,
+        model_errors: Optional[List[ModelError]] = None,
+        text: str = None,
+    ):
+        """
+        Holds the output of a comparison between ground truth and predicted
+        :param results: List of objects of type Counter
+        with structure {(actual, predicted) : count}
+        :param model_errors: List of specific model errors for further inspection
+        :param text: sample's full text (if used for one sample)
+        """
+
+        self.results = results
+        self.model_errors = model_errors
+        self.text = text
+
+        self.pii_recall = None
+        self.pii_precision = None
+        self.pii_f = None
+        self.entity_recall_dict = None
+        self.entity_precision_dict = None
+
+    def print(self):
+        recall_dict = self.entity_recall_dict
+        precision_dict = self.entity_precision_dict
+
+        recall_dict["PII"] = self.pii_recall
+        precision_dict["PII"] = self.pii_precision
+
+        entities = recall_dict.keys()
+        recall = recall_dict.values()
+        precision = precision_dict.values()
+
+        row_format = "{:>30}{:>30.2%}{:>30.2%}"
+        header_format = "{:>30}" * 3
+        print(header_format.format(*("Entity", "Precision", "Recall")))
+        for entity, precision, recall in zip(entities, precision, recall):
+            print(row_format.format(entity, precision, recall))
+
+        print("PII F measure: {}".format(self.pii_f))
+
+    def __repr__(self):
+        return f"stats={self.results}"
--- a/presidio_evaluator/evaluation/evaluator.py
+++ b/presidio_evaluator/evaluation/evaluator.py
@ -0,0 +1,318 @@
+from collections import Counter
+from typing import List, Optional, Dict
+
+import numpy as np
+from tqdm import tqdm
+
+from presidio_evaluator import InputSample
+from presidio_evaluator.evaluation import EvaluationResult, ModelError
+from presidio_evaluator.models import BaseModel, PresidioAnalyzerWrapper
+
+
+class Evaluator:
+    def __init__(
+        self,
+        model: BaseModel,
+        verbose: bool = False,
+        compare_by_io=True,
+        entities_to_keep: Optional[List[str]] = None,
+    ):
+        """
+        Evaluate a PII detection model or a Presidio analyzer / recognizer
+
+        :param model: Instance of a fitted model (of base type BaseModel)
+        :param compare_by_io: True if comparison should be done on the entity
+        level and not the sub-entity level
+        :param entities_to_keep: List of entity names to focus the evaluator on (and ignore the rest).
+        Default is None = all entities. If the provided model has a list of entities to keep,
+        this list would be used for evaluation.
+        """
+        self.model = model
+        self.verbose = verbose
+        self.compare_by_io = compare_by_io
+        self.entities_to_keep = entities_to_keep
+        if self.entities_to_keep is None and self.model.entities:
+            self.entities_to_keep = self.model.entities
+
+    def compare(self, input_sample: InputSample, prediction: List[str]):
+
+        """
+        Compares ground truth tags (annotation) and predicted (prediction)
+        :param input_sample: input sample containing list of tags with scheme
+        :param prediction: predicted value for each token
+        self.labeling_scheme
+
+        """
+        annotation = input_sample.tags
+        tokens = input_sample.tokens
+
+        if len(annotation) != len(prediction):
+            print(
+                "Annotation and prediction do not have the"
+                "same length. Sample={}".format(input_sample)
+            )
+            return Counter(), []
+
+        results = Counter()
+        mistakes = []
+
+        new_annotation = annotation.copy()
+
+        if self.compare_by_io:
+            new_annotation = self._to_io(new_annotation)
+            prediction = self._to_io(prediction)
+
+        # Ignore annotations that aren't in the list of
+        # requested entities.
+        if self.entities_to_keep:
+            prediction = self._adjust_per_entities(prediction)
+            new_annotation = self._adjust_per_entities(new_annotation)
+        for i in range(0, len(new_annotation)):
+            results[(new_annotation[i], prediction[i])] += 1
+
+            if self.verbose:
+                print("Annotation:", new_annotation[i])
+                print("Prediction:", prediction[i])
+                print(results)
+
+            # check if there was an error
+            is_error = new_annotation[i] != prediction[i]
+            if is_error:
+                if prediction[i] == "O":
+                    mistakes.append(
+                        ModelError(
+                            "FN",
+                            new_annotation[i],
+                            prediction[i],
+                            tokens[i],
+                            input_sample.full_text,
+                            input_sample.metadata,
+                        )
+                    )
+                elif new_annotation[i] == "O":
+                    mistakes.append(
+                        ModelError(
+                            "FP",
+                            new_annotation[i],
+                            prediction[i],
+                            tokens[i],
+                            input_sample.full_text,
+                            input_sample.metadata,
+                        )
+                    )
+                else:
+                    mistakes.append(
+                        ModelError(
+                            "Wrong entity",
+                            new_annotation[i],
+                            prediction[i],
+                            tokens[i],
+                            input_sample.full_text,
+                            input_sample.metadata,
+                        )
+                    )
+
+        return results, mistakes
+
+    def _adjust_per_entities(self, tags):
+        if self.entities_to_keep:
+            return [tag if tag in self.entities_to_keep else "O" for tag in tags]
+
+    @staticmethod
+    def _to_io(tags):
+        """
+        Translates BILOU/BIO/IOB to IO - only In or Out of entity.
+        ['B-PERSON','I-PERSON','L-PERSON'] is translated into
+        ['PERSON','PERSON','PERSON']
+        :param tags: the input tags in BILOU/IOB/BIO format
+        :return: a new list of IO tags
+        """
+        return [tag[2:] if "-" in tag else tag for tag in tags]
+
+    def evaluate_sample(
+        self, sample: InputSample, prediction: List[str]
+    ) -> EvaluationResult:
+        if self.verbose:
+            print("Input sentence: {}".format(sample.full_text))
+
+        results, mistakes = self.compare(input_sample=sample, prediction=prediction)
+        return EvaluationResult(results, mistakes, sample.full_text)
+
+    def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
+        evaluation_results = []
+        for sample in tqdm(dataset, desc="Evaluating {}".format(self.__class__)):
+            prediction = self.model.predict(sample)
+            evaluation_result = self.evaluate_sample(
+                sample=sample, prediction=prediction
+            )
+            evaluation_results.append(evaluation_result)
+
+        return evaluation_results
+
+    @staticmethod
+    def align_input_samples_to_presidio_analyzer(
+        input_samples: List[InputSample],
+        entities_mapping: Dict[
+            str, str
+        ] = PresidioAnalyzerWrapper.presidio_entities_map,
+    ) -> List[InputSample]:
+        """
+        Change input samples to conform with Presidio's entities
+        :return: new list of InputSample
+        """
+
+        new_input_samples = input_samples.copy()
+
+        # A list that will contain updated input samples,
+        new_list = []
+
+        # Iterate on all samples
+        for input_sample in new_input_samples:
+            contains_presidio_field = False
+            new_spans = []
+            # Update spans to match Presidio's entity name
+            for span in input_sample.spans:
+                in_presidio_field = False
+                if span.entity_type in entities_mapping.keys():
+                    new_name = entities_mapping.get(span.entity_type)
+                    span.entity_type = new_name
+                    contains_presidio_field = True
+
+                    # Add to new span list, if the span contains an entity relevant to Presidio
+                    new_spans.append(span)
+            input_sample.spans = new_spans
+
+            # Update tags in case this sample has relevant entities for evaluation
+            if contains_presidio_field:
+                for i, tag in enumerate(input_sample.tags):
+                    has_prefix = "-" in tag
+                    if has_prefix:
+                        prefix = tag[:2]
+                        clean = tag[2:]
+                    else:
+                        prefix = ""
+                        clean = tag
+
+                    if clean in entities_mapping.keys():
+                        new_name = entities_mapping.get(clean)
+                        input_sample.tags[i] = "{}{}".format(prefix, new_name)
+                    else:
+                        input_sample.tags[i] = "O"
+
+            new_list.append(input_sample)
+        return new_list
+
+    def calculate_score(
+        self,
+        evaluation_results: List[EvaluationResult],
+        entities: Optional[List[str]] = None,
+        beta: float = 1,
+    ) -> EvaluationResult:
+        """
+        Returns the pii_precision, pii_recall and f_measure either for each entity
+        or for all entities (ignore_entity_type = True)
+        :param evaluation_results: List of EvaluationResult
+        :param entities: List of entities to calculate score to. Default is None: all entities
+        :param beta: F measure beta value
+        between different entity types, or to treat these as misclassifications
+        :return: EvaluationResult with precision, recall and f measures
+        """
+
+        # aggregate results
+        all_results = sum([er.results for er in evaluation_results], Counter())
+
+        # compute pii_recall per entity
+        entity_recall = {}
+        entity_precision = {}
+        if not entities:
+            entities = list(set([x[0] for x in all_results.keys() if x[0] != "O"]))
+
+        for entity in entities:
+            # all annotation of given type
+            annotated = sum([all_results[x] for x in all_results if x[0] == entity])
+            predicted = sum([all_results[x] for x in all_results if x[1] == entity])
+            tp = all_results[(entity, entity)]
+
+            if annotated > 0:
+                entity_recall[entity] = tp / annotated
+            else:
+                entity_recall[entity] = np.NaN
+
+            if predicted > 0:
+                per_entity_tp = all_results[(entity, entity)]
+                entity_precision[entity] = per_entity_tp / predicted
+            else:
+                entity_precision[entity] = np.NaN
+
+        # compute pii_precision and pii_recall
+        annotated_all = sum([all_results[x] for x in all_results if x[0] != "O"])
+        predicted_all = sum([all_results[x] for x in all_results if x[1] != "O"])
+        if annotated_all > 0:
+            pii_recall = (
+                sum(
+                    [
+                        all_results[x]
+                        for x in all_results
+                        if (x[0] != "O" and x[1] != "O")
+                    ]
+                )
+                / annotated_all
+            )
+        else:
+            pii_recall = np.NaN
+        if predicted_all > 0:
+            pii_precision = (
+                sum(
+                    [
+                        all_results[x]
+                        for x in all_results
+                        if (x[0] != "O" and x[1] != "O")
+                    ]
+                )
+                / predicted_all
+            )
+        else:
+            pii_precision = np.NaN
+        # compute pii_f_beta-score
+        pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
+
+        # aggregate errors
+        errors = []
+        for res in evaluation_results:
+            if res.model_errors:
+                errors.extend(res.model_errors)
+
+        evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
+        evaluation_result.pii_precision = pii_precision
+        evaluation_result.pii_recall = pii_recall
+        evaluation_result.entity_recall_dict = entity_recall
+        evaluation_result.entity_precision_dict = entity_precision
+        evaluation_result.pii_f = pii_f_beta
+
+        return evaluation_result
+
+    @staticmethod
+    def precision(tp: int, fp: int) -> float:
+        return tp / (tp + fp + 1e-100)
+
+    @staticmethod
+    def recall(tp: int, fn: int) -> float:
+        return tp / (tp + fn + 1e-100)
+
+    @staticmethod
+    def f_beta(precision: float, recall: float, beta: float) -> float:
+        """
+        Returns the F score for precision, recall and a beta parameter
+        :param precision: a float with the precision value
+        :param recall: a float with the recall value
+        :param beta: a float with the beta parameter of the F measure,
+        which gives more or less weight to precision
+        vs. recall
+        :return: a float value of the f(beta) measure.
+        """
+        if np.isnan(precision) or np.isnan(recall) or (precision == 0 and recall == 0):
+            return np.nan
+
+        return ((1 + beta ** 2) * precision * recall) / (
+            ((beta ** 2) * precision) + recall
+        )
--- a/presidio_evaluator/evaluation/model_error.py
+++ b/presidio_evaluator/evaluation/model_error.py
@ -0,0 +1,174 @@
+from typing import Dict, List
+
+from presidio_evaluator.data_objects import SimpleToken
+
+import pandas as pd
+
+
+class ModelError:
+    def __init__(
+        self,
+        error_type: str,
+        annotation: str,
+        prediction: str,
+        token: SimpleToken,
+        full_text: str,
+        metadata: Dict,
+    ):
+        """
+        Holds information about an error a model made for analysis purposes
+        :param error_type: str, e.g. FP, FN, Person->Address etc.
+        :param annotation: ground truth value
+        :param prediction: predicted value
+        :param token: token in question
+        :param full_text: full input text
+        :param metadata: metadata on text from InputSample
+        """
+
+        self.error_type = error_type
+        self.annotation = annotation
+        self.prediction = prediction
+        self.token = token
+        self.full_text = full_text
+        self.metadata = metadata
+
+    def __str__(self):
+        return (
+            "type: {}, "
+            "Annotation = {}, "
+            "prediction = {}, "
+            "Token = {}, "
+            "Full text = {}, "
+            "Metadata = {}".format(
+                self.error_type,
+                self.annotation,
+                self.prediction,
+                self.token,
+                self.full_text,
+                self.metadata,
+            )
+        )
+
+    def __repr__(self):
+        return r"<ModelError {{0}}>".format(self.__str__())
+
+    @staticmethod
+    def most_common_fp_tokens(errors=List["ModelError"], n: int = 10, entity=None):
+        """
+        Print the n most common false positive tokens (tokens thought to be an entity)
+        """
+        fps = ModelError.get_false_positives(errors, entity)
+
+        tokens = [err.token.text for err in fps]
+        from collections import Counter
+
+        by_frequency = Counter(tokens)
+        most_common = by_frequency.most_common(n)
+        print("Most common false positive tokens:")
+        print(most_common)
+        print("Example sentence with each FP token:")
+        for tok, val in most_common:
+            with_tok = [err for err in fps if err.token.text == tok]
+            print(with_tok[0].full_text)
+
+    @staticmethod
+    def most_common_fn_tokens(errors=List["ModelError"], n: int = 10, entity=None):
+        """
+        Print all tokens that were missed by the model, including an example of the full text in which they appear
+        """
+        fns = ModelError.get_false_negatives(errors, entity)
+
+        fns_tokens = [err.token.text for err in fns]
+        from collections import Counter
+
+        by_frequency_fns = Counter(fns_tokens)
+        most_common_fns = by_frequency_fns.most_common(50)
+        print(most_common_fns)
+        for tok, val in most_common_fns:
+            with_tok = [err for err in fns if err.token.text == tok]
+            print(
+                "Token: {}, Annotation: {}, Full text: {}".format(
+                    with_tok[0].token, with_tok[0].annotation, with_tok[0].full_text
+                )
+            )
+
+    @staticmethod
+    def get_errors_df(
+        errors=List["ModelError"], entity: List[str] = None, error_type: str = "FN"
+    ):
+        """
+        Get ModelErrors as pd.DataFrame
+        """
+        if error_type == "FN":
+            filtered_errors = ModelError.get_false_negatives(errors, entity)
+        elif error_type == "FP":
+            filtered_errors = ModelError.get_false_positives(errors, entity)
+        else:
+            raise ValueError("error_type should be either FP or FN")
+
+        if len(filtered_errors) == 0:
+            print(
+                "No errors of type {} and entity {} were found".format(
+                    error_type, entity
+                )
+            )
+            return None
+
+        errors_df = pd.DataFrame.from_records(
+            [error.__dict__ for error in filtered_errors]
+        )
+        metadata_df = pd.DataFrame(errors_df["metadata"].tolist())
+        errors_df.drop(["metadata"], axis=1, inplace=True)
+        new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
+        return new_errors_df
+
+    @staticmethod
+    def get_fps_dataframe(errors=List["ModelError"], entity: str = None):
+        """
+        Get false positive ModelErrors as pd.DataFrame
+        """
+        return ModelError.get_errors_df(errors, entity, error_type="FP")
+
+    @staticmethod
+    def get_fns_dataframe(errors=List["ModelError"], entity: str = None):
+        """
+        Get false negative ModelErrors as pd.DataFrame
+        """
+        return ModelError.get_errors_df(errors, entity, error_type="FN")
+
+    @staticmethod
+    def get_false_positives(errors=List["ModelError"], entity=None):
+        """
+        Get a list of all false positive errors in the results
+        """
+        if isinstance(entity, str):
+            entity = [entity]
+
+        if entity:
+            return [
+                model_error
+                for model_error in errors
+                if model_error.error_type == "FP" and model_error.prediction in entity
+            ]
+        else:
+            return [
+                model_error for model_error in errors if model_error.error_type == "FP"
+            ]
+
+    @staticmethod
+    def get_false_negatives(errors=List["ModelError"], entity=None):
+        """
+        Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
+        """
+        if isinstance(entity, str):
+            entity = [entity]
+        if entity:
+            return [
+                model_error
+                for model_error in errors
+                if model_error.error_type != "FP" and model_error.annotation in entity
+            ]
+        else:
+            return [
+                model_error for model_error in errors if model_error.error_type != "FP"
+            ]
--- a/presidio_evaluator/evaluation/scorers.py
+++ b/presidio_evaluator/evaluation/scorers.py
@ -0,0 +1,159 @@
+"""E2E scoring pipelines for the different models"""
+
+import math
+from typing import List, Optional
+
+from presidio_analyzer import EntityRecognizer
+from presidio_analyzer.nlp_engine import SpacyNlpEngine
+
+from presidio_evaluator import InputSample
+from presidio_evaluator.data_generator import read_synth_dataset
+from presidio_evaluator.evaluation import EvaluationResult, Evaluator
+from presidio_evaluator.models import (
+    PresidioRecognizerWrapper,
+    PresidioAnalyzerWrapper,
+    BaseModel,
+)
+
+
+def score_model(
+    model: BaseModel,
+    entities_to_keep: List[str],
+    input_samples: List[InputSample],
+    verbose: bool = False,
+    beta: float = 2.5,
+) -> EvaluationResult:
+    """
+    Run data through a model and gather results and stats
+    """
+
+    print("Evaluating samples")
+
+    evaluator = Evaluator(model=model, entities_to_keep=entities_to_keep)
+    evaluated_samples = evaluator.evaluate_all(input_samples)
+
+    print("Estimating metrics")
+    evaluation_result = evaluator.calculate_score(
+        evaluation_results=evaluated_samples, beta=beta
+    )
+    precision = evaluation_result.pii_precision
+    recall = evaluation_result.pii_recall
+    entity_recall = evaluation_result.entity_recall_dict
+    entity_precision = evaluation_result.entity_precision_dict
+    f = evaluation_result.pii_f
+    errors = evaluation_result.model_errors
+    #
+    print(f"precision: {precision}")
+    print(f"Recall: {recall}")
+    print(f"F {beta}: {f}")
+    print(f"Precision per entity: {entity_precision}")
+    print(f"Recall per entity: {entity_recall}")
+
+    if verbose:
+
+        false_negatives = [
+            str(mistake) for mistake in errors if mistake.error_type == "FN"
+        ]
+        false_positives = [
+            str(mistake) for mistake in errors if mistake.error_type == "FP"
+        ]
+        other_mistakes = [
+            str(mistake) for mistake in errors if mistake.error_type not in ["FN", "FP"]
+        ]
+
+        print("False negatives: ")
+        print("\n".join(false_negatives))
+        print("\n******************\n")
+
+        print("False positives: ")
+        print("\n".join(false_positives))
+        print("\n******************\n")
+
+        print("Other mistakes: ")
+        print("\n".join(other_mistakes))
+
+    return evaluation_result
+
+
+def score_presidio_recognizer(
+    recognizer: EntityRecognizer,
+    entities_to_keep: List[str],
+    input_samples: Optional[List[InputSample]] = None,
+    labeling_scheme: str = "BILUO",
+    with_nlp_artifacts: bool = False,
+    verbose: bool = False,
+) -> EvaluationResult:
+    """
+    Run data through one EntityRecognizer and gather results and stats
+    """
+
+    if not input_samples:
+        print("Reading dataset")
+        input_samples = read_synth_dataset("../../data/synth_dataset.txt")
+    else:
+        input_samples = list(input_samples)
+
+    print("Preparing dataset by aligning entity names to Presidio's entity names")
+
+    updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(input_samples)
+
+    model = PresidioRecognizerWrapper(
+        recognizer=recognizer,
+        entities_to_keep=entities_to_keep,
+        labeling_scheme=labeling_scheme,
+        nlp_engine=SpacyNlpEngine(),
+        with_nlp_artifacts=with_nlp_artifacts,
+    )
+    return score_model(
+        model=model,
+        entities_to_keep=entities_to_keep,
+        input_samples=updated_samples,
+        verbose=verbose,
+    )
+
+
+def score_presidio_analyzer(
+    input_samples: Optional[List[InputSample]] = None,
+    entities_to_keep: Optional[List[str]] = None,
+    labeling_scheme: str = "BILUO",
+    verbose: bool = True,
+) -> EvaluationResult:
+    """"""
+    if not input_samples:
+        print("Reading dataset")
+        input_samples = read_synth_dataset("../../data/synth_dataset.txt")
+    else:
+        input_samples = list(input_samples)
+
+    print("Preparing dataset by aligning entity names to Presidio's entity names")
+
+    updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(input_samples)
+
+    flatten = lambda l: [item for sublist in l for item in sublist]
+    from collections import Counter
+
+    count_per_entity = Counter(
+        [
+            span.entity_type
+            for span in flatten(
+                [input_sample.spans for input_sample in updated_samples]
+            )
+        ]
+    )
+    if verbose:
+        print("Count per entity:")
+        print(count_per_entity)
+    analyzer = PresidioAnalyzerWrapper(
+        entities_to_keep=entities_to_keep, labeling_scheme=labeling_scheme
+    )
+
+    return score_model(
+        model=analyzer,
+        entities_to_keep=list(count_per_entity.keys()),
+        input_samples=updated_samples,
+        verbose=verbose,
+    )
+
+
+if __name__ == "__main__":
+    score_presidio_analyzer()
--- a/presidio_evaluator/model_evaluator.py
+++ b/presidio_evaluator/model_evaluator.py
@ -1,398 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import List, Tuple, Dict
-from collections import Counter
-
-import numpy as np
-import pandas as pd
-from presidio_evaluator import InputSample, EvaluationResult, ModelError
-from tqdm import tqdm
-
-
-class ModelEvaluator(ABC):
-
-    def __init__(self, entities_to_keep: List[str] = None,
-                 verbose: bool = False,
-                 use_spans: bool = False, labeling_scheme="BIO",
-                 compare_by_io=True):
-
-        """
-        Abstract class for evaluating NER models and others
-        :param entities_to_keep: Which entities should be evaluated? All other
-        entities are ignored. If None, none are filtered
-        :param verbose: Whether to print more debug info
-        :param labeling_scheme: Type of scheme used for labeling (BILOU,
-        BIO/LOB or IO)
-        :param compare_by_io: True if comparison should be done on the entity
-        level and not the sub-entity level
-
-        """
-        self.entities = entities_to_keep
-        self.verbose = verbose
-        self.use_spans = use_spans
-        self.compare_by_io = compare_by_io
-        self.labeling_scheme = labeling_scheme
-
-    @abstractmethod
-    def predict(self, sample: InputSample) -> List[str]:
-        """
-        Abstract. Returns the predicted tokens/spans from the evaluated model
-        :param sample: Sample to be evaluated
-        :return: if self.use spans: list of spans
-                 if not self.use_spans: tags in self.labeling_scheme format
-        """
-        pass
-
-    def compare(self, input_sample: InputSample, prediction: List[str]):
-
-        """
-        Compares gound truth tags (annotation) and predicted (prediction)
-        :param input_sample: input sample containing list of tags with scheme
-        :param prediction: predicted value for each token
-        self.labeling_scheme
-
-        """
-        annotation = input_sample.tags
-        tokens = input_sample.tokens
-
-        if len(annotation) != len(prediction):
-            print("Annotation and prediction do not have the"
-                  "same length. Sample={}".format(input_sample))
-            return Counter(), []
-
-        results = Counter()
-        mistakes = []
-
-        new_annotation = annotation.copy()
-
-        if self.compare_by_io:
-            new_annotation = self._to_io(new_annotation)
-            prediction = self._to_io(prediction)
-
-        # Ignore annotations that aren't in the list of
-        # requested entities.
-        if self.entities:
-            prediction = self._adjust_per_entities(prediction)
-            new_annotation = self._adjust_per_entities(new_annotation)
-        for i in range(0, len(new_annotation)):
-            results[(new_annotation[i], prediction[i])] += 1
-
-            if self.verbose:
-                print('Annotation:', new_annotation[i])
-                print('Prediction:', prediction[i])
-                print(results)
-
-            # check if there was an error
-            is_error = (new_annotation[i] != prediction[i])
-            if is_error:
-                if prediction[i] == 'O':
-                    mistakes.append(ModelError("FN",
-                                               new_annotation[i],
-                                               prediction[i],
-                                               tokens[i],
-                                               input_sample.full_text,
-                                               input_sample.metadata))
-                elif new_annotation[i] == 'O':
-                    mistakes.append(ModelError("FP",
-                                               new_annotation[i],
-                                               prediction[i],
-                                               tokens[i],
-                                               input_sample.full_text,
-                                               input_sample.metadata))
-                else:
-                    mistakes.append(ModelError("Wrong entity",
-                                               new_annotation[i],
-                                               prediction[i],
-                                               tokens[i],
-                                               input_sample.full_text,
-                                               input_sample.metadata))
-
-        return results, mistakes
-
-    def _adjust_per_entities(self, tags):
-        if self.entities:
-            return [tag if tag in self.entities else 'O' for tag in tags]
-
-    @staticmethod
-    def _to_io(tags):
-        """
-        Translates BILOU/BIO/IOB to IO - only In or Out of entity.
-        ['B-PERSON','I-PERSON','L-PERSON'] is translated into
-        ['PERSON','PERSON','PERSON']
-        :param tags: the input tags in BILOU/IOB/BIO format
-        :return: a new list of IO tags
-        """
-        return [tag[2:] if '-' in tag else tag for tag in tags]
-
-    def evaluate_sample(self, sample: InputSample) -> EvaluationResult:
-        if self.verbose:
-            print("Input sentence: {}".format(sample.full_text))
-
-        prediction = self.predict(sample)
-        results, mistakes = self.compare(
-            input_sample=sample,
-            prediction=prediction)
-        return EvaluationResult(results, mistakes, sample.full_text)
-
-    def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
-        evaluation_results = []
-        for sample in tqdm(dataset, desc='Evaluating {}'.format(self.__class__)):
-            evaluation_result = self.evaluate_sample(sample)
-            evaluation_results.append(evaluation_result)
-
-        return evaluation_results
-
-    def calculate_score(self, evaluation_results: List[
-        EvaluationResult], beta: float = 1) \
-            -> EvaluationResult:
-        """
-        Returns the pii_precision, pii_recall and f_measure either for each entity
-        or for all entities (ignore_entity_type = True)
-        :param evaluation_results: List of EvaluationResult
-        :param beta: F measure beta value
-        between different entity types, or to treat these as misclassifications
-        :return: EvaluationResult with precision, recall and f measures
-        """
-
-        # aggregate results
-        all_results = sum([er.results for er in evaluation_results], Counter())
-
-        # compute pii_recall per entity
-        entity_recall = {}
-        entity_precision = {}
-        if self.entities:
-            entities = self.entities
-        else:
-            entities = list(
-                set([x[0] for x in all_results.keys() if x[0] != 'O']))
-
-        for entity in entities:
-            # all annotation of given type
-            annotated = sum(
-                [all_results[x] for x in all_results if x[0] == entity])
-            predicted = sum(
-                [all_results[x] for x in all_results if x[1] == entity])
-            tp = all_results[(entity, entity)]
-
-            if annotated > 0:
-                entity_recall[entity] = tp / annotated
-            else:
-                entity_recall[entity] = np.NaN
-
-            if predicted > 0:
-                per_entity_tp = all_results[(entity, entity)]
-                entity_precision[entity] = per_entity_tp / predicted
-            else:
-                entity_precision[entity] = np.NaN
-
-        # compute pii_precision and pii_recall
-        annotated_all = sum(
-            [all_results[x] for x in all_results if x[0] != 'O'])
-        predicted_all = sum(
-            [all_results[x] for x in all_results if x[1] != 'O'])
-        if annotated_all > 0:
-            pii_recall = sum([all_results[x] for x in all_results if
-                              (x[0] != 'O' and x[1] != 'O')]) / annotated_all
-        else:
-            pii_recall = np.NaN
-        if predicted_all > 0:
-            pii_precision = sum([all_results[x] for x in all_results if
-                                 (x[0] != 'O' and x[1] != 'O')]) / predicted_all
-        else:
-            pii_precision = np.NaN
-        # compute pii_f_beta-score
-        pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
-
-        # aggregate errors
-        errors = []
-        for res in evaluation_results:
-            if res.model_errors:
-                errors.extend(res.model_errors)
-
-        evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
-        evaluation_result.pii_precision = pii_precision
-        evaluation_result.pii_recall = pii_recall
-        evaluation_result.entity_recall_dict = entity_recall
-        evaluation_result.entity_precision_dict = entity_precision
-        evaluation_result.pii_f = pii_f_beta
-
-        return evaluation_result
-
-    @staticmethod
-    def precision(tp: int, fp: int) -> float:
-        return tp / (tp + fp + 1e-100)
-
-    @staticmethod
-    def recall(tp: int, fn: int) -> float:
-        return tp / (tp + fn + 1e-100)
-
-    @staticmethod
-    def f_beta(precision: float, recall: float, beta: float) -> float:
-        """
-        Returns the F score for precision, recall and a beta parameter
-        :param precision: a float with the precision value
-        :param recall: a float with the recall value
-        :param beta: a float with the beta parameter of the F measure,
-        which gives more or less weight to precision
-        vs. recall
-        :return: a float value of the f(beta) measure.
-        """
-        if np.isnan(precision) or np.isnan(recall) or (
-                precision == 0 and recall == 0):
-            return np.nan
-
-        return ((1 + beta ** 2) * precision * recall) / (
-                ((beta ** 2) * precision) + recall)
-
-    @staticmethod
-    def align_input_samples_to_presidio_analyzer(input_samples: List[InputSample],
-                                                 entities_mapping: Dict[str, str],
-                                                 presidio_fields: List[str]=None) \
-            -> List[InputSample]:
-        """
-        Change input samples to conform with Presidio's entities
-        :return: new list of InputSample
-        """
-
-        new_input_samples = input_samples.copy()
-
-        # Match entity names to Presidio's
-        if not presidio_fields:
-            presidio_fields = ['CREDIT_CARD', 'CRYPTO', 'DATE_TIME', 'DOMAIN_NAME', 'EMAIL_ADDRESS', 'IBAN_CODE',
-                           'IP_ADDRESS', 'NRP', 'LOCATION', 'PERSON', 'PHONE_NUMBER', 'US_SSN']
-
-        # A list that will contain updated input samples,
-        new_list = []
-
-        # Iterate on all samples
-        for input_sample in new_input_samples:
-            contains_presidio_field = False
-            new_spans = []
-            # Update spans to match Presidio's entity name
-            for span in input_sample.spans:
-                in_presidio_field = False
-                if span.entity_type in entities_mapping.keys():
-                    new_name = entities_mapping.get(span.entity_type)
-                    span.entity_type = new_name
-                    contains_presidio_field = True
-
-                    # Add to new span list, if the span contains an entity relevant to Presidio
-                    new_spans.append(span)
-            input_sample.spans = new_spans
-
-            # Update tags in case this sample has relevant entities for evaluation
-            if contains_presidio_field:
-                for i, tag in enumerate(input_sample.tags):
-                    has_prefix = '-' in tag
-                    if has_prefix:
-                        prefix = tag[:2]
-                        clean = tag[2:]
-                    else:
-                        prefix = ""
-                        clean = tag
-
-                    if clean in entities_mapping.keys():
-                        new_name = entities_mapping.get(clean)
-                        input_sample.tags[i] = "{}{}".format(prefix, new_name)
-                    else:
-                        input_sample.tags[i] = 'O'
-
-            new_list.append(input_sample)
-        return new_list
-
-    @staticmethod
-    def get_false_positives(errors=List[ModelError], entity=None):
-        """
-        Get a list of all false positive errors in the results
-        """
-        if isinstance(entity, str):
-            entity = [entity]
-
-        if entity:
-            return [model_error for model_error in errors if
-                    model_error.error_type == 'FP' and model_error.prediction in entity]
-        else:
-            return [model_error for model_error in errors if model_error.error_type == 'FP']
-
-    @staticmethod
-    def get_false_negatives(errors=List[ModelError], entity=None):
-        """
-        Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
-        """
-        if isinstance(entity, str):
-            entity = [entity]
-        if entity:
-            return [model_error for model_error in errors if
-                    model_error.error_type != 'FP' and model_error.annotation in entity]
-        else:
-            return [model_error for model_error in errors if model_error.error_type != 'FP']
-
-    @staticmethod
-    def most_common_fp_tokens(errors=List[ModelError], n: int = 10, entity=None):
-        """
-        Print the n most common false positive tokens (tokens thought to be an entity)
-        """
-        fps = ModelEvaluator.get_false_positives(errors, entity)
-
-        tokens = [err.token.text for err in fps]
-        from collections import Counter
-        by_frequency = Counter(tokens)
-        most_common = by_frequency.most_common(n)
-        print("Most common false positive tokens:")
-        print(most_common)
-        print("Example sentence with each FP token:")
-        for tok, val in most_common:
-            with_tok = [err for err in fps if err.token.text == tok]
-            print(with_tok[0].full_text)
-
-    @staticmethod
-    def most_common_fn_tokens(errors=List[ModelError], n: int = 10, entity=None):
-        """
-        Print all tokens that were missed by the model, including an example of the full text in which they appear
-        """
-        fns = ModelEvaluator.get_false_negatives(errors, entity)
-
-        fns_tokens = [err.token.text for err in fns]
-        from collections import Counter
-        by_frequency_fns = Counter(fns_tokens)
-        most_common_fns = by_frequency_fns.most_common(50)
-        print(most_common_fns)
-        for tok, val in most_common_fns:
-            with_tok = [err for err in fns if err.token.text == tok]
-            print("Token: {}, Annotation: {}, Full text: {}".format(with_tok[0].token, with_tok[0].annotation,
-                                                                    with_tok[0].full_text))
-
-    @staticmethod
-    def get_errors_df(errors=List[ModelError], entity: List[str] = None, error_type: str = 'FN'):
-        """
-        Get ModelErrors as pd.DataFrame
-        """
-        if error_type == 'FN':
-            filtered_errors = ModelEvaluator.get_false_negatives(errors, entity)
-        elif error_type == 'FP':
-            filtered_errors = ModelEvaluator.get_false_positives(errors, entity)
-        else:
-            raise ValueError("error_type should be either FP or FN")
-
-        if len(filtered_errors) == 0:
-            print("No errors of type {} and entity {} were found".format(error_type,entity))
-            return None
-
-        errors_df = pd.DataFrame.from_records([error.__dict__ for error in filtered_errors])
-        metadata_df = pd.DataFrame(errors_df['metadata'].tolist())
-        errors_df.drop(['metadata'], axis=1, inplace=True)
-        new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
-        return new_errors_df
-
-    @staticmethod
-    def get_fps_dataframe(errors=List[ModelError], entity: List[str] = None):
-        """
-        Get false positive ModelErrors as pd.DataFrame
-        """
-        return ModelEvaluator.get_errors_df(errors, entity, error_type='FP')
-
-    @staticmethod
-    def get_fns_dataframe(errors=List[ModelError], entity: List[str] = None):
-        """
-        Get false negative ModelErrors as pd.DataFrame
-        """
-        return ModelEvaluator.get_errors_df(errors, entity, error_type='FN')
--- a/presidio_evaluator/models/init.py
+++ b/presidio_evaluator/models/init.py
@ -0,0 +1,15 @@
+from .base_model import BaseModel
+from .crf_model import CRFModel
+from .presidio_analyzer_wrapper import PresidioAnalyzerWrapper
+from .presidio_recognizer_wrapper import PresidioRecognizerWrapper
+from .spacy_model import SpacyModel
+from .flair_model import FlairModel
+
+__all__ = [
+    "BaseModel",
+    "CRFModel",
+    "PresidioRecognizerWrapper",
+    "PresidioAnalyzerWrapper",
+    "SpacyModel",
+    "FlairModel",
+]
--- a/presidio_evaluator/models/base_model.py
+++ b/presidio_evaluator/models/base_model.py
@ -0,0 +1,37 @@
+from abc import ABC, abstractmethod
+from typing import List
+
+from presidio_evaluator import InputSample
+
+
+class BaseModel(ABC):
+    def __init__(
+        self,
+        labeling_scheme: str = "BILUO",
+        entities_to_keep: List[str] = None,
+        verbose: bool = False,
+    ):
+
+        """
+        Abstract class for evaluating NER models and others
+        :param entities_to_keep: Which entities should be evaluated? All other
+        entities are ignored. If None, none are filtered
+        :param labeling_scheme: Used to translate (if needed)
+        the prediction to a specific scheme (IO, BIO/IOB, BILUO)
+        :param verbose: Whether to print more debug info
+
+
+        """
+        self.entities = entities_to_keep
+        self.labeling_scheme = labeling_scheme
+        self.verbose = verbose
+
+    @abstractmethod
+    def predict(self, sample: InputSample) -> List[str]:
+        """
+        Abstract. Returns the predicted tokens/spans from the evaluated model
+        :param sample: Sample to be evaluated
+        :return: if self.use spans: list of spans
+                 if not self.use_spans: tags in self.labeling_scheme format
+        """
+        pass
--- a/presidio_evaluator/models/crf_model.py
+++ b/presidio_evaluator/models/crf_model.py
@ -0,0 +1,104 @@
+import pickle
+from typing import List
+
+from presidio_evaluator import InputSample
+from presidio_evaluator.models import BaseModel
+
+
+class CRFModel(BaseModel):
+    def __init__(
+        self,
+        model_pickle_path: str = "../models/crf.pickle",
+        entities_to_keep: List[str] = None,
+        verbose: bool = False,
+    ):
+        super().__init__(
+            entities_to_keep=entities_to_keep,
+            verbose=verbose,
+        )
+
+        if model_pickle_path is None:
+            raise ValueError("model_pickle_path must be supplied")
+
+        with open(model_pickle_path, "rb") as f:
+            self.model = pickle.load(f)
+
+    def predict(self, sample: InputSample) -> List[str]:
+        tags = CRFModel.crf_predict(sample, self.model)
+
+        if self.entities:
+            tags = [tag for tag in tags if tag in self.entities]
+
+        if len(tags) != len(sample.tokens):
+            print("mismatch between previous tokens and new tokens")
+        # translated_tags = sample.rename_from_spacy_tags(tags)
+        return tags
+
+    @staticmethod
+    def crf_predict(sample, model):
+        sample.translate_input_sample_tags()
+
+        conll = sample.to_conll(translate_tags=True)
+        sentence = [(di["text"], di["pos"], di["label"]) for di in conll]
+        features = CRFModel.sent2features(sentence)
+        return model.predict([features])[0]
+
+    @staticmethod
+    def word2features(sent, i):
+        word = sent[i][0]
+        postag = sent[i][1]
+
+        features = {
+            "bias": 1.0,
+            "word.lower()": word.lower(),
+            "word[-3:]": word[-3:],
+            "word[-2:]": word[-2:],
+            "word.isupper()": word.isupper(),
+            "word.istitle()": word.istitle(),
+            "word.isdigit()": word.isdigit(),
+            "postag": postag,
+            "postag[:2]": postag[:2],
+        }
+        if i > 0:
+            word1 = sent[i - 1][0]
+            postag1 = sent[i - 1][1]
+            features.update(
+                {
+                    "-1:word.lower()": word1.lower(),
+                    "-1:word.istitle()": word1.istitle(),
+                    "-1:word.isupper()": word1.isupper(),
+                    "-1:postag": postag1,
+                    "-1:postag[:2]": postag1[:2],
+                }
+            )
+        else:
+            features["BOS"] = True
+
+        if i < len(sent) - 1:
+            word1 = sent[i + 1][0]
+            postag1 = sent[i + 1][1]
+            features.update(
+                {
+                    "+1:word.lower()": word1.lower(),
+                    "+1:word.istitle()": word1.istitle(),
+                    "+1:word.isupper()": word1.isupper(),
+                    "+1:postag": postag1,
+                    "+1:postag[:2]": postag1[:2],
+                }
+            )
+        else:
+            features["EOS"] = True
+
+        return features
+
+    @staticmethod
+    def sent2features(sent):
+        return [CRFModel.word2features(sent, i) for i in range(len(sent))]
+
+    @staticmethod
+    def sent2labels(sent):
+        return [label for token, postag, label in sent]
+
+    @staticmethod
+    def sent2tokens(sent):
+        return [token for token, postag, label in sent]
--- a/presidio_evaluator/models/flair_model.py
+++ b/presidio_evaluator/models/flair_model.py
@ -1,42 +1,40 @@
 from typing import List

+import spacy
+
 try:
    from flair.data import Sentence, build_spacy_tokenizer
    from flair.models import SequenceTagger
+    from flair.tokenization import SpacyTokenizer
 except ImportError:
    print("Flair is not installed by default")

-from presidio_evaluator import ModelEvaluator, InputSample
-import spacy
-
 from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
+from presidio_evaluator import InputSample
+from presidio_evaluator.models import BaseModel


-class FlairEvaluator(ModelEvaluator):
-
-    def __init__(self,
-                 model=None,
-                 model_path: str = None,
-                 entities_to_keep: List[str] = None,
-                 verbose: bool = False,
-                 labeling_scheme: str = "BIO",
-                 compare_by_io: bool = True,
-                 translate_to_spacy_entities=True):
+class FlairModel(BaseModel):
+    def __init__(
+        self,
+        model=None,
+        model_path: str = None,
+        entities_to_keep: List[str] = None,
+        verbose: bool = False,
+        translate_to_spacy_entities=True,
+    ):
        """
        Evaluator for Flair models
        :param model: model of type SequenceTagger
        :param model_path:
        :param entities_to_keep:
        :param verbose:
-        :param labeling_scheme:
-        :param compare_by_io:
        :param translate_to_spacy_entities:
        """
-        super().__init__(entities_to_keep=entities_to_keep,
-                         verbose=verbose,
-                         labeling_scheme=labeling_scheme,
-                         compare_by_io=compare_by_io)
-
+        super().__init__(
+            entities_to_keep=entities_to_keep,
+            verbose=verbose,
+        )
        if model is None:
            if model_path is None:
                raise ValueError("Either model_path or model object must be supplied")
@ -44,11 +42,15 @@ class FlairEvaluator(ModelEvaluator):
        else:
            self.model = model

-        self.spacy_tokenizer = build_spacy_tokenizer(model=spacy.blank('en'))
+        self.spacy_tokenizer = SpacyTokenizer(model=spacy.load("en_core_web_lg"))
        self.translate_to_spacy_entities = translate_to_spacy_entities

        if self.translate_to_spacy_entities:
-            print("Translating entities using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
+            print(
+                "Translating entities using this dictionary: {}".format(
+                    PRESIDIO_SPACY_ENTITIES
+                )
+            )

    def predict(self, sample: InputSample) -> List[str]:
        if self.translate_to_spacy_entities:
@ -59,13 +61,17 @@ class FlairEvaluator(ModelEvaluator):
        tags = self.get_tags_from_sentence(sentence)
        if len(tags) != len(sample.tokens):
            print("mismatch between previous tokens and new tokens")
+
+        if self.entities:
+            tags = [tag for tag in tags if tag in self.entities]
+
        return tags

    @staticmethod
    def get_tags_from_sentence(sentence):
        tags = []
        for token in sentence:
-            tags.append(token.get_tag('ner').value)
+            tags.append(token.get_tag("ner").value)

        new_tags = []
        for tag in tags:
--- a/presidio_evaluator/models/presidio_analyzer_wrapper.py
+++ b/presidio_evaluator/models/presidio_analyzer_wrapper.py
@ -0,0 +1,80 @@
+from typing import List
+
+from presidio_analyzer import AnalyzerEngine
+
+from presidio_evaluator import InputSample, span_to_tag
+from presidio_evaluator.models import BaseModel
+
+
+class PresidioAnalyzerWrapper(BaseModel):
+    def __init__(
+        self,
+        analyzer_engine=AnalyzerEngine(),
+        entities_to_keep: List[str] = None,
+        verbose: bool = False,
+        labeling_scheme="BIO",
+        score_threshold=0.4,
+    ):
+        """
+        Evaluation wrapper for the Presidio Analyzer
+        :param analyzer_engine: object of type AnalyzerEngine (from presidio-analyzer)
+        """
+        super().__init__(
+            entities_to_keep=entities_to_keep,
+            verbose=verbose,
+            labeling_scheme=labeling_scheme,
+        )
+        self.analyzer_engine = analyzer_engine
+
+        self.score_threshold = score_threshold
+
+    def predict(self, sample: InputSample) -> List[str]:
+
+        results = self.analyzer_engine.analyze(
+            text=sample.full_text,
+            entities=self.entities,
+            language="en",
+            score_threshold=self.score_threshold,
+        )
+        starts = []
+        ends = []
+        scores = []
+        tags = []
+        #
+        for res in results:
+            starts.append(res.start)
+            ends.append(res.end)
+            tags.append(res.entity_type)
+            scores.append(res.score)
+
+        response_tags = span_to_tag(
+            scheme=self.labeling_scheme,
+            text=sample.full_text,
+            start=starts,
+            end=ends,
+            tokens=sample.tokens,
+            scores=scores,
+            tag=tags,
+        )
+        return response_tags
+
+    # Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
+    presidio_entities_map = {
+        "PERSON": "PERSON",
+        "EMAIL_ADDRESS": "EMAIL_ADDRESS",
+        "CREDIT_CARD": "CREDIT_CARD",
+        "FIRST_NAME": "PERSON",
+        "PHONE_NUMBER": "PHONE_NUMBER",
+        "BIRTHDAY": "DATE_TIME",
+        "DATE_TIME": "DATE_TIME",
+        "DOMAIN": "DOMAIN",
+        "CITY": "LOCATION",
+        "ADDRESS": "LOCATION",
+        "NATIONALITY": "LOCATION",
+        "IBAN": "IBAN_CODE",
+        "URL": "DOMAIN_NAME",
+        "US_SSN": "US_SSN",
+        "IP_ADDRESS": "IP_ADDRESS",
+        "ORGANIZATION": "ORG",
+        "O": "O",
+    }
--- a/presidio_evaluator/models/presidio_recognizer_wrapper.py
+++ b/presidio_evaluator/models/presidio_recognizer_wrapper.py
@ -0,0 +1,75 @@
+from typing import List
+
+from presidio_analyzer import EntityRecognizer
+from presidio_analyzer.nlp_engine import NlpEngine
+
+from presidio_evaluator import InputSample
+from presidio_evaluator.models import BaseModel
+from presidio_evaluator.span_to_tag import span_to_tag
+
+
+class PresidioRecognizerWrapper(BaseModel):
+
+    def __init__(
+            self,
+            recognizer: EntityRecognizer,
+            nlp_engine: NlpEngine,
+            entities_to_keep: List[str] = None,
+            labeling_scheme: str = "BILUO",
+            with_nlp_artifacts: bool = False,
+            verbose: bool = False,
+    ):
+        """
+        Evaluator for one specific PII recognizer
+        To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
+        :param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
+        :param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
+        :param entities_to_keep: List of entity types to focus on while ignoring all the rest.
+        Default=None would look at all entity types
+        :param with_nlp_artifacts: Whether NLP artifacts should be obtained
+            (faster if not, but some recognizers need it)
+        """
+        super().__init__(
+            entities_to_keep=entities_to_keep,
+            verbose=verbose,
+            labeling_scheme=labeling_scheme,
+        )
+        self.with_nlp_artifacts = with_nlp_artifacts
+        self.recognizer = recognizer
+        self.nlp_engine = nlp_engine
+    #
+    def __make_nlp_artifacts(self, text: str):
+        return self.nlp_engine.process_text(text, "en")
+
+    #
+    def predict(self, sample: InputSample) -> List[str]:
+        nlp_artifacts = None
+        if self.with_nlp_artifacts:
+            nlp_artifacts = self.__make_nlp_artifacts(sample.full_text)
+
+        results = self.recognizer.analyze(
+            sample.full_text, self.entities, nlp_artifacts
+        )
+        starts = []
+        ends = []
+        tags = []
+        scores = []
+        for res in results:
+            if not res.start:
+                res.start = 0
+            starts.append(res.start)
+            ends.append(res.end)
+            tags.append(res.entity_type)
+            scores.append(res.score)
+        response_tags = span_to_tag(
+            scheme=self.labeling_scheme,
+            text=sample.full_text,
+            start=starts,
+            end=ends,
+            tag=tags,
+            tokens=sample.tokens,
+            scores=scores,
+        )
+        if len(sample.tags) == 0:
+            sample.tags = ["0" for _ in response_tags]
+        return response_tags
--- a/presidio_evaluator/models/spacy_model.py
+++ b/presidio_evaluator/models/spacy_model.py
@ -0,0 +1,55 @@
+from typing import List
+
+import spacy
+
+from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
+from presidio_evaluator import InputSample
+from presidio_evaluator.models import BaseModel
+
+
+class SpacyModel(BaseModel):
+    def __init__(
+        self,
+        model: spacy.language.Language = None,
+        model_name: str = None,
+        entities_to_keep: List[str] = None,
+        verbose: bool = False,
+        labeling_scheme: str = "BIO",
+        translate_to_spacy_entities=True,
+    ):
+        super().__init__(
+            entities_to_keep=entities_to_keep,
+            verbose=verbose,
+            labeling_scheme=labeling_scheme,
+        )
+
+        if model is None:
+            if model_name is None:
+                raise ValueError("Either model_name or model object must be supplied")
+            self.model = spacy.load(model_name)
+        else:
+            self.model = model
+
+        self.translate_to_spacy_entities = translate_to_spacy_entities
+        if self.translate_to_spacy_entities:
+            print(
+                "Translating entites using this dictionary: {}".format(
+                    PRESIDIO_SPACY_ENTITIES
+                )
+            )
+
+    def predict(self, sample: InputSample) -> List[str]:
+        if self.translate_to_spacy_entities:
+            sample.translate_input_sample_tags()
+
+        doc = self.model(sample.full_text)
+        tags = self.get_tags_from_doc(doc)
+        if len(doc) != len(sample.tokens):
+            print("mismatch between input tokens and new tokens")
+
+        return tags
+
+    @staticmethod
+    def get_tags_from_doc(doc):
+        tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
+        return tags
--- a/presidio_evaluator/presidio_analyzer_evaluator.py
+++ b/presidio_evaluator/presidio_analyzer_evaluator.py
@ -1,153 +0,0 @@
-from typing import List
-
-from presidio_analyzer import AnalyzerEngine
-
-from presidio_evaluator import ModelEvaluator, InputSample, span_to_tag
-from presidio_evaluator.data_generator import read_synth_dataset
-
-
-class PresidioAnalyzerEvaluator(ModelEvaluator):
-    def __init__(
-            self,
-            analyzer=AnalyzerEngine(),
-            entities_to_keep: List[str] = None,
-            verbose: bool = False,
-            labeling_scheme="BIO",
-            compare_by_io=True,
-            score_threshold=0.4,
-    ):
-        """
-        Evaluation wrapper for the Presidio Analyzer
-        :param analyzer: object of type AnalyzerEngine (from presidio-analyzer)
-        """
-        super().__init__(
-            entities_to_keep=entities_to_keep,
-            verbose=verbose,
-            labeling_scheme=labeling_scheme,
-            compare_by_io=compare_by_io,
-        )
-        self.analyzer = analyzer
-
-        self.score_threshold = score_threshold
-
-    def predict(self, sample: InputSample) -> List[str]:
-        if self.entities is None or len(self.entities) == 0:
-            all_fields = True
-        else:
-            all_fields = None
-        results = self.analyzer.analyze(
-            text=sample.full_text,
-            entities=self.entities,
-            language="en",
-            all_fields=all_fields,
-        )
-        starts = []
-        ends = []
-        scores = []
-        tags = []
-        #
-        for res in results:
-            #
-            if res.score >= self.score_threshold:
-                starts.append(res.start)
-                ends.append(res.end)
-                tags.append(res.entity_type)
-                scores.append(res.score)
-        #
-        response_tags = span_to_tag(
-            scheme=self.labeling_scheme,
-            text=sample.full_text,
-            start=starts,
-            end=ends,
-            tokens=sample.tokens,
-            scores=scores,
-            tag=tags,
-        )
-        return response_tags
-
-
-if __name__ == "__main__":
-    print("Reading dataset")
-    input_samples = read_synth_dataset("../data/synth_dataset.txt")
-
-    print("Preparing dataset by aligning entity names to Presidio's entity names")
-
-    # Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
-    entities_mapping = {
-        "PERSON": "PERSON",
-        "EMAIL": "EMAIL_ADDRESS",
-        "CREDIT_CARD": "CREDIT_CARD",
-        "FIRST_NAME": "PERSON",
-        "PHONE_NUMBER": "PHONE_NUMBER",
-        "BIRTHDAY": "DATE_TIME",
-        "DATE": "DATE_TIME",
-        "DOMAIN": "DOMAIN",
-        "CITY": "LOCATION",
-        "ADDRESS": "LOCATION",
-        "IBAN": "IBAN_CODE",
-        "URL": "DOMAIN_NAME",
-        "US_SSN": "US_SSN",
-        "IP_ADDRESS": "IP_ADDRESS",
-        "ORGANIZATION": "ORG",
-        "O": "O",
-    }
-
-    updated_samples = ModelEvaluator.align_input_samples_to_presidio_analyzer(
-        input_samples, entities_mapping
-    )
-
-    flatten = lambda l: [item for sublist in l for item in sublist]
-    from collections import Counter
-
-    count_per_entity = Counter(
-        [
-            span.entity_type
-            for span in flatten(
-                [input_sample.spans for input_sample in updated_samples]
-            )
-        ]
-    )
-
-    print("Evaluating samples")
-    analyzer = PresidioAnalyzerEvaluator(entities_to_keep=count_per_entity.keys())
-    evaluated_samples = analyzer.evaluate_all(updated_samples)
-
-    print("Estimating metrics")
-    score = analyzer.calculate_score(evaluation_results=evaluated_samples, beta=2.5)
-    precision = score.pii_precision
-    recall = score.pii_recall
-    entity_recall = score.entity_recall_dict
-    entity_precision = score.entity_precision_dict
-    f = score.pii_f
-    errors = score.model_errors
-    #
-    print("precision: {}".format(precision))
-    print("Recall: {}".format(recall))
-    print("F 2.5: {}".format(f))
-    print("Precision per entity: {}".format(entity_precision))
-    print("Recall per entity: {}".format(entity_recall))
-    #
-    FN_mistakes = [str(mistake) for mistake in errors if mistake.error_type == "FN"]
-    FP_mistakes = [str(mistake) for mistake in errors if mistake.error_type == "FP"]
-    other_mistakes = [
-        str(mistake) for mistake in errors if mistake.error_type not in ["FN", "FP"]
-    ]
-
-    fn = open("../data/fn_30000.txt", "w+", encoding="utf-8")
-    fn1 = "\n".join(FN_mistakes)
-    fn.write(fn1)
-    fn.close()
-
-    fp = open("../data/fp_30000.txt", "w+", encoding="utf-8")
-    fp1 = "\n".join(FP_mistakes)
-    fp.write(fp1)
-    fp.close()
-
-    mistakes_file = open("../data/mistakes_30000.txt", "w+", encoding="utf-8")
-    mistakes1 = "\n".join(other_mistakes)
-    mistakes_file.write(mistakes1)
-    mistakes_file.close()
-
-    from pickle import dump
-
-    dump(evaluated_samples, open("../data/evaluated_samples_30000.pickle", "wb"))
--- a/presidio_evaluator/presidio_api_evaluator.py
+++ b/presidio_evaluator/presidio_api_evaluator.py
@ -1,133 +0,0 @@
-import json
-from typing import List
-
-import requests
-
-from presidio_evaluator import InputSample, ModelEvaluator
-from presidio_evaluator.span_to_tag import span_to_tag, tokenize
-
-ENDPOINT = "http://40.113.201.221:8080/api/v1/projects/test/analyze"
-
-
-class PresidioAPIEvaluator(ModelEvaluator):
-
-    def __init__(self, endpoint=None, all_fields=False, entities_to_keep=None,
-                 verbose=False, labeling_scheme="IO", **kwargs):
-        """
-        evaluator model for the presidio API as a system
-        :param endpoint: url of presidio API
-        :param all_fields: boolean, true if no entities filtering should take
-        place
-        :param entities_to_keep: list of entities to return if found
-        :param labeling_scheme: BIO/IOB or BILOU
-        :param verbose:
-        :param kwargs:
-        """
-
-        if not endpoint:
-            print(
-                "Endpoint is missing. using default presidio API at {}".format(
-                    ENDPOINT))
-            self.endpoint = ENDPOINT
-        else:
-            self.endpoint = endpoint
-
-        if not entities_to_keep and not all_fields:
-            raise ValueError("Please provide either a list of entities or"
-                             "all_fields=true")
-
-        if all_fields:
-            entities_to_keep = None
-        super().__init__(verbose=verbose, entities_to_keep=entities_to_keep,
-                         labeling_scheme=labeling_scheme, **kwargs)
-
-        self.set_analyze_template(all_fields=all_fields,
-                                  entities=entities_to_keep)
-
-    def predict(self, sample: InputSample):
-        text = sample.full_text
-        request = {"text": text,
-                   "analyzeTemplate": self.analyze_template
-                   }
-        # Call presidio API
-        r = requests.post(self.endpoint, json=request)
-        starts = []
-        ends = []
-        tags = []
-
-        if r.status_code == 200:
-            analyzer_results = json.loads(r.text)
-            if self.verbose:
-                print(analyzer_results)
-
-            if analyzer_results:
-                for res in analyzer_results:
-                    if not res['location'].get('start'):
-                        res['location']['start'] = 0
-                    starts.append(res['location']['start'])
-                    ends.append(res['location']['end'])
-                    tags.append(res['field']['name'])
-
-            response_tags = span_to_tag(scheme=self.labeling_scheme,
-                                        text=text,
-                                        start=starts,
-                                        end=ends,
-                                        tag=tags)
-
-        elif r.status_code == 400 or r.text == "":
-            if self.verbose:
-                print("Status 400 received")
-            response_tags = ['O' for token in sample.tokens]
-        else:
-            print("Error getting result from Presidio API")
-            print("Request = {}".format(request))
-            print("Response = {}".format(r.text))
-            raise Exception(r)
-
-        return response_tags
-
-    def set_analyze_template(self, all_fields: bool, entities: List[str]):
-        template = {
-            "fields": [{"name": "EMAIL_ADDRESS"}, {"name": "IP_ADDRESS"},
-                       {"name": "US_DRIVER_LICENSE"},
-                       {"name": "US_ITIN"}, {"name": "US_SSN"},
-                       {"name": "DOMAIN_NAME"},
-                       {"name": "IBAN_CODE"}, {"name": "PERSON"},
-                       {"name": "PHONE_NUMBER"},
-                       {"name": "US_BANK_NUMBER"}, {"name": "CRYPTO"},
-                       {"name": "NRP"},
-                       {"name": "UK_NHS"}, {"name": "CREDIT_CARD"},
-                       {"name": "DATE_TIME"},
-                       {"name": "LOCATION"}, {"name": "US_PASSPORT"}]}
-
-        if all_fields:
-            self.analyze_template = template
-            return
-
-        requested_fields = []
-        for entity in entities:
-            for field in template['fields']:
-                if entity == field['name']:
-                    requested_fields.append(field)
-
-        new_template = {'fields': requested_fields}
-
-        self.analyze_template = new_template
-
-
-if __name__ == "__main__":
-    # Example:
-    text = "My siblings are Dan and magen"
-    bilou_tags = ['O', 'O', 'O', 'U-PERSON', 'O', 'U-PERSON']
-    presidio = PresidioAPIEvaluator(verbose=True, all_fields=True, compare_by_io=True)
-    tokens = tokenize(text)
-    s = InputSample(text, masked=None, spans=None)
-    s.tokens = tokens
-    s.tags = bilou_tags
-
-    evaluated_sample = presidio.evaluate_sample(s)
-    p, r, entity_recall, f, mistakes = presidio.calculate_score([evaluated_sample])
-    print("Precision = {}\n"
-          "Recall = {}\n"
-          "F_3 = {}\n"
-          "Errors = {}".format(p, r, f, mistakes))
--- a/presidio_evaluator/presidio_recognizer_evaluator.py
+++ b/presidio_evaluator/presidio_recognizer_evaluator.py
@ -1,88 +0,0 @@
-"""
-Presidio Analyzer not yet on PyPI, therefore it cannot be referenced explicitly
-"""
-
-import math
-from typing import List, Tuple, Dict
-
-from presidio_analyzer.nlp_engine import SpacyNlpEngine
-
-from presidio_evaluator import ModelEvaluator, InputSample, EvaluationResult
-from presidio_evaluator.span_to_tag import span_to_tag
-
-
-class PresidioRecognizerEvaluator(ModelEvaluator):
-    def __init__(
-        self,
-        recognizer,
-        nlp_engine,
-        entities_to_keep=None,
-        with_nlp_artifacts=False,
-        verbose=False,
-        compare_by_io=True,
-    ):
-        """
-        Evaluator for one recognizer
-        :param recognizer: An object of type EntityRecognizer (in presidion-analyzer)
-        :param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
-        """
-        super().__init__(
-            entities_to_keep=entities_to_keep,
-            verbose=verbose,
-            compare_by_io=compare_by_io,
-        )
-        self.withNlpArtifacts = with_nlp_artifacts
-        self.recognizer = recognizer
-        self.nlp_engine = nlp_engine
-
-    #
-    def __make_nlp_artifacts(self, text: str):
-        return self.nlp_engine.process_text(text, "en")
-
-    #
-    def predict(self, sample: InputSample) -> List[str]:
-        nlpArtifacts = None
-        if self.withNlpArtifacts:
-            nlpArtifacts = self.__make_nlp_artifacts(sample.full_text)
-        results = self.recognizer.analyze(sample.full_text, self.entities, nlpArtifacts)
-        starts = []
-        ends = []
-        tags = []
-        scores = []
-        for res in results:
-            if not res.start:
-                res.start = 0
-            starts.append(res.start)
-            ends.append(res.end)
-            tags.append(res.entity_type)
-            scores.append(res.score)
-        response_tags = span_to_tag(
-            scheme=self.labeling_scheme,
-            text=sample.full_text,
-            start=starts,
-            end=ends,
-            tag=tags,
-            tokens=sample.tokens,
-            scores=scores,
-            io_tags_only=self.compare_by_io,
-        )
-        if len(sample.tags) == 0:
-            sample.tags = ["0" for word in response_tags]
-        return response_tags
-
-
-def score_presidio_recognizer(
-    recognizer, entities_to_keep, input_samples, withNlpArtifacts=False
-) -> EvaluationResult:
-    model = PresidioRecognizerEvaluator(
-        recognizer=recognizer,
-        entities_to_keep=entities_to_keep,
-        nlp_engine=SpacyNlpEngine(),
-        with_nlp_artifacts=withNlpArtifacts,
-    )
-    evaluated_samples = model.evaluate_all(input_samples[:])
-    evaluation_result = model.calculate_score(evaluated_samples, beta=2.5)
-    evaluation_result.print()
-    if math.isnan(evaluation_result.pii_precision):
-        evaluation_result.pii_precision = 0
-    return evaluation_result
--- a/presidio_evaluator/spacy_evaluator.py
+++ b/presidio_evaluator/spacy_evaluator.py
@ -1,52 +0,0 @@
-from typing import List
-
-from presidio_evaluator import ModelEvaluator, InputSample
-import spacy
-
-from spacy.language import Language
-
-from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
-
-
-class SpacyEvaluator(ModelEvaluator):
-
-    def __init__(self,
-                 model: spacy.language.Language = None,
-                 model_name: str = None,
-                 entities_to_keep: List[str] = None,
-                 verbose: bool = False,
-                 labeling_scheme: str = "BIO",
-                 compare_by_io: bool = True,
-                 translate_to_spacy_ents = True):
-        super().__init__(entities_to_keep=entities_to_keep,
-                         verbose=verbose,
-                         labeling_scheme=labeling_scheme,
-                         compare_by_io=compare_by_io)
-
-        if model is None:
-            if model_name is None:
-                raise ValueError("Either model_name or model object must be supplied")
-            self.model = spacy.load(model_name)
-        else:
-            self.model = model
-
-        self.translate_to_spacy_ents = translate_to_spacy_ents
-        if self.translate_to_spacy_ents:
-            print("Translating entites using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
-
-    def predict(self, sample: InputSample) -> List[str]:
-        if self.translate_to_spacy_ents:
-            sample.translate_input_sample_tags()
-
-        doc = self.model(sample.full_text)
-        tags = self.get_tags_from_doc(doc)
-        if len(doc) != len(sample.tokens):
-            print("mismatch between input tokens and new tokens")
-
-        return tags
-
-    @staticmethod
-    def get_tags_from_doc(doc):
-        tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
-        return tags
-
--- a/presidio_evaluator/span_to_tag.py
+++ b/presidio_evaluator/span_to_tag.py
@ -1,14 +1,14 @@
-from collections import namedtuple
 from typing import List

 import spacy
+from spacy.tokens import Token

 loaded_spacy = {}


 def get_spacy(loaded_spacy=loaded_spacy, model_version="en_core_web_lg"):
    if model_version not in loaded_spacy:
-        disable = ['vectors', 'textcat', 'ner']
+        disable = ["vectors", "textcat", "ner"]
        print("loading model {}".format(model_version))
        loaded_spacy[model_version] = spacy.load(model_version, disable=disable)
    return loaded_spacy[model_version]
@ -26,7 +26,7 @@ def _get_detailed_tags(scheme, cur_tags):
    :return:
    """

-    if all([tag == 'O' for tag in cur_tags]):
+    if all([tag == "O" for tag in cur_tags]):
        return cur_tags

    return_tags = []
@ -52,7 +52,12 @@ def _get_detailed_tags(scheme, cur_tags):

 def _sort_spans(start, end, tag, score):
    if len(start) > 0:
-        tpl = [(a, b, c, d) for a, b, c, d in sorted(zip(start, end, tag, score), key=lambda pair: pair[0])]
+        tpl = [
+            (a, b, c, d)
+            for a, b, c, d in sorted(
+                zip(start, end, tag, score), key=lambda pair: pair[0]
+            )
+        ]
        start, end, tag, score = [[x[i] for x in tpl] for i in range(len(tpl[0]))]
    return start, end, tag, score

@ -65,8 +70,8 @@ def _handle_overlaps(start, end, tag, score):
    index = min(start)
    number_of_spans = len(start)
    i = 0
-    while i < number_of_spans-1:
-        for j in range(i+1,number_of_spans):
+    while i < number_of_spans - 1:
+        for j in range(i + 1, number_of_spans):
            # Span j intersects with span i
            if start[i] <= start[j] <= end[i]:
                # i's score is higher, remove intersecting part
@ -98,14 +103,15 @@ def _handle_overlaps(start, end, tag, score):
    return start, end, tag, score


-def span_to_tag(scheme: str,
-                text: str,
-                start: List[int],
-                end: List[int],
-                tag: List[str],
-                scores: List[float] = None,
-                tokens: List[spacy.tokens.Token] = None,
-                io_tags_only=False) -> List[str]:
+def span_to_tag(
+    scheme: str,
+    text: str,
+    start: List[int],
+    end: List[int],
+    tag: List[str],
+    scores: List[float] = None,
+    tokens: List[spacy.tokens.Token] = None,
+) -> List[str]:
    """
    Turns a list of start and end values with corresponding labels, into a NER
    tagging (BILOU,BIO/IOB)
@ -116,7 +122,6 @@ def span_to_tag(scheme: str,
    :param end: list of indices where entities in the text end
    :param tag: list of entity names
    :param scores: score of tag (confidence)
-    :param io_tags_only: Whether to return only I and O tags
    :return: list of strings, representing either BILOU or BIO for the input
    """

@ -141,7 +146,7 @@ def span_to_tag(scheme: str,
        if not found:
            io_tags.append("O")

-    if io_tags_only or scheme == "IO":
+    if scheme == "IO":
        return io_tags

    # Set tagging based on scheme (BIO/IOB or BILOU)
@ -158,7 +163,9 @@ def span_to_tag(scheme: str,
    new_return_tags = []
    for i in range(len(changes) - 1):
        new_return_tags.extend(
-            _get_detailed_tags(scheme=scheme,
-                               cur_tags=io_tags[changes[i]:changes[i + 1]]))
+            _get_detailed_tags(
+                scheme=scheme, cur_tags=io_tags[changes[i] : changes[i + 1]]
+            )
+        )

    return new_return_tags
--- a/presidio_evaluator/validation.py
+++ b/presidio_evaluator/validation.py
@ -7,7 +7,7 @@ import json
 from presidio_evaluator import InputSample


-def split_dataset(dataset : List[InputSample], ratios):
+def split_dataset(dataset: List[InputSample], ratios):
    """
    Splits a provided dataset into n groups, by the Template# attribute in each sample's metadata
    :param dataset: List of InputSamples to be splitted
@ -23,7 +23,9 @@ def split_dataset(dataset : List[InputSample], ratios):

    for ratio in ratios:
        if 1 >= ratio > 0:
-            first_templates, second_templates = split_by_template(remaining_dataset, ratio/remaining_ratio)
+            first_templates, second_templates = split_by_template(
+                remaining_dataset, ratio / remaining_ratio
+            )
            first_split = get_samples_by_pattern(remaining_dataset, first_templates)
            second_split = get_samples_by_pattern(remaining_dataset, second_templates)
            splits.append(first_split)
@ -39,7 +41,7 @@ def group_by_template(dataset: List[InputSample]) -> Dict[str, List[InputSample]
    """
    Creates a dict of key = template ID and value = List[InputSamples] for this template id
    """
-    samples_pattern_tup = [(sample.metadata["Template#"],sample) for sample in dataset]
+    samples_pattern_tup = [(sample.metadata["Template#"], sample) for sample in dataset]

    group_by_template = defaultdict(list)
    for sample in samples_pattern_tup:
@ -55,7 +57,9 @@ def split_by_template(input_samples: List[InputSample], train_pct: float = 0.7):
    samples_grpd = group_by_template(input_samples)

    templates = np.array(list(samples_grpd.keys()))
-    train_ind = set(random.sample(range(len(templates)), round(train_pct * len(templates))))
+    train_ind = set(
+        random.sample(range(len(templates)), round(train_pct * len(templates)))
+    )

    test_ind = set(range(len(templates))) - train_ind

@ -75,5 +79,5 @@ def get_samples_by_pattern(input_samples, patterns_list):
 def save_to_json(samples, output_file):
    examples_dict = [example.to_dict() for example in samples]

-    with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
+    with open("{}".format(output_file), "w+", encoding="utf-8") as f:
        json.dump(examples_dict, f, ensure_ascii=False, indent=4)
--- a/requirements.txt
+++ b/requirements.txt
@ -1,18 +1,15 @@
-spacy
-requests==2.22.0
-numpy
-jupyter
-pandas
-tqdm
-haikunator
+spacy==3.0.5
+numpy==1.20.2
+jupyter>=1
+pandas~=1.2.4
+tqdm~=4.60.0
+haikunator~=2.1.0
 schwifty
-faker
-sklearn
-https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz
-regex
-#azureml
-#azureml-sdk
+faker~=8.1.0
+scikit_learn==0.24.1
 #flair
-sklearn_crfsuite
-pytest
-presidio_analyzer
+sklearn_crfsuite==0.3.6
+pytest~=6.2.3
+presidio_analyzer
+presidio_anonymizer
+requests~=2.25.1
--- a/setup.py
+++ b/setup.py
@ -1,4 +1,4 @@
-from setuptools import setup
+from setuptools import setup, find_packages
 import os.path
 # read the contents of the README file
 from os import path
@ -7,7 +7,6 @@ this_directory = path.abspath(path.dirname(__file__))
 with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
    long_description = f.read()
    # print(long_description)
-__version__ = ""

 with open(os.path.join(this_directory, 'VERSION')) as version_file:
    __version__ = version_file.read().strip()
@ -17,16 +16,15 @@ setup(
    long_description=long_description,
    long_description_content_type='text/markdown',
    version=__version__,
-    packages=['presidio_evaluator', 'presidio_evaluator.data_generator'
-              ],
+    packages=find_packages(exclude=["tests"]),
    url='https://www.github.com/microsoft/presidio',
    license='MIT',
    description='PII dataset generator, model evaluator for Presidio and PII data in general',
    data_files=[('presidio_evaluator/data_generator/raw_data', ['presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv', 'presidio_evaluator/data_generator/raw_data/templates.txt', 'presidio_evaluator/data_generator/raw_data/organizations.csv', 'presidio_evaluator/data_generator/raw_data/nationalities.csv'])],
    include_package_data=True,
    install_requires=[
-        'spacy>=2.2.0',
-        'requests==2.22.0',
+        'spacy>=3.0.0',
+        'requests',
        'numpy',
        'pandas',
        'tqdm>=4.32.1',
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -12,7 +12,7 @@ def pytest_addoption(parser):
        "--runslow", action="store_true", default=False, help="run slow tests"
    )
    parser.addoption(
-        "--runinconclusive", action="store_true", default=False, help="run slow tests"
+        "--runinconclusive", action="store_true", default=False, help="run inconclusive tests"
    )


--- a/tests/generated_test.txt
+++ b/tests/generated_test.txt
@ -1,254 +1,139 @@
 [
    {
-        "full_text": "My full address is Avda. Alameda Sundheim 46",
+        "full_text": "I either live on 2347 Lauzon Parkway, Windsor N9A7A2 or ",
        "masked": null,
        "spans": [
            {
-                "entity_type": "FULL_ADDRESS",
-                "entity_value": "Avda. Alameda Sundheim 46",
-                "start_position": 19,
-                "end_position": 44
+                "entity_type": "LOCATION",
+                "entity_value": "2347 Lauzon Parkway, Windsor N9A7A2",
+                "start_position": 17,
+                "end_position": 52
+            },
+            {
+                "entity_type": "LOCATION",
+                "entity_value": "",
+                "start_position": 56,
+                "end_position": 56
            }
        ],
        "tokens": [
            {
-                "text": "My",
-                "idx": 0,
-                "tag_": "PRP$",
-                "pos_": "DET",
-                "dep_": "poss",
-                "lemma_": "-PRON-",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "full",
-                "idx": 3,
-                "tag_": "JJ",
-                "pos_": "ADJ",
-                "dep_": "amod",
-                "lemma_": "full",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "address",
-                "idx": 8,
-                "tag_": "NN",
-                "pos_": "NOUN",
-                "dep_": "nsubj",
-                "lemma_": "address",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "is",
-                "idx": 16,
-                "tag_": "VBZ",
-                "pos_": "AUX",
-                "dep_": "ROOT",
-                "lemma_": "be",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "Avda",
-                "idx": 19,
-                "tag_": "NNP",
-                "pos_": "PROPN",
-                "dep_": "attr",
-                "lemma_": "Avda",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": ".",
-                "idx": 23,
-                "tag_": ".",
-                "pos_": "PUNCT",
-                "dep_": "punct",
-                "lemma_": ".",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "Alameda",
-                "idx": 25,
-                "tag_": "NNP",
-                "pos_": "PROPN",
-                "dep_": "compound",
-                "lemma_": "Alameda",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "Sundheim",
-                "idx": 33,
-                "tag_": "NNP",
-                "pos_": "PROPN",
-                "dep_": "ROOT",
-                "lemma_": "Sundheim",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "46",
-                "idx": 42,
-                "tag_": "CD",
-                "pos_": "NUM",
-                "dep_": "nummod",
-                "lemma_": "46",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            }
-        ],
-        "tags": [
-            "O",
-            "O",
-            "O",
-            "O",
-            "B-FULL_ADDRESS",
-            "I-FULL_ADDRESS",
-            "I-FULL_ADDRESS",
-            "I-FULL_ADDRESS",
-            "L-FULL_ADDRESS"
-        ],
-        "template_id": null,
-        "metadata": {
-            "Gender": "male",
-            "NameSet": "Croatian",
-            "Country": "Uganda",
-            "Lowercase": false,
-            "Template#": 9
-        }
-    },
-    {
-        "full_text": "You want my credit card? No problem: 4532368231815457",
-        "masked": null,
-        "spans": [
-            {
-                "entity_type": "CREDIT_CARD",
-                "entity_value": "4532368231815457",
-                "start_position": 37,
-                "end_position": 53
-            }
-        ],
-        "tokens": [
-            {
-                "text": "You",
+                "text": "I",
                "idx": 0,
                "tag_": "PRP",
                "pos_": "PRON",
                "dep_": "nsubj",
-                "lemma_": "-PRON-",
+                "lemma_": "I",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
-                "text": "want",
-                "idx": 4,
+                "text": "either",
+                "idx": 2,
+                "tag_": "RB",
+                "pos_": "ADV",
+                "dep_": "advmod",
+                "lemma_": "either",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": "live",
+                "idx": 9,
                "tag_": "VBP",
                "pos_": "VERB",
                "dep_": "ROOT",
-                "lemma_": "want",
+                "lemma_": "live",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
-                "text": "my",
-                "idx": 9,
-                "tag_": "PRP$",
-                "pos_": "DET",
-                "dep_": "poss",
-                "lemma_": "-PRON-",
+                "text": "on",
+                "idx": 14,
+                "tag_": "IN",
+                "pos_": "ADP",
+                "dep_": "prep",
+                "lemma_": "on",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
-                "text": "credit",
-                "idx": 12,
-                "tag_": "NN",
-                "pos_": "NOUN",
-                "dep_": "compound",
-                "lemma_": "credit",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "card",
-                "idx": 19,
-                "tag_": "NN",
-                "pos_": "NOUN",
-                "dep_": "dobj",
-                "lemma_": "card",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "?",
-                "idx": 23,
-                "tag_": ".",
-                "pos_": "PUNCT",
-                "dep_": "punct",
-                "lemma_": "?",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "No",
-                "idx": 25,
-                "tag_": "DT",
-                "pos_": "DET",
-                "dep_": "det",
-                "lemma_": "no",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "problem",
-                "idx": 28,
-                "tag_": "NN",
-                "pos_": "NOUN",
-                "dep_": "ROOT",
-                "lemma_": "problem",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": ":",
-                "idx": 35,
-                "tag_": ":",
-                "pos_": "PUNCT",
-                "dep_": "punct",
-                "lemma_": ":",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "4532368231815457",
-                "idx": 37,
+                "text": "2347",
+                "idx": 17,
                "tag_": "CD",
                "pos_": "NUM",
+                "dep_": "nummod",
+                "lemma_": "2347",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": "Lauzon",
+                "idx": 22,
+                "tag_": "NNP",
+                "pos_": "PROPN",
+                "dep_": "compound",
+                "lemma_": "Lauzon",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": "Parkway",
+                "idx": 29,
+                "tag_": "NNP",
+                "pos_": "PROPN",
+                "dep_": "pobj",
+                "lemma_": "Parkway",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": ",",
+                "idx": 36,
+                "tag_": ",",
+                "pos_": "PUNCT",
+                "dep_": "punct",
+                "lemma_": ",",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": "Windsor",
+                "idx": 38,
+                "tag_": "NNP",
+                "pos_": "PROPN",
+                "dep_": "compound",
+                "lemma_": "Windsor",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": "N9A7A2",
+                "idx": 46,
+                "tag_": "NNP",
+                "pos_": "PROPN",
                "dep_": "appos",
-                "lemma_": "4532368231815457",
+                "lemma_": "N9A7A2",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": "or",
+                "idx": 53,
+                "tag_": "CC",
+                "pos_": "CCONJ",
+                "dep_": "cc",
+                "lemma_": "or",
                "_": {
                    "is_in_vocabulary": false
                }
@ -259,37 +144,38 @@
            "O",
            "O",
            "O",
-            "O",
-            "O",
-            "O",
-            "O",
-            "O",
-            "U-CREDIT_CARD"
+            "B-LOCATION",
+            "I-LOCATION",
+            "I-LOCATION",
+            "I-LOCATION",
+            "I-LOCATION",
+            "L-LOCATION",
+            "O"
        ],
        "template_id": null,
        "metadata": {
-            "Gender": "female",
-            "NameSet": "Czech",
-            "Country": "Austria",
+            "Gender": "male",
+            "NameSet": "Polish",
+            "Country": "Croatia",
            "Lowercase": false,
-            "Template#": 7
+            "Template#": 11
        }
    },
    {
-        "full_text": "My first name is Rogelio and my last is Patrick",
+        "full_text": "My accounts are  and ",
        "masked": null,
        "spans": [
            {
-                "entity_type": "PERSON",
-                "entity_value": "Rogelio",
-                "start_position": 17,
-                "end_position": 24
+                "entity_type": "ACCOUNT_NUMBER",
+                "entity_value": "",
+                "start_position": 16,
+                "end_position": 16
            },
            {
-                "entity_type": "PERSON",
-                "entity_value": "Patrick",
-                "start_position": 40,
-                "end_position": 47
+                "entity_type": "ACCOUNT_NUMBER",
+                "entity_value": "",
+                "start_position": 21,
+                "end_position": 21
            }
        ],
        "tokens": [
@ -297,39 +183,28 @@
                "text": "My",
                "idx": 0,
                "tag_": "PRP$",
-                "pos_": "DET",
+                "pos_": "PRON",
                "dep_": "poss",
-                "lemma_": "-PRON-",
+                "lemma_": "my",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
-                "text": "first",
+                "text": "accounts",
                "idx": 3,
-                "tag_": "JJ",
-                "pos_": "ADJ",
-                "dep_": "amod",
-                "lemma_": "first",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
-            {
-                "text": "name",
-                "idx": 9,
-                "tag_": "NN",
+                "tag_": "NNS",
                "pos_": "NOUN",
                "dep_": "nsubj",
-                "lemma_": "name",
+                "lemma_": "account",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
-                "text": "is",
-                "idx": 14,
-                "tag_": "VBZ",
+                "text": "are",
+                "idx": 12,
+                "tag_": "VBP",
                "pos_": "AUX",
                "dep_": "ROOT",
                "lemma_": "be",
@ -338,19 +213,19 @@
                }
            },
            {
-                "text": "Rogelio",
-                "idx": 17,
-                "tag_": "NNP",
-                "pos_": "PROPN",
+                "text": " ",
+                "idx": 16,
+                "tag_": "_SP",
+                "pos_": "SPACE",
                "dep_": "attr",
-                "lemma_": "Rogelio",
+                "lemma_": " ",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
                "text": "and",
-                "idx": 25,
+                "idx": 17,
                "tag_": "CC",
                "pos_": "CCONJ",
                "dep_": "cc",
@ -358,47 +233,76 @@
                "_": {
                    "is_in_vocabulary": false
                }
-            },
+            }
+        ],
+        "tags": [
+            "O",
+            "O",
+            "O",
+            "O",
+            "O"
+        ],
+        "template_id": null,
+        "metadata": {
+            "Gender": "male",
+            "NameSet": "Hispanic",
+            "Country": "Iraq",
+            "Lowercase": false,
+            "Template#": 14
+        }
+    },
+    {
+        "full_text": "I live in Uralaane",
+        "masked": null,
+        "spans": [
            {
-                "text": "my",
-                "idx": 29,
-                "tag_": "PRP$",
-                "pos_": "DET",
-                "dep_": "poss",
-                "lemma_": "-PRON-",
-                "_": {
-                    "is_in_vocabulary": false
-                }
-            },
+                "entity_type": "LOCATION",
+                "entity_value": "Uralaane",
+                "start_position": 10,
+                "end_position": 18
+            }
+        ],
+        "tokens": [
            {
-                "text": "last",
-                "idx": 32,
-                "tag_": "JJ",
-                "pos_": "ADJ",
+                "text": "I",
+                "idx": 0,
+                "tag_": "PRP",
+                "pos_": "PRON",
                "dep_": "nsubj",
-                "lemma_": "last",
+                "lemma_": "I",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
-                "text": "is",
-                "idx": 37,
-                "tag_": "VBZ",
-                "pos_": "AUX",
-                "dep_": "conj",
-                "lemma_": "be",
+                "text": "live",
+                "idx": 2,
+                "tag_": "VBP",
+                "pos_": "VERB",
+                "dep_": "ROOT",
+                "lemma_": "live",
                "_": {
                    "is_in_vocabulary": false
                }
            },
            {
-                "text": "Patrick",
-                "idx": 40,
+                "text": "in",
+                "idx": 7,
+                "tag_": "IN",
+                "pos_": "ADP",
+                "dep_": "prep",
+                "lemma_": "in",
+                "_": {
+                    "is_in_vocabulary": false
+                }
+            },
+            {
+                "text": "Uralaane",
+                "idx": 10,
                "tag_": "NNP",
                "pos_": "PROPN",
-                "dep_": "attr",
-                "lemma_": "Patrick",
+                "dep_": "pobj",
+                "lemma_": "Uralaane",
                "_": {
                    "is_in_vocabulary": false
                }
@ -408,21 +312,15 @@
            "O",
            "O",
            "O",
-            "O",
-            "U-PERSON",
-            "O",
-            "O",
-            "O",
-            "O",
-            "U-PERSON"
+            "U-LOCATION"
        ],
        "template_id": null,
        "metadata": {
-            "Gender": "male",
-            "NameSet": "American",
-            "Country": "California",
+            "Gender": "female",
+            "NameSet": "Chechen (Latin)",
+            "Country": "United States Of America",
            "Lowercase": false,
-            "Template#": 2
+            "Template#": 5
        }
    }
 ]
--- a/tests/mocks/init.py
+++ b/tests/mocks/init.py
@ -1,3 +1,11 @@
-from .model_mock import IdentityTokensMockModel, \
-    FiftyFiftyIdentityTokensMockModel, \
-    MockTokensModel
+from .model_mock import (
+    IdentityTokensMockModel,
+    FiftyFiftyIdentityTokensMockModel,
+    MockTokensModel,
+)
+
+__all__ = [
+    "IdentityTokensMockModel",
+    "FiftyFiftyIdentityTokensMockModel",
+    "MockTokensModel",
+]
--- a/tests/mocks/model_mock.py
+++ b/tests/mocks/model_mock.py
@ -1,14 +1,15 @@
-from typing import List
+from typing import List, Optional

-from presidio_evaluator import InputSample, ModelEvaluator
+from presidio_evaluator import InputSample
+from presidio_evaluator.models import BaseModel


-class MockTokensModel(ModelEvaluator):
+class MockTokensModel(BaseModel):
    """
    Simulates a real model, returns the prediction given in the constructor
    """

-    def __init__(self, prediction: List[str], entities_to_keep: List = None,
+    def __init__(self, prediction: Optional[List[str]], entities_to_keep: List = None,
                 verbose: bool = False, **kwargs):
        super().__init__(entities_to_keep=entities_to_keep, verbose=verbose,
                         **kwargs)
@ -18,20 +19,19 @@ class MockTokensModel(ModelEvaluator):
        return self.prediction


-class IdentityTokensMockModel(ModelEvaluator):
+class IdentityTokensMockModel(BaseModel):
    """
    Simulates a real model, always return the label as prediction
    """

-    def __init__(self, entities_to_keep: List = None,
-                 verbose: bool = False):
-        super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
+    def __init__(self, verbose: bool = False):
+        super().__init__(verbose=verbose)

    def predict(self, sample: InputSample) -> List[str]:
        return sample.tags


-class FiftyFiftyIdentityTokensMockModel(ModelEvaluator):
+class FiftyFiftyIdentityTokensMockModel(BaseModel):
    """
    Simulates a real model, returns the label or no predictions (list of 'O')
    alternately
--- a/tests/test_crf_evaluator.py
+++ b/tests/test_crf_evaluator.py
@ -1,6 +1,6 @@
 import numpy as np

-from presidio_evaluator.crf_evaluator import CRFEvaluator
+from presidio_evaluator.models.crf_model import CRFModel
 from presidio_evaluator.data_generator import read_synth_dataset


@ -12,7 +12,7 @@ def no_test_test_crf_simple():

    model_path = os.path.abspath(os.path.join(dir_path, "..", "model-outputs/crf.pickle"))

-    crf_evaluator = CRFEvaluator(model_pickle_path=model_path,entities_to_keep=['PERSON'])
+    crf_evaluator = CRFModel(model_pickle_path=model_path, entities_to_keep=['PERSON'])
    evaluation_results = crf_evaluator.evaluate_all(input_samples)
    scores = crf_evaluator.calculate_score(evaluation_results)

--- a/tests/test_evaluator.py
+++ b/tests/test_evaluator.py
@ -0,0 +1,298 @@
+import numpy as np
+
+from presidio_evaluator import InputSample
+from presidio_evaluator.data_generator import read_synth_dataset
+from presidio_evaluator.evaluation import EvaluationResult, Evaluator
+from tests.mocks import (
+    IdentityTokensMockModel,
+    FiftyFiftyIdentityTokensMockModel,
+    MockTokensModel,
+)
+
+
+def test_evaluator_simple():
+    prediction = ["O", "O", "O", "U-ANIMAL"]
+    model = MockTokensModel(prediction=prediction, entities_to_keep=["ANIMAL"])
+
+    evaluator = Evaluator(model=model)
+    sample = InputSample(
+        full_text="I am the walrus", masked="I am the [ANIMAL]", spans=None
+    )
+    sample.tokens = ["I", "am", "the", "walrus"]
+    sample.tags = ["O", "O", "O", "U-ANIMAL"]
+
+    evaluated = evaluator.evaluate_sample(sample, prediction)
+    final_evaluation = evaluator.calculate_score([evaluated])
+
+    assert final_evaluation.pii_precision == 1
+    assert final_evaluation.pii_recall == 1
+
+
+def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
+    prediction = ["O", "O", "O", "U-ANIMAL"]
+    model = MockTokensModel(prediction=prediction)
+
+    evaluator = Evaluator(model=model, entities_to_keep=["SPACESHIP"])
+
+    sample = InputSample(
+        full_text="I am the walrus", masked="I am the [ANIMAL]", spans=None
+    )
+    sample.tokens = ["I", "am", "the", "walrus"]
+    sample.tags = ["O", "O", "O", "U-ANIMAL"]
+
+    evaluated = evaluator.evaluate_sample(sample, prediction)
+    assert evaluated.results[("O", "O")] == 4
+
+
+def test_evaluate_same_entity_correct_statistics():
+    prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
+    model = MockTokensModel(prediction=prediction)
+    evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
+    sample = InputSample(
+        full_text="I dog the walrus", masked="I [ANIMAL] the [ANIMAL]", spans=None
+    )
+    sample.tokens = ["I", "am", "the", "walrus"]
+    sample.tags = ["O", "O", "O", "U-ANIMAL"]
+
+    evaluation_result = evaluator.evaluate_sample(sample, prediction)
+    assert evaluation_result.results[("O", "O")] == 2
+    assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
+    assert evaluation_result.results[("O", "ANIMAL")] == 1
+
+
+def test_evaluate_multiple_entities_to_keep_correct_statistics():
+    prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
+    entities_to_keep = ["ANIMAL", "PLANT", "SPACESHIP"]
+    model = MockTokensModel(prediction=prediction)
+    evaluator = Evaluator(model=model, entities_to_keep=entities_to_keep)
+
+    sample = InputSample(
+        full_text="I dog the walrus", masked="I [ANIMAL] the [ANIMAL]", spans=None
+    )
+    sample.tokens = ["I", "am", "the", "walrus"]
+    sample.tags = ["O", "O", "O", "U-ANIMAL"]
+
+    evaluation_result = evaluator.evaluate_sample(sample, prediction)
+    assert evaluation_result.results[("O", "O")] == 2
+    assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
+    assert evaluation_result.results[("O", "ANIMAL")] == 1
+
+
+def test_evaluate_multiple_tokens_correct_statistics():
+    prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
+    model = MockTokensModel(prediction=prediction)
+    evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
+    sample = InputSample(
+        "I am the walrus amaericanus magnifico", masked=None, spans=None
+    )
+    sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
+    sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
+
+    evaluated = evaluator.evaluate_sample(sample, prediction)
+    evaluation = evaluator.calculate_score([evaluated])
+
+    assert evaluation.pii_precision == 1
+    assert evaluation.pii_recall == 1
+
+
+def test_evaluate_multiple_tokens_partial_match_correct_statistics():
+    prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
+    model = MockTokensModel(prediction=prediction)
+    evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
+    sample = InputSample(
+        "I am the walrus amaericanus magnifico", masked=None, spans=None
+    )
+    sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
+    sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
+
+    evaluated = evaluator.evaluate_sample(sample, prediction)
+    evaluation = evaluator.calculate_score([evaluated])
+
+    assert evaluation.pii_precision == 1
+    assert evaluation.pii_recall == 4 / 6
+
+
+def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
+    prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
+    model = MockTokensModel(prediction=prediction)
+    evaluator = Evaluator(model=model, entities_to_keep=["ANIMAL"])
+    sample = InputSample(
+        "I am the walrus amaericanus magnifico", masked=None, spans=None
+    )
+    sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
+    sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
+
+    evaluated = evaluator.evaluate_sample(sample, prediction)
+    evaluation = evaluator.calculate_score([evaluated])
+
+    assert np.isnan(evaluation.pii_precision)
+    assert evaluation.pii_recall == 0
+
+
+def test_evaluate_multiple_examples_correct_statistics():
+    prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
+    model = MockTokensModel(prediction=prediction)
+    evaluator = Evaluator(model=model, entities_to_keep=["PERSON"])
+    input_sample = InputSample("My name is Raphael or David", masked=None, spans=None)
+    input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
+    input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
+
+    evaluated = evaluator.evaluate_all(
+        [input_sample, input_sample, input_sample, input_sample]
+    )
+    scores = evaluator.calculate_score(evaluated)
+    assert scores.pii_precision == 0.5
+    assert scores.pii_recall == 0.5
+
+
+def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
+    prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
+    model = MockTokensModel(prediction=prediction)
+
+    evaluator = Evaluator(model=model, entities_to_keep=["PERSON", "TENNIS_PLAYER"])
+    input_sample = InputSample("My name is Raphael or David", masked=None, spans=None)
+    input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
+    input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
+
+    evaluated = evaluator.evaluate_all(
+        [input_sample, input_sample, input_sample, input_sample]
+    )
+    scores = evaluator.calculate_score(evaluated)
+    assert scores.pii_precision == 1
+    assert scores.pii_recall == 1
+
+
+def test_confusion_matrix_correct_metrics():
+    from collections import Counter
+
+    evaluated = [
+        EvaluationResult(
+            results=Counter(
+                {
+                    ("O", "O"): 150,
+                    ("O", "PERSON"): 30,
+                    ("O", "COMPANY"): 30,
+                    ("PERSON", "PERSON"): 40,
+                    ("COMPANY", "COMPANY"): 40,
+                    ("PERSON", "COMPANY"): 10,
+                    ("COMPANY", "PERSON"): 10,
+                    ("PERSON", "O"): 30,
+                    ("COMPANY", "O"): 30,
+                }
+            ),
+            model_errors=None,
+            text=None,
+        )
+    ]
+
+    model = MockTokensModel(prediction=None)
+    evaluator = Evaluator(model=model, entities_to_keep=["PERSON", "COMPANY"])
+    scores = evaluator.calculate_score(evaluated, beta=2.5)
+
+    assert scores.pii_precision == 0.625
+    assert scores.pii_recall == 0.625
+    assert scores.entity_recall_dict["PERSON"] == 0.5
+    assert scores.entity_precision_dict["PERSON"] == 0.5
+    assert scores.entity_recall_dict["COMPANY"] == 0.5
+    assert scores.entity_precision_dict["COMPANY"] == 0.5
+
+
+def test_confusion_matrix_2_correct_metrics():
+    from collections import Counter
+
+    evaluated = [
+        EvaluationResult(
+            results=Counter(
+                {
+                    ("O", "O"): 65467,
+                    ("O", "ORG"): 4189,
+                    ("GPE", "O"): 3370,
+                    ("PERSON", "PERSON"): 2024,
+                    ("GPE", "PERSON"): 1488,
+                    ("GPE", "GPE"): 1033,
+                    ("O", "GPE"): 964,
+                    ("ORG", "ORG"): 914,
+                    ("O", "PERSON"): 834,
+                    ("GPE", "ORG"): 401,
+                    ("PERSON", "ORG"): 35,
+                    ("PERSON", "O"): 33,
+                    ("ORG", "O"): 8,
+                    ("PERSON", "GPE"): 5,
+                    ("ORG", "PERSON"): 1,
+                }
+            ),
+            model_errors=None,
+            text=None,
+        )
+    ]
+
+    model = MockTokensModel(prediction=None)
+    evaluator = Evaluator(model=model)
+    scores = evaluator.calculate_score(evaluated, beta=2.5)
+
+    pii_tp = (
+        evaluated[0].results[("PERSON", "PERSON")]
+        + evaluated[0].results[("ORG", "ORG")]
+        + evaluated[0].results[("GPE", "GPE")]
+        + evaluated[0].results[("ORG", "GPE")]
+        + evaluated[0].results[("ORG", "PERSON")]
+        + evaluated[0].results[("GPE", "ORG")]
+        + evaluated[0].results[("GPE", "PERSON")]
+        + evaluated[0].results[("PERSON", "GPE")]
+        + evaluated[0].results[("PERSON", "ORG")]
+    )
+
+    pii_fp = (
+        evaluated[0].results[("O", "PERSON")]
+        + evaluated[0].results[("O", "GPE")]
+        + evaluated[0].results[("O", "ORG")]
+    )
+
+    pii_fn = (
+        evaluated[0].results[("PERSON", "O")]
+        + evaluated[0].results[("GPE", "O")]
+        + evaluated[0].results[("ORG", "O")]
+    )
+
+    assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
+    assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
+
+
+def test_dataset_to_metric_identity_model():
+    import os
+
+    dir_path = os.path.dirname(os.path.realpath(__file__))
+    input_samples = read_synth_dataset(
+        "{}/data/generated_small.txt".format(dir_path), length=10
+    )
+
+    model = IdentityTokensMockModel()
+    evaluator = Evaluator(model=model)
+    evaluation_results = evaluator.evaluate_all(input_samples)
+    metrics = evaluator.calculate_score(evaluation_results)
+
+    assert metrics.pii_precision == 1
+    assert metrics.pii_recall == 1
+
+
+def test_dataset_to_metric_50_50_model():
+    import os
+
+    dir_path = os.path.dirname(os.path.realpath(__file__))
+    input_samples = read_synth_dataset(
+        "{}/data/generated_small.txt".format(dir_path), length=100
+    )
+
+    # Replace 50% of the predictions with a list of "O"
+    model = FiftyFiftyIdentityTokensMockModel()
+    evaluator = Evaluator(model=model, entities_to_keep=["PERSON"])
+    evaluation_results = evaluator.evaluate_all(input_samples)
+    metrics = evaluator.calculate_score(evaluation_results)
+
+    print(metrics.pii_precision)
+    print(metrics.pii_recall)
+    print(metrics.pii_f)
+
+    assert metrics.pii_precision == 1
+    assert metrics.pii_recall < 0.75
+    assert metrics.pii_recall > 0.25
--- a/tests/test_flair_evaluator.py
+++ b/tests/test_flair_evaluator.py
@ -1,15 +1,18 @@
 import pytest

+from presidio_evaluator.evaluation import Evaluator
+
 try:
    from flair.models import SequenceTagger
 except:
    ImportError("Flair is not installed by default")

 from presidio_evaluator.data_generator import read_synth_dataset
-from presidio_evaluator.flair_evaluator import FlairEvaluator
+from presidio_evaluator.models.flair_model import FlairModel

 import numpy as np

+
 # no-unit because flair is not a dependency by default
@pytest.mark.skip(reason="Flair not installed by default")
 def test_flair_simple():
@ -22,9 +25,10 @@ def test_flair_simple():

    model = SequenceTagger.load("ner-ontonotes-fast")  # .load('ner')

-    flair_evaluator = FlairEvaluator(model=model, entities_to_keep=["PERSON"])
-    evaluation_results = flair_evaluator.evaluate_all(input_samples)
-    scores = flair_evaluator.calculate_score(evaluation_results)
+    flair_model = FlairModel(model=model, entities_to_keep=["PERSON"])
+    evaluator = Evaluator(model=flair_model)
+    evaluation_results = evaluator.evaluate_all(input_samples)
+    scores = evaluator.calculate_score(evaluation_results)

    np.testing.assert_almost_equal(
        scores.pii_precision, scores.entity_precision_dict["PERSON"]
--- a/tests/test_model_evaluator.py
+++ b/tests/test_model_evaluator.py
@ -1,271 +0,0 @@
-import numpy as np
-import pytest
-
-from presidio_evaluator import InputSample, EvaluationResult
-from presidio_evaluator.data_generator import read_synth_dataset
-from tests.mocks import IdentityTokensMockModel, \
-    FiftyFiftyIdentityTokensMockModel, MockTokensModel
-
-
-def test_evaluator_simple():
-    prediction = ["O", "O", "O", "U-ANIMAL"]
-    model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
-
-    sample = InputSample(full_text="I am the walrus",
-                         masked="I am the [ANIMAL]",
-                         spans=None)
-    sample.tokens = ["I", "am", "the", "walrus"]
-    sample.tags = ["O", "O", "O", "U-ANIMAL"]
-
-    evaluated = model.evaluate_sample(sample)
-    final_evaluation = model.calculate_score(
-        [evaluated])
-
-    assert final_evaluation.pii_precision == 1
-    assert final_evaluation.pii_recall == 1
-
-
-def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
-    prediction = ["O", "O", "O", "U-ANIMAL"]
-    model = MockTokensModel(prediction=prediction,
-                            entities_to_keep=['SPACESHIP'])
-
-    sample = InputSample(full_text="I am the walrus",
-                         masked="I am the [ANIMAL]",
-                         spans=None)
-    sample.tokens = ["I", "am", "the", "walrus"]
-    sample.tags = ["O", "O", "O", "U-ANIMAL"]
-
-    evaluated = model.evaluate_sample(sample)
-    assert evaluated.results[("O", "O")] == 4
-
-
-def test_evaluate_same_entity_correct_statistics():
-    prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
-    model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
-
-    sample = InputSample(full_text="I dog the walrus",
-                         masked="I [ANIMAL] the [ANIMAL]",
-                         spans=None)
-    sample.tokens = ["I", "am", "the", "walrus"]
-    sample.tags = ["O", "O", "O", "U-ANIMAL"]
-
-    evaluation_result = model.evaluate_sample(sample)
-    assert evaluation_result.results[("O", "O")] == 2
-    assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
-    assert evaluation_result.results[("O", "ANIMAL")] == 1
-
-
-def test_evaluate_multiple_entities_to_keep_correct_statistics():
-    prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
-    model = MockTokensModel(prediction=prediction, labeling_scheme='BIO',
-                            entities_to_keep=['ANIMAL', 'PLANT', 'SPACESHIP'])
-    sample = InputSample(full_text="I dog the walrus",
-                         masked="I [ANIMAL] the [ANIMAL]",
-                         spans=None)
-    sample.tokens = ["I", "am", "the", "walrus"]
-    sample.tags = ["O", "O", "O", "U-ANIMAL"]
-
-    evaluation_result = model.evaluate_sample(sample)
-    assert evaluation_result.results[("O", "O")] == 2
-    assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
-    assert evaluation_result.results[("O", "ANIMAL")] == 1
-
-
-def test_evaluate_multiple_tokens_correct_statistics():
-    prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
-    model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
-
-    sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
-                         spans=None)
-    sample.tokens = ["I", "am", "the",
-                     "walrus", "americanus", "magnifico"]
-    sample.tags = ["O", "O", "O",
-                   "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
-
-    evaluated = model.evaluate_sample(sample)
-    evaluation = model.calculate_score(
-        [evaluated])
-
-    assert evaluation.pii_precision == 1
-    assert evaluation.pii_recall == 1
-
-
-def test_evaluate_multiple_tokens_partial_match_correct_statistics():
-    prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
-    model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
-
-    sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
-                         spans=None)
-    sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
-    sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
-
-    evaluated = model.evaluate_sample(sample)
-    evaluation = model.calculate_score(
-        [evaluated])
-
-    assert evaluation.pii_precision == 1
-    assert evaluation.pii_recall == 4 / 6
-
-
-def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
-    prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
-    model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
-
-    sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
-                         spans=None)
-    sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
-    sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
-
-    evaluated = model.evaluate_sample(sample)
-    evaluation = model.calculate_score(
-        [evaluated])
-
-    assert np.isnan(evaluation.pii_precision)
-    assert evaluation.pii_recall == 0
-
-
-def test_evaluate_multiple_examples_correct_statistics():
-    prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
-    model = MockTokensModel(prediction=prediction,
-                            labeling_scheme='BILOU',
-                            entities_to_keep=['PERSON'])
-    input_sample = InputSample("My name is Raphael or David", masked=None,
-                               spans=None)
-    input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
-    input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
-
-    evaluated = model.evaluate_all(
-        [input_sample, input_sample, input_sample, input_sample])
-    scores = model.calculate_score(
-        evaluated)
-    assert scores.pii_precision == 0.5
-    assert scores.pii_recall == 0.5
-
-
-def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
-    prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
-    model = MockTokensModel(prediction=prediction,
-                            labeling_scheme='BILOU',
-                            entities_to_keep=['PERSON', 'TENNIS_PLAYER'])
-    input_sample = InputSample("My name is Raphael or David", masked=None,
-                               spans=None)
-    input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
-    input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
-
-    evaluated = model.evaluate_all(
-        [input_sample, input_sample, input_sample, input_sample])
-    scores = model.calculate_score(evaluated)
-    assert scores.pii_precision == 1
-    assert scores.pii_recall == 1
-
-
-def test_confusion_matrix_correct_metrics():
-    from collections import Counter
-
-    evaluated = [EvaluationResult(results=Counter({
-        ('O', 'O'): 150,
-        ('O', 'PERSON'): 30,
-        ('O', 'COMPANY'): 30,
-        ('PERSON', 'PERSON'): 40,
-        ('COMPANY', 'COMPANY'): 40,
-        ('PERSON', 'COMPANY'): 10,
-        ('COMPANY', 'PERSON'): 10,
-        ('PERSON', 'O'): 30,
-        ('COMPANY', 'O'): 30}), model_errors=None, text=None)]
-
-    model = MockTokensModel(prediction=None,
-                            entities_to_keep=['PERSON', 'COMPANY'])
-
-    scores = model.calculate_score(evaluated, beta=2.5)
-
-    assert scores.pii_precision == 0.625
-    assert scores.pii_recall == 0.625
-    assert scores.entity_recall_dict['PERSON'] == 0.5
-    assert scores.entity_precision_dict['PERSON'] == 0.5
-    assert scores.entity_recall_dict['COMPANY'] == 0.5
-    assert scores.entity_precision_dict['COMPANY'] == 0.5
-
-
-def test_confusion_matrix_2_correct_metrics():
-    from collections import Counter
-
-    evaluated = [EvaluationResult(results=Counter(
-        {('O', 'O'): 65467,
-         ('O', 'ORG'): 4189,
-         ('GPE', 'O'): 3370,
-         ('PERSON', 'PERSON'): 2024,
-         ('GPE', 'PERSON'): 1488,
-         ('GPE', 'GPE'): 1033,
-         ('O', 'GPE'): 964,
-         ('ORG', 'ORG'): 914,
-         ('O', 'PERSON'): 834,
-         ('GPE', 'ORG'): 401,
-         ('PERSON', 'ORG'): 35,
-         ('PERSON', 'O'): 33,
-         ('ORG', 'O'): 8,
-         ('PERSON', 'GPE'): 5,
-         ('ORG', 'PERSON'): 1}), model_errors=None, text=None)]
-
-    model = MockTokensModel(prediction=None)
-
-    scores = model.calculate_score(evaluated, beta=2.5)
-
-    pii_tp = evaluated[0].results[('PERSON', 'PERSON')] + \
-             evaluated[0].results[('ORG', 'ORG')] + \
-             evaluated[0].results[('GPE', 'GPE')] + \
-             evaluated[0].results[('ORG', 'GPE')] + \
-             evaluated[0].results[('ORG', 'PERSON')] + \
-             evaluated[0].results[('GPE', 'ORG')] + \
-             evaluated[0].results[('GPE', 'PERSON')] + \
-             evaluated[0].results[('PERSON', 'GPE')] + \
-             evaluated[0].results[('PERSON', 'ORG')]
-
-    pii_fp = evaluated[0].results[('O', 'PERSON')] + \
-             evaluated[0].results[('O', 'GPE')] + \
-             evaluated[0].results[('O', 'ORG')]
-
-    pii_fn = evaluated[0].results[('PERSON', 'O')] + \
-             evaluated[0].results[('GPE', 'O')] + \
-             evaluated[0].results[('ORG', 'O')]
-
-    assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
-    assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
-
-
-def test_dataset_to_metric_identity_model():
-    import os
-    dir_path = os.path.dirname(os.path.realpath(__file__))
-    input_samples = read_synth_dataset(
-        "{}/data/generated_small.txt".format(dir_path), length=10)
-
-    model = IdentityTokensMockModel()
-
-    evaluation_results = model.evaluate_all(input_samples)
-    metrics = model.calculate_score(
-        evaluation_results)
-
-    assert metrics.pii_precision == 1
-    assert metrics.pii_recall == 1
-
-
-def test_dataset_to_metric_50_50_model():
-    import os
-    dir_path = os.path.dirname(os.path.realpath(__file__))
-    input_samples = read_synth_dataset(
-        "{}/data/generated_small.txt".format(dir_path), length=100)
-
-    # Replace 50% of the predictions with a list of "O"
-    model = FiftyFiftyIdentityTokensMockModel(entities_to_keep='PERSON')
-
-    evaluation_results = model.evaluate_all(input_samples)
-    metrics = model.calculate_score(
-        evaluation_results)
-
-    print(metrics.pii_precision)
-    print(metrics.pii_recall)
-    print(metrics.pii_f)
-
-    assert metrics.pii_precision == 1
-    assert metrics.pii_recall < 0.75
-    assert metrics.pii_recall > 0.25
--- a/tests/test_presidio_analyzer_wrapper.py
+++ b/tests/test_presidio_analyzer_wrapper.py
@ -2,27 +2,8 @@ import pytest

 from presidio_evaluator import InputSample, Span
 from presidio_evaluator.data_generator import read_synth_dataset
-from presidio_evaluator.presidio_analyzer_evaluator import PresidioAnalyzerEvaluator
-
-# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
-entities_mapping = {
-    "PERSON": "PERSON",
-    "EMAIL": "EMAIL_ADDRESS",
-    "CREDIT_CARD": "CREDIT_CARD",
-    "FIRST_NAME": "PERSON",
-    "PHONE_NUMBER": "PHONE_NUMBER",
-    "BIRTHDAY": "DATE_TIME",
-    "DATE": "DATE_TIME",
-    "DOMAIN": "DOMAIN",
-    "CITY": "LOCATION",
-    "ADDRESS": "LOCATION",
-    "IBAN": "IBAN_CODE",
-    "URL": "DOMAIN_NAME",
-    "US_SSN": "US_SSN",
-    "IP_ADDRESS": "IP_ADDRESS",
-    "ORGANIZATION": "ORG",
-    "O": "O",
-}
+from presidio_evaluator.evaluation import Evaluator
+from presidio_evaluator.models.presidio_analyzer_wrapper import PresidioAnalyzerWrapper


 class GeneratedTextTestCase:
@ -54,8 +35,7 @@ analyzer_test_generate_text_testdata = [


 def test_analyzer_simple_input():
-    model = PresidioAnalyzerEvaluator(entities_to_keep=["PERSON"])
-
+    model = PresidioAnalyzerWrapper(entities_to_keep=["PERSON"])
    sample = InputSample(
        full_text="My name is Mike",
        masked="My name is [PERSON]",
@ -63,8 +43,11 @@ def test_analyzer_simple_input():
        create_tags_from_span=True,
    )

-    evaluated = model.evaluate_sample(sample)
-    metrics = model.calculate_score([evaluated])
+    prediction = model.predict(sample)
+    evaluator = Evaluator(model=model)
+
+    evaluated = evaluator.evaluate_sample(sample, prediction)
+    metrics = evaluator.calculate_score([evaluated])

    assert metrics.pii_precision == 1
    assert metrics.pii_recall == 1
@ -89,13 +72,14 @@ def test_analyzer_with_generated_text(test_input, acceptance_threshold):
    dir_path = os.path.dirname(os.path.realpath(__file__))
    input_samples = read_synth_dataset(test_input.format(dir_path))

-    updated_samples = PresidioAnalyzerEvaluator.align_input_samples_to_presidio_analyzer(
-        input_samples=input_samples, entities_mapping=entities_mapping
+    updated_samples = Evaluator.align_input_samples_to_presidio_analyzer(
+        input_samples=input_samples, entities_mapping=PresidioAnalyzerWrapper.presidio_entities_map
    )

-    analyzer = PresidioAnalyzerEvaluator()
-    evaluated_samples = analyzer.evaluate_all(updated_samples)
-    scores = analyzer.calculate_score(evaluation_results=evaluated_samples)
+    analyzer = PresidioAnalyzerWrapper()
+    evaluator = Evaluator(model=analyzer)
+    evaluated_samples = evaluator.evaluate_all(updated_samples)
+    scores = evaluator.calculate_score(evaluation_results=evaluated_samples)

    assert acceptance_threshold <= scores.pii_precision
    assert acceptance_threshold <= scores.pii_recall
--- a/tests/test_presidio_perturb.py
+++ b/tests/test_presidio_perturb.py
@ -8,19 +8,16 @@ import pandas as pd


@pytest.mark.parametrize(
+    # fmt: off
    "text, entity1, entity2, start1, end1, start2, end2",
    [
        (
            "Hi I live in South Africa and my name is Toma",
-            "LOCATION",
-            "PERSON",
-            13,
-            25,
-            41,
-            45,
+            "LOCATION", "PERSON", 13, 25, 41, 45,
        ),
        ("Africa is my continent, James", "LOCATION", "PERSON", 0, 6, 24, 29,),
    ],
+    # fmt: on
 )
 def test_presidio_perturb_two_entities(
    text, entity1, entity2, start1, end1, start2, end2
@ -51,15 +48,13 @@ def test_entity_translation():
        RecognizerResult(entity_type="EMAIL_ADDRESS", start=12, end=27, score=0.5)
    ]

-    presidio_perturb = PresidioPerturb(
-        fake_pii_df=get_mock_fake_df(), entity_dict={"EMAIL_ADDRESS": "EMAIL"}
-    )
+    presidio_perturb = PresidioPerturb(fake_pii_df=get_mock_fake_df())
    fake_df = presidio_perturb.fake_pii
    perturbations = presidio_perturb.perturb(
        original_text=text, presidio_response=presidio_response, count=1
    )

-    assert fake_df["EMAIL"].str.lower()[0] in perturbations[0]
+    assert fake_df["EMAIL_ADDRESS"].str.lower()[0] in perturbations[0]


 def test_subset_perturbation():
@ -76,7 +71,7 @@ def test_subset_perturbation():
            "NameSet": ["Hebrew", "English"],
        }
    )
-    ignore_types = ("DATE", "LOCATION", "ADDRESS", "GENDER")
+    ignore_types = {"DATE", "LOCATION", "ADDRESS", "GENDER"}

    presidio_perturb = PresidioPerturb(fake_pii_df=fake_df, ignore_types=ignore_types)

--- a/tests/test_recognizers_generated_text.py
+++ b/tests/test_recognizers_generated_text.py
@ -1,8 +1,8 @@
 from presidio_evaluator.data_generator import read_synth_dataset
-from presidio_evaluator.presidio_recognizer_evaluator import score_presidio_recognizer
+from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
 import pytest

-from presidio_analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
+from presidio_analyzer.predefined_recognizers import CreditCardRecognizer

 # test case parameters for tests with dataset which was previously generated.
 class GeneratedTextTestCase:
@ -13,8 +13,12 @@ class GeneratedTextTestCase:
        self.marks = marks

    def to_pytest_param(self):
-        return pytest.param(self.test_input, self.acceptance_threshold,
-                            id=self.test_name, marks=self.marks)
+        return pytest.param(
+            self.test_input,
+            self.acceptance_threshold,
+            id=self.test_name,
+            marks=self.marks,
+        )


 # generated-text test cases
@ -24,35 +28,39 @@ cc_test_generate_text_testdata = [
        test_name="small-set",
        test_input="{}/data/generated_small.txt",
        acceptance_threshold=1,
-        marks=pytest.mark.none
+        marks=pytest.mark.none,
    ),
    # large set fixture which expects all type results. marked as "slow"
    GeneratedTextTestCase(
        test_name="large_set",
        test_input="{}/data/generated_large.txt",
        acceptance_threshold=1,
-        marks=pytest.mark.slow
-    )
+        marks=pytest.mark.slow,
+    ),
 ]


 # credit card recognizer tests on generated data
-@pytest.mark.parametrize("test_input,acceptance_threshold",
-                         [testcase.to_pytest_param()
-                          for testcase in cc_test_generate_text_testdata])
+@pytest.mark.parametrize(
+    "test_input,acceptance_threshold",
+    [testcase.to_pytest_param() for testcase in cc_test_generate_text_testdata],
+)
 def test_credit_card_recognizer_with_generated_text(test_input, acceptance_threshold):
    """
-        Test credit card recognizer with a generated dataset text file
-        :param test_input: input text file location
-        :param acceptance_threshold: minimim precision/recall
-         allowed for tests to pass
+    Test credit card recognizer with a generated dataset text file
+    :param test_input: input text file location
+    :param acceptance_threshold: minimim precision/recall
+     allowed for tests to pass
    """

    # read test input from generated file
    import os
+
    dir_path = os.path.dirname(os.path.realpath(__file__))
-    input_samples = read_synth_dataset(
-        test_input.format(dir_path))
+    input_samples = read_synth_dataset(test_input.format(dir_path))
    scores = score_presidio_recognizer(
-        CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
+        recognizer=CreditCardRecognizer(),
+        entities_to_keep=["CREDIT_CARD"],
+        input_samples=input_samples,
+    )
    assert acceptance_threshold <= scores.pii_f
--- a/tests/test_recognizers_template_csv.py
+++ b/tests/test_recognizers_template_csv.py
@ -1,15 +1,25 @@
 from presidio_evaluator.data_generator import generate
-from presidio_evaluator.presidio_recognizer_evaluator import \
-    score_presidio_recognizer
+from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
 import pytest
 import numpy as np

-from presidio_analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
+from presidio_analyzer.predefined_recognizers import CreditCardRecognizer
+

-# test case parameters for tests with dataset generated from a template and csv values
 class TemplateTextTestCase:
-    def __init__(self, test_name, pii_csv, utterances, dictionary_path,
-                 num_of_examples, acceptance_threshold, marks):
+    """
+    Test case parameters for tests with dataset generated from a template and csv values
+    """
+    def __init__(
+        self,
+        test_name,
+        pii_csv,
+        utterances,
+        dictionary_path,
+        num_of_examples,
+        acceptance_threshold,
+        marks,
+    ):
        self.test_name = test_name
        self.pii_csv = pii_csv
        self.utterances = utterances
@ -19,9 +29,15 @@ class TemplateTextTestCase:
        self.marks = marks

    def to_pytest_param(self):
-        return pytest.param(self.pii_csv, self.utterances, self.dictionary_path,
-                            self.num_of_examples, self.acceptance_threshold,
-                            id=self.test_name, marks=self.marks)
+        return pytest.param(
+            self.pii_csv,
+            self.utterances,
+            self.dictionary_path,
+            self.num_of_examples,
+            self.acceptance_threshold,
+            id=self.test_name,
+            marks=self.marks,
+        )


 # template-dataset test cases
@ -34,46 +50,52 @@ cc_test_template_testdata = [
        dictionary_path="{}/data/Dictionary_test.csv",
        num_of_examples=100,
        acceptance_threshold=0.9,
-        marks=pytest.mark.slow
+        marks=pytest.mark.slow,
    )
 ]


 # credit card recognizer tests on template-generates data
-@pytest.mark.parametrize("pii_csv, "
-                         "utterances, "
-                         "dictionary_path, "
-                         "num_of_examples, "
-                         "acceptance_threshold",
-                         [testcase.to_pytest_param()
-                          for testcase in cc_test_template_testdata])
-def test_credit_card_recognizer_with_template(pii_csv, utterances,
-                                              dictionary_path,
-                                              num_of_examples,
-                                              acceptance_threshold):
+@pytest.mark.parametrize(
+    "pii_csv, "
+    "utterances, "
+    "dictionary_path, "
+    "num_of_examples, "
+    "acceptance_threshold",
+    [testcase.to_pytest_param() for testcase in cc_test_template_testdata],
+)
+def test_credit_card_recognizer_with_template(
+    pii_csv, utterances, dictionary_path, num_of_examples, acceptance_threshold
+):
    """
-        Test credit card recognizer with a dataset generated from
-        template and a CSV values file
-        :param pii_csv: input csv file location
-        :param utterances: template file location
-        :param dictionary_path: dictionary/vocabulary file location
-        :param num_of_examples: number of samples to be used from dataset
-        to test
-        :param acceptance_threshold: minimim precision/recall
-         allowed for tests to pass
+    Test credit card recognizer with a dataset generated from
+    template and a CSV values file
+    :param pii_csv: input csv file location
+    :param utterances: template file location
+    :param dictionary_path: dictionary/vocabulary file location
+    :param num_of_examples: number of samples to be used from dataset
+    to test
+    :param acceptance_threshold: minimum precision/recall
+     allowed for tests to pass
    """

    # read template and CSV files
    import os
+
    dir_path = os.path.dirname(os.path.realpath(__file__))

-    input_samples = generate(fake_pii_csv=pii_csv.format(dir_path),
-                             utterances_file=utterances.format(dir_path),
-                             dictionary_path=dictionary_path.format(dir_path),
-                             lower_case_ratio=0.5,
-                             num_of_examples=num_of_examples)
+    input_samples = generate(
+        fake_pii_csv=pii_csv.format(dir_path),
+        utterances_file=utterances.format(dir_path),
+        dictionary_path=dictionary_path.format(dir_path),
+        lower_case_ratio=0.5,
+        num_of_examples=num_of_examples,
+    )

    scores = score_presidio_recognizer(
-        CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
+        recognizer=CreditCardRecognizer(),
+        entities_to_keep=["CREDIT_CARD"],
+        input_samples=input_samples,
+    )
    if not np.isnan(scores.pii_f):
        assert acceptance_threshold <= scores.pii_f
--- a/tests/test_recognizers_template_join_csv.py
+++ b/tests/test_recognizers_template_join_csv.py
@ -1,18 +1,32 @@
 from presidio_evaluator.data_generator import FakeDataGenerator
-from presidio_evaluator.presidio_recognizer_evaluator import \
-    score_presidio_recognizer
+from presidio_evaluator.evaluation.scorers import score_presidio_recognizer
 import pandas as pd
 import pytest
 import numpy as np

 from presidio_analyzer import Pattern, PatternRecognizer

-# test case parameters for tests with dataset generated from a template and
-# two csv value files, one containing the common-entities and another one with custom entities
+
 class PatternRecognizerTestCase:
-    def __init__(self, test_name, entity_name, pattern, score, pii_csv, ext_csv,
-                 utterances, dictionary_path, num_of_examples, acceptance_threshold,
-                 max_mistakes_number, marks):
+    """
+    Test case parameters for tests with dataset generated from a template and
+    two csv value files, one containing the common-entities and another one with custom entities.
+    """
+    def __init__(
+        self,
+        test_name,
+        entity_name,
+        pattern,
+        score,
+        pii_csv,
+        ext_csv,
+        utterances,
+        dictionary_path,
+        num_of_examples,
+        acceptance_threshold,
+        max_mistakes_number,
+        marks,
+    ):
        self.test_name = test_name
        self.entity_name = entity_name
        self.pattern = pattern
@ -27,12 +41,20 @@ class PatternRecognizerTestCase:
        self.marks = marks

    def to_pytest_param(self):
-        return pytest.param(self.pii_csv, self.ext_csv, self.utterances,
-                            self.dictionary_path,
-                            self.entity_name, self.pattern, self.score,
-                            self.num_of_examples, self.acceptance_threshold,
-                            self.max_mistakes_number, id=self.test_name,
-                            marks=self.marks)
+        return pytest.param(
+            self.pii_csv,
+            self.ext_csv,
+            self.utterances,
+            self.dictionary_path,
+            self.entity_name,
+            self.pattern,
+            self.score,
+            self.num_of_examples,
+            self.acceptance_threshold,
+            self.max_mistakes_number,
+            id=self.test_name,
+            marks=self.marks,
+        )


 # template-dataset test cases
@ -42,7 +64,7 @@ rocket_test_template_testdata = [
    PatternRecognizerTestCase(
        test_name="rocket-no-errors",
        entity_name="ROCKET",
-        pattern=r'\W*(rocket)\W*',
+        pattern=r"\W*(rocket)\W*",
        score=0.8,
        pii_csv="{}/data/FakeNameGenerator.com_100.csv",
        ext_csv="{}/data/FakeRocketGenerator.csv",
@ -51,14 +73,14 @@ rocket_test_template_testdata = [
        num_of_examples=100,
        acceptance_threshold=1,
        max_mistakes_number=0,
-        marks=pytest.mark.slow
+        marks=pytest.mark.slow,
    ),
    # large dataset fixture. marked as slow
    # all input is correct, test is conclusive
    PatternRecognizerTestCase(
        test_name="rocket-all-errors",
        entity_name="ROCKET",
-        pattern=r'\W*(rocket)\W*',
+        pattern=r"\W*(rocket)\W*",
        score=0.8,
        pii_csv="{}/data/FakeNameGenerator.com_100.csv",
        ext_csv="{}/data/FakeRocketErrorsGenerator.csv",
@ -67,14 +89,14 @@ rocket_test_template_testdata = [
        num_of_examples=100,
        acceptance_threshold=0,
        max_mistakes_number=100,
-        marks=pytest.mark.slow
+        marks=pytest.mark.slow,
    ),
    # large dataset fixture. marked as slow
    # some input is correct some is not, test is inconclusive
    PatternRecognizerTestCase(
        test_name="rocket-some-errors",
        entity_name="ROCKET",
-        pattern=r'\W*(rocket)\W*',
+        pattern=r"\W*(rocket)\W*",
        score=0.8,
        pii_csv="{}/data/FakeNameGenerator.com_100.csv",
        ext_csv="{}/data/FakeRocket50PercentErrorsGenerator.csv",
@ -83,8 +105,8 @@ rocket_test_template_testdata = [
        num_of_examples=100,
        acceptance_threshold=0.3,
        max_mistakes_number=70,
-        marks=[pytest.mark.slow, pytest.mark.inconclusive]
-    )
+        marks=[pytest.mark.slow, pytest.mark.inconclusive],
+    ),
 ]


@ -92,30 +114,39 @@ rocket_test_template_testdata = [
    "pii_csv, ext_csv, utterances, dictionary_path, "
    "entity_name, pattern, score, num_of_examples, "
    "acceptance_threshold, max_mistakes_number",
-    [testcase.to_pytest_param()
-     for testcase in rocket_test_template_testdata])
-def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
-                            entity_name, pattern,
-                            score, num_of_examples, acceptance_threshold,
-                            max_mistakes_number):
+    [testcase.to_pytest_param() for testcase in rocket_test_template_testdata],
+)
+def test_pattern_recognizer(
+    pii_csv,
+    ext_csv,
+    utterances,
+    dictionary_path,
+    entity_name,
+    pattern,
+    score,
+    num_of_examples,
+    acceptance_threshold,
+    max_mistakes_number,
+):
    """
-        Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
-        and another CSV values file with a custom entity
-        :param pii_csv: input csv file location with the common entities
-        :param ext_csv: input csv file location with custom entities
-        :param utterances: template file location
-        :param dictionary_path: vocabulary/dictionary file location
-        :param entity_name: custom entity name
-        :param pattern: recognizer pattern
-        :param num_of_examples: number of samples to be used from dataset to test
-        :param acceptance_threshold: minimim precision/recall
-         allowed for tests to pass
+    Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
+    and another CSV values file with a custom entity
+    :param pii_csv: input csv file location with the common entities
+    :param ext_csv: input csv file location with custom entities
+    :param utterances: template file location
+    :param dictionary_path: vocabulary/dictionary file location
+    :param entity_name: custom entity name
+    :param pattern: recognizer pattern
+    :param num_of_examples: number of samples to be used from dataset to test
+    :param acceptance_threshold: minimum precision/recall
+     allowed for tests to pass
    """

    import os
+
    dir_path = os.path.dirname(os.path.realpath(__file__))
-    dfpii = pd.read_csv(pii_csv.format(dir_path), encoding='utf-8')
-    dfext = pd.read_csv(ext_csv.format(dir_path), encoding='utf-8')
+    dfpii = pd.read_csv(pii_csv.format(dir_path), encoding="utf-8")
+    dfext = pd.read_csv(ext_csv.format(dir_path), encoding="utf-8")
    dictionary_path = dictionary_path.format(dir_path)
    ext_column_name = dfext.columns[0]

@ -127,18 +158,23 @@ def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
    dfpii[ext_column_name] = [get_from_ext(i) for i in range(0, dfpii.shape[0])]

    # generate examples
-    generator = FakeDataGenerator(fake_pii_csv_file=dfpii,
-                                  utterances_file=utterances.format(dir_path),
-                                  dictionary_path=dictionary_path)
+    generator = FakeDataGenerator(
+        fake_pii_df=dfpii,
+        templates=utterances.format(dir_path),
+        dictionary_path=dictionary_path,
+    )
    examples = generator.sample_examples(num_of_examples)

    pattern = Pattern("test pattern", pattern, score)
-    pattern_recognizer = PatternRecognizer(entity_name,
-                                           name="test recognizer",
-                                           patterns=[pattern])
+    pattern_recognizer = PatternRecognizer(
+        entity_name, name="test recognizer", patterns=[pattern]
+    )

    scores = score_presidio_recognizer(
-        pattern_recognizer, [entity_name], examples)
+        recognizer=pattern_recognizer,
+        entities_to_keep=[entity_name],
+        input_samples=examples,
+    )
    if not np.isnan(scores.pii_f):
        assert acceptance_threshold <= scores.pii_f
    assert max_mistakes_number >= len(scores.model_errors)
--- a/tests/test_spacy_evaluator.py
+++ b/tests/test_spacy_evaluator.py
@ -1,5 +1,6 @@
 from presidio_evaluator.data_generator import read_synth_dataset
-from presidio_evaluator.spacy_evaluator import SpacyEvaluator
+from presidio_evaluator.evaluation import Evaluator
+from presidio_evaluator.models.spacy_model import SpacyModel
 import numpy as np


@ -8,9 +9,10 @@ def test_spacy_simple():
    dir_path = os.path.dirname(os.path.realpath(__file__))
    input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))

-    spacy_evaluator = SpacyEvaluator(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
-    evaluation_results = spacy_evaluator.evaluate_all(input_samples)
-    scores = spacy_evaluator.calculate_score(evaluation_results)
+    spacy_model = SpacyModel(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
+    evaluator = Evaluator(model=spacy_model)
+    evaluation_results = evaluator.evaluate_all(input_samples)
+    scores = evaluator.calculate_score(evaluation_results)

    np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
    np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
--- a/tests/test_spacy_recognizer_generated_text.py
+++ b/tests/test_spacy_recognizer_generated_text.py
@ -1,12 +1,14 @@
 from presidio_evaluator.data_generator import read_synth_dataset
-from presidio_evaluator.presidio_recognizer_evaluator import \
-    score_presidio_recognizer
+from presidio_evaluator.evaluation.scorers import score_presidio_recognizer

 import pytest
 from presidio_analyzer.predefined_recognizers.spacy_recognizer import SpacyRecognizer

-# test case parameters for tests with dataset which was previously generated.
+
 class GeneratedTextTestCase:
+    """
+    Test case parameters for tests with dataset which was previously generated.
+    """
    def __init__(self, test_name, test_input, acceptance_threshold, marks):
        self.test_name = test_name
        self.test_input = test_input
@ -14,8 +16,12 @@ class GeneratedTextTestCase:
        self.marks = marks

    def to_pytest_param(self):
-        return pytest.param(self.test_input, self.acceptance_threshold,
-                            id=self.test_name, marks=self.marks)
+        return pytest.param(
+            self.test_input,
+            self.acceptance_threshold,
+            id=self.test_name,
+            marks=self.marks,
+        )


 # generated-text test cases
@ -25,35 +31,37 @@ cc_test_generate_text_testdata = [
        test_name="small-set",
        test_input="{}/data/generated_small.txt",
        acceptance_threshold=0.5,
-        marks=pytest.mark.inconclusive
+        marks=pytest.mark.inconclusive,
    ),
    # large dataset - test is slow and inconclusive
    GeneratedTextTestCase(
        test_name="large-set",
        test_input="{}/data/generated_large.txt",
        acceptance_threshold=0.5,
-        marks=pytest.mark.slow
-    )
+        marks=pytest.mark.slow,
+    ),
 ]


 # credit card recognizer tests on generated data
-@pytest.mark.parametrize("test_input,acceptance_threshold",
-                         [testcase.to_pytest_param() for testcase in
-                          cc_test_generate_text_testdata])
+@pytest.mark.parametrize(
+    "test_input,acceptance_threshold",
+    [testcase.to_pytest_param() for testcase in cc_test_generate_text_testdata],
+)
 def test_spacy_recognizer_with_generated_text(test_input, acceptance_threshold):
    """
-        Test spacy recognizer with a generated dataset text file
-        :param test_input: input text file location
-        :param acceptance_threshold: minimim precision/recall
-         allowed for tests to pass
+    Test spacy recognizer with a generated dataset text file
+    :param test_input: input text file location
+    :param acceptance_threshold: minimim precision/recall
+     allowed for tests to pass
    """

    # read test input from generated file
    import os
+
    dir_path = os.path.dirname(os.path.realpath(__file__))
-    input_samples = read_synth_dataset(
-        test_input.format(dir_path))
+    input_samples = read_synth_dataset(test_input.format(dir_path))
    scores = score_presidio_recognizer(
-        SpacyRecognizer(), ['PERSON'], input_samples, True)
+        SpacyRecognizer(), ["PERSON"], input_samples, with_nlp_artifacts=True
+    )
    assert acceptance_threshold <= scores.pii_f
--- a/tests/test_span_to_tag.py
+++ b/tests/test_span_to_tag.py
@ -2,8 +2,9 @@ from presidio_evaluator import span_to_tag

 BILOU_SCHEME = "BILOU"
 BIO_SCHEME = "BIO"
+IO_SCHEME = "IO"

-
+# fmt: off
 def test_span_to_bio_multiple_tokens():
    text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
    start = 14
@ -166,8 +167,7 @@ def test_overlapping_entities_first_ends_in_mid_second():
    expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
                'US_PHONE_NUMBER', 'US_PHONE_NUMBER',
                 'O', 'O', 'O', 'O']
-    io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
-                     io_tags_only=True)
+    io = span_to_tag(IO_SCHEME, text, start, end, tag, scores)
    assert io == expected


@ -180,8 +180,7 @@ def test_overlapping_entities_second_embedded_in_first_with_lower_score():
    expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'PHONE_NUMBER',
                'PHONE_NUMBER', 'PHONE_NUMBER',
                 'O', 'O', 'O', 'O']
-    io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
-                     io_tags_only=True)
+    io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
    assert io == expected


@ -194,8 +193,7 @@ def test_overlapping_entities_second_embedded_in_first_has_higher_score():
    expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
                'PHONE_NUMBER', 'PHONE_NUMBER',
                 'O', 'O', 'O', 'O']
-    io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
-                     io_tags_only=True)
+    io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
    assert io == expected


@ -207,6 +205,6 @@ def test_overlapping_entities_pyramid():
    tag = ["A1", "B2","C3"]
    expected = ['O', 'O', 'O', 'O', 'O', 'A1', 'B2', 'C3', 'B2',
                 'A1', 'O', 'O', 'O', 'O']
-    io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
-                     io_tags_only=True)
+    io = span_to_tag(scheme=IO_SCHEME, text=text, start=start, end=end, tag=tag, scores=scores)
    assert io == expected
+# fmt: on