black and flake8-ing the entire code

2022-01-20 00:04:18 +02:00 · 2022-01-20 00:04:18 +02:00 · 5655404a2c
--- a/README.md
+++ b/README.md
@ -1,8 +1,10 @@
 # Presidio-research

-This package features data-science related tasks for developing new recognizers for [Presidio](https://github.com/microsoft/presidio).
-It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models
-
+This package features data-science related tasks for developing new recognizers for 
+[Presidio](https://github.com/microsoft/presidio).
+It is used for the evaluation of the entire system, 
+as well as for evaluating specific PII recognizers or PII detection models
+In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.
 ## Who should use it?

 - Anyone interested in **developing or evaluating a PII detection model**, an existing Presidio instance or a Presidio PII recognizer.
@ -42,72 +44,100 @@ Note that some dependencies (such as Flair and Stanza) are not installed to redu

 See [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.

-The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
-Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
+The data generation process receives a file with templates,0
+e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
+Then, it creates new synthetic sentences by sampling templates and PII values. 
+Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.

- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).
+- For information on data generation/augmentation, 
+- see the data generator [README](presidio_evaluator/data_generator/README.md).

- For an example for running the generation process, see [this notebook](notebooks/data%20generation/Generate%20data.ipynb).
+- For an example for running the generation process, 
+- see [this notebook](notebooks/1_Generate_data.ipynb).

- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/PII%20EDA.ipynb).
-Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
+- For an understanding of the underlying fake PII data used, 
+see this [exploratory data analysis notebook](notebooks/2_PII_EDA.ipynb).

-Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See [this notebook for more details](notebooks/Split%20by%20pattern%20%23.ipynb).
+Once data is generated, it could be split into train/test/validation sets 
+while ensuring that each template only exists in one set. 
+See [this notebook for more details](notebooks/3_Split_by_pattern_%23.ipynb).

 ## 2. Data representation

-In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see [data_objects.py](presidio_evaluator/data_objects.py).
+In order to standardize the process, 
+we use specific data objects that hold all the information needed for generating, 
+analyzing, modeling and evaluating data and models. Specifically, 
+see [data_objects.py](presidio_evaluator/data_objects.py).

-## 3. Recognizer evaluation
+The standardized structure, `List[InputSample]` could be translated into different formats:
+- CONLL
+```python
+from presidio_evaluator import InputSample
+dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
+conll = InputSample.create_conll_dataset(dataset)
+conll.to_csv("dataset.csv", sep="\t")

-The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
-The main logic lies in the [Evaluator](presidio_evaluator/evaluation/evaluator.py) class. It provides a structured way of evaluating models and recognizers.
+```

-### Ready model / engine wrappers
+- spaCy v3
+```python
+from presidio_evaluator import InputSample
+dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
+InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
+```

-Some evaluators were developed for analysis and references. These include:
+- Flair
+```python
+from presidio_evaluator import InputSample
+dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
+flair = InputSample.create_flair_dataset()(dataset)
+```

-#### Presidio analyzer evaluation
+- json
+```python
+from presidio_evaluator import InputSample
+dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
+InputSample.to_json(dataset, output_file="dataset_json")
+```

-Allows you to evaluate an existing Presidio instance. [See this notebook for details](notebooks/Evaluate%20Presidio%20Analyzer.ipynb).
+## 3. PII models evaluation

-#### One recognizer evaluation
+The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, 
+or a specific PII recognizer for precision and recall.

-Evaluate one specific recognizer for precision and recall.
-Similar to the analyzer evaluation just focusing on one type of PII recognizer.
-See [presidio_recognizer_wrapper.py](presidio_evaluator/models/presidio_recognizer_wrapper.py)

-#### Conditional Random Fields
+### Examples:
+- [Evaluate Presidio](notebooks/4_Evaluate_Presidio_Analyzer.ipynb)
+- [Evaluate spaCy models](notebooks/models/Evaluate%20spacy%20models.ipynb)
+- [Evaluate Stanza models](notebooks/models/Evaluate%20stanza%20models.ipynb)
+- [Evaluate CRF models](notebooks/models/Evaluate%20crf%20models.ipynb)
+- [Evaluate Flair models](notebooks/models/Evaluate%20flair%20models.ipynb)

-To train a CRF on a new dataset, see [this notebook](notebooks/models/Train CRF.ipynb).
-To evaluate a CRF model, see the the [same notebook](notebooks/models/Train CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).

-#### spaCy based models
+## 4. Training PII detection models

-There are three ways of interacting with spaCy models:
+### CRF

-1. Evaluate an existing trained model
-2. Train with pretrained embeddings
-3. Fine tune an existing spaCy model
+To train a vanilla CRF on a new dataset, see [this notebook](notebooks/models/Train%20CRF.ipynb). To evaluate, see [this notebook](notebooks/models/Evaluate%20CRF%20models.ipynb).

-Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
-See [this notebook for creating spaCy datasets](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb).
+### spaCy

-##### Evaluate an existing spaCy model
+To train a new spaCy model, first save the dataset in a spaCy format:
+```python
+# dataset is a List[InputSample]
+InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")
+```

-To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
+To evaluate, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb)

-#### Flair based models
-
-To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object.
-For experimenting with other embedding types, change the `embeddings` object in the `train` method.
-To train a Flair model, run:
+### Flair

+- To train Flair models, see this [helper class](presidio_evaluator/models/flair_train.py) or this snippet:
 ```python
 from presidio_evaluator.models import FlairTrainer
-train_samples = "../data/generated_train.json"
-test_samples = "../data/generated_test.json"
-val_samples = "../data/generated_validation.json"
+train_samples = "data/generated_train.json"
+test_samples = "data/generated_test.json"
+val_samples = "data/generated_validation.json"

 trainer = FlairTrainer()
 trainer.create_flair_corpus(train_samples, test_samples, val_samples)
@ -116,9 +146,8 @@ corpus = trainer.read_corpus("")
 trainer.train(corpus)
 ```

-To evaluate an existing model, see [this notebook](notebooks/models/Evaluate%20flair%20models.ipynb).

-# For more information
+## For more information

 - [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)
 - [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)
--- a/2
+++ b/2
@ -1,2 +1,2 @@
-0.0.2
+0.1.0

--- a/notebooks/4_Evaluate_Presidio_Analyzer.ipynb
+++ b/notebooks/4_Evaluate_Presidio_Analyzer.ipynb
@ -27,9 +27,9 @@
    "\n",
    "import pandas as pd\n",
    "\n",
-    "pd.set_option('display.max_columns', None) \n",
-    "pd.set_option('display.max_rows', None) \n",
-    "pd.set_option('display.max_colwidth', None)\n",
+    "pd.set_option(\"display.max_columns\", None)\n",
+    "pd.set_option(\"display.max_rows\", None)\n",
+    "pd.set_option(\"display.max_colwidth\", None)\n",
    "\n",
    "%reload_ext autoreload\n",
    "%autoreload 2"
@ -82,12 +82,16 @@
    "print(dataset[1])\n",
    "\n",
    "print(\"\\nMin and max number of tokens in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
+    "print(\n",
+    "    f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
+    ")\n",
    "\n",
    "print(\"\\nMin and max sentence length in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
+    "print(\n",
+    "    f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
+    ")"
   ]
  },
  {
@ -153,8 +157,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "sent = 'I am taiwanese but I live in Cambodia.'\n",
-    "#sent = input(\"Enter sentence: \")\n",
+    "sent = \"I am taiwanese but I live in Cambodia.\"\n",
+    "# sent = input(\"Enter sentence: \")\n",
    "model.predict(InputSample(full_text=sent))"
   ]
  },
@ -238,7 +242,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "fns_df = ModelError.get_fns_dataframe(errors,entity=['PHONE_NUMBER'])"
+    "fns_df = ModelError.get_fns_dataframe(errors, entity=[\"PHONE_NUMBER\"])"
   ]
  },
  {
@ -259,7 +263,7 @@
   "outputs": [],
   "source": [
    "print(\"All errors:\\n\")\n",
-    "[print(error,\"\\n\") for error in errors]"
+    "[print(error, \"\\n\") for error in errors]"
   ]
  }
 ],
--- a/notebooks/models/Evaluate
+++ b/notebooks/models/Evaluate
@ -31,9 +31,9 @@
    "\n",
    "import pandas as pd\n",
    "\n",
-    "pd.set_option('display.max_columns', None) \n",
-    "pd.set_option('display.max_rows', None) \n",
-    "pd.set_option('display.max_colwidth', None)\n",
+    "pd.set_option(\"display.max_columns\", None)\n",
+    "pd.set_option(\"display.max_rows\", None)\n",
+    "pd.set_option(\"display.max_colwidth\", None)\n",
    "\n",
    "%reload_ext autoreload\n",
    "%autoreload 2"
@ -98,12 +98,16 @@
    "print(dataset[1])\n",
    "\n",
    "print(\"\\nMin and max number of tokens in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
+    "print(\n",
+    "    f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
+    ")\n",
    "\n",
    "print(\"\\nMin and max sentence length in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
+    "print(\n",
+    "    f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
+    ")"
   ]
  },
  {
@ -160,7 +164,7 @@
    "    results = evaluator.calculate_score(evaluation_results)\n",
    "\n",
    "    # update params tracking\n",
-    "    params = {\"dataset_name\":dataset_name, \"model_name\": model_path}\n",
+    "    params = {\"dataset_name\": dataset_name, \"model_name\": model_path}\n",
    "    params.update(model.to_log())\n",
    "    experiment.log_parameters(params)\n",
    "    experiment.log_dataset_hash(dataset)\n",
@ -197,8 +201,8 @@
   },
   "outputs": [],
   "source": [
-    "sent = 'I am taiwanese but I live in Cambodia.'\n",
-    "#sent = input(\"Enter sentence: \")\n",
+    "sent = \"I am taiwanese but I live in Cambodia.\"\n",
+    "# sent = input(\"Enter sentence: \")\n",
    "model.predict(InputSample(full_text=sent))"
   ]
  },
@ -290,7 +294,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "fns_df = ModelError.get_fns_dataframe(errors, entity=['GPE'])"
+    "fns_df = ModelError.get_fns_dataframe(errors, entity=[\"GPE\"])"
   ]
  },
  {
@ -311,7 +315,7 @@
   "outputs": [],
   "source": [
    "print(\"All errors:\\n\")\n",
-    "[print(error,\"\\n\") for error in errors]"
+    "[print(error, \"\\n\") for error in errors]"
   ]
  },
  {
--- a/notebooks/models/Evaluate
+++ b/notebooks/models/Evaluate
@ -27,9 +27,9 @@
    "\n",
    "import pandas as pd\n",
    "\n",
-    "pd.set_option('display.max_columns', None) \n",
-    "pd.set_option('display.max_rows', None) \n",
-    "pd.set_option('display.max_colwidth', None)\n",
+    "pd.set_option(\"display.max_columns\", None)\n",
+    "pd.set_option(\"display.max_rows\", None)\n",
+    "pd.set_option(\"display.max_colwidth\", None)\n",
    "\n",
    "%reload_ext autoreload\n",
    "%autoreload 2"
@ -51,7 +51,9 @@
   "outputs": [],
   "source": [
    "dataset_name = \"synth_dataset_v2.json\"\n",
-    "dataset = InputSample.read_dataset_json(Path(Path.cwd().parent.parent, \"data\", dataset_name))\n",
+    "dataset = InputSample.read_dataset_json(\n",
+    "    Path(Path.cwd().parent.parent, \"data\", dataset_name)\n",
+    ")\n",
    "print(len(dataset))"
   ]
  },
@ -82,12 +84,16 @@
    "print(dataset[1])\n",
    "\n",
    "print(\"\\nMin and max number of tokens in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
+    "print(\n",
+    "    f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
+    ")\n",
    "\n",
    "print(\"\\nMin and max sentence length in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
+    "print(\n",
+    "    f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
+    ")"
   ]
  },
  {
@ -109,7 +115,13 @@
    "flair_ner_fast = \"ner-english-fast\"\n",
    "flair_ontonotes_fast = \"ner-english-ontonotes-fast\"\n",
    "flair_ontonotes_large = \"ner-english-ontonotes-large\"\n",
-    "models = [flair_ner, flair_ner_fast, flair_ontonotes_fast ,flair_ner_fast, flair_ontonotes_large]"
+    "models = [\n",
+    "    flair_ner,\n",
+    "    flair_ner_fast,\n",
+    "    flair_ontonotes_fast,\n",
+    "    flair_ner_fast,\n",
+    "    flair_ontonotes_large,\n",
+    "]"
   ]
  },
  {
@ -138,7 +150,7 @@
    "    results = evaluator.calculate_score(evaluation_results)\n",
    "\n",
    "    # update params tracking\n",
-    "    params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
+    "    params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
    "    params.update(model.to_log())\n",
    "    experiment.log_parameters(params)\n",
    "    experiment.log_dataset_hash(dataset)\n",
@ -171,8 +183,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "sent = 'I am taiwanese but I live in Cambodia.'\n",
-    "#sent = input(\"Enter sentence: \")\n",
+    "sent = \"I am taiwanese but I live in Cambodia.\"\n",
+    "# sent = input(\"Enter sentence: \")\n",
    "model.predict(InputSample(full_text=sent))"
   ]
  },
@ -265,7 +277,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "fns_df = ModelError.get_fns_dataframe(errors, entity=['GPE'])"
+    "fns_df = ModelError.get_fns_dataframe(errors, entity=[\"GPE\"])"
   ]
  },
  {
@ -286,7 +298,7 @@
   "outputs": [],
   "source": [
    "print(\"All errors:\\n\")\n",
-    "[print(error,\"\\n\") for error in errors]"
+    "[print(error, \"\\n\") for error in errors]"
   ]
  },
  {
--- a/notebooks/models/Evaluate
+++ b/notebooks/models/Evaluate
@ -136,15 +136,13 @@
    "    print(f\"Evaluating model {model_name}\")\n",
    "\n",
    "    nlp = spacy.load(model_name)\n",
-    "    model = SpacyModel(\n",
-    "        model=nlp, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"]\n",
-    "    )\n",
+    "    model = SpacyModel(model=nlp, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"])\n",
    "    evaluator = Evaluator(model=model)\n",
    "    evaluation_results = evaluator.evaluate_all(deepcopy(dataset))\n",
    "    results = evaluator.calculate_score(evaluation_results)\n",
    "\n",
    "    # update params tracking\n",
-    "    params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
+    "    params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
    "    params.update(model.to_log())\n",
    "    experiment.log_parameters(params)\n",
    "    experiment.log_dataset_hash(dataset)\n",
@ -200,8 +198,8 @@
   },
   "outputs": [],
   "source": [
-    "sent = 'I am taiwanese but I live in Cambodia.'\n",
-    "#sent = input(\"Enter sentence: \")\n",
+    "sent = \"I am taiwanese but I live in Cambodia.\"\n",
+    "# sent = input(\"Enter sentence: \")\n",
    "model.predict(InputSample(full_text=sent))"
   ]
  },
@ -272,7 +270,6 @@
   },
   "outputs": [],
   "source": [
-    "\n",
    "ModelError.most_common_fn_tokens(errors, n=50, entity=[\"PERSON\"])"
   ]
  },
--- a/notebooks/models/Evaluate
+++ b/notebooks/models/Evaluate
@ -52,7 +52,9 @@
   "outputs": [],
   "source": [
    "dataset_name = \"synth_dataset_v2.json\"\n",
-    "dataset = InputSample.read_dataset_json(Path(Path.cwd().parent.parent, \"data\", dataset_name))\n",
+    "dataset = InputSample.read_dataset_json(\n",
+    "    Path(Path.cwd().parent.parent, \"data\", dataset_name)\n",
+    ")\n",
    "print(len(dataset))"
   ]
  },
@ -83,12 +85,16 @@
    "print(dataset[1])\n",
    "\n",
    "print(\"\\nMin and max number of tokens in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.tokens) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.tokens) for sample in dataset])}\")\n",
+    "print(\n",
+    "    f\"Min: {min([len(sample.tokens) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.tokens) for sample in dataset])}\"\n",
+    ")\n",
    "\n",
    "print(\"\\nMin and max sentence length in dataset:\")\n",
-    "print(f\"Min: {min([len(sample.full_text) for sample in dataset])}, \" \\\n",
-    "      f\"Max: {max([len(sample.full_text) for sample in dataset])}\")"
+    "print(\n",
+    "    f\"Min: {min([len(sample.full_text) for sample in dataset])}, \"\n",
+    "    f\"Max: {max([len(sample.full_text) for sample in dataset])}\"\n",
+    ")"
   ]
  },
  {
@ -128,14 +134,16 @@
    "    experiment = get_experiment_tracker()\n",
    "    print(\"-----------------------------------\")\n",
    "    print(f\"Evaluating model {model_name}\")\n",
-    "    \n",
-    "    model = StanzaModel(model_name=model_name, entities_to_keep=['PERSON', 'GPE', 'ORG', 'NORP'])\n",
+    "\n",
+    "    model = StanzaModel(\n",
+    "        model_name=model_name, entities_to_keep=[\"PERSON\", \"GPE\", \"ORG\", \"NORP\"]\n",
+    "    )\n",
    "    evaluator = Evaluator(model=model)\n",
    "    evaluation_results = evaluator.evaluate_all(deepcopy(dataset))\n",
    "    results = evaluator.calculate_score(evaluation_results)\n",
    "\n",
    "    # update params tracking\n",
-    "    params = {\"dataset_name\":dataset_name, \"model_name\": model_name}\n",
+    "    params = {\"dataset_name\": dataset_name, \"model_name\": model_name}\n",
    "    params.update(model.to_log())\n",
    "    experiment.log_parameters(params)\n",
    "    experiment.log_dataset_hash(dataset)\n",
@ -168,8 +176,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "sent = 'I am taiwanese but I live in Cambodia.'\n",
-    "#sent = input(\"Enter sentence: \")\n",
+    "sent = \"I am taiwanese but I live in Cambodia.\"\n",
+    "# sent = input(\"Enter sentence: \")\n",
    "model.predict(InputSample(full_text=sent))"
   ]
  },
@ -243,7 +251,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "fns_df = ModelError.get_fns_dataframe(errors=results.model_errors,entity=['GPE'])"
+    "fns_df = ModelError.get_fns_dataframe(errors=results.model_errors, entity=[\"GPE\"])"
   ]
  },
  {
@ -264,7 +272,7 @@
   "outputs": [],
   "source": [
    "print(\"All errors:\\n\")\n",
-    "[print(error,\"\\n\") for error in results.model_errors]"
+    "[print(error, \"\\n\") for error in results.model_errors]"
   ]
  },
  {
--- a/presidio_evaluator/data_generator/README.md
+++ b/presidio_evaluator/data_generator/README.md
@ -9,22 +9,30 @@ for model training and evaluation.

 There are two main scenarios for using the Presidio Data Generator:

-1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates (see [this file](raw_data/templates.txt) for example)
+1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates 
+(see [this file](raw_data/templates.txt) for example)
 2. Augment an existing labeled dataset with additional fake values.

-In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates, and then scenario 1 is applied.
+In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates, 
+and then scenario 1 is applied.

 ## Process

 This generator heavily relies on the [Faker package](https://www.github.com/joke2k/faker) with a few differences:

-1. `PresidioDataGenerator` returns not only fake text, but all the spans in which fake entities appear in the text
+1. `PresidioDataGenerator` returns not only fake text, but also the spans in which fake entities appear in the text.

-2. Faker samples each value independently. In many cases we would want to keep the semantic dependency between two values. For example, for the template `My name is {{name}} and my email is {{email}}`, we would prefer a result which has the name within the email address, such as `My name is Mike and my email is mike1243@gmail.com`. For this functionality, a new `RecordGenerator` (based on Faker's `Generator` class) is implemented. It accepts a dictionary / pandas DataFrame, and favors returning objects from the same record (if possible).
+2. `Faker` samples each value independently. 
+In many cases we would want to keep the semantic dependency between two values. 
+For example, for the template `My name is {{name}} and my email is {{email}}`, 
+we would prefer a result which has the name within the email address, 
+such as `My name is Mike and my email is mike1243@gmail.com`. 
+For this functionality, a new `RecordGenerator` (based on Faker's `Generator` class) is implemented. 
+It accepts a dictionary / pandas DataFrame, and favors returning objects from the same record (if possible).

 ## Example

-For a full example, see the [Generation Data Notebook](../../notebooks/1_Generate_data.ipynb).
+For a full example, see the [Generate Data Notebook](../../notebooks/1_Generate_data.ipynb).

 Simple example:

@ -57,44 +65,18 @@ The process in high level is the following:

 1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
 templates: `My name is John` -> `My name is [PERSON]`
-2. (Optional) adapt the FakeDataGenerator to support new extensions
-which could generate fake PII entities
-3. Generate X samples using the templates list + a fake PII dataset +
-extensions that add additional PII entities
+2. (Optional) add new Faker providers to the `PresidioDataGenerator` to support types of PII not returned by Faker
+3. Generate samples using the templates list
 4. Split the generated dataset to train/test/validation while making sure
 that samples from the same template would only appear in one set
 5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
 6. Train models
-7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
+7. Evaluate using one of the [evaluation notebooks](../../notebooks/models)

 Notes:

 - For steps 5, 6, 7 see the main [README](../../README.md).
- For a simple data generation pipeline,
-[see this notebook](../../notebooks/data%20generation/Generate%20data.ipynb).
- For information on transforming a NER dataset into a templates,
-see the notebooks in the [helper notebooks](../../notebooks/data%20generation) folder.

-Example run:
-
-```python
-from presidio_evaluator.data_generator import generate
-TEMPLATES_FILE = 'raw_data/templates.txt'
-OUTPUT = "generated_.txt"
-
-## Should be downloaded from FakeNameGenerator
-fake_pii_csv = 'raw_data/FakeNameGenerator.csv'
-
-examples = generate(fake_pii_csv=fake_pii_csv,
-                    utterances_file=TEMPLATES_FILE,
-                    dictionary_path=None,
-                    output_file=OUTPUT,
-                    lower_case_ratio=0.1,
-                    num_of_examples=100,
-                    ignore_types={"IP_ADDRESS", 'US_SSN', 'URL'},
-                    keep_only_tagged=False,
-                    span_to_tag=True)
-```

 *Copyright notice:*

--- a/presidio_evaluator/data_generator/faker_extensions/init.py
+++ b/presidio_evaluator/data_generator/faker_extensions/init.py
@ -9,7 +9,7 @@ from .providers import (
    IpAddressProvider,
    AddressProviderNew,
    PhoneNumberProviderNew,
-    AgeProvider
+    AgeProvider,
 )

 __all__ = [
@ -24,5 +24,5 @@ __all__ = [
    "AddressProviderNew",
    "PhoneNumberProviderNew",
    "AgeProvider",
-    "RecordsFaker"
+    "RecordsFaker",
 ]
--- a/presidio_evaluator/data_generator/faker_extensions/records_faker.py
+++ b/presidio_evaluator/data_generator/faker_extensions/records_faker.py
@ -7,7 +7,6 @@ from presidio_evaluator.data_generator.faker_extensions import RecordGenerator


 class RecordsFaker(Faker):
-    
    def __init__(self, records: Union[pd.DataFrame, List[Dict]], **kwargs):
        if isinstance(records, pd.DataFrame):
            records = records.to_dict(orient="records")
--- a/presidio_evaluator/data_generator/presidio_data_generator.py
+++ b/presidio_evaluator/data_generator/presidio_data_generator.py
@ -281,7 +281,7 @@ class PresidioDataGenerator:
                    "prefix_female",
                    "prefix_male",
                    "last_name_female",
-                    "last_name_male"
+                    "last_name_male",
                ],
            ),
            axis=1,
--- a/presidio_evaluator/data_objects.py
+++ b/presidio_evaluator/data_objects.py
@ -5,7 +5,7 @@ from typing import List, Optional, Union, Dict, Any, Tuple
 import pandas as pd
 import spacy
 from spacy import Language
-from spacy.tokens import Token, Doc, DocBin
+from spacy.tokens import Doc, DocBin
 from spacy.training import iob_to_biluo
 from tqdm import tqdm

@ -137,14 +137,14 @@ class InputSample(object):
        :param full_text: The raw text of this sample
        :param masked: Masked/Templated version of the raw text
        :param spans: List of spans for entities
-        :param create_tags_from_span: True if tags (tokens+taks) should be added
+        :param create_tags_from_span: True if tags (tokens+tags) should be added
        :param scheme: IO, BIO/IOB or BILOU. Only applicable if span_to_tag=True
        :param tokens: spaCy Doc object
        :param tags: list of strings representing the label for each token,
        given the scheme
        :param metadata: A dictionary of additional metadata on the sample,
        in the English (or other language) vocabulary
-        :param template_id: Original template (utterance) of sample, in case it was generated
+        :param template_id: Original template (utterance) of sample, in case it was generated  # noqa
        """
        if tags is None:
            tags = []
@ -532,7 +532,7 @@ class InputSample(object):
                span.entity_value = "O"

    @staticmethod
-    def create_flair_dataset(dataset):
+    def create_flair_dataset(dataset: List["InputSample"]) -> List[str]:
        flair_samples = []
        for sample in dataset:
            flair_samples.append(sample.to_flair())
--- a/presidio_evaluator/evaluation/evaluator.py
+++ b/presidio_evaluator/evaluation/evaluator.py
@ -143,7 +143,9 @@ class Evaluator:
    def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
        evaluation_results = []
        if self.model.entity_mapping:
-            print(f"Mapping entity values using this dictionary: {self.model.entity_mapping}")
+            print(
+                f"Mapping entity values using this dictionary: {self.model.entity_mapping}"
+            )
        for sample in tqdm(dataset, desc=f"Evaluating {self.model.__class__}"):

            # Align tag values to the ones expected by the model
--- a/presidio_evaluator/experiment_tracking/experiment_tracker.py
+++ b/presidio_evaluator/experiment_tracking/experiment_tracker.py
@ -40,7 +40,7 @@ class ExperimentTracker:
        labels=List[str],
    ):
        self.confusion_matrix = matrix
-        self.labels=labels
+        self.labels = labels

    def start(self):
        pass
--- a/presidio_evaluator/models/init.py
+++ b/presidio_evaluator/models/init.py
@ -1,3 +1,4 @@
+"""Helper scripts for calling different NER models."""
 from .base_model import BaseModel
 from .crf_model import CRFModel
 from .presidio_analyzer_wrapper import PresidioAnalyzerWrapper
--- a/presidio_evaluator/models/crf_model.py
+++ b/presidio_evaluator/models/crf_model.py
@ -16,7 +16,7 @@ class CRFModel(BaseModel):
        super().__init__(
            entities_to_keep=entities_to_keep,
            verbose=verbose,
-            entity_mapping=entity_mapping
+            entity_mapping=entity_mapping,
        )

        if model_pickle_path is None:
--- a/presidio_evaluator/models/flair_model.py
+++ b/presidio_evaluator/models/flair_model.py
@ -16,6 +16,15 @@ from presidio_evaluator.models import BaseModel


 class FlairModel(BaseModel):
+    """
+    Evaluator for Flair models
+    :param model: model of type SequenceTagger
+    :param model_path:
+    :param entities_to_keep:
+    :param verbose:
+    and model expected entity types
+    """
+
    def __init__(
        self,
        model=None,
@ -24,14 +33,7 @@ class FlairModel(BaseModel):
        verbose: bool = False,
        entity_mapping: Dict[str, str] = PRESIDIO_SPACY_ENTITIES,
    ):
-        """
-        Evaluator for Flair models
-        :param model: model of type SequenceTagger
-        :param model_path:
-        :param entities_to_keep:
-        :param verbose:
-        and model expected entity types
-        """
+
        super().__init__(
            entities_to_keep=entities_to_keep,
            verbose=verbose,
--- a/presidio_evaluator/models/flair_train.py
+++ b/presidio_evaluator/models/flair_train.py
@ -1,5 +1,7 @@
 from typing import List

+import pandas as pd
+
 try:
    from flair.data import Corpus, Sentence
    from flair.datasets import ColumnCorpus
@ -21,11 +23,20 @@ from os import path


 class FlairTrainer:
+    """
+    Helper class for training Flair models
+    """
+
    @staticmethod
-    def to_flair_row(text, pos, label):
+    def to_flair_row(text: str, pos: str, label: str) -> str:
+        """
+        Turn text, part of speech and label into one row.
+        :return: str
+        """
        return "{} {} {}".format(text, pos, label)

-    def to_flair(self, df, outfile="flair_train.txt"):
+    def to_flair(self, df: pd.DataFrame, outfile: str = "flair_train.txt") -> None:
+        """Translate a pd.DataFrame to a flair dataset."""
        sentence = 0
        flair = []
        for row in df.itertuples():
@ -43,10 +54,19 @@ class FlairTrainer:
    def create_flair_corpus(
        self, train_samples_path, test_samples_path, val_samples_path
    ):
+        """
+        Create a flair Corpus object and saive it to train, test, validation files.
+        :param train_samples_path: Path to train samples
+        :param test_samples_path: Path to test samples
+        :param val_samples_path: Path to validation samples
+        :return:
+        """
        if not path.exists("flair_train.txt"):
            train_samples = InputSample.read_dataset_json(train_samples_path)
            train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
-            print(f"Kept {len(train_tagged)} train samples after removal of non-tagged samples")
+            print(
+                f"Kept {len(train_tagged)} train samples after removal of non-tagged samples"
+            )
            train_data = InputSample.create_conll_dataset(train_tagged)
            self.to_flair(train_data, outfile="flair_train.txt")

@ -61,7 +81,12 @@ class FlairTrainer:
            self.to_flair(val_data, outfile="flair_val.txt")

    @staticmethod
-    def read_corpus(data_folder):
+    def read_corpus(data_folder: str):
+        """
+        Read Flair Corpus object.
+        :param data_folder: Path with files
+        :return: Corpus object
+        """
        columns = {0: "text", 1: "pos", 2: "ner"}
        corpus = ColumnCorpus(
            data_folder,
@ -73,7 +98,12 @@ class FlairTrainer:
        return corpus

    @staticmethod
-    def train(corpus):
+    def train(corpus: Corpus):
+        """
+        Train a Flair model
+        :param corpus: Corpus object
+        :return:
+        """
        print(corpus)

        # 2. what tag do we want to predict?
--- a/presidio_evaluator/models/presidio_analyzer_wrapper.py
+++ b/presidio_evaluator/models/presidio_analyzer_wrapper.py
@ -15,7 +15,7 @@ class PresidioAnalyzerWrapper(BaseModel):
        labeling_scheme: str = "BIO",
        score_threshold: float = 0.4,
        language: str = "en",
-        entity_mapping:Optional[Dict[str,str]]=None
+        entity_mapping: Optional[Dict[str, str]] = None,
    ):
        """
        Evaluation wrapper for the Presidio Analyzer
@ -25,7 +25,7 @@ class PresidioAnalyzerWrapper(BaseModel):
            entities_to_keep=entities_to_keep,
            verbose=verbose,
            labeling_scheme=labeling_scheme,
-            entity_mapping=entity_mapping
+            entity_mapping=entity_mapping,
        )
        self.score_threshold = score_threshold
        self.language = language
--- a/presidio_evaluator/models/presidio_recognizer_wrapper.py
+++ b/presidio_evaluator/models/presidio_recognizer_wrapper.py
@ -9,6 +9,17 @@ from presidio_evaluator.span_to_tag import span_to_tag


 class PresidioRecognizerWrapper(BaseModel):
+    """
+    Class wrapper for one specific PII recognizer
+    To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
+    :param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
+    :param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
+    :param entities_to_keep: List of entity types to focus on while ignoring all the rest.
+    Default=None would look at all entity types
+    :param with_nlp_artifacts: Whether NLP artifacts should be obtained
+        (faster if not, but some recognizers need it)
+    """
+
    def __init__(
        self,
        recognizer: EntityRecognizer,
@ -19,21 +30,12 @@ class PresidioRecognizerWrapper(BaseModel):
        entity_mapping: Optional[Dict[str, str]] = None,
        verbose: bool = False,
    ):
-        """
-        Evaluator for one specific PII recognizer
-        To evaluate the entire set of recognizers, refer to PresidioAnaylzerWrapper
-        :param recognizer: An object of type EntityRecognizer (in presidio-analyzer)
-        :param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
-        :param entities_to_keep: List of entity types to focus on while ignoring all the rest.
-        Default=None would look at all entity types
-        :param with_nlp_artifacts: Whether NLP artifacts should be obtained
-            (faster if not, but some recognizers need it)
-        """
+
        super().__init__(
            entities_to_keep=entities_to_keep,
            verbose=verbose,
            labeling_scheme=labeling_scheme,
-            entity_mapping=entity_mapping
+            entity_mapping=entity_mapping,
        )
        self.with_nlp_artifacts = with_nlp_artifacts
        self.recognizer = recognizer
--- a/presidio_evaluator/models/spacy_model.py
+++ b/presidio_evaluator/models/spacy_model.py
@ -21,7 +21,7 @@ class SpacyModel(BaseModel):
            entities_to_keep=entities_to_keep,
            verbose=verbose,
            labeling_scheme=labeling_scheme,
-            entity_mapping=entity_mapping
+            entity_mapping=entity_mapping,
        )

        if model is None:
@ -32,14 +32,19 @@ class SpacyModel(BaseModel):
            self.model = model

    def predict(self, sample: InputSample) -> List[str]:
+        """
+        Predict a list of tags for an inpuit sample.
+        :param sample: InputSample
+        :return: list of tags
+        """
        doc = self.model(sample.full_text)
-        tags = self.get_tags_from_doc(doc)
+        tags = self._get_tags_from_doc(doc)
        if len(doc) != len(sample.tokens):
            print("mismatch between input tokens and new tokens")

        return tags

    @staticmethod
-    def get_tags_from_doc(doc):
+    def _get_tags_from_doc(doc):
        tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
        return tags
--- a/presidio_evaluator/models/stanza_model.py
+++ b/presidio_evaluator/models/stanza_model.py
@ -14,6 +14,17 @@ from presidio_evaluator.models import SpacyModel


 class StanzaModel(SpacyModel):
+    """
+    Class wrapping Stanza models, using spacy_stanza.
+
+    :param model: spaCy Language object representing a stanza model
+    :param model_name: Name of model, e.g. "en"
+    :param entities_to_keep: List of entities to predict on
+    :param verbose: Whether to print more
+    :param labeling_scheme: Whether to return IO, BIO or BILUO tags
+    :param entity_mapping: Mapping between input dataset entities and entities expected by the model
+    """
+
    def __init__(
        self,
        model: spacy.language.Language = None,
@ -23,6 +34,7 @@ class StanzaModel(SpacyModel):
        labeling_scheme: str = "BIO",
        entity_mapping: Optional[Dict[str, str]] = PRESIDIO_SPACY_ENTITIES,
    ):
+
        if not model and not model_name:
            raise ValueError("Either model_name or model object must be supplied")
        if not model:
@ -40,6 +52,12 @@ class StanzaModel(SpacyModel):
        )

    def predict(self, sample: InputSample) -> List[str]:
+        """
+        Predict the tags using a stanza model.
+
+        :param sample: InputSample with text
+        :return: list of tags
+        """

        doc = self.model(sample.full_text)
        if doc.ents:
@ -48,7 +66,8 @@ class StanzaModel(SpacyModel):
            )

            # Stanza tokens might not be consistent with spaCy's tokens.
-            # Use spacy tokenization and not stanza to maintain consistency with other models:
+            # Use spacy tokenization and not stanza
+            # to maintain consistency with other models:
            if not sample.tokens:
                sample.tokens = tokenize(sample.full_text)

--- a/setup.py
+++ b/setup.py
@ -20,12 +20,12 @@ setup(
    packages=find_packages(exclude=["tests"]),
    url="https://www.github.com/microsoft/presidio-research",
    license="MIT",
-    description="PII dataset generator, model evaluator for Presidio and PII data in general",
+    description="PII dataset generator, model evaluator for Presidio and PII data in general",  # noqa
    data_files=[
        (
            "presidio_evaluator/data_generator/raw_data",
            [
-                "presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv",
+                "presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv",  # noqa
                "presidio_evaluator/data_generator/raw_data/templates.txt",
                "presidio_evaluator/data_generator/raw_data/organizations.csv",
                "presidio_evaluator/data_generator/raw_data/nationalities.csv",
@ -46,6 +46,6 @@ setup(
        "schwifty",
        "faker",
        "sklearn_crfsuite",
-        "python-dotenv"
+        "python-dotenv",
    ],
 )
--- a/tests/test_data_objects.py
+++ b/tests/test_data_objects.py
@ -69,7 +69,7 @@ def test_to_spacy_file_and_back(small_dataset):
        output_path="dataset.spacy",
        translate_tags=False,
        spacy_pipeline=spacy_pipeline,
-        alignment_mode = "strict"
+        alignment_mode="strict",
    )

    db = DocBin()
--- a/tests/test_flair_model.py
+++ b/tests/test_flair_model.py
@ -23,7 +23,6 @@ def test_flair_simple():
        os.path.join(dir_path, "data/generated_small.json")
    )

-
    flair_model = FlairModel(model_path="ner", entities_to_keep=["PERSON"])
    evaluator = Evaluator(model=flair_model)
    evaluation_results = evaluator.evaluate_all(input_samples)
--- a/tests/test_recognizers_template_csv.py
+++ b/tests/test_recognizers_template_csv.py
@ -55,10 +55,7 @@ cc_test_template_testdata = [

 # credit card recognizer tests on template-generates data
@pytest.mark.parametrize(
-    "pii_csv, "
-    "utterances, "
-    "num_of_examples, "
-    "acceptance_threshold",
+    "pii_csv, " "utterances, " "num_of_examples, " "acceptance_threshold",
    [testcase.to_pytest_param() for testcase in cc_test_template_testdata],
 )
 def test_credit_card_recognizer_with_template(
--- a/tests/test_recognizers_template_join_csv.py
+++ b/tests/test_recognizers_template_join_csv.py
@ -77,7 +77,7 @@ rocket_test_template_testdata = [
        utterances="{}/data/rocket_example_sentences.txt",
        num_of_examples=100,
        acceptance_threshold=0,
-        max_mistakes_number=100
+        max_mistakes_number=100,
    ),
    PatternRecognizerTestCase(
        test_name="rocket-some-errors",