Просмотреть файл

@ -1,14 +1,132 @@
# Presidio-evaluator
This package features data-science related tasks for developing new recognizers for Presidio.
It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models
# Contributing
## Who should use it?
Anyone interested in evaluating an existing Presidio instance, a specific PII recognizer or to develop new models or logic for detecting PII could leverage the preexisting work in this package.
Additionally, anyone interested in generating new data based on previous datasets (e.g. to increase the coverage of entity values) for Named Entity Recognition models could leverage the data generator contained in this package.
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
## What's in this package?
This project has adopted the [Microsoft Open Source Code of Conduct](
For more information see the [Code of Conduct FAQ]( or
contact []( with any additional questions or comments.
1. **Data generator** for PII recognizers and NER models
2. **Data representation layer** for data generation, modeling and analysis
3. Multiple **Model/Recognizer evaluation** files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
4. **Training and modeling code** for multiple models
4. Helper functions for **results analysis**
## 1. Data generation
See [Data Generator README](/presidio_evaluator/data_generator/ for more details.
The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/
- For an example for running the generation process, see [this notebook](notebooks/Generate%20data.ipynb).
- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/PII%20EDA.ipynb).
Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See [this notebook for more details](notebooks/Split%20by%20pattern%20%23.ipynb).
## 2. Data representation
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see [](presidio_evaluator/
## 3. Recognizer evaluation
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
The main logic lies in the [ModelEvaluator](presidio_evaluator/ class. It provides a structured way of evaluating models and recognizers.
### Ready evaluators
Some evaluators were developed for analysis and references. These include:
#### 1. Presidio API evaluator
Allows you to evaluate an existing Presidio deployment through the API. [See this notebook for details](notebooks/Evaluate%20Presidio-API.ipynb).
#### 2. Presidio analyzer evaluator
Allows you to evaluate the local Presidio-Analyzer package. Faster than the API option but requires you to have Presidio-Analyzer installed locally. [See this class for more information](presidio_evaluator/
#### 3. One recognizer evaluator
Evaluate one specific recognizer for precision and recall. See [](presidio_evaluator/
## 4. Modeling
### Conditional Random Fields
To train a CRF on a new dataset, see [this notebook](notebooks/models/CRF).
To evaluate a CRF model, see [this notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/
### spaCy based models
There are three ways of interacting with spaCy models:
1. Evaluate an existing trained model
2. Train with pretrained embeddings
3. Fine tune an existing spaCy model
Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
See [this notebook for creating spaCy datasets](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb).
#### Evaluate an existing trained model
To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
#### Train with pretrain embeddings
In order to train a new spaCy model from scratch with pretrained embeddings (FastText wiki news subword in this case), follow these three steps:
##### 1. Download FastText pretrained (sub) word embeddings
``` sh
##### 2. Init spaCy model with pre-trained embeddings
Using spaCy CLI:
``` sh
python -m spacy init-model en spacy_fasttext --vectors-loc wiki-news-300d-1M-subword.vec
##### 3. Train spaCy NER model
Using spaCy CLI:
``` sh
python -m spacy train en spacy_fasttext_100 train.json test.json --vectors spacy_fasttext --pipeline ner -n 100
#### Fine-tune an existing spaCy model
See [this code for retraining an existing spaCy model](models/ Specifically, run a SpacyRetrainer:
First, you would have to create train and test pickle files for your train and test sets. See [this notebook](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb) for more information.
from models import SpacyRetrainer
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
n_iter=500, dropout=0.1, aml_config=None)
### Flair based models
To train a new model, see the [FlairTrainer](presidio_evaluator/models/ object.
For experimenting with other embedding types, change the `embeddings` object in the `train` method.
To train a Flair model, run:
from models import FlairTrainer
train_samples = "../data/generated_train.json"
test_samples = "../data/generated_test.json"
val_samples = "../data/generated_validation.json"
trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
corpus = trainer.read_corpus("")
To evaluate an existing model, see [this notebook](notebooks/models/Evaluate%20flair%20models.ipynb).
Copyright notice:
Fake Name Generator identities by the [Fake Name Generator](
Copyright notice:
Fake Name Generator identities by the [Fake Name Generator](
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License]( Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

VERSION Normal file
Просмотреть файл

@ -0,0 +1,2 @@

data/synth_dataset.txt Normal file

Разница между файлами не показана из-за своего большого размера Загрузить разницу

models/ Normal file
Просмотреть файл

@ -0,0 +1,2 @@
from .spacy_retrain import SpacyRetrainer
from .flair_train import FlairTrainer

models/ Normal file
Просмотреть файл

@ -0,0 +1,123 @@
from typing import List
from import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import read_synth_dataset
from os import path
class FlairTrainer:
def to_flair_row(self, text, pos, label):
return "{} {} {}".format(text, pos, label)
def to_flair(self, df, outfile="flair_train.txt"):
sentence = 0
flair = []
for row in df.itertuples():
if row.sentence != sentence:
sentence += 1
flair.append(self.to_flair_row(row.text, row.pos, row.label))
if outfile:
with open(outfile, "w", encoding="utf-8") as f:
for item in flair:
def create_flair_corpus(self, train_samples_path, test_samples_path, val_samples_path):
if not path.exists("flair_train.txt"):
train_samples = read_synth_dataset(train_samples_path)
train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
print("Kept {} train samples after removal of non-tagged samples".format(len(train_tagged)))
train_data = InputSample.create_conll_dataset(train_tagged)
self.to_flair(train_data, outfile="flair_train.txt")
if not path.exists("flair_test.txt"):
test_samples = read_synth_dataset(test_samples_path)
test_data = InputSample.create_conll_dataset(test_samples)
self.to_flair(test_data, outfile="flair_test.txt")
if not path.exists("flair_val.txt"):
val_samples = read_synth_dataset(val_samples_path)
val_data = InputSample.create_conll_dataset(val_samples)
self.to_flair(val_data, outfile="flair_val.txt")
def read_corpus(data_folder) -> Corpus:
columns = {0: 'text', 1: 'pos', 2: 'ner'}
corpus: Corpus = ColumnCorpus(data_folder, columns,
return corpus
def train(corpus):
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
# 5. initialize sequence tagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
# 6. initialize trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
checkpoint = 'resources/taggers/presidio-ner/'
# trainer = ModelTrainer.load_checkpoint(checkpoint, corpus)
sentence = Sentence('I am from Jerusalem')
# run NER over sentence
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
if __name__ == "__main__":
train_samples = "../data/generated_train_November 12 2019.json"
test_samples = "../data/generated_test_November 12 2019.json"
val_samples = "../data/generated_validation_November 12 2019.json"
trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
corpus = trainer.read_corpus("")

models/ Normal file
Просмотреть файл

@ -0,0 +1,206 @@
import logging
import pickle
import random
import sys
from pathlib import Path
import spacy
from azureml.core import Workspace, Experiment
from spacy.util import minibatch, compounding
from presidio_evaluator import SpacyEvaluator, InputSample
root = logging.getLogger()
handler = logging.StreamHandler(sys.stdout)
class SpacyRetrainer:
def __init__(self, original_model_name=None, experiment_name=None, n_iter=100, dropout=0.5,
aml_config='config.json', output_dir='../../model-outputs', train_pickle='../data/train.pickle',
self.experiment_name = experiment_name
if aml_config: = Workspace.from_config(aml_config)
self.experiment = Experiment(, name=experiment_name)
self.aml_run = self.experiment.start_logging()
self.has_aml = True
self.has_aml = False
self.model = original_model_name
self.n_iter = n_iter
self.output_dir = output_dir
self.train_file = train_pickle
self.test_file = test_pickle
self.dropout = dropout
def run(self):
if self.has_aml:
self.aml_run.log("model", self.model)
self.aml_run.log("n_iter", self.n_iter)
self.aml_run.log("train_file", self.train_file)
self.aml_run.log("test_file", self.test_file)
self.aml_run.log("dropout rate", self.dropout)
model_path = self._train(self.model, self.output_dir, self.n_iter, self.train_file, self.experiment_name)
self._score_validate(model_path, self.test_file)
if self.has_aml:
def print_scores(self, split, evaluation_result):
Logs results into experiment run.
:param split: Name of this split. For ex 'train' or 'valid'
:param evaluation_result: EvaluationResult containing various metrics
:return: None. Writes to experiment runner and logs locally.
"""'SPLIT: {0}. PII_precision: {1}, PII_recall: {2},'
'Person_precision: {3}, Person_recall: {4}'. \
format(split, evaluation_result.pii_precision, evaluation_result.pii_recall,
if self.has_aml:
self.aml_run.log('Precision', evaluation_result.pii_precision, split)
self.aml_run.log('Recall', evaluation_result.pii_recall, split)
def _score(model, data):
Score the model against the data
:param model: Trained model
:param data: Data split which is being scored.
:return: An EvaluationResult containing various metrics
spacy_evaluator = SpacyEvaluator(model=model)
results = []
for text, ground_truth_annotations in data:
ground_truth_entities = ground_truth_annotations['entities']
input_sample = InputSample.from_spacy(text, ground_truth_entities)
return spacy_evaluator.calculate_score(evaluation_results=results)
def _score_validate(self, model_path, test_data_file):
Validation step for the model. Also prints the scores.
:param model_path: Path to trained model.
:param test_data_file: Data file which has the dataset for this split.
:return: None. Prints the scores.
with open(test_data_file, 'rb') as f:
valid_data = pickle.load(f)
nlp = spacy.load(model_path)
self.print_scores('Valid', self._score(nlp, valid_data))
# @plac.annotations(
# model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
# output_dir=("Optional output directory", "option", "o", Path),
# n_iter=("Number of training iterations", "option", "n", int),
# train_file=("File containing pickled training Spacy NER formatted data", "option", "d", Path),
# test_file=("File containing pickled test Spacy NER formatted data", "option", "d", Path),
# exp_name=("Name of this experiment", "option", "e")
# )
def _train(self, model, output_dir, n_iter, train_file, exp_name):
"""Load the model, set up the pipeline and train the entity recognizer."""
nlp = self.load_or_create_empty_model(model)
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
ner = nlp.get_pipe("ner")
with open(train_file, 'rb') as f:
train_data = pickle.load(f)
train_data = train_data[:50]
# add labels
for _, annotations in train_data:
for ent in annotations.get("entities"):
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
for itn in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, drop=self.dropout, losses=losses, )
logging.debug("Losses", losses)
if self.has_aml:
self.aml_run.log('Losses', losses['ner'])
self.print_scores('Itn {}'.format(itn), self._score(nlp, train_data))
self.print_scores('Train', self._score(nlp, train_data))
saved_model_path = self.save_model(exp_name, nlp, output_dir)
return saved_model_path
def save_model(exp_name, model, output_dir):
Saves model to disk for later use.
:param exp_name: Name of the running experiment. This is used as folder name for storing the model.
:param model: Model being saved
:param output_dir: Directory where to save the model.
:return: Full path to saved model.
saved_model_path = Path(output_dir, exp_name)
if not saved_model_path.exists():
model.to_disk(saved_model_path)"Saved model to {}".format(output_dir))
return saved_model_path
def load_model(exp_name, model_dir):
Loads a spacy model from disk
:param exp_name: Name of experiment under which the model was saved
:param model_dir: path to saved model
:return: spacy model
saved_model_path = Path(model_dir, exp_name)
return spacy.load(saved_model_path)
def load_or_create_empty_model(model=None):
Loads a given model or creates a blank english model.
:param model: Optional Model to load.
:return: Loaded or blank model.
if model:
nlp = spacy.load(model)
logging.debug("Loaded model {}".format(model))
nlp = spacy.blank("en")
logging.debug("Created blank 'en' model")
return nlp
if __name__ == "__main__":
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
n_iter=500, dropout=0.5, aml_config=None)

models/ Normal file
Просмотреть файл

@ -0,0 +1,151 @@
# coding: utf-8
Example of a Streamlit app for an interactive spaCy model visualizer. You can
either download the script, or point streamlit run to the raw URL of this
file. For more details, see
pip install streamlit
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download de_core_news_sm
streamlit run
from __future__ import unicode_literals
import streamlit as st
import spacy
from spacy import displacy
import pandas as pd
SPACY_MODEL_NAMES = ["en_core_web_lg", "spacy_new_ontonotes28","spacy_ft_100/model-final"]
DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
def load_model(name):
return spacy.load(name)
def process_text(model_name, text):
nlp = load_model(model_name)
return nlp(text)
st.sidebar.title("Interactive spaCy visualizer")
Process text with [spaCy]( models and visualize named entities,
dependencies and more. Uses spaCy's built-in
[displaCy]( visualizer under the hood.
spacy_model = st.sidebar.selectbox("Model name", SPACY_MODEL_NAMES)
model_load_state ="Loading model '{spacy_model}'...")
nlp = load_model(spacy_model)
text = st.text_area("Text to analyze", DEFAULT_TEXT)
doc = process_text(spacy_model, text)
if "parser" in nlp.pipe_names:
st.header("Dependency Parse & Part-of-speech tags")
st.sidebar.header("Dependency Parse")
split_sents = st.sidebar.checkbox("Split sentences", value=True)
collapse_punct = st.sidebar.checkbox("Collapse punctuation", value=True)
collapse_phrases = st.sidebar.checkbox("Collapse phrases")
compact = st.sidebar.checkbox("Compact mode")
options = {
"collapse_punct": collapse_punct,
"collapse_phrases": collapse_phrases,
"compact": compact,
docs = [span.as_doc() for span in doc.sents] if split_sents else [doc]
for sent in docs:
html = displacy.render(sent, options=options)
# Double newlines seem to mess with the rendering
html = html.replace("\n\n", "\n")
if split_sents and len(docs) > 1:
st.markdown(f"> {sent.text}")
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
if "ner" in nlp.pipe_names:
st.header("Named Entities")
st.sidebar.header("Named Entities")
label_set = nlp.get_pipe("ner").labels
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
html = displacy.render(doc, style="ent", options={"ents": labels})
# Newlines seem to mess with the rendering
html = html.replace("\n", " ")
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
attrs = ["text", "label_", "start", "end", "start_char", "end_char"]
if "entity_linker" in nlp.pipe_names:
data = [
[str(getattr(ent, attr)) for attr in attrs]
for ent in doc.ents
if ent.label_ in labels
df = pd.DataFrame(data, columns=attrs)
if "textcat" in nlp.pipe_names:
st.header("Text Classification")
st.markdown(f"> {text}")
df = pd.DataFrame(doc.cats.items(), columns=("Label", "Score"))
vector_size = nlp.meta.get("vectors", {}).get("width", 0)
if vector_size:
st.header("Vectors & Similarity")
text1 = st.text_input("Text or word 1", "apple")
text2 = st.text_input("Text or word 2", "orange")
doc1 = process_text(spacy_model, text1)
doc2 = process_text(spacy_model, text2)
similarity = doc1.similarity(doc2)
if similarity > 0.5:
st.header("Token attributes")
if st.button("Show token attributes"):
attrs = [
data = [[str(getattr(token, attr)) for attr in attrs] for token in doc]
df = pd.DataFrame(data, columns=attrs)
st.header("JSON Doc")
if st.button("Show JSON Doc"):
st.header("JSON model meta")
if st.button("Show JSON model meta"):

Просмотреть файл

@ -0,0 +1,245 @@
"cells": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"from presidio_evaluator import ModelEvaluator\n",
"from collections import Counter\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate your Presidio instance via the Presidio API"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A. Read dataset for evaluation"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_samples = read_synth_dataset(\"../data/synth_dataset.txt\")\n",
"print(\"Read {} samples\".format(len(input_samples)))"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### B. Descriptive statistics"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flatten = lambda l: [item for sublist in l for item in sublist]\n",
"count_per_entity = Counter([span.entity_type for span in flatten([input_sample.spans for input_sample in input_samples])])\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### C. Match the dataset's entity names with Presidio's entity names"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity\n",
"entities_mapping = {\n",
" 'PERSON': 'PERSON',\n",
" # 'BIRTHDAY': 'DATE_TIME',\n",
" # 'DATE': 'DATE_TIME',\n",
" 'DOMAIN': 'DOMAIN',\n",
" # 'CITY': 'LOCATION',\n",
" # 'ADDRESS': 'LOCATION',\n",
" 'IBAN': 'IBAN_CODE',\n",
" # 'URL': 'DOMAIN_NAME',\n",
" 'US_SSN': 'US_SSN',\n",
" 'O': 'O'\n",
"new_list = ModelEvaluator.align_input_samples_to_presidio_analyzer(input_samples,\n",
" entities_mapping,\n",
" presidio_fields)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### D. Recalculate statistics on updated dataset"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## recheck counter\n",
"count_per_entity_new = Counter([span.entity_type for span in flatten([input_sample.spans for input_sample in new_list])])\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### E. Run the presidio-evaluator framework with Presidio's API as the 'model' at test"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator import PresidioAPIEvaluator\n",
"presidio = PresidioAPIEvaluator(entities_to_keep=list(count_per_entity_new.keys()),endpoint=MY_PRESIDIO_ENDPOINT)\n",
"evaluted_samples = presidio.evaluate_all(new_list[:100])"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### F. Extract statistics\n",
"- Presicion, recall and F measure are calculated based on a PII/Not PII binary classification per token.\n",
"- Specific entity recall and precision are calculated on the specific PII entity level."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"evaluation_result = presidio.calculate_score(evaluted_samples)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"#### G. Analyze wrong predictions"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"errors = evaluation_result.model_errors"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')\n",
"metadata": {
Просмотреть файл

@ -0,0 +1,226 @@
"cells": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
"outputs": [],
"source": [
"from tqdm import tqdm_notebook as tqdm\n",
"from presidio_evaluator.data_generator.main import generate,read_synth_dataset\n",
"import datetime\n",
"import json"
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generate fake PII data using Presidio's data generator"
"cell_type": "markdown",
"metadata": {},
"source": [
"Presidio's data generator allows you to generate a synthetic dataset with two preriquisites:\n",
"1. A fake PII csv (We used\n",
"2. A text file with template sentences or paragraphs. In this file, each PII entity placeholder is written in brackets. The name of the PII entity should be one of the columns in the fake PII csv file.\n",
"The generator creates fake sentences based on the provided fake PII csv AND a list of [extension functions](../presidio_evaluator/data_generator/ and a few additional 3rd party libraries like `Faker`, and `haikunator`.\n",
"For example:\n",
"1. **A fake PII csv**:\n",
"| David | Brown | |\n",
"| Mel | Brown | |\n",
"2. **Templates**:\n",
"My name is [FIRST_NAME]\n",
"You can email me at [EMAIL]. Thanks, [FIRST_NAME]\n",
"What's your last name? It's [LAST_NAME]\n",
"Every time I see you falling I get down on my knees and pray\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate files\n",
"Based on these two prerequisites, a requested number of examples and an output file name:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
"outputs": [],
"source": [
"EXAMPLES = 100\n",
"SPAN_TO_TAG = True #Whether to create tokens + token labels (tags)\n",
"TEMPLATES_FILE = '../presidio_evaluator/data_generator/' \\\n",
" 'raw_data/ontonotes_based_templates.txt'\n",
"OUTPUT = \"generated_size_{}_date_{}.txt\".format(EXAMPLES, cur_time)\n",
"cur_time =\"%B %d %Y\")\n",
"fake_pii_csv = '../presidio_evaluator/data_generator/' \\\n",
" 'raw_data/FakeNameGenerator.com_100.csv'\n",
"utterances_file = TEMPLATES_FILE\n",
"dictionary_path = '../presidio_evaluator/data_generator/' \\\n",
" 'raw_data/Dictionary.csv'\n",
"examples = generate(fake_pii_csv=fake_pii_csv,\n",
" utterances_file=utterances_file,\n",
" dictionary_path=dictionary_path,\n",
" output_file=OUTPUT,\n",
" lower_case_ratio=LOWER_CASE_RATIO,\n",
" num_of_examples=EXAMPLES,\n",
" ignore_types=IGNORE_TYPES,\n",
" keep_only_tagged=KEEP_ONLY_TAGGED,\n",
" span_to_tag=SPAN_TO_TAG)"
"cell_type": "markdown",
"metadata": {},
"source": [
"To read a dataset file into the InputSample format, use `read_synth_dataset`:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
"outputs": [],
"source": [
"input_samples = read_synth_dataset(OUTPUT)"
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"The full structure of each input_sample is the following. It includes different feature values per token as calculated by Spacy"
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Verify randomness of dataset"
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
"outputs": [],
"source": [
"from collections import Counter\n",
"count_per_template_id = Counter([sample.metadata['Template#'] for sample in input_samples])\n",
"for key in sorted(count_per_template_id):\n",
" print(\"{}: {}\".format(key,count_per_template_id[key]))\n",
" \n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Transform to the CONLL structure:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"conll = InputSample.create_conll_dataset(input_samples)\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Copyright notice:\n",
"Data generated for evaluation was created using Fake Name Generator.\n",
"Fake Name Generator identities by the [Fake Name Generator]( \n",
"are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License]( Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.\n"
notebooks/PII EDA.ipynb Normal file
Просмотреть файл

@ -0,0 +1,274 @@
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fake PII data: Exploratory data analysis\n",
"This notebook is used to verify the different fake entities before and after the creation of a synthetic dataset / augmented dataset. First part looks at the generation details and stats, second part evaluates the created synthetic dataset after it has been generated."
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from presidio_evaluator.data_generator.extensions import generate_iban, generate_ip_addresses, generate_SSNs, \\\n",
" generate_company_names, generate_url, generate_roles, generate_titles, generate_nationality, generate_nation_man, \\\n",
" generate_nation_woman, generate_nation_plural, generate_title\n",
"from presidio_evaluator.data_generator import FakeDataGenerator, read_synth_dataset\n",
"from collections import Counter\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Evaluate generation logic and the fake PII bank used during generation"
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"../presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_100000.csv\",encoding=\"utf-8\")"
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"generator = FakeDataGenerator(fake_pii_df=df, \n",
" templates=None, \n",
" dictionary_path=None,\n",
" ignore_types={\"IP_ADDRESS\", 'US_SSN', 'URL','ADDRESS'})"
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pii_df = generator.prep_fake_pii(df)"
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"for (name, series) in pii_df.iteritems():\n",
" print(name)\n",
" print(\"Unique values: {}\".format(len(series.unique())))\n",
" print(series.value_counts())\n",
" print(\"\\n**************\\n\")"
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from wordcloud import WordCloud\n",
"def series_to_wordcloud(series):\n",
" freqs = series.value_counts()\n",
" wordcloud = WordCloud(background_color='white',width=800,height=400).generate_from_frequencies(freqs)\n",
" fig = plt.figure(figsize=(16, 8))\n",
" plt.suptitle(\"{} word cloud\".format(\n",
" plt.imshow(wordcloud, interpolation='bilinear')\n",
" plt.axis(\"off\")"
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Evaluate different entities in the synthetic dataset after creation"
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"synth = read_synth_dataset(\"../data/generated_train_November 12 2019.json\")"
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"sentences_only = [(sample.full_text,sample.metadata) for sample in synth]"
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"print(\"Proportions of female vs. male based samples:\")\n",
"Counter([sentence[1]['Gender'] for sentence in sentences_only])"
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"print(\"Proportion of lower case samples:\")\n",
"Counter([sentence[1]['Lowercase'] for sentence in sentences_only])"
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"print(\"Proportion of nameset across samples:\")\n",
"Counter([sentence[1]['NameSet'] for sentence in sentences_only])"
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"def get_entity_values_from_sample(sample,entity_types):\n",
" name_entities = [span.entity_value for span in sample.spans if span.entity_type in entity_types]\n",
" return name_entities\n",
" \n",
"names = [get_entity_values_from_sample(sample,['PERSON','FIRST_NAME','LAST_NAME']) for sample in synth]\n",
"names = [item for sublist in names for item in sublist]\n",
"series_to_wordcloud(pd.Series(names,name='PERSON, FIRST_NAME, LAST_NAME'))"
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"countries = [get_entity_values_from_sample(sample,['LOCATION']) for sample in synth]\n",
"countries = [item for sublist in countries for item in sublist]\n",
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"orgs = [get_entity_values_from_sample(sample,['ORGANIZATION']) for sample in synth]\n",
"orgs = [item for sublist in orgs for item in sublist]\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
Просмотреть файл

@ -0,0 +1,166 @@
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train/Test/Validation split of input samples. \n",
"This notebook shows how train/test/split is being made on a List[InputSample]\n",
"This is different for the normal split since we don't want sentences generated from the same pattern to be in more than one set. (Applicable only if the dataset was generated from templates)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"from presidio_evaluator.validation import split_dataset, save_to_json\n",
"%reload_ext autoreload"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DATE_DATE = \"November 12 2019\"\n",
"SIZE = 80000"
"cell_type": "markdown",
"metadata": {},
"source": [
"Load full dataset"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_samples = read_synth_dataset(\"../presidio_evaluator/data_generator/generated_size_{}_date_{}.txt\".format(SIZE, DATE_DATE))\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"Split to train/test/dev"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"TRAIN_TEST_VAL_RATIOS = [0.7,0.2,0.1]\n",
"train, test, validation = split_dataset(all_samples,TRAIN_TEST_VAL_RATIOS)\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"Train/Test only (no validation)\n"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#TRAIN_TEST_RATIOS = [0.7,0.3]\n",
"#train,test = split_dataset(all_sampleTRAIN_TEST_RATIOSEST_RATIOS)"
"cell_type": "markdown",
"metadata": {},
"source": [
"Save the different sets to files"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert len(train) + len(test) + len(validation) == len(all_samples)"
notebooks/models/CRF.ipynb Normal file
Просмотреть файл

@ -0,0 +1,308 @@
"cells": [
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
"source": [
"CRF trainer using the sklearn_crfsuite package (Python wrapper for CRFSuite):"
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"import sklearn_crfsuite\n",
"from sklearn_crfsuite import metrics\n",
"from presidio_evaluator import InputSample\n",
"from presidio_evaluator.crf_evaluator import CRFEvaluator\n",
"from presidio_evaluator.data_generator import read_synth_dataset"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DATA_DATE = \"November 12 2019\""
"cell_type": "markdown",
"metadata": {},
"source": [
"Source a dataset to use for training / testing:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true,
"name": "#%%\n"
"outputs": [],
"source": [
"train_samples = read_synth_dataset(\"../../data/generated_train_{}.json\".format(DATA_DATE))\n",
"test_samples = read_synth_dataset(\"../../data/generated_test_{}.json\".format(DATA_DATE))"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
"outputs": [],
"source": [
"train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]\n",
"print(\"Kept {} train samples after removal of non-tagged samples\".format(len(train_tagged)))\n",
"train_data = InputSample.create_conll_dataset(train_tagged)\n",
"test_data = InputSample.create_conll_dataset(test_samples)\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Turn every sentence into a list of lists (list of tokens + pos + label)\n",
"test_sents=test_data.groupby('sentence')[['text','pos','label']].apply(lambda x: x.values.tolist())\n",
"train_sents=train_data.groupby('sentence')[['text','pos','label']].apply(lambda x: x.values.tolist())\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"Create features for CRF"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = [CRFEvaluator.sent2features(s) for s in train_sents]\n",
"y_train = [CRFEvaluator.sent2labels(s) for s in train_sents]\n",
"X_test = [CRFEvaluator.sent2features(s) for s in test_sents]\n",
"y_test = [CRFEvaluator.sent2labels(s) for s in test_sents]"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crf = sklearn_crfsuite.CRF(\n",
" algorithm='lbfgs',\n",
" c1=0.1,\n",
" c2=0.1,\n",
" max_iterations=100,\n",
" all_possible_transitions=True\n",
", y_train)"
"cell_type": "markdown",
"metadata": {},
"source": [
"Save trained model to pickle"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open(\"../../models/crf.pickle\",'wb') as f:\n",
" data = pickle.dump(crf, f,protocol=pickle.HIGHEST_PROTOCOL)\n",
" "
"cell_type": "markdown",
"metadata": {},
"source": [
"Open saved model"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"../../models/crf.pickle\", 'rb') as f:\n",
" crf = pickle.load(f)"
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract info and predictions from model"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels = list(crf.classes_)\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_pred = crf.predict(X_test)\n",
"metrics.flat_f1_score(y_test, y_pred,\n",
" average='weighted', labels=labels)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## predict one:\n",
"y_5_pred = crf.predict([X_test[5]])\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# group B and I results\n",
"sorted_labels = sorted(\n",
" labels,\n",
" key=lambda name: (name[1:], name[0])\n",
" y_test, y_pred, labels=sorted_labels, digits=3\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"Model explainability"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def print_transitions(trans_features):\n",
" for (label_from, label_to), weight in trans_features:\n",
" print(\"%-6s -> %-7s %0.6f\" % (label_from, label_to, weight))\n",
"print(\"Top likely transitions:\")\n",
"print(\"\\nTop unlikely transitions:\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def print_state_features(state_features):\n",
" for (attr, label), weight in state_features:\n",
" print(\"%0.6f %-8s %s\" % (weight, label, attr))\n",
"print(\"Top positive:\")\n",
"print(\"\\nTop negative:\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
Просмотреть файл

@ -0,0 +1,315 @@
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spacy dataset creation"
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
"source": [
"This notebook takes train and test datasets (of type `List[InputSample]`)\n",
"and transforms them into two structures consumed by Spacy:\n",
"1. Spacy JSON (see\n",
"2. Spacy Pickle files (of structure `[(full_text,\"entities\":[(start, end, type),(...))]`. \n",
"See more details here:\n",
"JSON is used for Spacy's CLI trainer. \n",
"Pickle is used for fine-tuning using the logic in [../models/](../models/"
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"%reload_ext autoreload"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
"outputs": [],
"source": [
"DATA_DATE = 'November 12 2019'"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
"outputs": [],
"source": [
"data_path = \"../data/generated_{}_{}.json\"\n",
"train_samples = read_synth_dataset(data_path.format(\"train\",DATA_DATE))\n",
"print(\"Read {} samples\".format(len(train_samples)))"
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
"source": [
"For training, keep only sentences with entities:"
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"train_tagged = [sample for sample in train_samples if len(sample.spans)>0]\n",
"print(\"Kept {} samples after removal of non-tagged samples\".format(len(train_tagged)))"
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
"source": [
"Evaluate training set's entities"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
"outputs": [],
"source": [
"print(\"Entities found in training set:\")\n",
"entities = []\n",
"for sample in train_tagged:\n",
" entities.extend([tag for tag in sample.tags])\n",
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
"source": [
"Create Spacy dataset (option 2)"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"import pickle\n",
"spacy_train = InputSample.create_spacy_dataset(train_tagged)\n"
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"entities_spacy = [x[1]['entities'] for x in spacy_train]\n",
"entities_spacy_flat = []\n",
"for samp in entities_spacy:\n",
" for ent in samp:\n",
" entities_spacy_flat.append(ent[2])\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"Create Spacy dataset (option 1: JSON)"
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"spacy_train_json = InputSample.create_spacy_json(train_tagged)"
"cell_type": "markdown",
"metadata": {},
"source": [
"Quick evaluation of samples"
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"[sample[0] for sample in spacy_train[:100]]"
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"Dump training set to pickle and json respectively"
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"import json\n",
"with open(\"../data/train.pickle\", 'wb') as handle:\n",
" pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
"with open(\"../data/train.json\",\"w\") as f:\n",
" json.dump(spacy_train_json,f)\n",
" "
"cell_type": "markdown",
"metadata": {},
"source": [
"Create JSON and pickle files for test dataset"
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
"print(\"Read {} samples\".format(len(test_samples)))"
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"spacy_test = InputSample.create_spacy_dataset(test_samples)\n",
"spacy_test_json = InputSample.create_spacy_json(test_samples)\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"Dump test set to pickle and json respectively"
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open(\"../data/test.pickle\", 'wb') as handle:\n",
" pickle.dump(spacy_test,handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
" \n",
"with open(\"../data/test.json\",\"w\") as f:\n",
" json.dump(spacy_test_json,f)\n",
" "
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
Просмотреть файл

@ -0,0 +1,380 @@
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate CRF models for person names, orgs and locations using the Presidio Evaluator framework\n",
"Data = `generated_test_November 12 2019`"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"from tqdm import tqdm_notebook as tqdm\n",
"import logging\n",
"from presidio_evaluator import InputSample\n",
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"import spacy\n",
"import pandas as pd\n",
"import pickle\n",
"pd.set_option('display.width', 10000)\n",
"pd.set_option('display.max_colwidth', -1)\n",
"%reload_ext autoreload\n",
"%autoreload 2\n",
"DATA_DATE = 'November 12 2019'"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"data_path = \"../../data/generated_{}_{}.json\""
"cell_type": "markdown",
"metadata": {},
"source": [
"Select data for evaluation:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#test_samples = read_synth_dataset(data_path.format(\"test\", DATA_DATE))\n",
"val_samples = read_synth_dataset(data_path.format(\"validation\", DATA_DATE))\n",
"#synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
"#conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
"DATASET = val_samples"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"from collections import Counter\n",
"entity_counter = Counter()\n",
"for sample in DATASET:\n",
" for tag in sample.tags:\n",
" entity_counter[tag]+=1"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#max length sentence\n",
"max([len(sample.tokens) for sample in DATASET])\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"Select models for evaluation:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"crf_vanilla = \"../../model-outputs/crf.pickle\"\n",
" \n",
"models = [crf_vanilla]"
"cell_type": "markdown",
"metadata": {},
"source": [
"Run evaluation on all models:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"from presidio_evaluator.crf_evaluator import CRFEvaluator\n",
"for model in models:\n",
" print(\"-----------------------------------\")\n",
" print(\"Evaluating model {}\".format(model))\n",
" crf_evaluator = CRFEvaluator(model_pickle_path=model)\n",
" evaluation_results = crf_evaluator.evaluate_all(DATASET)\n",
" scores = crf_evaluator.calculate_score(evaluation_results)\n",
" \n",
" print(\"Confusion matrix:\")\n",
" print(scores.results)\n",
" print(\"Precision and recall\")\n",
" scores.print()"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Custom evaluation of the model"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Try out the model\n",
"def sent_to_features(model_path,sent):\n",
" \"\"\"\n",
" Translates a sentence into a prediction using a saved CRF model\n",
" \"\"\"\n",
" \n",
" with open(model_path, 'rb') as f:\n",
" model = pickle.load(f)\n",
" \n",
" tokenizer = spacy.blank('en')\n",
" tokens = tokenizer(sent)\n",
" tags = ['O' for token in tokens] # Placeholder: Not used but required. \n",
" metadata = {'Template#':1,'Gender':'1','Country':'2'} #Placeholder: Not used but required\n",
" input_sample = InputSample(full_text=sent,masked=\"\",spans=None,tokens=tokens,tags=tags,metadata=metadata,create_tags_from_span=False,)\n",
" return CRFEvaluator.crf_predict(input_sample, model)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SENTENCE = \"Michael is American\"\n",
"sent_to_features(model_path=crf_vanilla, sent=SENTENCE)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False positives"
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Most false positive tokens:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"errors = scores.model_errors\n",
"from presidio_evaluator import ModelEvaluator\n",
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"2. review false positives for entity 'PERSON'"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False negative examples"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
"cell_type": "markdown",
"metadata": {},
"source": [
"More FN analysis"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
Просмотреть файл

@ -0,0 +1,326 @@
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate Flair models for person names, orgs and locations using the Presidio Evaluator framework\n",
"Data = `generated_test_November 12 2019`"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"%reload_ext autoreload\n",
"%autoreload 2"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"DATA_DATE = \"November 12 2019\"\n",
"data_path = \"../../data/generated_{}_{}.json\""
"cell_type": "markdown",
"metadata": {},
"source": [
"Select data for evaluation:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
"#val_samples = read_synth_dataset(data_path.format(\"validation\",DATA_DATE))\n",
"#synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
"DATASET = conll_samples"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"from collections import Counter\n",
"entity_counter = Counter()\n",
"for sample in DATASET:\n",
" for tag in sample.tags:\n",
" entity_counter[tag]+=1"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#max length sentence\n",
"max([len(sample.tokens) for sample in DATASET])"
"cell_type": "markdown",
"metadata": {},
"source": [
"Select models for evaluation:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"flair_ner = 'ner'\n",
"flair_ner_fast = 'ner-fast'\n",
"flair_ontonotes = 'ner-ontonotes-fast'\n",
"flair_bert_embeddings = '../../models/presidio-ner/'\n",
"glove_flair_embeddings = '../../models/presidio-ner/'\n",
"models = [glove_flair_embeddings]\n",
"#models = [flair_bert_embeddings, glove_flair_embeddings, flair_ner,flair_ner_fast,flair_ontonotes ]"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true
"outputs": [],
"source": [
"from presidio_evaluator.flair_evaluator import FlairEvaluator\n",
"for model in models:\n",
" print(\"-----------------------------------\")\n",
" print(\"Evaluating model {}\".format(model))\n",
" flair_evaluator = FlairEvaluator(model_path=model)\n",
" evaluation_results = flair_evaluator.evaluate_all(DATASET)\n",
" scores = flair_evaluator.calculate_score(evaluation_results)\n",
" \n",
" \n",
" print(\"Confusion matrix:\")\n",
" print(scores.results)\n",
" print(\"Precision and recall\")\n",
" scores.print()\n",
" errors = scores.model_errors\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"Custom evaluation"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False positives"
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Most false positive tokens:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"errors = scores.model_errors\n",
"from presidio_evaluator import ModelEvaluator\n",
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']\n"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"2. False negative examples"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
"cell_type": "markdown",
"metadata": {},
"source": [
"More FN analysis"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%%\n"
"outputs": [],
"source": [
Просмотреть файл

@ -0,0 +1,404 @@
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate Spacy models for person names, orgs and locations using the Presidio Evaluator framework\n",
"Data = `generated_test_November 12 2019`"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"import spacy\n",
"from presidio_evaluator import ModelEvaluator\n",
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"%reload_ext autoreload\n",
"%autoreload 2"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DATA_DATE = \"November 12 2019\""
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#!pip freeze | grep en_core_web_lg\n",
"!pip freeze | findstr en-core-web-lg\n",
"!pip freeze | findstr spacy"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"data_path = \"../../data/generated_{}_{}.json\""
"cell_type": "markdown",
"metadata": {},
"source": [
"Select data for evaluation:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"# test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
"# print(len(test_samples))\n",
"# val_samples = read_synth_dataset(data_path.format(\"validation\",DATA_DATE))\n",
"# print(len(val_samples))\n",
"# synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
"# print(len(synth_samples))"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
"DATASET = conll_samples"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"from collections import Counter\n",
"entity_counter = Counter()\n",
"for sample in DATASET:\n",
" for span in sample.spans:\n",
" entity_counter[span.entity_type]+=1"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#max length sentence\n",
"max([len(sample.tokens) for sample in DATASET])"
"cell_type": "markdown",
"metadata": {},
"source": [
"Select models for evaluation:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"models = []\n",
"en_core_web_lg = r\"en_core_web_lg\"\n",
"spacy_new_ontonotes28 = r\"C:\\Users\\ommendel\\OneDrive - Microsoft\\Projects\\presidio\\Presidio-internal\\presidio-evaluator\\models\\spacy_new_ontonotes28\"\n",
"spacy_ft_100 = r\"C:\\Users\\ommendel\\OneDrive - Microsoft\\Projects\\presidio\\Presidio-internal\\presidio-evaluator\\models\\spacy_ft_100\\model-final\"\n",
"models = [en_core_web_lg, spacy_new_ontonotes28, spacy_ft_100]"
"cell_type": "markdown",
"metadata": {},
"source": [
"Run evaluation on all models:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true
"outputs": [],
"source": [
"from presidio_evaluator.spacy_evaluator import SpacyEvaluator\n",
"for model in models:\n",
" print(\"-----------------------------------\")\n",
" print(\"Evaluating model {}\".format(model))\n",
" nlp = spacy.load(model)\n",
" spacy_evaluator = SpacyEvaluator(model=nlp,entities_to_keep=['PERSON','GPE','ORG'])\n",
" evaluation_results = spacy_evaluator.evaluate_all(DATASET)\n",
" scores = spacy_evaluator.calculate_score(evaluation_results)\n",
" \n",
" print(\"Confusion matrix:\")\n",
" print(scores.results)\n",
" print(\"Precision and recall\")\n",
" scores.print()\n",
" errors = scores.model_errors"
"cell_type": "markdown",
"metadata": {},
"source": [
"Custom evaluation"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#evaluate custom sentences\n",
"nlp = spacy.load(spacy_ft_100)\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results analysis"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"#sent = input(\"Enter sentence: \")\n",
"sent = 'David is talking loudly'\n",
"doc = nlp(sent)\n",
"for ent in doc.ents:\n",
" print(\"Entity = {} value = {}\".format(ent.label_,ent.text))"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False positives"
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Most false positive tokens:"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='LOCATION')\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"2. False negative examples"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"errors = scores.model_errors\n",
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
"cell_type": "markdown",
"metadata": {},
"source": [
"More FN analysis"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='GPE')"
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"[print(error,\"\\n\") for error in errors]"
Просмотреть файл

@ -0,0 +1,5 @@
from .span_to_tag import span_to_tag, tokenize
from .data_objects import Span, InputSample, EvaluationResult, ModelError
from .model_evaluator import ModelEvaluator
from .spacy_evaluator import SpacyEvaluator
from .presidio_api_evaluator import PresidioAPIEvaluator

Просмотреть файл

@ -0,0 +1,97 @@
import pickle
from typing import List
from presidio_evaluator import ModelEvaluator, InputSample
class CRFEvaluator(ModelEvaluator):
def __init__(self,
model_pickle_path: str = "../models/crf.pickle",
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True):
if model_pickle_path is None:
raise ValueError("model_pickle_path must be supplied")
with open(model_pickle_path, 'rb') as f:
self.model = pickle.load(f)
def predict(self, sample: InputSample) -> List[str]:
tags = CRFEvaluator.crf_predict(sample,self.model)
if len(tags) != len(sample.tokens):
print("mismatch between previous tokens and new tokens")
# translated_tags = sample.rename_from_spacy_tags(tags)
return tags
def crf_predict(sample, model):
conll = sample.to_conll(translate_tags=True)
sentence = [(di['text'], di['pos'], di['label']) for di in conll]
features = CRFEvaluator.sent2features(sentence)
return model.predict([features])[0]
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
if i > 0:
word1 = sent[i - 1][0]
postag1 = sent[i - 1][1]
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
features['BOS'] = True
if i < len(sent) - 1:
word1 = sent[i + 1][0]
postag1 = sent[i + 1][1]
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
features['EOS'] = True
return features
def sent2features(sent):
return [CRFEvaluator.word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
return [label for token, postag, label in sent]
def sent2tokens(sent):
return [token for token, postag, label in sent]

Просмотреть файл

@ -0,0 +1,48 @@
# PII dataset generator
This data generator takes a text file with templates (e.g. `my name is [PERSON]`) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer) and tags in various schemas (BIO/IOB, IO, BILOU)
In addition it provides some off-the-shelf features on each token, like `pos`, `dep` and `is_in_vocabulary`
The main class is `FakeDataGenerator` however the `main` module has two functions for creating and reading a fake dataset.
During the generation process, the tool either takes fake PII from a provided CSV with a known format, and/or from extension functions which can be found in the file.
The process in high level is the following:
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of templates: `My name is John` -> `My name is [PERSON]`
2. (Optional) adapt the FakeDataGenerator to support new extensions which could generate fake PII entities
3. Generate X samples using the templates list + a fake PII dataset + extensions that add additional PII entities
4. Split the generated dataset to train/test/validation while making sure that samples from the same template would only appear in one set
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
6. Train models
7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
- For steps 5, 6, 7 see the main [README](../../
- For a simple data generation pipeline, [see this notebook](../../notebooks/Generate data.ipynb).
- For information on transforming a NER dataset into a templates, see the notebooks in the [helper notebooks](helper%20notebooks) folder.
Example run:
TEMPLATES_FILE = 'raw_data/templates.txt'
OUTPUT = "generated_.txt"
## Should be downloaded from FakeNameGenerator
fake_pii_csv = 'raw_data/FakeNameGenerator.csv'
examples = generate(fake_pii_csv=fake_pii_csv,
ignore_types={"IP_ADDRESS", 'US_SSN', 'URL'},
*Copyright notice:*
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

Просмотреть файл

@ -0,0 +1,2 @@
from .generator import FakeDataGenerator
from .main import generate, read_synth_dataset

Просмотреть файл

@ -0,0 +1,124 @@
import random
import pandas as pd
from faker import Faker
from haikunator import Haikunator
from presidio_evaluator.data_generator.nationality_generator import NationalityGenerator
from presidio_evaluator.data_generator.org_name_generator import OrgNameGenerator
fake = Faker()
haikunator = Haikunator()
IP_V4_RATIO = 0.8
org_name_generator = OrgNameGenerator()
nationality_generator = NationalityGenerator()
def generate_url(domain: pd.Series):
def generate_url_postfix():
length = random.randint(4, 8)
delim = "/" if random.random() > 0.5 else ""
postfix = haikunator.haikunate(delimiter=delim,
return postfix
def generate_url_prefix():
rand = random.random()
if rand < 0.3:
return "http://"
elif rand < 0.6:
return "http://www."
return ""
def concat_url(prefix, domain, postfix):
return "{}{}/{}".format(prefix, domain, postfix)
return domain.apply(lambda x: concat_url(generate_url_prefix(), x.lower(), generate_url_postfix()))
# urls = []
# for index, value in domain.items():
# url = "{}{}/{}".format(generate_url_prefix(), value.lower(), generate_url_postfix())
# urls.append(url)
# return urls
def generate_SSNs(length):
return [fake.ssn() for _ in range(length)]
def generate_iban(country: pd.Series):
def generate_one_iban(cntry):
from schwifty.iban import _get_iban_spec, code_length, IBAN
import math
spec = _get_iban_spec(cntry)
bank_code_length = code_length(spec, 'bank_code')
branch_code_length = code_length(spec, 'branch_code')
bank_and_branch_code_length = bank_code_length + branch_code_length
account_code_length = code_length(spec, 'account_code')
bank_code = random.randint(1, math.pow(10, bank_and_branch_code_length) - 1)
account_code = random.randint(1, math.pow(10, account_code_length) - 1)
iban = IBAN.generate(cntry, str(bank_code), str(account_code))
return iban.formatted
except ValueError as err:
## Failed to generate IBAN
return "IL270126100000000544211"
return country.apply(generate_one_iban)
def generate_company_names(length):
return [org_name_generator.get_organization() for _ in range(length)]
def generate_ip_addresses(length):
def generate_one():
v = 4 if random.random() > IP_V4_RATIO else 6
return fake.ipv4() if v == 4 else fake.ipv6()
return [generate_one() for _ in range(length)]
def generate_title(gender=None):
MALE_TITLES = ['Mr.', 'Dr.', 'Professor.', 'Eng.', 'Prof.', 'Doctor.']
FEMALE_TITLES = ['Mrs.', 'Ms.', 'Miss', 'Dr.', 'Professor.', 'Eng.', 'Prof.', 'Doctor']
if gender.lower() == 'male':
return random.choices(MALE_TITLES, weights=[0.7, 0.1, 0.05, 0.05, 0.05, 0.05])[0]
return random.choices(FEMALE_TITLES, weights=[0.3, 0.25, 0.20, 0.05, 0.05, 0.05, 0.05, 0.05])[0]
def generate_titles(gender: pd.Series):
return gender.apply(generate_title)
def generate_roles(length):
roles = ['President', 'Vice-president', 'Chief of staff', 'Chief Architect', 'CEO', 'CFO', 'Engineer', 'Accountant',
'Attorney', 'Scientist', 'Journalist', 'Operator', 'CIO', "Chief Information Officer", "General Manager",
"Manager", "Chief Executive Officer", 'Actuary', 'Secretary', 'Prime minister', 'Minister', 'Director']
return [random.choice(roles) for _ in range(length)]
def generate_nationality(length):
return [nationality_generator.get_nationality() for _ in range(length)]
def generate_country(length):
return [nationality_generator.get_country() for _ in range(length)]
def generate_nation_woman(length):
return [nationality_generator.get_nation_woman() for _ in range(length)]
def generate_nation_man(length):
return [nationality_generator.get_nation_man() for _ in range(length)]
def generate_nation_plural(length):
return [nationality_generator.get_nation_plural() for _ in range(length)]

Просмотреть файл

@ -0,0 +1,343 @@
import random
from typing import List
import re
from collections import Counter
import pandas as pd
from spacy.tokens import Token
from tqdm import tqdm
from presidio_evaluator import Span, InputSample
from presidio_evaluator.data_generator.extensions import generate_iban, generate_ip_addresses, generate_SSNs, \
generate_company_names, generate_url, generate_roles, generate_titles, generate_nationality, generate_nation_man, \
generate_nation_woman, generate_nation_plural, generate_title, generate_country
class FakeDataGenerator:
def __init__(self, fake_pii_df: pd.DataFrame, templates: List[str],
lower_case_ratio: float = 0.5, include_metadata=True,
dictionary_path: str = None,
ignore_types=None, span_to_tag=True, labeling_scheme="BILOU"):
Fake data generator.
Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
e.g. "My name is [FIRST_NAME]"
:param fake_pii_df:
A pd.DataFrame with a predefined set of PII entities as columns created using
:param templates: A list of templates
with place holders for PII entities.
For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
Note that in case you have multiple entities of the same type
in a template, you should put a number on the second. For example:
"I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
More than two are currently not supported but extending this
is straightforward.
:param lower_case_ratio: Percentage of names that should start
with lower case
:param include_metadata: Whether to include additional
information in the output
(e.g. NameSet from which the name was taken, gender, country etc.)
:param dictionary_path: A path to a csv containing a vocabulary of
a language, to check if a token exists in the vocabulary or not.
:param ignore_types: set of types to ignore
:param span_to_tag: whether to tokenize the generated samples or not
:param labeling_scheme: labeling scheme (BILOU, BIO, IO)
if ignore_types is None:
ignore_types = {}
self.lower_case_ratio = lower_case_ratio
self.include_metadata = include_metadata
self.ignore_types = ignore_types
if dictionary_path:
vocab_df = pd.read_csv(dictionary_path, sep=',')
self.vocabulary_words = set(vocab_df['WORD'].values.tolist())
print("Warning: Dictionary path not provided. "
"Feature `is_in_vocabulary` will be set to False for all samples")
self.vocabulary_words = []
if templates:
self.templates = self.prep_templates(templates)
print("Warning: templates not provided")
self.templates = None
self.original_pii_df = fake_pii_df
self.fake_pii = None
self.span_to_tag = span_to_tag
self.labeling_scheme = labeling_scheme
def get_is_in_vocabulary(self, token):
return token.text.lower() in self.vocabulary_words
def prep_fake_pii(self, df):
print("Preparing fake PII data for ingestion")
# define new column names
column_names = {"Surname": "LAST_NAME", "GivenName": "FIRST_NAME",
"Title": "TITLE", "Gender": "GENDER",
"City": "CITY", "ZipCode": "ZIP",
"CountryFull": "COUNTRY",
"Occupation": "OCCUPTAION",
"TelephoneNumber": "PHONE_NUMBER",
"CCNumber": "CREDIT_CARD", "Birthday": "BIRTHDAY",
"EmailAddress": "EMAIL",
"StreetAddress": "FULL_ADDRESS",
"Domain": "DOMAIN_NAME"}
# Remove brackets as they interfere with the process
def remove_brackets(series):
if series.dtype == object or series.dtype == str:
series = series.str.replace("[", "(")
series = series.str.replace("]", ")")
return series
df = df.apply(remove_brackets, axis=0)
# change column names
column_names = {key: value for (key, value) in column_names.items() if value not in self.ignore_types}
df.rename(columns=column_names, inplace=True)
df["PERSON"] = df["FIRST_NAME"] + " " + df["LAST_NAME"]
df['COUNTRY'] = generate_country(len(df)) # replace previous country which has limited options
# Copied entities
df["DATE"] = df["BIRTHDAY"]
df['LOCATION'] = df[random.choice(["CITY", "COUNTRY"])].str.title()
df['LOCATION'] = self.reshuffle_entity(df['LOCATION']) # Reshuffle to not have the same location and country
if 'ADDRESS' not in self.ignore_types:
# title and role
if 'ROLE' not in self.ignore_types:
print("Generating roles")
df['ROLE'] = generate_roles(length=len(df))
if 'TITLE' not in self.ignore_types:
print("Generating titles")
df['TITLE'] = generate_titles(df['GENDER'])
df['FEMALE_TITLE'] = [generate_title('female') for _ in range(len(df))]
df['MALE_TITLE'] = [generate_title('male') for _ in range(len(df))]
if 'NATIONALITY' not in self.ignore_types:
print("Generating nationalities")
df['NATIONALITY'] = generate_nationality(len(df))
df['NATION_MAN'] = generate_nation_man(len(df))
df['NATION_WOMAN'] = generate_nation_woman(len(df))
df['NATION_PLURAL'] = generate_nation_plural(len(df))
if 'IBAN' not in self.ignore_types:
print("Generating IBANs")
df['IBAN'] = generate_iban(df['COUNTRY']) # "IL270126100000000544211"
if 'IP_ADDRESS' not in self.ignore_types:
print("Generating IP addresses")
df['IP_ADDRESS'] = generate_ip_addresses(len(df))
if 'US_SSN' not in self.ignore_types:
print("Generating SSN numbers")
df['US_SSN'] = generate_SSNs(len(df))
if 'URL' not in self.ignore_types:
print("Generating URLs")
df['URL'] = generate_url(df['DOMAIN_NAME'])
if 'ORGANIZATION' not in self.ignore_types:
print("Generating company names")
df['ORG'] = generate_company_names(len(df))
df['ORGANIZATION'] = df[random.choice(["Company", "ORG"])].str.title()
print("Finished preparing fake PII data")
return df
def address_parts(self, df):
# extract street no, street and full address
print("Generating address parts")
if 'STREET_NO' not in self.ignore_types:
df["STREET_NO"] = df["FULL_ADDRESS"].map(
lambda r:"([\d]+)", r).group(1))
if 'STREET' not in self.ignore_types:
df["STREET"] = df["FULL_ADDRESS"].map(
lambda r:"[\d]+(.*)", r).group(1))
if 'ADDRESS' not in self.ignore_types:
df["ADDRESS"] = df.apply(
lambda r: "{0}, {2} {1}".format(r["FULL_ADDRESS"],
r["ZIP"].replace(" ", ""),
r["CITY"]), axis=1)
def get_additional_entity(df, entity):
return df.sample(1).iloc[0][entity]
def reshuffle_entity(series):
shuffled = series.sample(frac=1)
shuffled.reset_index(inplace=True, drop=True)
return shuffled
def prep_templates(raw_templates):
print("Preparing sample sentences for ingestion")
# Todo: introduce typos
templates = [l.strip().replace("[", "{").replace("]", "}") for l in
return templates
def get_template_entities(template):
templates = []
entities_count = Counter()
for m in re.finditer(r"\{([A-Z_0-9]+)\}", template):
ent = m.groups()[0]
start, end = m.span()
entities_count[ent] += 1
if entities_count.get(ent) == 1:
# Add an index to all additional entities of this type (LOCATION2, LOCATION3 etc.)
templates.append(ent + str(entities_count[ent]))
for entity, count in entities_count.items():
while count > 1:
template = template.replace("{" + entity + "}", "{" + entity + str(count) + "}", 1)
count -= 1
return template, templates, entities_count
def sample_examples(self, count):
if not self.fake_pii:
self.fake_pii = self.prep_fake_pii(self.original_pii_df)
for _ in tqdm(range(count)):
template_sentence_index = random.choice(range(len(self.templates)))
original_sentence = self.templates[template_sentence_index]
fake_pii_sample = self.fake_pii.sample(1).iloc[0]
# Find entities to be replaced + add running index for multiple entities of the same type
original_sentence, replacements, entity_counts = self.get_template_entities(original_sentence)
# Get additional fake entries in case of multiple entities of the same type
fake_pii_sample_duplicated = self.add_duplicated_entities(fake_pii_sample, entity_counts)
# Fill in fake entities for each template slot
values = {}
for h in replacements:
if h in fake_pii_sample_duplicated:
values[h] = str(fake_pii_sample_duplicated[h])
print("Warning: entity {} is in the templates but not in the PII dataset. Ignoring.".format(h))
values[h] = ''
# Create a new InputSample combining template with fake PII data
input_sample = self.create_input_sample(original_sentence, values)
if self.include_metadata:
metadata = {"Gender": fake_pii_sample['GENDER'],
"NameSet": fake_pii_sample['NameSet'],
"Country": fake_pii_sample['COUNTRY'],
"Lowercase": input_sample.full_text.islower(),
"Template#": template_sentence_index
input_sample.metadata = metadata
# Creating tokens only after entities consolidation
if self.span_to_tag:
tokens, tags = input_sample.get_tags(scheme=self.labeling_scheme)
input_sample.tokens = tokens
input_sample.tags = tags
yield input_sample
def consolidate_names(input_sample):
for span in input_sample.spans:
if span.entity_type in names:
span.entity_type = 'PERSON'
elif span.entity_type in locations:
span.entity_type = "LOCATION"
masked = input_sample.masked
for location in locations:
masked = masked.replace("[" + location + "]", "[LOCATION]")
for name in names:
masked = masked.replace("[" + name + "]", "[PERSON]")
input_sample.masked = masked
def create_input_sample(self, original_sentence, values):
Creates an InputSample out of a template sentence
and a dict of entity names and values
:param original_sentence: template (e.g. My name is [FIRST_NAME})
:param values: Key = entity name, value = entity value
(e.g. {"TITLE":"Mr."})
:return: a list of InputSamples
sentence = original_sentence
spans = []
to_lower = random.random() < self.lower_case_ratio
i = 0
# replaces placeholders with values and retrieve indices
while i < len(sentence):
entity_start ="{", sentence, flags=0)
if entity_start:
entity_start = entity_start.start()
entity_end ="}", sentence[entity_start:],
flags=0).start() + entity_start
entity = sentence[entity_start + 1:entity_end]
entity_value = values[entity]
entity_value = entity_value.strip()
# Remove duplicate entity indices:
entity = ''.join(i for i in entity if not i.isdigit())
entity_value_len = len(entity_value)
sentence = sentence[:entity_start] + entity_value + sentence[
entity_end + 1:]
# replace a with an if
if ((sentence[entity_start - 2: entity_start].lower() == "a " and entity_start == 2)
or (sentence[entity_start - 3: entity_start].lower() == " a ")) \
and entity_value[0].lower() in ['a', 'e', 'i', 'o', 'u']:
sentence = sentence[:entity_start - 1] + "n " + sentence[entity_start:]
entity_start = entity_start + 1
if to_lower:
entity_value = entity_value.lower()
end_position=entity_start + entity_value_len))
i = entity_start + entity_value_len
if to_lower:
sentence = sentence.lower()
# Not creating tokens here since we're consolidating names afterwards
return InputSample(sentence, original_sentence, spans,
def add_duplicated_entities(self, fake_pii_sample, entity_counts):
for entity, ent_count in entity_counts.items():
while ent_count > 1:
fake_pii_sample[entity + str(ent_count)] = self.get_additional_entity(self.fake_pii, entity)
ent_count -= 1
return fake_pii_sample

"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook takes the CONLL2003 dataset using deepavlov, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.options.display.max_rows = 4000\n",
"pd.set_option('display.max_colwidth', -1)\n",
"from deeppavlov.dataset_readers.conll2003_reader import Conll2003DatasetReader"
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"is_executing": false
"outputs": [],
"source": [
"reader = Conll2003DatasetReader()\n",
"dataset = =\"../../data\",dataset_name='conll2003')\n",
"#Note: make sure you haven't downloaded something else with this function before, \n",
"# as it will not download a new dataset (even if your previous download was for a different dataset)"
"cell_type": "markdown",
"metadata": {},
"source": [
"### To pandas + add sentence_idx"
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"new_dataset = [list(zip(a,b)) for a,b in dataset['train']]\n",
"df_list = []\n",
"sentence_id = 0\n",
"for sentence in new_dataset:\n",
" \n",
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
" df[\"sentence_idx\"] = sentence_id\n",
" sentence_id+=1\n",
" df_list.append(df)\n",
"ner_dataset = pd.concat(df_list)\n"
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"sentences = ner_dataset.groupby('sentence_idx')['word'].apply(lambda x: \" \".join(x))"
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example sentence:"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Unique entities\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace tokenization replacements"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['word'] = ner_dataset['word']\\\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# helper columns:\n",
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove unneeded (non PII) entities:"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def remote_unwanted_tags(x):\n",
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
" return 'O'\n",
" else:\n",
" return x\n",
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove PERSON tags if preceding word is 'the' (e.g. the Bush administration)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
"def remove_tag_if_the_person(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove tag from 's (Joe Wilson's cat)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_apostraphe_after_tag(row):\n",
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Re-tag words from dictionaries (countries, nationalities, roles, titles)"
"cell_type": "markdown",
"metadata": {},
"source": [
"Nationalities and countries:"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"algeria\" in nationalities['country'].values"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['metadata'] = None\n",
"def get_nationality_as_metadata(row):\n",
" if row['word'].lower() in nationalities['country'].values:\n",
" return 'COUNTRY'\n",
" elif row['word'].lower() in nationalities['nationality'].values:\n",
" return 'NATIONALITY'\n",
" elif row['word'].lower() in nationalities['man'].values:\n",
" return 'NATION_MAN'\n",
" elif row['word'].lower() in nationalities['woman'].values:\n",
" return 'NATION_WOMAN'\n",
" elif row['word'].lower() in nationalities['plural'].values:\n",
" return 'NATION_PLURAL'\n",
" return row['metadata']\n",
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
"def update_tag_based_on_metadata(row):\n",
" if row['metadata'] is not None:\n",
" return \"B-\"+row['metadata']\n",
" else:\n",
" return row['tag']\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Titles"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MALE_TITLES = ['mr', 'dr', 'professor', 'eng','prof','doctor']\n",
"FEMALE_TITLES = ['mrs', 'ms', 'miss', 'dr', 'professor', 'eng', 'prof','doctor']\n",
"def get_title_as_metadata(row):\n",
" if row['word'].lower() in MALE_TITLES:\n",
" return 'MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES:\n",
" return 'FEMALE_TITLE'\n",
" return row['metadata']\n",
"def update_title_tag_if_missing(row):\n",
" if row['word'].lower() in MALE_TITLES and row['tag']=='O':\n",
" return 'B-MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES and row['tag']=='O':\n",
" return 'B-FEMALE_TITLE'\n",
" else:\n",
" return row['tag']\n",
"ner_dataset['metadata'] = ner_dataset.apply(get_title_as_metadata,axis=1)\n",
"ner_dataset['tag'] = ner_dataset.apply(update_title_tag_if_missing,axis=1)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove 'the' from 'the NORP' if NORP is not in nationalities list."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_the_norp(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove sentences with adjacent different entities (e.g calling from New York Larry King)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
"sentences_to_remove = ner_dataset[adjacent_idc]['sentence_idx'].values\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Update tag for discovered metadata values (eg. nationalities)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['tag'] = ner_dataset.apply(update_tag_based_on_metadata, axis=1)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create templates base on NER dataset\n",
"Here we create the actual templates + handle multiple weird cases that should cause the template sentences to be weird. Note that a manual run over the templates dataset is still required after this step."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"class SentenceGetter(object):\n",
" \n",
" def __init__(self, dataset):\n",
" self.n_sent = 1\n",
" self.dataset = dataset\n",
" self.empty = False\n",
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
" s[\"tag\"].values.tolist())]\n",
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
" self.sentences = [s for s in self.grouped]\n",
" \n",
" def get_next(self):\n",
" try:\n",
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
" self.n_sent += 1\n",
" return s\n",
" except:\n",
" return None\n",
" \n",
" @staticmethod \n",
" def cleanse_template(template, ents):\n",
" # Remove whitespace before certain punctuation marks\n",
" template = re.sub(r'\\s([?,:.!](?:|$))+', r'\\1', template)\n",
" \n",
" # Remove whitespaces within double quotes\n",
" template = re.sub('\\\"\\s*([^\\\"]*?)\\s*\\\"', r'\"\\1\"', template) \n",
" \n",
" # Remove whitespaces within quotes\n",
" template = re.sub(\"\\'\\s*([^\\']*?)\\s*\\'\", r\"'\\1'\", template) \n",
" \n",
" # Remove whitespaces within parentheses\n",
" template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
" \n",
" for ent in ents:\n",
" duplicates = \"[{}] [{}]\".format(ent,ent)\n",
" template = template.replace(duplicates,\"[{}]\".format(ent))\n",
" \n",
" \n",
" # Replace additional weird templates:\n",
" to_replace = {\n",
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
" \"[LOCATION] said\" : \"[PERSON] said\",\n",
" \"the [COUNTRY]\" : \"[COUNTRY]\",\n",
" \" 's \":\"'s\",\n",
" \"] 's \":\"]'s \",\n",
" \"] 's,\":\"]'s,\",\n",
" \"] 's.\":\"]'s.\",\n",
" \" n't\" : \"n't\",\n",
" \"/?\":\"?\",\n",
" \"%u\":\"u\",\n",
" \"%m\":\"m\",\n",
" \"%e\":\"e\", \n",
" \"%h\":\"h\", \n",
" \"%a\":\"a\",\n",
" \" %\":\"%\",\n",
" \" ?\":\"?\",\n",
" \" /?\":\"?\",\n",
" \" ' .\":\"'.\",\n",
" \"[ \":\"(\",\n",
" \" ]\":\")\",\n",
" \"[PERSON] -- [PERSON]\":\"[PERSON]\",\n",
" \"Jews\" : \"[NATIONALITY]\",\n",
" \"Chinese\" : \"[NATIONALITY]\",\n",
" \"Dutch\" : \"[NATIONALITY]\",\n",
" \"[LOCATION], [LOCATION]\":\"[LOCATION]\",\n",
" }\n",
" \n",
" for weird in to_replace.keys():\n",
" #if weird in template:\n",
" # print(\"Weird sentence\",template)\n",
" template = template.replace(weird,to_replace[weird])\n",
" \n",
" template = template.replace(\" -- \",\" - \")\n",
" \n",
" #Ignore templates that are incomplete\n",
" if \"/-\" in template:\n",
" template = \"\"\n",
" \n",
" #Ignore templates that have numbers after the end or start of the entity\n",
" if len(re.findall(r\"\\]\\s[0-9]\",template)) > 0:\n",
" template = \"\"\n",
" \n",
" if len(re.findall(r\"[0-9]\\s\\[\",template)) > 0:\n",
" template = \"\"\n",
" \n",
" if len(re.findall(r\"[0-9].\\s\\[\",template)) > 0:\n",
" template = \"\"\n",
" \n",
" \n",
" if \"[PERSON] ([COUNTRY])\" in template:\n",
" template = \"\"\n",
" if \"[PERSON] ([LOCATION])\" in template:\n",
" template = \"\"\n",
" \n",
" if template.count('\"') == 1:\n",
" template = template.replace('\"','')\n",
" return template\n",
" \n",
" @staticmethod \n",
" def get_template(grouped,entity_name_replace_dict):\n",
" template = \"\"\n",
" i=0\n",
" cur_index = 0\n",
" ents = []\n",
" for token in grouped:\n",
" # remove brackets as they interefere with the data generation process\n",
" token_text = token[0].replace(\"[\", \"(\").replace(\"]\",\")\")\n",
" token_text = token[0].replace(\"{\", \"(\").replace(\"}\",\")\")\n",
" token_tag = token[1]\n",
" token_entity = token_tag[2:] if len(token_tag)>1 else token_tag\n",
" \n",
" if token_entity == 'O':\n",
" template += \" \" + token_text\n",
" elif 'B-' in token_tag and token_entity not in TAGS_TO_IGNORE:\n",
" #print(\"found entity: {}\".format(token_entity))\n",
" ent = entity_name_replace_dict[token_entity]\n",
" ents.append(ent)\n",
" \n",
" template += \" [\" + ent + \"]\"\n",
" #print(\"template: \",template)\n",
" \n",
" template = SentenceGetter.cleanse_template(template, ents)\n",
" \n",
" return template.strip()\n",
" \n",
"getter = SentenceGetter(ner_dataset)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" \"PER\":\"PERSON\",\n",
" \"GPE\":\"COUNTRY\",\n",
" \"NORP\":\"LOCATION\",\n",
" \"LOC\":\"LOCATION\",\n",
" \"ORG\":\"ORGANIZATION\",\n",
" \"MALE_TITLE\":\"MALE_TITLE\",\n",
" \"COUNTRY\":\"COUNTRY\",\n",
" \"NATION_MAN\":\"NATION_MAN\",\n",
"sentences = getter.sentences\n",
"sent_id = 445\n",
"print(\"template:\", getter.get_template(sentences[sent_id],entity_name_replace_dict=ENTITIES_DICTIONARY))"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_templates = [getter.get_template(sentence,entity_name_replace_dict=ENTITIES_DICTIONARY) for sentence in sentences]"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"original length of templates: {}\".format(len(all_templates)))\n",
"all_templates = list(set(all_templates))\n",
"print(\"length after duplicates removal: {}\".format(len(all_templates)))"
"cell_type": "markdown",
"metadata": {},
"source": [
"Save templates to file:"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"../raw_data/conll_based_templates.txt\",\"w+\",encoding='utf-8') as f:\n",
" for template in all_templates:\n",
" f.write(\"%s\\n\" % template) "
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate new examples based on this dataset: \n",
"This notebook takes the ner dataset from the previous link, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
"Note that due to the nature of the tagging, there might be weird output sentences. For example:\n",
"- The same entity shows multiple times in sentence: \"I travel from Argentina to Argentina\"\n",
"- Bad grammer due to the lack of inflection and changes to nouns due to context: \"*The statement said no Denmark or India-led troops were killed*\" instead of \"*The statement said no Danish or Indian led troops were killed*\"\n",
"- Unrealistic sentences due to change in entities: \"Prime minister Lebron James enters the government building in Kuala Lumpur\"\n",
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#First, Download ner.csv from\n",
"ner_dataset = pd.read_csv(\"ner.csv\",encoding = \"ISO-8859-1\", error_bad_lines=False)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset = ner_dataset.drop_duplicates()\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"Example sentence:"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### New entities - Title and Role\n",
"- **Title**: Mr., Mrs., Professor, Doctor, ...\n",
"- **Role**: President, Secretary General, U.N. Secretary, ..."
"cell_type": "markdown",
"metadata": {},
"source": [
"Quick exploratory analysis of frequencies:\n",
"- First PER token\n",
"- Second PER token\n",
"- First and second PER token\n",
"- One before and first tokens of PER"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Evaluate words before I-per\n",
"bper = ner_dataset[ner_dataset['tag']=='B-per']\n",
"bper_tokens = bper['word']\n",
"prev_bper_token = bper['prev-word']\n",
"next_bper_token = bper['next-word']\n",
"two_prev_tokens = zip(prev_bper_token, bper_tokens)\n",
"two_next_tokens = zip(bper_tokens, next_bper_token)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"print(\"20 most common PER token frequencies:\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"20 most common previous and first PER token frequencies:\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"20 most common first and second PER token frequencies:\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Lists of titles and roles to update as ttl, rol\n",
"TITLES = ['Mr.','Ms.','Mrs.']\n",
"ROLES = ['President','General','Senator','Secretary-General','Minister','General']\n",
"BIGRAMS_ROLES = [('Prime','Minister'),('prime','minister'),('U.S.','President'),\n",
" ('Venezuelan', 'President'),('Vice','President'), ('Foreign', 'Minister'),\n",
" ('U.S.','Secretary'),('U.N.','Secretary'),('Defence','Secretary')]\n"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Update title and per for most common cases\n",
"def fix_bigram_title(df, row,index,first='Prime',second='Minister',tag='ttl'):\n",
" if row['word'] == first and row['next-word'] == second and 'per' in row['tag']:\n",
" df.loc[index,'tag'] = 'B-{}'.format(tag)\n",
" elif row['word'] == second and row['prev-word'] == first and 'per' in row['tag']:\n",
" df.loc[index,'tag'] = 'I-{}'.format(tag)\n",
" elif row['tag']== 'I-per' and row['prev-word'] == second and 'per' in row['tag']:\n",
" df.loc[index,'tag'] = 'B-per'\n",
"def fix_unigram_title(df, prev_row,prev_index, row , index, title='President',tag='ttl'):\n",
" #print(row)\n",
" if prev_row['word'] == title and prev_row['tag'] == 'B-per' and row['tag']=='I-per':\n",
" df.loc[prev_index,'tag']='B-{}'.format(tag)\n",
" df.loc[index,'tag'] = 'B-per'\n",
"prev_row = None\n",
"prev_index = None\n",
"for index, row in ner_dataset.iterrows():\n",
" # Handle 'Prime Minister'\n",
" for bigram in BIGRAMS_ROLES:\n",
" fix_bigram_title(ner_dataset,row,index,bigram[0],bigram[1],'rol')\n",
" if prev_row is not None:\n",
" for title in TITLES:\n",
" fix_unigram_title(df=ner_dataset,prev_row=prev_row,prev_index=prev_index,row=row,index=index,title=title,tag='ttl')\n",
" for role in ROLES:\n",
" fix_unigram_title(ner_dataset,prev_row,prev_index,row,index,role,'rol')\n",
" prev_row = row\n",
" prev_index = index"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# keep only relevant columns\n",
"dataset = ner_dataset[['sentence_idx','word','tag']]"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset.to_csv(\"../../../datasets/ner_with_titles.csv\",encoding = \"ISO-8859-1\")"
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create templates base on NER dataset"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"class SentenceGetter(object):\n",
" \n",
" def __init__(self, dataset):\n",
" self.n_sent = 1\n",
" self.dataset = dataset\n",
" self.empty = False\n",
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
" s[\"tag\"].values.tolist())]\n",
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
" self.sentences = [s for s in self.grouped]\n",
" \n",
" def get_next(self):\n",
" try:\n",
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
" self.n_sent += 1\n",
" return s\n",
" except:\n",
" return None\n",
" \n",
" @staticmethod \n",
" def get_template(grouped,entity_name_replace_dict=None):\n",
" TAGS_TO_IGNORE = ['nat','eve','art','tim']\n",
" template = \"\"\n",
" i=0\n",
" cur_index = 0\n",
" ents = []\n",
" for token in grouped:\n",
" token_text = token[0].replace(\"[\", \"\").replace(\"]\",\"\")\n",
" token_tag = token[1]\n",
" if token_tag == 'O':\n",
" template += \" \" + token_text\n",
" elif 'B-' in token_tag and token_tag[2:] not in TAGS_TO_IGNORE:\n",
" if entity_name_replace_dict:\n",
" ent = entity_name_replace_dict[token[1][2:]]\n",
" else:\n",
" ent = token_tag[2:]\n",
" ents.append(ent)\n",
" template += \" [\" + ent + \"]\"\n",
" template = re.sub(r'\\s([?,\\':.!\"](?:|$))+', r'\\1', template)\n",
" \n",
" for ent in ents:\n",
" weird = \"[{}] [{}]\".format(ent,ent)\n",
" template = template.replace(weird,\"[{}]\".format(ent))\n",
" \n",
" #remove additional weird combinations:\n",
" \n",
" to_replace = {\n",
" \"[COUNTRY] [ROLE] [PERSON]\": \"[ROLE] [PERSON]\",\n",
" \"[COUNTRY] [ROLE]\" : \"[ROLE]\",\n",
" \"[COUNTRY] [LOCATION]\" : \"[LOCATION]\",\n",
" \"[LOCATION] [COUNTRY]\": \"[LOCATION]\",\n",
" \"[PERSON] [COUNTRY]\" : \"[PERSON]\",\n",
" \"[PERSON] [LOCATION]\" : \"[PERSON]\",\n",
" \"[COUNTRY] [PERSON]\" : \"[PERSON]\",\n",
" \"[LOCATION] [PERSON]\" : \"[PERSON]\"],\n",
" \"[PERSON] [ORGANIZATION]\" : \"[PERSON]\",\n",
" \"[ORGANIZATION] [PERSON]\" : \"[PERSON]\",\n",
" \"[PERSON] [PERSON]\": \"[PERSON]\",\n",
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
" \"[LOCATION] said\" : \"[PERSON] said\"\n",
" \n",
" \n",
" }\n",
" \n",
" for weird in to_replace.keys():\n",
" template = template.replace(weird,to_replace[weird])\n",
" \n",
" return template.strip()\n",
" \n",
"getter = SentenceGetter(dataset)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ENTITIES_DICTIONARY = {\"per\":\"PERSON\",\"gpe\":\"COUNTRY\",\"geo\":\"LOCATION\",\"org\":\"ORGANIZATION\",'ttl':'TITLE','rol':'ROLE'}\n",
"sentences = getter.sentences\n",
"print(\"template:\", getter.get_template(sentences[12],entity_name_replace_dict=ENTITIES_DICTIONARY))"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_templates = [SentenceGetter.get_template(sentence, ENTITIES_DICTIONARY) for sentence in sentences]\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save to file\n",
"with open(\"../../presidio_evaluator/data_generator/raw_data/new_templates2.txt\",\"w+\", encoding = \"ISO-8859-1\") as f:\n",
" for template in new_templates:\n",
" f.write(\"%s\\n\" % template)\n",
" "
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook takes the ontonoes ner dataset, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.options.display.max_rows = 4000\n",
"pd.set_option('display.max_colwidth', -1)"
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"## Download OntoNotes data\n",
"ontonotes = \"\""
"cell_type": "markdown",
"metadata": {},
"source": [
"### To pandas + add sentence_idx"
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"df_list = []\n",
"sentence_id = 0\n",
"for sentence in ontonotes:\n",
" \n",
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
" df[\"sentence_idx\"] = sentence_id\n",
" sentence_id+=1\n",
" df_list.append(df)\n",
"ner_dataset = pd.concat(df_list)\n",
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"sentences = ner_dataset.groupby('sentence_idx')['word'].apply(lambda x: \" \".join(x))"
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example sentence:"
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Unique entities\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace tokenization replacements"
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['word'] = ner_dataset['word']\\\n",
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# helper columns:\n",
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove unneeded (non PII) entities:"
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"def remote_unwanted_tags(x):\n",
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
" return 'O'\n",
" else:\n",
" return x\n",
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove PERSON tags if preceding word is 'the' (e.g. the Bush administration)"
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
"def remove_tag_if_the_person(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove tag from 's (Joe Wilson's cat)"
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_apostraphe_after_tag(row):\n",
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Re-tag words from dictionaries (countries, nationalities, roles, titles)"
"cell_type": "markdown",
"metadata": {},
"source": [
"Nationalities and countries:"
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"\"algeria\" in nationalities['country'].values"
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['metadata'] = None\n",
"def get_nationality_as_metadata(row):\n",
" if row['word'].lower() in nationalities['country'].values:\n",
" return 'COUNTRY'\n",
" elif row['word'].lower() in nationalities['nationality'].values:\n",
" return 'NATIONALITY'\n",
" elif row['word'].lower() in nationalities['man'].values:\n",
" return 'NATION_MAN'\n",
" elif row['word'].lower() in nationalities['woman'].values:\n",
" return 'NATION_WOMAN'\n",
" elif row['word'].lower() in nationalities['plural'].values:\n",
" return 'NATION_PLURAL'\n",
" return row['metadata']\n",
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
"def update_tag_based_on_metadata(row):\n",
" if row['tag'] != 'O' and row['metadata'] is not None:\n",
" return row['tag'][:2] + row['metadata']\n",
" else:\n",
" return row['tag']\n",
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Titles"
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"MALE_TITLES = ['mr', 'dr', 'professor', 'eng','prof','doctor']\n",
"FEMALE_TITLES = ['mrs', 'ms', 'miss', 'dr', 'professor', 'eng', 'prof','doctor']\n",
"def get_title_as_metadata(row):\n",
" if row['word'].lower() in MALE_TITLES:\n",
" return 'MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES:\n",
" return 'FEMALE_TITLE'\n",
" return row['metadata']\n",
"def update_title_tag_if_missing(row):\n",
" if row['word'].lower() in MALE_TITLES and row['tag']=='O':\n",
" return 'B-MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES and row['tag']=='O':\n",
" return 'B-FEMALE_TITLE'\n",
" else:\n",
" return row['tag']\n",
"ner_dataset['metadata'] = ner_dataset.apply(get_title_as_metadata,axis=1)\n",
"ner_dataset['tag'] = ner_dataset.apply(update_title_tag_if_missing,axis=1)"
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove 'the' from 'the NORP' if NORP is not in nationalities list."
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_the_norp(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove sentences with adjacent different entities (e.g calling from New York Larry King)"
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
"sentences_to_remove = ner_dataset[adjacent_idc]['sentence_idx'].values\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Update tag for discovered metadata values (eg. nationalities)"
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['tag'] = ner_dataset.apply(update_tag_based_on_metadata, axis=1)"
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create templates base on NER dataset"
"cell_type": "code",
"execution_count": 331,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"class SentenceGetter(object):\n",
" \n",
" def __init__(self, dataset):\n",
" self.n_sent = 1\n",
" self.dataset = dataset\n",
" self.empty = False\n",
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
" s[\"tag\"].values.tolist())]\n",
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
" self.sentences = [s for s in self.grouped]\n",
" \n",
" def get_next(self):\n",
" try:\n",
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
" self.n_sent += 1\n",
" return s\n",
" except:\n",
" return None\n",
" \n",
" @staticmethod \n",
" def cleanse_template(template, ents):\n",
" # Remove whitespace before certain punctuation marks\n",
" template = re.sub(r'\\s([?,:.!](?:|$))+', r'\\1', template)\n",
" \n",
" # Remove whitespaces within double quotes\n",
" template = re.sub('\\\"\\s*([^\\\"]*?)\\s*\\\"', r'\"\\1\"', template) \n",
" \n",
" # Remove whitespaces within quotes\n",
" template = re.sub(\"\\'\\s*([^\\']*?)\\s*\\'\", r\"'\\1'\", template) \n",
" \n",
" # Remove whitespaces within parentheses\n",
" template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
" \n",
" for ent in ents:\n",
" duplicates = \"[{}] [{}]\".format(ent,ent)\n",
" template = template.replace(duplicates,\"[{}]\".format(ent))\n",
" \n",
" \n",
" # Replace additional weird templates:\n",
" to_replace = {\n",
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
" \"[LOCATION] said\" : \"[PERSON] said\",\n",
" \"the [COUNTRY]\" : \"[COUNTRY]\",\n",
" \" 's \":\"'s\",\n",
" \"] 's \":\"]'s \",\n",
" \"] 's,\":\"]'s,\",\n",
" \"] 's.\":\"]'s.\",\n",
" \" n't\" : \"n't\",\n",
" \"/?\":\"?\",\n",
" \"%u\":\"u\",\n",
" \"%m\":\"m\",\n",
" \"%e\":\"e\", \n",
" \"%h\":\"h\", \n",
" \"%a\":\"a\",\n",
" \" %\":\"%\",\n",
" \" ?\":\"?\",\n",
" \" /?\":\"?\",\n",
" \" ' .\":\"'.\",\n",
" \"[ \":\"(\",\n",
" \" ]\":\")\",\n",
" \"[PERSON] -- [PERSON]\":\"[PERSON]\",\n",
" \"Jews\" : \"[NATIONALITY]\",\n",
" \"Chinese\" : \"[NATIONALITY]\",\n",
" \"Dutch\" : \"[NATIONALITY]\",\n",
" }\n",
" \n",
" for weird in to_replace.keys():\n",
" #if weird in template:\n",
" # print(\"Weird sentence\",template)\n",
" template = template.replace(weird,to_replace[weird])\n",
" \n",
" template = template.replace(\" -- \",\" - \")\n",
" \n",
" #Ignore templates that are incomplete\n",
" if \"/-\" in template:\n",
" template = \"\"\n",
" \n",
" if template.count('\"') == 1:\n",
" template = template.replace('\"','')\n",
" return template\n",
" \n",
" @staticmethod \n",
" def get_template(grouped,entity_name_replace_dict):\n",
" template = \"\"\n",
" i=0\n",
" cur_index = 0\n",
" ents = []\n",
" for token in grouped:\n",
" # remove brackets as they interefere with the data generation process\n",
" token_text = token[0].replace(\"[\", \"(\").replace(\"]\",\")\")\n",
" token_text = token[0].replace(\"{\", \"(\").replace(\"}\",\")\")\n",
" token_tag = token[1]\n",
" token_entity = token_tag[2:] if len(token_tag)>1 else token_tag\n",
" \n",
" if token_entity == 'O':\n",
" template += \" \" + token_text\n",
" elif 'B-' in token_tag and token_entity not in TAGS_TO_IGNORE:\n",
" #print(\"found entity: {}\".format(token_entity))\n",
" ent = entity_name_replace_dict[token_entity]\n",
" ents.append(ent)\n",
" \n",
" template += \" [\" + ent + \"]\"\n",
" #print(\"template: \",template)\n",
" \n",
" template = SentenceGetter.cleanse_template(template, ents)\n",
" \n",
" return template.strip()\n",
" \n",
"getter = SentenceGetter(ner_dataset)"
"cell_type": "code",
"execution_count": 321,
"metadata": {},
"outputs": [],
"source": [
" \"GPE\":\"COUNTRY\",\n",
" \"NORP\":\"LOCATION\",\n",
" \"LOC\":\"LOCATION\",\n",
" \"ORG\":\"ORGANIZATION\",\n",
" \"MALE_TITLE\":\"MALE_TITLE\",\n",
" \"COUNTRY\":\"COUNTRY\",\n",
" \"NATION_MAN\":\"NATION_MAN\",\n",
" \n",
"sentences = getter.sentences\n",
"sent_id = 445\n",
"print(\"template:\", getter.get_template(sentences[sent_id],entity_name_replace_dict=ENTITIES_DICTIONARY))"
"cell_type": "code",
"execution_count": 322,
"metadata": {},
"outputs": [],
"source": [
"all_templates = [getter.get_template(sentence,entity_name_replace_dict=ENTITIES_DICTIONARY) for sentence in sentences]"
"cell_type": "code",
"execution_count": 323,
"metadata": {},
"outputs": [],
"source": [
"print(\"original length of templates: {}\".format(len(all_templates)))\n",
"all_templates = list(set(all_templates))\n",
"print(\"length after duplicates removal: {}\".format(len(all_templates)))"
"cell_type": "code",
"execution_count": 324,
"metadata": {},
"outputs": [],
"source": [
"# save to file\n",
"with open(\"../raw_data/ontonotes_based_templates.txt\",\"w+\",encoding='utf-8') as f:\n",
" for template in all_templates:\n",
" f.write(\"%s\\n\" % template)\n",
" "
"cell_type": "code",
"execution_count": 330,
"metadata": {},
"outputs": [],
"source": [
"template = \"[NATIONALITY]'s[MALE_TITLE]'\"\n",
"template = getter.cleanse_template(template,[])\n",
"#template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
"cell_type": "code",
"execution_count": 326,
"metadata": {},
"outputs": [],
"source": [
"if template.count(\"'\")==1:\n",
" print(True)\n",
" template = template.replace(\"'\",'')"
"cell_type": "code",
"execution_count": 327,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"Exploratory data analysis on the OntoNotes dataset, to gain insights towards the templating of the dataset"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.options.display.max_rows = 4000\n",
"pd.set_option('display.max_colwidth', -1)"
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"conll = \"\" # Download CoNLL-2003\n",
"df_list = []\n",
"sentence_id = 0\n",
"for sentence in conll:\n",
" \n",
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
" df[\"sentence_idx\"] = sentence_id\n",
" sentence_id+=1\n",
" df_list.append(df)\n",
"ner_dataset = pd.concat(df_list)\n",
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def remote_unwanted_tags(x):\n",
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
" return 'O'\n",
" else:\n",
" return x\n",
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"sentences = ner_dataset.groupby('sentence_idx')['word'].transform(lambda x: ' '.join(x)).unique().tolist()"
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"with open(\"raw_sentences.txt\",\"w\",encoding=\"utf8\") as f:\n",
" for item in sentences:\n",
" f.write(\"{}\\n\".format(item))"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Number of labels per tag"
"cell_type": "code",
"execution_count": 261,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": 264,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['word'] = ner_dataset['word'].replace('-LRB-',')')\\\n",
"cell_type": "code",
"execution_count": 265,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Add lead and lag words and tags to dataset_no_punct"
"cell_type": "code",
"execution_count": 267,
"metadata": {},
"outputs": [],
"source": [
"import string\n",
"punct = [c for c in string.punctuation]\n",
"dataset_no_punct = ner_dataset[~ner_dataset.word.str.strip().isin(punct)]\n",
"dataset_no_punct['prev-word'] = dataset_no_punct.word.shift(1)\n",
"dataset_no_punct['prev-prev-word'] = dataset_no_punct['word'].shift(2)\n",
"dataset_no_punct['next-word'] = dataset_no_punct['word'].shift(-1)\n",
"dataset_no_punct['prev-tag'] = dataset_no_punct['tag'].shift(1)\n",
"dataset_no_punct['next-tag'] = dataset_no_punct['tag'].shift(-1)\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Add features for easier manipulation"
"cell_type": "code",
"execution_count": 268,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Gather statistics on the first person token"
"cell_type": "code",
"execution_count": 269,
"metadata": {},
"outputs": [],
"source": [
"bper = dataset_no_punct[dataset_no_punct['tag']=='B-PERSON']"
"cell_type": "code",
"execution_count": 270,
"metadata": {},
"outputs": [],
"source": [
"# histogram of B-PERSON tokens\n",
"from collections import Counter\n",
"cell_type": "code",
"execution_count": 271,
"metadata": {},
"outputs": [],
"source": [
"prev_bper_token = bper['prev-word'].str.lower()\n",
"cell_type": "code",
"execution_count": 272,
"metadata": {},
"outputs": [],
"source": [
"prev_prev_bper_token = bper['prev-prev-word']\n",
"two_prev_tokens = zip(prev_prev_bper_token.str.lower(), prev_bper_token.str.lower())\n",
"cell_type": "code",
"execution_count": 273,
"metadata": {},
"outputs": [],
"source": [
"# find \"the\" followed by B-PERSON\n",
"the_PERSON = ner_dataset[(ner_dataset['prev-word'].str.lower()==\"the\") & (ner_dataset['tag']=='B-PERSON')]\n",
"print(the_PERSON['prev-word']+\" \"+the_PERSON['word']+\" \"+the_PERSON['next-word']+\" \"+the_PERSON['next-next-word'].values)"
"cell_type": "code",
"execution_count": 296,
"metadata": {},
"outputs": [],
"source": [
"## add metadata for nationalities (to differentiate between America, Americans and US citizen)\n",
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
"ner_dataset['metadata'] = None\n",
"def get_nationality_as_metadata(row):\n",
" if row['word'].lower() in nationalities['country'].values:\n",
" return 'COUNTRY'\n",
" elif row['word'].lower() in nationalities['nationality'].values:\n",
" return 'NATIONALITY'\n",
" elif row['word'].lower() in nationalities['man'].values:\n",
" return 'NATION_MAN'\n",
" elif row['word'].lower() in nationalities['woman'].values:\n",
" return 'NATION_WOMAN'\n",
" return row['metadata']\n",
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)"
"cell_type": "code",
"execution_count": 297,
"metadata": {},
"outputs": [],
"source": [
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
"def remove_tag_if_the_person(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" return row['tag']\n",
"def remove_tag_if_the_norp(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
"cell_type": "code",
"execution_count": 299,
"metadata": {},
"outputs": [],
"source": [
"# find \"the\" followed by B-NORP\n",
"the_NORP = ner_dataset[(ner_dataset['prev-word'].str.lower()==\"the\") & (ner_dataset['tag']=='B-NORP')]\n",
"print(the_NORP['prev-word']+\" \"+the_NORP['word']+\" \"+the_NORP['next-word']+\" \"+the_NORP['next-next-word'].values + \" (\" + the_NORP['metadata'] + \")\")"
"cell_type": "code",
"execution_count": 276,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_apostraphe_after_tag(row):\n",
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_apostraphe_after_tag,axis=1)"
"cell_type": "code",
"execution_count": 277,
"metadata": {},
"outputs": [],
"source": [
"sentences_with_president=ner_dataset[ner_dataset['word'].str.lower() == 'president']['sentence_idx']\n",
"cell_type": "code",
"execution_count": 279,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Adjacent tags"
"cell_type": "code",
"execution_count": 281,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
"cell_type": "code",
"execution_count": 286,
"metadata": {},
"outputs": [],
"source": [
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
"print(\"sentences with duplicate different entities: \",str(len(ner_dataset[adjacent_idc])))\n",
"cell_type": "code",
"execution_count": 289,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"NORP values"
"cell_type": "code",
"execution_count": 293,
"metadata": {},
"outputs": [],
"source": [
"norp_values = ner_dataset[ner_dataset['entity']=='NORP']['word']\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### The country?"
"cell_type": "code",
"execution_count": 311,
"metadata": {},
"outputs": [],
"source": [
"the_X_idx = (ner_dataset['prev-word']=='the') & (ner_dataset['tag'] != 'O')\n",
"the_X_sentences = ner_dataset[the_X_idx]['sentence_idx']\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
import datetime
import json
import pandas as pd
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import FakeDataGenerator
def read_utterances(utterances_file):
with open(utterances_file) as f:
return f.readlines()
def generate(fake_pii_csv,
:param fake_pii_csv: csv containing fake PII
:param utterances_file: txt file containing template sentences
:param output_file: filepath for json or csv output
:param num_of_examples: number of examples to generate
:param dictionary_path: path to vocabulary file
:param store_masked_text: Whether to remove or keep masked version of text
:param keep_only_tagged: Ignore utterances with no entity
(e.g. Remove: 'I went to the shop today', Keep: '[PERSON] went to the shop today')
:return: list of generated InputSamples
if not output_file:
raise ValueError("Please provide an output file path")
templates = read_utterances(utterances_file)
if keep_only_tagged:
templates = [template for template in templates if "[" in template]
df = pd.read_csv(fake_pii_csv, encoding='utf-8')
generator = FakeDataGenerator(fake_pii_df=df,
templates=templates, **kwargs)
counter = 0
examples = []
for example in generator.sample_examples(num_of_examples):
if not store_masked_text:
example.masked = None
examples_json = [example.to_dict() for example in examples]
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
json.dump(examples_json, f, ensure_ascii=False, indent=4)
print("generated {} examples".format(len(examples)))
print("Finished creating generated dataset. File location:{}".format(output_file))
return examples
def read_synth_dataset(filepath=None, length=None):
import json
with open(filepath, "r", encoding="utf-8") as f:
dataset = json.load(f)
if length:
dataset = dataset[:length]
input_samples = [InputSample.from_json(row) for row in dataset]
return input_samples
if __name__ == "__main__":
TEMPLATES_FILE = 'raw_data/templates.txt'
cur_time ="%B %d %Y")
OUTPUT = "generated_size_{}_date_{}.txt".format(EXAMPLES, cur_time)
fake_pii_csv = '../../presidio_evaluator/data_generator/' \
utterances_file = TEMPLATES_FILE
dictionary_path = None
examples = generate(fake_pii_csv=fake_pii_csv,
# sanity
input_samples = read_synth_dataset(OUTPUT)
for sample in input_samples:
if len(sample.tags) != len(sample.tokens):
print("ERROR during generation. sample: {}".format(sample))

import random
import os
from pathlib import Path
import pandas as pd
import re
class NationalityGenerator:
def __init__(self, company_name_file_path="raw_data/nationalities.csv"):
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = Path(dir_path, company_name_file_path)
df = pd.read_csv(str(file_path))
self.df = df
def get_country(self):
return NationalityGenerator.capitalizeWords(random.choice(self.df['country'].values))
def get_nationality(self):
return NationalityGenerator.capitalizeWords(random.choice(self.df['nationality'].values))
def get_nation_woman(self):
return NationalityGenerator.capitalizeWords(random.choice(self.df['woman'].values))
def get_nation_man(self):
return NationalityGenerator.capitalizeWords(random.choice(self.df['man'].values))
def get_nation_plural(self):
return NationalityGenerator.capitalizeWords(random.choice(self.df['plural'].values))
def capitalizeWords(s):
return re.sub(r'\w+', lambda m:, s)

Просмотреть файл

@ -0,0 +1,16 @@
import random
import os
from pathlib import Path
class OrgNameGenerator:
def __init__(self, company_name_file_path="raw_data/organizations.csv"):
self.companies = []
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = Path(dir_path, company_name_file_path)
with open(str(file_path)) as file:
self.companies =
def get_organization(self):
return random.choice(self.companies)

Seeds of peace
The Bill & Melinda Gates Foundation
Sky News
I want to increase limit on my card # [CREDIT_CARD] for certain duration of time. is it possible?
My credit card [CREDIT_CARD] has been lost, Can I request you to block it.
Need to change billing date of my card [CREDIT_CARD]
I want to upadte my primary and secondary address to same: [ADDRESS]
In case of my child's account, we need to add [PERSON] as guardian
Are there any charges applied for money transfer from [IBAN] to other bank accounts
Are there any charges applied to widraw money from ATM with the card [CREDIT_CARD]
Not getting bank documents on my addres. Can you please validate the following [ADDRESS]
Please update billing addrress with [ADDRESS] for this card: [CREDIT_CARD]
Need to see last 10 transaction of card [CREDIT_CARD]
I have lost my card [CREDIT_CARD]. Could you please block my credit card ASAP ? , My name is [PERSON].
My card [CREDIT_CARD] is expiring this month. Please let me know process to it's extend validity.
I have done an online order but didn't get any message on my registered [PHONE_NUMBER]. Could you please look into it ?
What is procedure to redeem points won on credit card [CREDIT_CARD] transactions ?
My card [CREDIT_CARD] expires soon <20> when will I get a new one?
How do I check my balance on my credit card?
Could I change the payment due date of my credit card?
How can I request a new credit card pin ?
Can I withdraw cash using my card [CREDIT_CARD] at aTM center ?
How do I change the address linked to my credit card to [ADDRESS]?
How do I open my credit card statement?
I'm originally from [COUNTRY]
I will be travelling to [COUNTRY] next week, so I need my passport to be ready by then
Who's coming to [COUNTRY] with me?
[COUNTRY] was super fun to visit!
Could you please email me the statement for laste month , my credit card number is [CREDIT_CARD]?
Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL]?
How do I change my address to [ADDRESS] for post mail?
My name appears incorrectly on credit card statement could you please correct it to [TITLE] [PERSON]?
card number [CREDIT_CARD] is lost, can you please send a new one to [ADDRESS] i am in [CITY] for a business trip
Please transfer all funds from my account to this hackers' [EMAIL]
I can't browse to your site, keep getting address [IP_ADDRESS] blocked error
My religion does not allow speaking to bots, they are evil and hacked by the Devil
Excuse me, Sir bot, but I really don't like this tone
Please have the manager call me at [PHONE_NUMBER] I'd like to join accounts with ms. [FIRST_NAME]
Inject SELECT * FROM Users WHERE clinet_ip = ?%//!%20\|[IP_ADDRESS]|%20/
[FIRST_NAME], can I please speak to your boss?
May I request to have the statement sent to [ADDRESS]?
Will my account stay active? It's under my partner's name [PERSON]
What are my options?
Bot: Where would you like this to be sent to? User: [ADDRESS]
Bot: What's the name on the account? User: [PERSON]
I would like to stop receiving messages to [PHONE_NUMBER]
I would like to remove my kid [FIRST_NAME] from the will. How do I do that?
The name in the account is not correct, please change it to [PERSON]
Hello I moved, please update my new address is [ADDRESS]
I need to add addresses, here they are: [ADDRESS], [ADDRESS]
Please send my portfolio to this email [EMAIL]
Hello, this is [TITLE] [PERSON]. Who are you?
I want to add [PERSON] as a beneficiary to my account
I want to cancel my card [CREDIT_CARD] because I lost it
Please block card no [CREDIT_CARD]
What is the limit for card [CREDIT_CARD]?
Can someone call me on [PHONE_NUMBER]? I have some questions about opening an account.
My nam is [FIRST_NAME]
I'm moving out of the country, so please cancel my subscription
My name is [PERSON] but everyone calls me [FIRST_NAME]
Please tell me your date of birth. It's [BIRTHDAY]
You said your email is [EMAIL]. Is that correct?
I once lived in [ADDRESS]. I now live in [ADDRESS]
I'd like to order a taxi to [ADDRESS]
Please charge my credit card. Number is [CREDIT_CARD]
What's your email? [EMAIL]
What's your credit card? [CREDIT_CARD]
What's your name? [PERSON]
What's your last name? [LAST_NAME]
How can we reach you? You can call [PHONE_NUMBER]
I'd like it to be sent to [ADDRESS]
Meet me at [ADDRESS]
So where are we meeting? There's this nice new Thai place downtown. Cool, what's the address? Oh do they serve vegan stuff? It's in [ADDRESS]
Hi [FIRST_NAME], I'm contacting you about a problem I have with sending a wire transfer using this IBAN [IBAN]
She was born on [BIRTHDAY]. Her maiden name is [LAST_NAME]
Sometimes people call me [FIRST_NAME]
Maybe it's under [PERSON]
It's like that since [BIRTHDAY]
Just posted a photo [URL]
My website is [URL]
I've shared files with you [URL]
[PERSON] from [ORGANIZATION] is the keynote speaker
The address of [ORGANIZATION] is [ADDRESS]
His social security number is [US_SSN]
Here's my SSN: [US_SSN]
[FIRST_NAME] is a very sympathetic person. He's also a good listener
[FIRST_NAME] is very reliable. You can always depend on him.
Why is [FIRST_NAME] so impulsive?
[PERSON] will be talking in the conference
have you heard [PERSON] speak yet?
Have you been to a [PERSON] concert before?
I'm so jealous! said [FIRST_NAME] to [FIRST_NAME]
The true gender of [FIRST_NAME] has been under debate for years, but the riff and building energy is a rock masterpiece regardless.
For my take on Mr. [LAST_NAME], see Guilty Pleasures: 5 Musicians Of The 70s You're Supposed To Hate (But Secretly Love)
Unlike the [LAST_NAME] novel, it's not about necrophilia. What it is about, I suppose is anyone's guess. A brilliant piece of baroque pop.
One of the most depressing songs on the list. He's injured from the waist down from [COUNTRY], but [FIRST_NAME] just has to get laid. Don't go to town, [FIRST_NAME]!
Is there a better crafted pop song on this list? [LAST_NAME] and [LAST_NAME] were precision engineers.
C'mon, sing it with me: "You picked a fine time to leave me [FIRST_NAME], four hungry children and a crop in the field..."
A tribute to [PERSON] – sadly, she wasn't impressed.
When they weren't singing about Hobbits, satanic felines and interstellar journeys, they were singing about the verses from [PERSON]'s Cautionary Tales. Is there a better example of unbridled creativity than early [LAST_NAME]?
A great song made even greater by a mandolin coda (not by [PERSON]).
[PERSON] listed his top 20 songs for Entertainment Weekly and had the balls to list this song at #15. (What did he put at #1 you ask? Answer:"Tube Snake Boogie" by [PERSON] – go figure)
From the film American graffiti (also features [PERSON]. What's not to love?
You can tell [FIRST_NAME] was a huge [PERSON] fan. Written when he was only 14.
This song by ex-Zombie [LAST_NAME] is a perfect example of why you shouldn't concentrate on the order of this list. An argument could be made that this should be at number one, and I wouldn't argue with it.
The title refers to [STREET] Street in [CITY]. It was on this street that many of the clubs where Metallica first played were situated. "Battery is found in me" shows that these early shows on [STREET] Street were important to them. Battery is where "lunacy finds you" and you "smash through the boundaries."
Blink-182 pay tribute here to the [COUNTRY]. Producer [PERSON] explained to Fuse TV: "We all liked the idea of writing a song about our state, where we live and love. To me it's the most beautiful place in the world, this song was us giving credit to how lucky we are to have lived here and grown up here, raising families here, the whole thing."
It may be too that [LAST_NAME] was influenced by an earlier song, "Carry Me Back To [COUNTRY]," which was arranged and sung by [PERSON] in 1847 (though [LAST_NAME]'s song was actually about a boat!).
The [PERSON] version recorded for [ORGANIZATION] became the first celebrity recording by a classical musician to sell one million copies. The song was awarded the seventh gold disc ever granted.
In [COUNTRY]] they have company songs, musical expressions of employee loyalty sung by salarymen. Unfortunately, as regular RR commenter [PERSON] points out, "most are horrible".
"The big three" of The Big Three Killed My Baby are the car manufacturers that dominate the economy of the White Stripes' home city [CITY]: [ORGANIZATION], [ORGANIZATION] and [ORGANIZATION]. "Don't feed me planned obsolescence," says [PERSON] in an uncharacteristically political song, lamenting the demise of the unions in the 60s.
[ORGANIZATION] songwriter [PERSON] employs corporate lingo in the first verse of his [ORGANIZATION] Resignation Letter
Mission Statement: This non-profit founded by radio executives "serves as an advocate for the value of music" and "supports its songwriters, composers and publishers by taking care of an important aspect of their careers – getting paid," according to the [ORGANIZATION] website. They offer blanket music licenses to businesses and organizations that allow them to play nearly 13 million musical works.
The [ORGANIZATION] Orchestra was founded in 1929. Since then, the TSO has grown from a volunteer community orchestra to a fully professional orchestra serving Southern [COUNTRY]
Celebrating its 10th year in [CITY], [ORGANIZATION] is a 501(c)3 that invites songwriters from around the world to Texas to share the universal language of music in collaborations designed to bridge cultures, build friendships and cultivate peace.
[ORGANIZATION] is the brainchild of our 3 founders: [PERSON], [PERSON] and [PERSON]. The idea was born (on the beach) while they were constructing a website to be the basis of another start-up idea.
[ORGANIZATION] is an [NATIONALITY] multinational investment bank and financial services company
Zoolander is a 2001 American action-comedy film directed by [PERSON] and starring [LAST_NAME]
During the 1990s, [ORGANIZATION] invested heavily in new microprocessor designs fostering the rapid growth of the computer industry.
On 29 March 2017, the [NATIONALITY] government formally began the process of withdrawal by invoking Article 50 of the Treaty on European Union
[FIRST_NAME] shouted at [FIRST_NAME]: "What are you doing here?"
[LAST_NAME] spent a year at [ORGANIZATION] as the assistant to [PERSON], and the following year at [ORGANIZATION] in [CITY], which later became [ORGANIZATION] in 1965.
[LAST_NAME] began writing as a teenager, publishing her first story, "The Dimensions of a Shadow", in 1950 while studying English and journalism at the University of [CITY].

from typing import List, Counter, Dict
import spacy
import srsly
from spacy.tokens import Token
from tqdm import tqdm
from presidio_evaluator import span_to_tag, tokenize
"CITY": "GPE",
"GPE": "GPE",
"ORG": "ORG",
class Span:
Holds information about the start, end, type nad value
of an entity in a text
def __init__(self, entity_type, entity_value, start_position, end_position):
self.entity_type = entity_type
self.entity_value = entity_value
self.start_position = start_position
self.end_position = end_position
def intersect(self, other, ignore_entity_type: bool):
Checks if self intersects with a different Span
:return: If interesecting, returns the number of
intersecting characters.
If not, returns 0
# if they do not overlap the intersection is 0
if self.end_position < other.start_position or other.end_position < \
return 0
# if we are accounting for entity type a diff type means intersection 0
if not ignore_entity_type and (self.entity_type != other.entity_type):
return 0
# otherwise the intersection is min(end) - max(start)
return min(self.end_position, other.end_position) - max(
def __repr__(self):
return "Type: {}, value: {}, start: {}, end: {}".format(
self.entity_type, self.entity_value, self.start_position,
def __eq__(self, other):
return self.entity_type == other.entity_type \
and self.entity_value == other.entity_value \
and self.start_position == other.start_position \
and self.end_position == other.end_position
def __hash__(self):
return hash(('entity_type', self.entity_type,
'entity_value', self.entity_value,
'start_position', self.start_position,
'end_position', self.end_position))
def from_json(cls, data):
return cls(**data)
class SimpleSpacyExtensions(object):
def __init__(self, **kwargs):
Serialization of Spacy Token extensions.
:param kwargs: dictionary of spacy extensions and their values
def to_dict(self):
return self.__dict__
class SimpleToken(object):
A class mimicking the Spacy Token class, for serialization purposes
def __init__(self, text, idx, tag_=None,
spacy_extensions: SimpleSpacyExtensions = None,
self.text = text
self.idx = idx
self.tag_ = tag_
self.pos_ = pos_
self.dep_ = dep_
self.lemma_ = lemma_
# serialization for Spacy extensions:
if spacy_extensions is None:
self._ = SimpleSpacyExtensions()
self._ = spacy_extensions
self.params = kwargs
def from_spacy_token(cls, token):
if isinstance(token, SimpleToken):
return token
elif isinstance(token, Token):
if token._ and token._._extensions:
extensions = list(token._.token_extensions.keys())
extension_values = {}
for extension in extensions:
extension_values[extension] = token._.__getattr__(extension)
spacy_extensions = SimpleSpacyExtensions(**extension_values)
spacy_extensions = None
return cls(text=token.text,
def to_dict(self):
return {
"text": self.text,
"idx": self.idx,
"tag_": self.tag_,
"pos_": self.pos_,
"dep_": self.dep_,
"lemma_": self.lemma_,
"_": self._.to_dict()
def __repr__(self):
return self.text
def from_json(cls, data):
if '_' in data:
data['spacy_extensions'] = \
return cls(**data)
class InputSample(object):
def __init__(self, full_text: str, masked: str, spans: List[Span],
tokens=[], tags=[],
create_tags_from_span=True, scheme="IO", metadata=None, template_id=None):
Holds all the information needed for evaluation in the
presidio-evaluator framework.
Can generate tags (BIO/BILOU/IO) based on spans
:param full_text: The raw text of this sample
:param masked: Masked version of the raw text (desired output)
:param spans: List of spans for entities
:param create_tags_from_span: True if tags (tokens+taks) should be added
:param scheme: IO, BIO/IOB or BILOU. Only applicable if span_to_tag=True
:param tokens: list of items of type SimpleToken
:param tags: list of strings representing the label for each token,
given the scheme
:param metadata: A dictionary of additional metadata on the sample,
in the English (or other language) vocabulary
:param template_id: Original template (utterance) of sample, in case it was generated
self.full_text = full_text
self.masked = masked
self.spans = spans if spans else []
self.metadata = metadata
# generated samples have a template from which they were generated
if not template_id and self.metadata:
self.template_id = self.metadata.get("Template#")
self.template_id = template_id
if create_tags_from_span:
tokens, tags = self.get_tags(scheme)
self.tokens = tokens
self.tags = tags
self.tokens = tokens
self.tags = tags
def __repr__(self):
return "Full text: {}\n" \
"Spans: {}\n" \
"Tokens: {}\n" \
"Tags: {}\n".format(self.full_text, self.spans, self.tokens,
def to_dict(self):
return {
"full_text": self.full_text,
"masked": self.masked,
"spans": [span.__dict__ for span in self.spans],
"tokens": [SimpleToken.from_spacy_token(token).to_dict()
for token in self.tokens],
"tags": self.tags,
"template_id": self.template_id,
"metadata": self.metadata
def from_json(cls, data):
if 'spans' in data:
data['spans'] = [Span.from_json(span) for span in data['spans']]
if 'tokens' in data:
data['tokens'] = [SimpleToken.from_json(val) for val in
return cls(**data, create_tags_from_span=False)
def get_tags(self, scheme="IOB"):
start_indices = [span.start_position for span in self.spans]
end_indices = [span.end_position for span in self.spans]
tags = [span.entity_type for span in self.spans]
tokens = tokenize(self.full_text)
labels = span_to_tag(scheme=scheme, text=self.full_text, tag=tags,
start=start_indices, end=end_indices,
return tokens, labels
def to_conll(self, translate_tags, scheme="BIO"):
conll = []
for i, token in enumerate(self.tokens):
if translate_tags:
label = self.translate_tag(self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
label = self.tags[i]
conll.append({"text": token.text,
"pos": token.pos_,
"tag": token.tag_,
"Template#": self.metadata['Template#'],
"gender": self.metadata['Gender'],
"country": self.metadata['Country'],
"label": label},
return conll
def get_template_id(self):
return self.metadata['Template#']
def create_conll_dataset(dataset, translate_tags=True, to_bio=True):
import pandas as pd
conlls = []
i = 0
for sample in dataset:
if to_bio:
conll = sample.to_conll(translate_tags=translate_tags)
for token in conll:
token['sentence'] = i
i += 1
return pd.DataFrame(conlls)
def to_spacy(self, entities=None, translate_tags=True):
entities = [(span.start_position, span.end_position, span.entity_type)
for span in self.spans if (entities is None) or (span.entity_type in entities)]
new_entities = []
if translate_tags:
for entity in entities:
new_tag = self.translate_tag(entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
new_entities.append((entity[0], entity[1], new_tag))
new_entities = entities
return (self.full_text,
{"entities": new_entities})
def from_spacy(cls, text, annotations, translate_from_spacy=True):
spans = []
for annotation in annotations:
tag = cls.rename_from_spacy_tags([annotation[2]])[0] if translate_from_spacy else annotation[2]
span = Span(tag, text[annotation[0]: annotation[1]], annotation[0], annotation[1])
return cls(full_text=text, masked=None, spans=spans)
def create_spacy_dataset(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
def template_sort(x):
return x.metadata['Template#']
if sort_by_template_id:
return [sample.to_spacy(entities=entities, translate_tags=translate_tags) for sample in dataset]
def to_spacy_json(self, entities=None, translate_tags=True):
token_dicts = []
for i, token in enumerate(self.tokens):
if entities:
tag = self.tags[i] if self.tags[i][2:] in entities else 'O'
tag = self.tags[i]
if translate_tags:
tag = self.translate_tag(tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
"orth": token.text,
"tag": token.tag_,
"ner": tag
spacy_json_sentence = {
"raw": self.full_text,
"sentences": [{
"tokens": token_dicts
return spacy_json_sentence
def to_spacy_doc(self):
doc = self.tokens
spacy_spans = []
for span in self.spans:
start_token = [token.i for token in self.tokens if token.idx == span.start_position][0]
end_token = [token.i for token in self.tokens if token.idx + len(token.text) == span.end_position][0] + 1
spacy_span = spacy.tokens.span.Span(doc, start=start_token, end=end_token,
doc.ents = spacy_spans
return doc
def create_spacy_json(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
def template_sort(x):
return x.metadata['Template#']
if sort_by_template_id:
json_str = []
for i, sample in tqdm(enumerate(dataset)):
paragraph = sample.to_spacy_json(entities=entities, translate_tags=translate_tags)
"id": i,
"paragraphs": [paragraph]
return json_str
def translate_tags(tags, dictionary, ignore_unknown):
Translates entity types from one set to another
:param tags: list of entities to translate, e.g. ["LOCATION","O","PERSON"]
:param dictionary: Dictionary of old tags to new tags
:param ignore_unknown: Whether to put "O" when word not in dictionary or keep old entity type
:return: list of translated entities
new_tags = []
for tag in tags:
new_tags.append(InputSample.translate_tag(tag, dictionary, ignore_unknown))
return new_tags
def translate_tag(tag, dictionary, ignore_unknown):
has_prefix = len(tag) > 2 and tag[1] == '-'
no_prefix = tag[2:] if has_prefix else tag
if no_prefix in dictionary.keys():
return tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
if ignore_unknown:
return "O"
return tag
def bilou_to_bio(self):
new_tags = []
for tag in self.tags:
new_tag = tag
has_prefix = len(tag) > 2 and tag[1] == '-'
if has_prefix:
if tag[0] == 'U':
new_tag = 'B' + tag[1:]
elif tag[0] == 'L':
new_tag = 'I' + tag[1:]
self.tags = new_tags
def rename_from_spacy_tags(spacy_tags, ignore_unknown=False):
return InputSample.translate_tags(spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown)
def rename_to_spacy_tags(tags, ignore_unknown=True):
return InputSample.translate_tags(tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown)
def write_spacy_json_from_docs(dataset, filename="spacy_output.json"):
docs = [sample.to_spacy_doc() for sample in dataset]
srsly.write_json(filename, [])
def to_flair(self):
for token, i in enumerate(self.tokens):
return "{} {} {}".format(token, token.pos_, self.tags[i])
def translate_input_sample_tags(self, dictionary=PRESIDIO_SPACY_ENTITIES, ignore_unknown=True):
self.tags = InputSample.translate_tags(self.tags, dictionary, ignore_unknown=ignore_unknown)
for span in self.spans:
if span.entity_value in PRESIDIO_SPACY_ENTITIES:
span.entity_value = PRESIDIO_SPACY_ENTITIES[span.entity_value]
elif ignore_unknown:
span.entity_value = 'O'
def create_flair_dataset(dataset):
flair_samples = []
for sample in dataset:
return flair_samples
class ModelError:
def __init__(self, error_type, annotation, prediction, token, full_text, metadata):
Holds information about an error a model made for analysis purposes
:param error_type: str, e.g. FP, FN, Person->Address etc.
:param annotation: ground truth value
:param prediction: predicted value
:param token: token in question
:param full_text: full input text
:param metadata: metadata on text from InputSample
self.error_type = error_type
self.annotation = annotation
self.prediction = prediction
self.token = token
self.full_text = full_text
self.metadata = metadata
def __str__(self):
return "type: {}, " \
"Annotation = {}, " \
"prediction = {}, " \
"Token = {}, " \
"Full text = {}, " \
"Metadata = {}".format(self.error_type,
def __repr__(self):
return r"<ModelError {{0}}>".format(self.__str__())
class EvaluationResult(object):
def __init__(self, results: Counter, model_errors: List[ModelError], text: str = None):
Holds the output of a comparison between ground truth and predicted
:param results: List of objects of type Counter
with structure {(actual, predicted) : count}
:param model_errors: List of ModelError
:param text: sample's full text (if used for one sample)
:type results: Counter
:type model_errors : List[ModelError]
:type text: object
self.results = results
self.model_errors = model_errors
self.text = text
self.pii_recall = None
self.pii_precision = None
self.pii_f = None
self.entity_recall_dict = None
self.entity_precision_dict = None
def print(self):
recall_dict = self.entity_recall_dict
precision_dict = self.entity_precision_dict
recall_dict["PII"] = self.pii_recall
precision_dict["PII"] = self.pii_precision
entities = recall_dict.keys()
recall = recall_dict.values()
precision = precision_dict.values()
row_format = "{:>30}{:>30.2%}{:>30.2%}"
header_format = "{:>30}" * 3
print(header_format.format(*("Entity", "Precision", "Recall")))
for entity, precision, recall in zip(entities, recall, precision):
print(row_format.format(entity, precision, recall))
print("PII F measure: {}".format(self.pii_f))

from typing import List
from import Sentence, build_spacy_tokenizer
from flair.models import SequenceTagger
except ImportError:
print("Flair is not installed by default")
from presidio_evaluator import ModelEvaluator, InputSample
import spacy
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
class FlairEvaluator(ModelEvaluator):
def __init__(self,
model_path: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True,
Evaluator for Flair models
:param model: model of type SequenceTagger
:param model_path:
:param entities_to_keep:
:param verbose:
:param labeling_scheme:
:param compare_by_io:
:param translate_to_spacy_entities:
if model is None:
if model_path is None:
raise ValueError("Either model_path or model object must be supplied")
self.model = SequenceTagger.load(model_path)
self.model = model
self.spacy_tokenizer = build_spacy_tokenizer(model=spacy.blank('en'))
self.translate_to_spacy_entities = translate_to_spacy_entities
if self.translate_to_spacy_entities:
print("Translating entities using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
def predict(self, sample: InputSample) -> List[str]:
if self.translate_to_spacy_entities:
sentence = Sentence(text=sample.full_text, use_tokenizer=self.spacy_tokenizer)
tags = self.get_tags_from_sentence(sentence)
if len(tags) != len(sample.tokens):
print("mismatch between previous tokens and new tokens")
return tags
def get_tags_from_sentence(sentence):
tags = []
for token in sentence:
new_tags = []
for tag in tags:
new_tags.append("PERSON" if tag == "PER" else tag)
return new_tags

Просмотреть файл

@ -0,0 +1,398 @@
from abc import ABC, abstractmethod
from typing import List, Tuple, Dict
from collections import Counter
import numpy as np
import pandas as pd
from presidio_evaluator import InputSample, EvaluationResult, ModelError
from tqdm import tqdm
class ModelEvaluator(ABC):
def __init__(self, entities_to_keep: List[str] = None,
verbose: bool = False,
use_spans: bool = False, labeling_scheme="BIO",
Abstract class for evaluating NER models and others
:param entities_to_keep: Which entities should be evaluated? All other
entities are ignored. If None, none are filtered
:param verbose: Whether to print more debug info
:param labeling_scheme: Type of scheme used for labeling (BILOU,
:param compare_by_io: True if comparison should be done on the entity
level and not the sub-entity level
self.entities = entities_to_keep
self.verbose = verbose
self.use_spans = use_spans
self.compare_by_io = compare_by_io
self.labeling_scheme = labeling_scheme
def predict(self, sample: InputSample) -> List[str]:
Abstract. Returns the predicted tokens/spans from the evaluated model
:param sample: Sample to be evaluated
:return: if self.use spans: list of spans
if not self.use_spans: tags in self.labeling_scheme format
def compare(self, input_sample: InputSample, prediction: List[str]):
Compares gound truth tags (annotation) and predicted (prediction)
:param input_sample: input sample containing list of tags with scheme
:param prediction: predicted value for each token
annotation = input_sample.tags
tokens = input_sample.tokens
if len(annotation) != len(prediction):
print("Annotation and prediction do not have the"
"same length. Sample={}".format(input_sample))
return Counter(), []
results = Counter()
mistakes = []
new_annotation = annotation.copy()
if self.compare_by_io:
new_annotation = self._to_io(new_annotation)
prediction = self._to_io(prediction)
# Ignore annotations that aren't in the list of
# requested entities.
if self.entities:
prediction = self._adjust_per_entities(prediction)
new_annotation = self._adjust_per_entities(new_annotation)
for i in range(0, len(new_annotation)):
results[(new_annotation[i], prediction[i])] += 1
if self.verbose:
print('Annotation:', new_annotation[i])
print('Prediction:', prediction[i])
# check if there was an error
is_error = (new_annotation[i] != prediction[i])
if is_error:
if prediction[i] == 'O':
elif new_annotation[i] == 'O':
mistakes.append(ModelError("Wrong entity",
return results, mistakes
def _adjust_per_entities(self, tags):
if self.entities:
return [tag if tag in self.entities else 'O' for tag in tags]
def _to_io(tags):
Translates BILOU/BIO/IOB to IO - only In or Out of entity.
['B-PERSON','I-PERSON','L-PERSON'] is translated into
:param tags: the input tags in BILOU/IOB/BIO format
:return: a new list of IO tags
return [tag[2:] if '-' in tag else tag for tag in tags]
def evaluate_sample(self, sample: InputSample) -> EvaluationResult:
if self.verbose:
print("Input sentence: {}".format(sample.full_text))
prediction = self.predict(sample)
results, mistakes =
return EvaluationResult(results, mistakes, sample.full_text)
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
evaluation_results = []
for sample in tqdm(dataset, desc='Evaluating {}'.format(self.__class__)):
evaluation_result = self.evaluate_sample(sample)
return evaluation_results
def calculate_score(self, evaluation_results: List[
EvaluationResult], beta: float = 1) \
-> EvaluationResult:
Returns the pii_precision, pii_recall and f_measure either for each entity
or for all entities (ignore_entity_type = True)
:param evaluation_results: List of EvaluationResult
:param beta: F measure beta value
between different entity types, or to treat these as misclassifications
:return: EvaluationResult with precision, recall and f measures
# aggregate results
all_results = sum([er.results for er in evaluation_results], Counter())
# compute pii_recall per entity
entity_recall = {}
entity_precision = {}
if self.entities:
entities = self.entities
entities = list(
set([x[0] for x in all_results.keys() if x[0] != 'O']))
for entity in entities:
# all annotation of given type
annotated = sum(
[all_results[x] for x in all_results if x[0] == entity])
predicted = sum(
[all_results[x] for x in all_results if x[1] == entity])
tp = all_results[(entity, entity)]
if annotated > 0:
entity_recall[entity] = tp / annotated
entity_recall[entity] = np.NaN
if predicted > 0:
per_entity_tp = all_results[(entity, entity)]
entity_precision[entity] = per_entity_tp / predicted
entity_precision[entity] = np.NaN
# compute pii_precision and pii_recall
annotated_all = sum(
[all_results[x] for x in all_results if x[0] != 'O'])
predicted_all = sum(
[all_results[x] for x in all_results if x[1] != 'O'])
if annotated_all > 0:
pii_recall = sum([all_results[x] for x in all_results if
(x[0] != 'O' and x[1] != 'O')]) / annotated_all
pii_recall = np.NaN
if predicted_all > 0:
pii_precision = sum([all_results[x] for x in all_results if
(x[0] != 'O' and x[1] != 'O')]) / predicted_all
pii_precision = np.NaN
# compute pii_f_beta-score
pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
# aggregate errors
errors = []
for res in evaluation_results:
if res.model_errors:
evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
evaluation_result.pii_precision = pii_precision
evaluation_result.pii_recall = pii_recall
evaluation_result.entity_recall_dict = entity_recall
evaluation_result.entity_precision_dict = entity_precision
evaluation_result.pii_f = pii_f_beta
return evaluation_result
def precision(tp: int, fp: int) -> float:
return tp / (tp + fp + 1e-100)
def recall(tp: int, fn: int) -> float:
return tp / (tp + fn + 1e-100)
def f_beta(precision: float, recall: float, beta: float) -> float:
Returns the F score for precision, recall and a beta parameter
:param precision: a float with the precision value
:param recall: a float with the recall value
:param beta: a float with the beta parameter of the F measure,
which gives more or less weight to precision
vs. recall
:return: a float value of the f(beta) measure.
if np.isnan(precision) or np.isnan(recall) or (
precision == 0 and recall == 0):
return np.nan
return ((1 + beta ** 2) * precision * recall) / (
((beta ** 2) * precision) + recall)
def align_input_samples_to_presidio_analyzer(input_samples: List[InputSample],
entities_mapping: Dict[str, str],
presidio_fields: List[str]=None) \
-> List[InputSample]:
Change input samples to conform with Presidio's entities
:return: new list of InputSample
new_input_samples = input_samples.copy()
# Match entity names to Presidio's
if not presidio_fields:
# A list that will contain updated input samples,
new_list = []
# Iterate on all samples
for input_sample in new_input_samples:
contains_presidio_field = False
new_spans = []
# Update spans to match Presidio's entity name
for span in input_sample.spans:
in_presidio_field = False
if span.entity_type in entities_mapping.keys():
new_name = entities_mapping.get(span.entity_type)
span.entity_type = new_name
contains_presidio_field = True
# Add to new span list, if the span contains an entity relevant to Presidio
input_sample.spans = new_spans
# Update tags in case this sample has relevant entities for evaluation
if contains_presidio_field:
for i, tag in enumerate(input_sample.tags):
has_prefix = '-' in tag
if has_prefix:
prefix = tag[:2]
clean = tag[2:]
prefix = ""
clean = tag
if clean in entities_mapping.keys():
new_name = entities_mapping.get(clean)
input_sample.tags[i] = "{}{}".format(prefix, new_name)
input_sample.tags[i] = 'O'
return new_list
def get_false_positives(errors=List[ModelError], entity=None):
Get a list of all false positive errors in the results
if isinstance(entity, str):
entity = [entity]
if entity:
return [model_error for model_error in errors if
model_error.error_type == 'FP' and model_error.prediction in entity]
return [model_error for model_error in errors if model_error.error_type == 'FP']
def get_false_negatives(errors=List[ModelError], entity=None):
Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
if isinstance(entity, str):
entity = [entity]
if entity:
return [model_error for model_error in errors if
model_error.error_type != 'FP' and model_error.annotation in entity]
return [model_error for model_error in errors if model_error.error_type != 'FP']
def most_common_fp_tokens(errors=List[ModelError], n: int = 10, entity=None):
Print the n most common false positive tokens (tokens thought to be an entity)
fps = ModelEvaluator.get_false_positives(errors, entity)
tokens = [err.token.text for err in fps]
from collections import Counter
by_frequency = Counter(tokens)
most_common = by_frequency.most_common(n)
print("Most common false positive tokens:")
print("Example sentence with each FP token:")
for tok, val in most_common:
with_tok = [err for err in fps if err.token.text == tok]
def most_common_fn_tokens(errors=List[ModelError], n: int = 10, entity=None):
Print all tokens that were missed by the model, including an example of the full text in which they appear
fns = ModelEvaluator.get_false_negatives(errors, entity)
fns_tokens = [err.token.text for err in fns]
from collections import Counter
by_frequency_fns = Counter(fns_tokens)
most_common_fns = by_frequency_fns.most_common(50)
for tok, val in most_common_fns:
with_tok = [err for err in fns if err.token.text == tok]
print("Token: {}, Annotation: {}, Full text: {}".format(with_tok[0].token, with_tok[0].annotation,
def get_errors_df(errors=List[ModelError], entity: List[str] = None, error_type: str = 'FN'):
Get ModelErrors as pd.DataFrame
if error_type == 'FN':
filtered_errors = ModelEvaluator.get_false_negatives(errors, entity)
elif error_type == 'FP':
filtered_errors = ModelEvaluator.get_false_positives(errors, entity)
raise ValueError("error_type should be either FP or FN")
if len(filtered_errors) == 0:
print("No errors of type {} and entity {} were found".format(error_type,entity))
return None
errors_df = pd.DataFrame.from_records([error.__dict__ for error in filtered_errors])
metadata_df = pd.DataFrame(errors_df['metadata'].tolist())
errors_df.drop(['metadata'], axis=1, inplace=True)
new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
return new_errors_df
def get_fps_dataframe(errors=List[ModelError], entity: List[str] = None):
Get false positive ModelErrors as pd.DataFrame
return ModelEvaluator.get_errors_df(errors, entity, error_type='FP')
def get_fns_dataframe(errors=List[ModelError], entity: List[str] = None):
Get false negative ModelErrors as pd.DataFrame
return ModelEvaluator.get_errors_df(errors, entity, error_type='FN')

Presidio Analyzer not yet on PyPI, cannot explicitly reference it
from typing import List, Dict
from presidio_evaluator import ModelEvaluator, InputSample, span_to_tag
from presidio_evaluator.data_generator import read_synth_dataset
class PresidioAnalyzer(ModelEvaluator):
def __init__(self, analyzer,
entities_to_keep: List[str] = None,
verbose: bool = False,
Evaluation wrapper for the Presidio Analyzer
:param analyzer: object of type AnalyzerEngine (from presidio-analyzer)
self.analyzer = analyzer
self.score_threshold = score_threshold
def predict(self, sample: InputSample) -> List[str]:
if self.entities is None or len(self.entities) == 0:
all_fields = True
all_fields = None
results = self.analyzer.analyze(sample.full_text, self.entities,
language='en', all_fields=all_fields)
starts = []
ends = []
scores = []
tags = []
for res in results:
if res.score >= self.score_threshold:
response_tags = span_to_tag(scheme=self.labeling_scheme,
return response_tags
if __name__ == "__main__":
print("Reading dataset")
input_samples = read_synth_dataset("../data/generated_size_30000_date_July 24 2019.txt")
print("Preparing dataset by aligning entity names to Presidio's entity names")
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
entities_mapping = {
'O': 'O'
updated_samples = ModelEvaluator.align_input_samples_to_presidio_analyzer(input_samples,
flatten = lambda l: [item for sublist in l for item in sublist]
from collections import Counter
count_per_entity = Counter(
[span.entity_type for span in flatten([input_sample.spans for input_sample in updated_samples])])
print("Evaluating samples")
analyzer = PresidioAnalyzer(entities_to_keep=count_per_entity.keys())
evaluated_samples = analyzer.evaluate_all(updated_samples)
print("Estimating metrics")
precision, recall, \
entity_recall, entity_precision, \
f, errors = analyzer.calculate_score(evaluation_results=evaluated_samples, beta=2.5)
print("precision: {}".format(precision))
print("Recall: {}".format(recall))
print("F 2.5: {}".format(f))
print("Precision per entity: {}".format(entity_precision))
print("Recall per entity: {}".format(entity_recall))
FN_mistakes = [mistake for mistake in flatten(errors) if mistake[0:2] == 'FN']
FP_mistakes = [mistake for mistake in flatten(errors) if mistake[0:2] == 'FP']
other_mistakes = [mistake for mistake in flatten(errors) if "Wrong entity" in mistake]
fn = open('../data/fn_30000.txt', 'w+', encoding='utf-8')
fn1 = '\n'.join(FN_mistakes)
fp = open('../data/fp_30000.txt', 'w+', encoding='utf-8')
fp1 = '\n'.join(FP_mistakes)
mistakes_file = open('../data/mistakes_30000.txt', 'w+', encoding='utf-8')
mistakes1 = '\n'.join(other_mistakes)
from pickle import dump
dump(evaluated_samples, open("../data/evaluated_samples_30000.pickle", "wb"))

Просмотреть файл

@ -0,0 +1,133 @@
import json
from typing import List
import requests
from presidio_evaluator import InputSample, ModelEvaluator
from presidio_evaluator.span_to_tag import span_to_tag, tokenize
class PresidioAPIEvaluator(ModelEvaluator):
def __init__(self, endpoint=None, all_fields=False, entities_to_keep=None,
verbose=False, labeling_scheme="IO", **kwargs):
evaluator model for the presidio API as a system
:param endpoint: url of presidio API
:param all_fields: boolean, true if no entities filtering should take
:param entities_to_keep: list of entities to return if found
:param labeling_scheme: BIO/IOB or BILOU
:param verbose:
:param kwargs:
if not endpoint:
"Endpoint is missing. using default presidio API at {}".format(
self.endpoint = ENDPOINT
self.endpoint = endpoint
if not entities_to_keep and not all_fields:
raise ValueError("Please provide either a list of entities or"
if all_fields:
entities_to_keep = None
super().__init__(verbose=verbose, entities_to_keep=entities_to_keep,
labeling_scheme=labeling_scheme, **kwargs)
def predict(self, sample: InputSample):
text = sample.full_text
request = {"text": text,
"analyzeTemplate": self.analyze_template
# Call presidio API
r =, json=request)
starts = []
ends = []
tags = []
if r.status_code == 200:
analyzer_results = json.loads(r.text)
if self.verbose:
if analyzer_results:
for res in analyzer_results:
if not res['location'].get('start'):
res['location']['start'] = 0
response_tags = span_to_tag(scheme=self.labeling_scheme,
elif r.status_code == 400 or r.text == "":
if self.verbose:
print("Status 400 received")
response_tags = ['O' for token in sample.tokens]
print("Error getting result from Presidio API")
print("Request = {}".format(request))
print("Response = {}".format(r.text))
raise Exception(r)
return response_tags
def set_analyze_template(self, all_fields: bool, entities: List[str]):
template = {
"fields": [{"name": "EMAIL_ADDRESS"}, {"name": "IP_ADDRESS"},
{"name": "US_DRIVER_LICENSE"},
{"name": "US_ITIN"}, {"name": "US_SSN"},
{"name": "DOMAIN_NAME"},
{"name": "IBAN_CODE"}, {"name": "PERSON"},
{"name": "PHONE_NUMBER"},
{"name": "US_BANK_NUMBER"}, {"name": "CRYPTO"},
{"name": "NRP"},
{"name": "UK_NHS"}, {"name": "CREDIT_CARD"},
{"name": "DATE_TIME"},
{"name": "LOCATION"}, {"name": "US_PASSPORT"}]}
if all_fields:
self.analyze_template = template
requested_fields = []
for entity in entities:
for field in template['fields']:
if entity == field['name']:
new_template = {'fields': requested_fields}
self.analyze_template = new_template
if __name__ == "__main__":
# Example:
text = "My siblings are Dan and magen"
bilou_tags = ['O', 'O', 'O', 'U-PERSON', 'O', 'U-PERSON']
presidio = PresidioAPIEvaluator(verbose=True, all_fields=True, compare_by_io=True)
tokens = tokenize(text)
s = InputSample(text, masked=None, spans=None)
s.tokens = tokens
s.tags = bilou_tags
evaluated_sample = presidio.evaluate_sample(s)
p, r, entity_recall, f, mistakes = presidio.calculate_score([evaluated_sample])
print("Precision = {}\n"
"Recall = {}\n"
"F_3 = {}\n"
"Errors = {}".format(p, r, f, mistakes))

Presidio Analyzer not yet on PyPI, therefore it cannot be referenced explicitly
import math
from typing import List, Tuple, Dict
from presidio_evaluator import ModelEvaluator, InputSample
from presidio_evaluator.span_to_tag import span_to_tag
class PresidioRecognizerEvaluator(ModelEvaluator):
def __init__(self, recognizer, nlp_engine, entities_to_keep=None,
with_nlp_artifacts=False, verbose=False, compare_by_io=True,
Evaluator for one recognizer
:param recognizer: An object of type EntityRecognizer (in presidion-analyzer)
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
verbose=verbose, compare_by_io=compare_by_io)
self.withNlpArtifacts = with_nlp_artifacts
self.recognizer = recognizer
self.nlp_engine = nlp_engine
def __make_nlp_artifacts(self, text: str):
return self.nlp_engine.process_text(text, 'en')
def predict(self, sample: InputSample) -> List[str]:
nlpArtifacts = None
if self.withNlpArtifacts:
nlpArtifacts = self.__make_nlp_artifacts(sample.full_text)
results = self.recognizer.analyze(sample.full_text, self.entities,
starts = []
ends = []
tags = []
scores = []
for res in results:
if not res.start:
res.start = 0
response_tags = span_to_tag(scheme=self.labeling_scheme,
if len(sample.tags) == 0:
sample.tags = ['0' for word in response_tags]
return response_tags
def score_presidio_recognizer(recognizer, entities_to_keep, input_samples,
withNlpArtifacts=False) \
-> Tuple[Dict[str, float], Dict[str, float], Dict[str, float], Dict[
str, float], Dict[str, float], List[str]]:
model = PresidioRecognizerEvaluator(recognizer=recognizer,
evaluated_samples = model.evaluate_all(input_samples[:])
precision, recall, ent_recall, \
ent_precision, fscore, mistakes = model.calculate_score(
evaluated_samples, beta=2.5)
print("p={precision}, r={recall},f={f},"
"entity recall={ent},entity precision={prec}".format(
if math.isnan(precision):
precision = 0
return precision, recall, ent_recall, ent_precision, fscore, mistakes

Просмотреть файл

@ -0,0 +1,52 @@
from typing import List
from presidio_evaluator import ModelEvaluator, InputSample
import spacy
from spacy.language import Language
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
class SpacyEvaluator(ModelEvaluator):
def __init__(self,
model: spacy.language.Language = None,
model_name: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True,
translate_to_spacy_ents = True):
if model is None:
if model_name is None:
raise ValueError("Either model_name or model object must be supplied")
self.model = spacy.load(model_name)
self.model = model
self.translate_to_spacy_ents = translate_to_spacy_ents
if self.translate_to_spacy_ents:
print("Translating entites using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
def predict(self, sample: InputSample) -> List[str]:
if self.translate_to_spacy_ents:
doc = self.model(sample.full_text)
tags = self.get_tags_from_doc(doc)
if len(doc) != len(sample.tokens):
print("mismatch between input tokens and new tokens")
return tags
def get_tags_from_doc(doc):
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
return tags

from collections import namedtuple
from typing import List
import spacy
loaded_spacy = {}
def get_spacy(loaded_spacy=loaded_spacy, model_version="en_core_web_lg"):
if model_version not in loaded_spacy:
disable = ['vectors', 'textcat', 'ner']
print("loading model {}".format(model_version))
loaded_spacy[model_version] = spacy.load(model_version, disable=disable)
return loaded_spacy[model_version]
def tokenize(text, model_version="en_core_web_lg"):
return get_spacy(model_version=model_version)(text)
def _get_detailed_tags(scheme, cur_tags):
Replaces IO tags (e.g. PERSON PERSON) with IOB/BIO/BILOU tags
:param cur_tags:
:param scheme:
if all([tag == 'O' for tag in cur_tags]):
return cur_tags
return_tags = []
if len(cur_tags) == 1:
if scheme == "BILOU":
elif len(cur_tags) > 0:
tg = cur_tags[0]
for j in range(0, len(cur_tags)):
if j == 0:
elif j == len(cur_tags) - 1:
if scheme == "BILOU":
return return_tags
def _sort_spans(start, end, tag, score):
if len(start) > 0:
tpl = [(a, b, c, d) for a, b, c, d in sorted(zip(start, end, tag, score), key=lambda pair: pair[0])]
start, end, tag, score = [[x[i] for x in tpl] for i in range(len(tpl[0]))]
return start, end, tag, score
def _handle_overlaps(start, end, tag, score):
start, end, tag, score = _sort_spans(start, end, tag, score)
if len(start) == 0:
return start, end, tag, score
max_end = max(end)
index = min(start)
number_of_spans = len(start)
i = 0
while i < number_of_spans-1:
for j in range(i+1,number_of_spans):
# Span j intersects with span i
if start[i] <= start[j] <= end[i]:
# i's score is higher, remove intersecting part
if score[i] > score[j]:
# j is contained within i but has lower score, remove
if start[i] >= end[j] >= end[i]:
score[j] = 0
# else, j continues after i ended:
start[j] = end[i] + 1
# j's score is higher, break i
# If i finishes after j ended, split i
if end[j] < end[i]:
# create new span at the end
start.append(end[j] + 1)
number_of_spans += 1
# truncate the current i to end at start(j)
end[i] = start[j] - 1
# else, i finishes before j ended. truncate i
end[i] = start[j] - 1
i += 1
start, end, tag, score = _sort_spans(start, end, tag, score)
return start, end, tag, score
def span_to_tag(scheme: str,
text: str,
start: List[int],
end: List[int],
tag: List[str],
scores: List[float] = None,
tokens: List[spacy.tokens.Token] = None,
io_tags_only=False) -> List[str]:
Turns a list of start and end values with corresponding labels, into a NER
tagging (BILOU,BIO/IOB)
:param scheme: labeling scheme, either BILOU, BIO/IOB or IO
:param text: input text
:param tokens: text tokenized to tokens
:param start: list of indices where entities in the text start
:param end: list of indices where entities in the text end
:param tag: list of entity names
:param scores: score of tag (confidence)
:param io_tags_only: Whether to return only I and O tags
:return: list of strings, representing either BILOU or BIO for the input
if not scores:
# assume all scores are of equal weight
scores = [0.5 for start in start]
start, end, tag, scores = _handle_overlaps(start, end, tag, scores)
if not tokens:
tokens = tokenize(text)
io_tags = []
for token in tokens:
found = False
for span_index in range(0, len(start)):
if start[span_index] <= token.idx < end[span_index]:
found = True
if not found:
if io_tags_only or scheme == "IO":
return io_tags
# Set tagging based on scheme (BIO/IOB or BILOU)
current_tag = ""
span_index = 0
changes = []
for io_tag in io_tags:
if io_tag != current_tag:
span_index += 1
current_tag = io_tag
new_return_tags = []
for i in range(len(changes) - 1):
cur_tags=io_tags[changes[i]:changes[i + 1]]))
return new_return_tags

from collections import defaultdict
import random
import numpy as np
from typing import List, Dict
import json
from presidio_evaluator import InputSample
def split_dataset(dataset : List[InputSample], ratios):
Splits a provided dataset into n groups, by the Template# attribute in each sample's metadata
:param dataset: List of InputSamples to be splitted
:param ratios: list of percentages. The len of the list would be the len of the splits returned,
e.g. [0.7,0.2,0.1] for train, test, validation
splits = []
remaining_dataset = dataset
remaining_ratio = 1.0
if sum(ratios) > 1 or sum(ratios) < 0.999:
raise ValueError("Ratios should sum to 1 and be in (0,1]")
for ratio in ratios:
if 1 >= ratio > 0:
first_templates, second_templates = split_by_template(remaining_dataset, ratio/remaining_ratio)
first_split = get_samples_by_pattern(remaining_dataset, first_templates)
second_split = get_samples_by_pattern(remaining_dataset, second_templates)
remaining_dataset = second_split
remaining_ratio -= ratio
raise ValueError("Ratio needs to be in (0,1]")
return tuple(splits)
def group_by_template(dataset: List[InputSample]) -> Dict[str, List[InputSample]]:
Creates a dict of key = template ID and value = List[InputSamples] for this template id
samples_pattern_tup = [(sample.metadata["Template#"],sample) for sample in dataset]
group_by_template = defaultdict(list)
for sample in samples_pattern_tup:
return group_by_template
def split_by_template(input_samples: List[InputSample], train_pct: float = 0.7):
Splits a daset of type List[InputSample] into a tuple of train template IDs and test template IDs
samples_grpd = group_by_template(input_samples)
templates = np.array(list(samples_grpd.keys()))
train_ind = set(random.sample(range(len(templates)), round(train_pct * len(templates))))
test_ind = set(range(len(templates))) - train_ind
return templates[list(train_ind)], templates[list(test_ind)]
def get_samples_by_pattern(input_samples, patterns_list):
samples_grpd = group_by_template(input_samples)
dataset = []
for pattern in patterns_list:
return dataset
def save_to_json(samples, output_file):
examples_dict = [example.to_dict() for example in samples]
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
json.dump(examples_dict, f, ensure_ascii=False, indent=4)

pytest.ini Normal file
@ -0,0 +1,17 @@

@ -0,0 +1,38 @@
from setuptools import setup
import os.path
# read the contents of the README file
from os import path
this_directory = path.abspath(path.dirname(__file__))
with open(path.join(this_directory, ''), encoding='utf-8') as f:
long_description =
# print(long_description)
__version__ = ""
with open(os.path.join(this_directory, 'VERSION')) as version_file:
__version__ =
packages=['presidio_evaluator', 'presidio_evaluator.data_generator'
description='PII dataset generator, model evaluator for Presidio and PII data in general',

tests/ Normal file
Просмотреть файл

tests/ Normal file
Просмотреть файл

@ -0,0 +1,31 @@
import pytest
# pytest configuration file
# the configuration allow 3 kind of tests:
# * unmarked tests run on all pytest execution
# * tests with large datasets\long testing time are marked as "slow" and have to be run with pytest run --runslow
# * tests with inconclusive result are marked as "inconclusive" have to be run with pytest run --runinconclusive
# * tests can be both slow and inconclusive and have to be run with pytest run --runslow --runinconclusive
def pytest_addoption(parser):
"--runslow", action="store_true", default=False, help="run slow tests"
"--runinconclusive", action="store_true", default=False, help="run slow tests"
def pytest_collection_modifyitems(items, config):
if not config.getoption("--runslow"):
skip_slow = pytest.mark.skip(reason="need --runslow option to run")
for item in items:
if "slow" in item.keywords:
if not config.getoption("--runinconclusive"):
skip_slow = pytest.mark.skip(reason="need --runinconclusive option to run")
for item in items:
if "inconclusive" in item.keywords:

Просмотреть файл

@ -0,0 +1,18 @@
a 1,()
a b c,()
a cappella,()
a fortiori,()
a mensa et thoro,()
a posteriori,()
a priori,()
aaron's rod,()
2 a ()
3 a- ()
4 a 1 ()
5 a b c ()
6 a cappella ()
7 a fortiori ()
8 a mensa et thoro ()
9 a posteriori ()
10 a priori ()
11 aam (n.)
12 aard-vark (n.)
13 aard-wolf (n.)
14 aaronic (a.)
15 aaronical (a.)
16 aaron's rod ()
17 ab (n.)
18 ab- ()

1,female,Czech,Mrs.,Marie,J,Hamanová,"P.O. Box 255",Kangerlussuaq,QE,Qeqqata,3910,GL,Greenland,,Wasco1982,eiZookooB7,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","84 23 30",299,Kubíková,3/29/1982,37,Aries,MasterCard,5545634085461876,511,1/2020,,"1Z 789 686 82 8979 914 6",6945116246,34746079,Purple,"Surveillance officer","Simple Solutions","1995 Zastava 65",,O+,217.6,98.9,"5' 5""",164,6781b04d-7b5f-4c1a-bceb-b953e6ef70d7,77.377518,-67.015569
2,female,French,Ms.,Patricia,G,Desrosiers,"Avenida Noruega 42","Vila Real",VR,"Vila Real",5000-047,PT,Portugal,,Fultses,eb6soCha4ae,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","21 259 903 5696",351,Daviau,2/28/1956,63,Pisces,MasterCard,5317250628844522,874,3/2022,,"1Z V38 747 73 7311 832 9",7398998399,18093674,Blue,"Vascular technologist","Formula Gray","2006 Lexus GS",,O+,118.1,53.7,"5' 0""",152,2b2e7e1a-855f-4089-a570-c0af2381a6d6,41.274541,-7.876658
3,female,American,Ms.,Debra,O,Neal,"1659 Hoog St",Brakpan,GA,Gauteng,1553,ZA,"South Africa",,Cognoy,sha3Sohzee,"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/ Yowser/2.5 Safari/537.36","082 490 1693",27,Barrett,6/11/1957,62,Gemini,Visa,4916429195104076,315,5/2020,5706114632083,"1Z 061 1E5 71 3400 427 4",6186449862,58702271,Blue,"Information architect librarian",Dahlkemper's,"1993 Honda Prelude",,A+,120.1,54.6,"5' 4""",162,2ef83f4c-3102-4f79-839d-c75bf6a06f0a,-26.22096,28.283398
4,male,French,Mr.,Peverell,C,Racine,"183 Epimenidou Street",Limassol,LI,Limassol,3041,CY,"Cyprus (Anglicized)",,Restlys,Aekie7ohs,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","25 470375",357,Grondin,6/14/1962,57,Gemini,Visa,4485421519226702,653,5/2023,,"1Z F44 91V 14 3570 491 2",0850016444,52534088,Blue,"Desk clerk",Quickbiz,"2008 Infiniti G35",,B+,142.1,64.6,"5' 9""",174,bfb4be71-3710-4ffa-baaf-5af6aa4b339e,41.30296,-72.989066
5,female,Slovenian,Mrs.,Iolanda,S,Tratnik,"Karu põik 61",Pärnu,PR,Pärnumaa,80098,EE,Estonia,,Trely1962,jeiziejohH3ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","445 6271",372,Korbun,1/23/1962,57,Aquarius,Visa,4532820383285186,893,4/2024,,"1Z 060 418 64 7516 574 4",1178606881,74806227,Purple,"Production assistant","Dubrow's Cafeteria","2007 Fiat Idea",,O+,141.5,64.3,"5' 3""",160,0cbb7bf3-466f-4df6-bda3-9c9fe7bfc5c1,58.293395,24.434851
6,male,Italian,Mr.,Domenico,D,Pisano,"Via Pisanelli 104",Traversara,RA,Ravenna,48020,IT,Italy,,Hatelt,lohhee8Zah,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36","0312 0828589",39,Conti,6/1/1979,40,Gemini,Visa,4532872142737056,237,6/2023,WK48391724,"1Z 175 1F5 29 1963 168 1",7448393148,31617424,Blue,"Professional scout",Littler's,"1998 Nissan Serena",,O+,247.5,112.5,"6' 0""",182,f4feeb24-e3b1-4d99-9c71-e8c6a95762fe,44.588081,12.055283
7,male,Greenland,Mr.,Pavia,A,Rosing,"29 Wattle St","King William's Town",EC,"Eastern Cape",5601,ZA,"South Africa",,Thattere,aiCheed7tie,"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/ Yowser/2.5 Safari/537.36","082 692 3461",27,Lennert,5/5/1937,82,Taurus,Visa,4539980160229196,256,10/2020,3705057790082,"1Z 507 770 52 3012 473 1",1256867146,65720899,Green,"Chemical engineer","Pup 'N' Taco","2003 Peugeot Partner",,O-,192.1,87.3,"6' 0""",182,b6b75cf9-dfbf-424d-a03c-90cdd859e9eb,-32.787712,27.343649
8,female,French,Mrs.,Ormazd,M,Jomphe,"Mattenstrasse 108",Sissach,,,4450,CH,Switzerland,,Deace1999,oochui5Eboe5T,"Mozilla/5.0 (Windows NT 6.1; rv:66.0) Gecko/20100101 Firefox/66.0","061 947 83 90",41,Busson,1/14/1999,20,Capricorn,Visa,4556603638439886,691,6/2024,,"1Z 091 192 83 9348 168 6",4380386435,24628087,Purple,"Clinical psychologist","Linens 'n Things","1996 Plymouth Neon",,O+,115.3,52.4,"5' 1""",154,e5858bdb-9173-4991-9857-4e09b61e4e16,47.520557,7.863831
9,male,Norwegian,Mr.,Severin,L,Akhtar,"251 Charilaou Trikoupi Str.",Pigenia,NI,Nicosia,2962,CY,"Cyprus (Anglicized)",,Heremer,cieCipua8L,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 586625",357,Mathisen,4/30/1960,59,Taurus,MasterCard,5230940651584482,785,10/2022,,"1Z W81 228 52 4912 032 1",8897778249,55031915,Green,"Pump operator","Fragrant Flower Lawn Services","2005 Dodge Nitro",,B+,155.3,70.6,"6' 0""",182,64383596-6dc8-4b77-9476-c1a8ef23ffc6,41.335894,-72.908321
10,female,Greenland,Mrs.,Margrethe,H,Kristiansen,"94 boulevard Amiral Courbet",ORLÉANS,CE,Centre,45100,FR,France,,Theirturavid,Aiv4ohwae,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",,33,Berthelsen,12/13/1979,40,Sagittarius,MasterCard,5306020150102745,915,12/2024,"2791269679323 49","1Z 987 E42 01 7982 218 2",0231937615,18687876,Purple,"Systems software engineer","Independent Wealth Management","2012 Porsche 911",,A+,200.2,91.0,"5' 3""",159,ae7b1a56-0d6d-46ba-895e-75ee10482858,47.850047,1.875252
11,female,Hispanic,Mrs.,Myrna,G,Feliciano,"Männi 12",Mustoja,LV,Lääne-Virumaa,45429,EE,Estonia,,Wilthe84,Chu6shiRees,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","329 3803",372,Cortés,1/23/1984,35,Aquarius,MasterCard,5402842306504596,788,7/2020,,"1Z 599 666 46 6430 018 3",8678427166,90687114,Blue,"Radiologic technician",Monmax,"2003 Mitsubishi Lancer",,O+,212.7,96.7,"5' 5""",164,4e8b5c6c-0b04-43c1-ad5b-2fc92365455c,59.638357,26.059683
12,male,Czech,Mr.,Michal,E,Horký,"Algade 33",Guldborg,SJ,"Region Sjælland",4862,DK,Denmark,,Fiect1941,oxep7Aev,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0",28-64-27-85,45,Siváková,6/19/1941,78,Gemini,MasterCard,5108116586316493,376,2/2021,190641-4941,"1Z 85E W86 50 8027 647 6",4547979244,81431928,Orange,"Dental assistant",Pointers,"1995 Daihatsu Rocky",,B+,165.0,75.0,"5' 10""",177,a2aa0138-07d3-41e3-b3b7-455d9854e31f,54.815396,11.760822
13,male,French,Mr.,Donat,M,Lespérance,"96 rue de Penthièvre",PONTOISE,IL,Île-de-France,95000,FR,France,,Sirep1950,re8ZieK4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",,33,Dodier,11/12/1950,69,Scorpio,Visa,4556904288472270,512,9/2022,"1501143313127 93","1Z 070 9Y5 64 7265 236 0",1770634719,11974448,Black,Neurosonographer,"American Appliance","2009 Kia Cerato",,A+,180.8,82.2,"5' 7""",170,78b497ed-6d7d-4e5f-8150-1155abf9716e,48.977559,1.976986
14,female,"Japanese (Anglicized)",Ms.,Yuuka,M,Shimasaki,"Mjövattnet 1",NYLAND,,,"870 52",SE,Sweden,,Coun1976,ohleaT4ae,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",0613-9040212,46,Kimura,2/11/1976,43,Aquarius,MasterCard,5457403889440023,903,5/2021,760211-4105,"1Z 975 450 29 2316 562 4",8652144021,72590241,Blue,"Credit checker",Elek-Tek,"2006 Ford Territory",,A-,151.8,69.0,"5' 5""",165,df1a2d57-31d8-4a71-8ad6-cb687ee250d4,62.773416,17.853904
15,male,Swedish,Mr.,Wiktor,H,Ek,"Norðurbraut 27",Reykjavík,,,112,IS,Iceland,,Boally,aigo2OoPhoi,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","434 6815",354,Göransson,5/4/1945,74,Taurus,Visa,4532063379896779,489,4/2024,,"1Z 278 965 48 6106 268 1",1158883056,79465382,White,"Legal secretary","Handy Andy Home Improvement Center","2014 Jaguar XF",,O+,227.0,103.2,"5' 10""",178,d39f58d7-7bb7-4f77-9956-9b505f8a4cc8,64.187422,-21.93344
16,female,Slovenian,Ms.,Polona,H,Ranković,"Õli 68",Himmiste,PL,Põlvamaa,64204,EE,Estonia,,Whou1985,eeZae5oech,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36","798 9719",372,Orehek,7/18/1985,34,Cancer,MasterCard,5433975988486550,723,7/2022,,"1Z A77 0E9 59 0580 345 8",0984960146,75114603,Blue,"Personal trainer","Budget Tapes & Records","1995 Lancia Delta",,A+,160.2,72.8,"5' 5""",165,c8e80b15-7ff4-4b21-bebf-5737d8133bdc,57.990467,27.141455
17,male,Scottish,Mr.,Ivan,M,King,"Pachergasse 64",BÜSCHENDORF,ST,Styria,8786,AT,Austria,,Anempon,XooJoh0se5sh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0660 475 13 89",43,Watson,9/15/1994,25,Virgo,MasterCard,5257015834586726,714,12/2020,,"1Z 37A 329 60 9892 939 6",1176881962,86098320,Green,"Placement counselor",Opticomp,"2000 Citroen C 15",,A+,130.2,59.2,"5' 7""",170,58aec14a-fc35-4e19-a61f-b8f170a9ec7d,47.492881,14.377639
18,female,Finnish,Mrs.,Nelma,M,Grönholm,"Rostsestraat 222",Froidchapelle,WLX,Luxembourg,6440,BE,Belgium,,Obect1946,aeJ6OhneiF3t,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0479 50 54 03",32,Pelkonen,9/15/1946,73,Virgo,MasterCard,5208559023644291,187,12/2021,,"1Z 416 A35 89 5065 644 8",4484163657,51232002,Green,"Claims adjuster","Starship Tapes & Records","2010 Land Rover Defender",,A+,216.5,98.4,"5' 7""",169,2f575438-1e23-4123-ab60-990725e8c08b,50.072755,4.344408
19,female,Hungarian,Ms.,Tünde,F,Hoffmann,"Via Nazario Sauro 112","Cusano Milanino",MI,Milano,20095,IT,Italy,,Preacces,aob4eiteiL,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","0352 9353380",39,Bagi,11/1/1951,68,Scorpio,MasterCard,5559688562190559,258,7/2023,DA75119938,"1Z 959 98A 67 7929 896 3",2123939613,78515387,Orange,"Apparel worker","Hugh M. Woods","2005 BMW X5",,O+,186.1,84.6,"4' 11""",151,0a14766e-42f5-4943-ae10-884bcadf43f8,45.644863,9.128014
20,female,Finnish,Ms.,Riitta,N,Hahl,"ul. Elbląska 97",Olsztyn,,,10-672,PL,Poland,,Thill1954,Aiwar2ooh1,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","67 788 02 83",48,Linna,3/2/1954,65,Pisces,MasterCard,5150663209146952,542,2/2021,54030268860,"1Z 626 853 00 4461 590 4",4657047492,90091015,Purple,"Medical secretary",Edwards,"2012 Toyota Prius",,O+,187.7,85.3,"5' 7""",170,e7825dc5-9fdf-478c-ba72-ab25879699f1,53.768852,20.536572
21,male,Dutch,Mr.,Harwin,R,Galesloot,"Glynitveien 218",SKI,,,1400,NO,Norway,,Saffive,wohHaix5fa,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","914 54 925",47,Ramakers,9/28/1994,25,Libra,MasterCard,5580997921941047,961,4/2022,,"1Z 761 410 55 6702 728 6",9467674600,08790710,Blue,"Electric motor repairer","Buena Vista Realty Service","2002 ZAZ Slavuta",,B+,217.1,98.7,"5' 6""",167,4029654d-5c8f-407e-b003-72bde9f593a1,59.814132,10.87172
22,male,Hispanic,Mr.,Azarías,A,Segovia,"Via Francesco Girardi 49","Carmignano Di Brenta",PD,Padova,35010,IT,Italy,,Portalime,duteiVeev1,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","0394 9130281",39,Nava,12/24/1949,70,Capricorn,MasterCard,5146736492498053,173,1/2022,SZ30479384,"1Z 931 W28 91 2882 876 3",2747014433,22062098,Blue,"Administrative office manager","Total Serve","2005 BMW 325",,B+,140.4,63.8,"6' 0""",184,916cbe50-55b2-4a11-acf4-b8d8d9cc9668,45.432966,12.000719
23,male,Hungarian,Mr.,Adelbert,A,Kuncz,"Turjaška 115","Rečica ob Savinji",,,3332,SI,Slovenia,,Firseten,woh3ejoo2Ei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",031-365-314,386,Pethô,10/17/1976,43,Libra,MasterCard,5431971800958886,663,4/2022,,"1Z 5V7 161 85 3863 932 0",7652480699,67841291,Blue,"Extruding, forming, pressing, and compacting machine setter","Red Owl","2002 Citroen C-Airdream",,O+,193.6,88.0,"6' 1""",186,6653d52f-870f-4956-8b1a-c1463007f387,46.357704,14.932894
24,female,England/Wales,Ms.,Charlie,S,Campbell,"Rue du Château 414",Limont,WLG,Liège,4357,BE,Belgium,,Lithatinquir,Kae4aetah,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0489 33 97 26",32,Tucker,10/16/1987,32,Libra,MasterCard,5268537479220623,072,1/2023,,"1Z 3V4 354 38 1677 342 8",0119239854,24858321,Purple,"Dry-cleaning worker","Red Robin Stores","2004 Audi S4",,O+,119.5,54.3,"5' 8""",173,f3a846e6-ec1c-4ccc-bf51-f66bb64ff559,50.616062,5.247283
25,female,American,Ms.,Thelma,K,Mitchell,"Väike-Laagri 80",Orissaare,SA,Saaremaa,94691,EE,Estonia,,Trind1979,eeThooy3ieph,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","455 6186",372,Rumbaugh,5/23/1979,40,Gemini,MasterCard,5252527082023181,249,9/2024,,"1Z 001 306 31 4745 481 7",4024903278,57842179,Black,"Camera repairer","Omni Superstore","2000 Ford Artic",,A+,118.4,53.8,"5' 6""",168,6e2490c7-4a17-4ce8-9a80-4b82551d3180,58.540748,23.059808
26,male,Brazil,Mr.,Davi,G,Santos,"Rákóczi út 66.",Barnag,VE,Veszprém,8291,HU,Hungary,,Cumeneamord,AeYeRie2doo,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","(88) 158-170",36,Goncalves,3/15/1974,45,Pisces,Visa,4539342451489007,234,9/2021,,"1Z 598 735 11 9286 476 3",9550539865,84997161,Blue,"Clinical manager","Star Merchant Services","2011 Alfa Romeo Giulietta",,A+,156.9,71.3,"5' 11""",180,95501fdd-a13f-43fd-981c-0dced4dd23e6,47.04597,17.745945
27,male,England/Wales,Mr.,Jonathan,A,Conway,"Rua Doutor Afrânio Junqueira 1460","São Paulo",SP,"São Paulo",04581-040,BR,Brazil,,Bessed,Egaez2Vuo,"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko","(11) 7113-8192",55,Humphries,1/20/1938,81,Aquarius,Visa,4916317241919037,680,3/2023,308.271.618-08,"1Z 02V 42E 21 2992 331 4",3494650897,26033304,Blue,"Electrical drafter","Pro Yard Services","2006 Dodge Caravan",,B+,170.9,77.7,"6' 0""",183,0eb93cfe-1260-4caa-99d5-e0f31f73b1e9,-23.594567,-46.709971
28,male,French,Mr.,Guy,A,Migneault,"90 Petworth Rd",DUNSINNAN,,,"PH2 5HL",GB,"United Kingdom",,Enut1960,nahming4Oo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","077 5138 5842",44,Labrecque,4/28/1960,59,Taurus,MasterCard,5581802812256860,692,3/2022,"ZT 01 87 75","1Z 96V 661 27 4962 061 4",8863556519,92044686,Blue,"Technical trainer","Rogers Peet","1999 Fiat Siena",,B+,176.7,80.3,"5' 7""",171,eca54fd7-3576-426c-b4f9-b617a3662931,56.025747,-3.640577
29,male,Hispanic,Dr.,Breogan,J,Orosco,"Ηλίου 64",ΛΑΡΝΑΚΑ,LA,Λάρνακα,6031,CY,"Cyprus (Greek)",,Priback,Quoo9choo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","97 687579",357,Ceballos,9/19/1957,62,Virgo,MasterCard,5567650882732569,392,10/2022,,"1Z A69 196 36 1803 521 0",6314780611,79148788,Green,"Machinery maintenance mechanic","Jack Lang","2010 Hyundai i30",,A+,155.1,70.5,"5' 10""",179,8dda77bb-d75c-40ef-b1c0-6381d305b053,41.348113,-72.957665
30,female,Czech,Mrs.,Jaroslava,M,Kindlová,"22 Rue de Sidi Bou Zid",Zouarine,33,"Governorate Kef",7170,TN,Tunisia,,Tepen1939,thahShee7,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763","78 427 062",216,Langová,11/28/1939,80,Sagittarius,Visa,4532368231815457,583,12/2022,,"1Z 349 036 22 9992 262 9",5257019048,20372734,Black,"Copy editor",Anthony's,"2005 Nissan Altima",,B+,108.7,49.4,"5' 4""",163,94e73608-33dd-4f7a-a86d-9575aa2e63e6,37.219459,9.881268
31,male,Croatian,Mr.,Stjepan,A,Perković,"Escuadro 26","Castelló de Rugat",V,Valencia,46841,ES,Spain,,Spastry,uLiH7iech3,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","779 021 982",34,Topić,3/18/1978,41,Pisces,Visa,4539449132248031,692,6/2023,,"1Z V92 2E8 34 9201 447 0",3497779171,87502347,Orange,Geoscientist,Monmax,"2010 Chrysler PT Cruiser",,A+,175.8,79.9,"5' 10""",178,e49f858d-b218-4ce7-89ae-62599b3bb276,38.927331,-0.375157
32,male,Croatian,Mr.,Stanko,T,Crnić,"Avda. Alameda Sundheim 46",Benasque,HU,Huesca,22440,ES,Spain,,Waskents,Piu4theeg1ae,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","793 358 347",34,Jozić,9/13/1974,45,Virgo,Visa,4929607743905830,463,2/2021,,"1Z 981 F24 67 9469 260 4",6183936244,94973462,Blue,"Studio camera operator","Wealthy Ideas","2004 Isuzu Axiom",,B+,185.2,84.2,"6' 1""",186,6ee36d07-1397-43c8-86a4-455bce264fbc,42.623207,0.475571
33,female,Russian,Ms.,Marianne,I,Zhdanova,"18 Rue de bayrout","Cite Badrani",61,"Governorate Sfax",3083,TN,Tunisia,,Thock1968,uRai6thoh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","74 849 807",216,,11/30/1968,51,Sagittarius,MasterCard,5223193173825541,622,7/2020,,"1Z 60W 422 22 2116 147 5",6606006372,32841170,Blue,"Geographic information specialist","Star Interior Design","1996 Dodge Caravan",,B+,168.3,76.5,"5' 7""",169,eff7c2c8-85f4-45dc-8a4d-60770f4eda42,35.319852,9.785865
34,female,Hungarian,Ms.,Ferike,G,Jónás,"Brucker Bundesstrasse 31",FÜRLING,NO,"Lower Austria",4152,AT,Austria,,Ofigaill49,ni8ooy1Thee,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763","0681 563 12 72",43,Tolnay,4/12/1949,70,Aries,Visa,4532628381402038,969,6/2021,,"1Z 099 5Y5 30 8995 126 3",7242712846,73346379,Red,"Computer systems administrator","ABCO Foods","1999 Land Rover Discovery",,O+,134.6,61.2,"5' 5""",166,4687baf3-0486-4466-a63b-d0dff8903249,48.633035,13.984742
35,female,Czech,Ms.,Lenka,T,Mizerová,"Rhinstrasse 91",München,BY,"Freistaat Bayern",80975,DE,Germany,,Daudgessed,keer4Ceej9j,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","089 60 25 65",49,Brožová,1/14/1944,75,Capricorn,Visa,4916250570933685,939,6/2023,,"1Z 580 A90 83 6948 520 2",9291535188,57982233,Red,Maid,AdventureSports!,"2000 Toyota Sienna",,O+,184.1,83.7,"5' 1""",154,efb875ed-74dc-4ea3-b155-4a48cb61d187,48.189892,11.502201
36,female,Icelandic,Mrs.,Eyþóra,P,Runólfsdóttir,"Ditscheinergasse 80",HUNDSHAGEN,OO,"Upper Austria",4773,AT,Austria,,Expregiat,Beph8ieX,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","0699 830 58 07",43,,11/22/1960,59,Sagittarius,MasterCard,5426319483823638,482,11/2024,,"1Z 448 19V 41 8418 729 7",2455440003,04422500,Purple,"Building cleaning worker","Matrix Architectural Service ","2010 BMW 650",,O-,130.7,59.4,"5' 6""",168,02a29393-6fdc-4120-adf0-568906c8c111,48.266115,13.568714
37,female,French,Mrs.,Rive,T,Lépicier,"144 Souniou Ave.",Menogeia,LA,Larnaca,7578,CY,"Cyprus (Anglicized)",,Carray,ieJij3no,"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0","24 102884",357,Lanteigne,4/15/1983,36,Aries,MasterCard,5494902843118711,376,8/2024,,"1Z W04 891 69 0373 898 8",7751332568,91800864,Yellow,"Payroll and benefits specialist","Lechters Housewares","2006 Nissan Pathfinder",,A+,119.7,54.4,"5' 4""",163,17afbc1b-4d95-4583-8602-9680b4fd7c5c,41.311392,-72.829123
38,male,Danish,Mr.,Marcus,O,Paulsen,"Plattenstrasse 57",Räterschen,,,8352,CH,Switzerland,,Mesee1943,Fi7eiva8Ah,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","044 347 47 26",41,Simonsen,10/8/1943,76,Libra,MasterCard,5378357034830932,713,8/2021,,"1Z 387 E31 19 5962 225 9",2639724434,59408085,Black,"Management development specialist","Parts and Pieces","2006 BMW M3",,B+,207.5,94.3,"5' 10""",177,43ae4a6b-e1ca-4d5e-b6cd-3732697c9c71,47.488972,8.868299
39,female,"Chechen (Latin)",Mrs.,Zeliha,I,Sultygov,"Rookopli 96",Uralaane,VG,Valgamaa,68712,EE,Estonia,,Dary1953,nohgief2A,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","763 5734",372,Desheriyev,7/27/1953,66,Leo,Visa,4539696753097085,200,11/2021,,"1Z E17 641 44 8337 404 1",5174706823,32841274,Red,"Mental health assistant","Reliable Investments","2000 Opel Signum",,O+,119.9,54.5,"5' 3""",160,68c2ef2c-9990-41bc-be94-afa07e6e2379,58.070181,26.064252
40,female,Russian,Mrs.,Ilona,B,Pirogova,"Αγ. Ανδρέα 130","ΒΑΣΑ ΚΟΙΛΑΝΙΟΥ",LI,Λεμεσός,4771,CY,"Cyprus (Greek)",,Hatiere,Ahsha4Ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","25 750307",357,,12/5/1970,49,Sagittarius,Visa,4539731441846112,542,4/2024,,"1Z 008 309 69 5575 108 1",3971097056,06463075,Purple,"Land acquisition manager","Wickes Furniture","1992 Ford Taurus",,O-,135.1,61.4,"5' 1""",156,0c5899a0-ce9c-43a7-aba1-f279893620f9,41.295272,-72.961282
41,female,Croatian,Ms.,Aleksandra,K,Petković,"Binzmühlestrasse 30","San Bernardino",,,6565,CH,Switzerland,,Ramessanies1994,vie2quai7Ie8,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","091 808 37 22",41,Bašić,6/8/1994,25,Gemini,MasterCard,5560110586075796,991,3/2023,,"1Z 11E 310 61 4037 919 3",0620594797,44658507,Purple,"Personal banker",Ejecta,"2005 Kia Amanti",,A+,110.2,50.1,"5' 4""",163,cc560302-0d00-410c-9629-2e68bb4ef864,46.506804,9.159102
42,male,American,Mr.,Rogelio,A,Patrick,"Πλ Καραισκάκη 128","ΑΓΙΟΣ ΘΕΟ∆ΩΡΟΣ ΣΟΛΕΑΣ",NI,Λευκωσία,2823,CY,"Cyprus (Greek)",,Whortin1952,Yu7mah2z,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","95 432561",357,Thacker,10/25/1952,67,Scorpio,Visa,4716135824324942,242,9/2021,,"1Z 14A 327 32 1181 648 4",9342714384,12993510,Blue,"Typesetting machine tender","Vibrant Man","2001 Toyota MR2",,B+,170.3,77.4,"5' 11""",180,76a566cc-be59-4327-862e-312da09e0c42,41.353523,-72.965839
43,female,American,Mrs.,Evelyn,R,Tucker,"Kringlan 66",Reykjavík,,,107,IS,Iceland,,Arresplet,deiT0ahyu,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36","450 3756",354,Burton,9/25/1986,33,Libra,MasterCard,5592761939548814,873,9/2022,,"1Z 263 919 45 6552 555 7",5057305433,86266508,Purple,"Aquaculture farmer",Weatherill's,"2004 Mitsubishi Galant",,O+,105.8,48.1,"5' 3""",159,716b5321-34bf-4514-8bca-fce5c482d8c3,64.159592,-21.928397
44,male,Icelandic,Mr.,Þorkell,H,Hallbjörnsson,"Školní 296","Kaplice 1",JC,"Jihoceský kraj","382 41",CZ,"Czech Republic",,Dessesid,eeXahew1ui,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","772 616 930",420,,12/19/1957,62,Sagittarius,MasterCard,5534480983249093,443,6/2023,,"1Z 34E 320 47 9554 749 9",1825424609,81224507,Blue,"Financial aid director","Grand Union","2004 Ford Explorer",,A-,236.9,107.7,"6' 1""",186,393472fc-3454-4ba8-af3a-e1f7f626cfee,48.691433,14.516696
45,male,Greenland,Mr.,Jan,H,Geisler,"Bayerhamerstrasse 79",GLAUBENDORF,NO,"Lower Austria",3704,AT,Austria,,Subjecould,eepooz6U,"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko","0699 456 17 84",43,Lange,10/25/2000,19,Scorpio,MasterCard,5305776196130476,904,6/2020,,"1Z 084 34A 51 9322 259 5",7785136902,52035400,Blue,"Fire prevention specialist",Mikrotechnic,"2005 Bizzarrini BZ-2001",,B+,203.1,92.3,"5' 9""",175,92630214-ba49-47e4-8717-93e4bba4262e,48.560667,15.917015
46,female,Norwegian,Mrs.,Caroline,M,Landmark,"Via Tasso 21",Perugia,PG,Perugia,06122,IT,Italy,,Sweves,ooNee0iechoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","0378 8718408",39,Benjaminsen,9/26/1975,44,Libra,MasterCard,5248190326919222,805,6/2024,PR74491787,"1Z 2Y4 773 67 4365 263 8",7709648575,99170582,Green,"Allopathic physician","Castro Convertibles","2012 Dodge Durango",,A-,187.7,85.3,"5' 6""",168,dde3a962-10d8-4092-b9df-6da79f89f383,43.072973,12.459411
47,female,Swedish,Ms.,Lena,M,Andersson,"Parkring 7",STEINPARZ,OO,"Upper Austria",4730,AT,Austria,,Freen1978,aeVaiHohy7,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","0650 858 08 11",43,Holm,10/24/1978,41,Scorpio,MasterCard,5451268671996177,795,8/2023,,"1Z 683 821 70 2253 409 0",7986278354,59684998,Blue,"Chemical technician","Wholesale Club, Inc.","2007 Kia Carnival",,AB+,180.2,81.9,"5' 1""",154,8d7f4c08-ee33-4024-9474-0beed004df45,48.234026,13.824536
48,male,Danish,Mr.,Elias,A,Jepsen,"Βασιλέως Αλεξάνδρου 195",ΦΑΡΜΑΚΑΣ,NI,Λευκωσία,2620,CY,"Cyprus (Greek)",,Thenim,uki0Zae7l,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 228011",357,Olsen,9/10/1967,52,Virgo,Visa,4929867576889614,699,4/2023,,"1Z 348 2Y4 34 0337 091 2",3256227540,51352382,Blue,"Court, municipal, and license clerk","Golden's Distributors","2011 Volvo XC70",,A+,214.5,97.5,"5' 9""",174,f06e06a7-59b0-4052-9928-92df476d7753,41.326352,-72.962624
49,male,French,Mr.,Honoré,N,Beaudouin,"13 Faubourg Saint Honoré",PAU,IL,Île-de-France,64000,FR,France,,Slise1955,Zei7phaeSuutu,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",,33,Daoust,11/2/1955,64,Scorpio,Visa,4539798879618651,321,11/2020,"1551124428495 35","1Z 424 792 46 6757 249 6",9796303410,07585755,Blue,"Dairy scientist","York Steak House","1999 GAZ 3111",,O+,236.1,107.3,"5' 9""",175,00a9f1f4-bec6-4dda-ac22-97e2860b1662,43.241847,-0.41343
50,male,American,Mr.,Richard,K,Martinez,"Via Zannoni 49","Tiarno Di Sopra",TN,Trento,38060,IT,Italy,,Himmest,Feitee9ien,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","0328 4921229",39,Dickenson,9/2/1985,34,Virgo,MasterCard,5258225275434802,882,2/2023,PJ53725800,"1Z 4V9 5Y7 22 3010 519 9",5935776326,45852443,Blue,Logistician,Macroserve,"1995 Fiat Bravo",,O+,205.9,93.6,"5' 9""",176,fbe7a3e7-ace4-4495-b6e6-0c2cb4abcc88,45.964528,10.759331
51,male,"Chechen (Latin)",Mr.,Salambek,T,Melikov,"1678 Dorp St",Claremont,WC,"Western Cape",7740,ZA,"South Africa",,Brint1956,ehahCh1xai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","083 792 9726",27,Gairbekov,7/13/1956,63,Cancer,MasterCard,5284109963636027,465,12/2023,5607139788084,"1Z 558 445 48 7417 922 3",1631262867,63160277,Green,"Telephone operator","Kinney Shoes","1998 Chevrolet Trans Sport",,O+,211.9,96.3,"5' 8""",172,b1628f36-9fdd-487d-8297-a35ee6d72ebf,-33.89536,18.479041
52,female,Greenland,Mrs.,Mette,K,Olsen,"ul. Karpacka 69",Bydgoszcz,,,85-164,PL,Poland,,Liffew,queiy6ooGh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","88 165 40 96",48,Jeremiassen,1/30/1935,84,Aquarius,MasterCard,5353410735290150,175,3/2023,35013096720,"1Z 129 156 25 6468 002 5",2379843087,66415820,Yellow,Lawyer,MagnaSolution,"2005 Peugeot 107",,O-,117.0,53.2,"5' 0""",152,01d98231-8e58-4e17-9c9a-bc5aa388928a,53.068638,18.093529
53,male,Russian,Mr.,Spartacus,N,Ignatieff,"Bahnhofstrasse 57",Glovelier,,,2855,CH,Switzerland,,Imeting1968,yi2Eep8gieh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","032 803 90 31",41,,6/16/1968,51,Gemini,MasterCard,5399586162423418,375,3/2024,,"1Z 632 482 77 2949 860 1",2465813266,58937027,Blue,"Sales worker supervisor",Schweggmanns,"2008 Mazda 5",,O+,152.9,69.5,"5' 8""",173,6365915b-b99c-426c-ae8e-698f846d3f03,47.232383,7.22689
54,male,Brazil,Mr.,Kauã,S,Cardoso,"P.O. Box 194",Upernavik,QA,Qaasuitsup,3962,GL,Greenland,,Searlitnot,caiT4reN,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 63 21",299,Castro,3/9/2000,19,Pisces,MasterCard,5301206781704786,476,5/2024,,"1Z 346 5Y6 16 7677 108 7",7407791874,63200812,Blue,"Press secretary","Dave Cooks","2002 Hyundai Elantra",,A+,155.5,70.7,"5' 10""",179,081371c1-479e-4055-95af-3110e72fc11a,72.786922,-56.131948
55,female,Brazil,Ms.,Fernanda,P,Cavalcanti,"Via degli Aldobrandeschi 3",Jelsi,CB,Campobasso,86015,IT,Italy,,Knour1941,ahChohqu4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0327 9982793",39,Souza,11/30/1941,78,Sagittarius,Visa,4929971103746071,969,10/2023,CI48765311,"1Z 223 435 25 6742 103 3",6973121025,59054247,Blue,"Budget analyst","Hughes & Hatcher","2001 Volkswagen Lupo",,B+,106.3,48.3,"5' 5""",165,c75dc0e6-fe45-431f-8907-6e58db479a3d,41.444028,14.707643
56,female,Hungarian,Ms.,Mónika,Z,Göröncsér,"Nábřežní 243","Spálené Porící",PL,"Plzenský kraj","335 61",CZ,"Czech Republic",,Thenetiong,quohwae5Quoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","376 147 284",420,Szôts,6/11/1946,73,Gemini,Visa,4916007260260864,417,6/2024,,"1Z A43 364 39 5822 708 0",8027400869,93626602,Black,"Precision printing worker","Plunkett Home Furnishings","2001 Bugatti EB 118",,B+,146.3,66.5,"5' 4""",162,13189ec1-db42-4f8c-b74e-e7a45d33a237,49.629,13.606864
57,female,Czech,Mrs.,Zuzana,M,Kozáková,"Via del Pontiere 101","Birgi Aerostazione",TP,Trapani,91020,IT,Italy,,Dinectich,mai5eiXaexai,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","0391 7843193",39,Minarčíková,1/21/1953,66,Aquarius,MasterCard,5164065137771907,924,5/2020,AT69882067,"1Z 736 067 39 5591 664 0",0937144872,77392590,Red,"Industrial engineering technician",Romp,"1998 Isuzu VX-02",,A+,123.0,55.9,"5' 3""",161,c414c613-db2b-4f1a-8bf1-a91e518afb85,37.610445,12.42306
58,female,Russian,Mrs.,Eugene,R,Bykova,"Via Goffredo Mameli 149","Poggiovalle Di Borgorose",RI,Rieti,02020,IT,Italy,,Ingentersed1943,aeN6eenul5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134","0366 9434948",39,,6/29/1943,76,Cancer,Visa,4556039653807659,972,5/2022,KX92160250,"1Z 975 21Y 29 8927 431 6",6861257852,04061254,Brown,"Executive secretary","Kinney Shoes","1994 Plymouth Voyager",,A+,195.8,89.0,"4' 11""",150,7eb2e374-3e73-41d3-95b6-f9bb93993872,42.079608,12.989079
59,female,Icelandic,Ms.,Nanna,S,Hallmundsdóttir,"85 Gimblett Street",Richmond,,Invercargill,9810,NZ,"New Zealand",,Wifflife1964,Deez4ooGi0,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","(022) 7929-466",64,,4/15/1964,55,Aries,MasterCard,5504023231705718,236,9/2020,,"1Z 668 366 39 7132 126 6",7750395260,67884791,Black,"Photographic process worker",Megatronic,"1999 Dodge Avenger",,O+,170.1,77.3,"5' 0""",153,05974695-4733-4bd3-b1ac-bffb56992160,-46.301185,168.423853
60,female,Polish,Ms.,Halina,C,Zielinska,"1 Gloucester Road",CLACHANDHU,,,"PA68 7QD",GB,"United Kingdom",,Corsome74,Vaem2keeV6,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","077 5149 6820",44,Zielinska,8/7/1974,45,Leo,Visa,4539422686560762,956,3/2022,"TB 10 69 23","1Z 189 A93 08 3744 015 7",0022083989,01394977,Blue,"Surgical technician",Monit,"2011 Renault Grand Scenic",,AB+,208.3,94.7,"5' 7""",171,44fd6c97-052a-42a8-b2d2-fb8b3fc70ba8,55.954988,-5.872046
61,female,"Japanese (Anglicized)",Ms.,Hatsuho,K,Yoneda,"Dalmatinova 35",Žabnica,,,4209,SI,Slovenia,,Therl1988,ahteel2maeSh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134",051-632-354,386,Mikami,9/17/1988,31,Virgo,MasterCard,5104564324299139,201,4/2021,,"1Z 486 074 10 6995 124 2",8337173450,63167043,Purple,Treasurer,Practi-Plan,"2003 MG ZT",,O+,134.4,61.1,"5' 8""",173,e51aa8e2-9fc6-43e7-8488-168ca87b75d7,46.108258,14.328302
62,male,Croatian,Mr.,Gojislav,V,Jukić,"Bahnhofstrasse 96",Gorgier,,,2023,CH,Switzerland,,Tinguen,GaiXa3ai,"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36","032 304 33 66",41,Pavić,9/21/1999,20,Virgo,MasterCard,5165115278847765,074,12/2022,,"1Z 385 466 21 1758 512 0",6790588061,83119520,Red,"Extractive metallurgical engineer","Value Giant","2011 Lexus LFA",,B+,155.5,70.7,"6' 1""",186,0165faab-d065-4a58-a543-057c540a6863,46.955216,6.830252
63,female,England/Wales,Mrs.,Megan,T,Swift,"Postbox 23",Maniitsoq,QE,Qeqqata,3912,GL,Greenland,,Thersevere,ieGh5huoK6,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","81 32 04",299,Poole,4/30/1964,55,Taurus,Visa,4929574688538812,336,4/2024,,"1Z E38 W63 04 0063 263 0",6213800707,38860322,Green,"Support service manager","Little Folk Shops","2007 Chevrolet Optra",,B+,168.7,76.7,"5' 4""",163,394b47d6-e3cb-4869-b4b5-c42a81ef9b00,65.395922,-52.878832
64,male,Slovenian,Mr.,"Milan Franc",S,Košelnik,"Na Výsluní 272",Primda,PL,"Plzenský kraj","348 06",CZ,"Czech Republic",,Lospay67,queeN0ies,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","733 162 319",420,Mankoč,1/14/1967,52,Capricorn,MasterCard,5453102419958199,879,12/2021,,"1Z A16 354 25 3838 551 9",0820470346,05905803,Brown,"Clinical laboratory technologist","Singer Lumber","2003 Holden UTE",,A+,214.5,97.5,"5' 9""",175,7234b959-896e-4335-8745-c5716b1c7638,49.619319,12.730847
65,female,German,Ms.,Johanna,P,Maurer,"Kaisergasse 64",KURZENKIRCHEN,OO,"Upper Austria",4770,AT,Austria,,Reptaked1981,ohwae5Tee,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","0688 992 50 51",43,Schuhmacher,7/12/1981,38,Cancer,Visa,4916529470563530,481,4/2020,,"1Z Y23 375 24 5962 121 9",0306589986,74875435,Purple,"Elevator repairer","Solution Answers","2008 Rover Streetwise",,AB+,186.6,84.8,"5' 5""",164,30ce10a0-2ea3-4e24-871a-f72cb3beedfc,48.439595,13.547802
66,male,Norwegian,Mr.,Teodor,K,Aune,"Gl. Sygehusvej 153",Narsaq,KU,Kujalleq,3921,GL,Greenland,,Dickent,ooph5leiG,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","66 28 45",299,Arntzen,7/18/1998,21,Cancer,MasterCard,5296939640241254,624,2/2024,,"1Z 216 Y97 87 6791 863 9",0054598944,41175224,Blue,"Private household cook","Builders Emporium","2008 Renault Laguna",,O+,239.1,108.7,"5' 9""",174,0562be22-b239-4dbc-a0f7-d64679ae153f,60.827346,-46.022413
67,male,Dutch,Mr.,Abderrahman,I,Kempers,"Hjellestadnipen 66",HJELLESTAD,,,5259,NO,Norway,,Ancery,eeC4tien9,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","460 73 493",47,Cuperus,7/17/1977,42,Cancer,MasterCard,5358951722758050,004,9/2024,,"1Z Y78 20V 88 9025 826 2",1986315622,50825375,Blue,"Time clerk","Carrols Restaurant Group","2008 Dacia Sandero",,O+,182.4,82.9,"5' 6""",168,49c9f2d1-5e74-4ef9-8c9c-c5e6338e1f6d,60.280666,5.157123
68,male,French,Mr.,Nicolas,R,Lebrun,"95 Burton Avenue",Okoia,,Wanganui,4500,NZ,"New Zealand",,Wroing,iR2rahpaim2a,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","(027) 0336-972",64,Bondy,3/3/1964,55,Pisces,MasterCard,5167661122227231,094,5/2021,,"1Z 418 1A1 09 5878 510 1",4757008829,69457336,Blue,"Corporate accountant",Playworld,"1997 Citroen Rally Raid",,O+,209.4,95.2,"5' 10""",178,851fc065-0061-4754-b807-421e4242b5ba,-39.863379,174.967351
69,male,Slovenian,Mr.,"Ivan Martin",J,Bugarski,"Breivangvegen 38",TROMSØ,,,9010,NO,Norway,,Dercy1937,iey2Xoh8o,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","448 63 713",47,Riboli,11/14/1937,82,Scorpio,MasterCard,5364317183005716,388,12/2022,,"1Z 291 4Y2 69 2883 563 7",1469869904,37602111,Red,"Office clerk",Quickbiz,"2000 Chrysler Grand Voyager",,O-,174.2,79.2,"5' 10""",179,ca8717a7-1eba-4f93-9f1d-970b2fa1a45e,69.651262,18.958466
70,female,Czech,Ms.,Jarmila,M,Chloupková,"729 Albert St",Germiston,GA,Gauteng,1419,ZA,"South Africa",,Wountim81,yie8Cees,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","084 256 2607",27,Poláčková,2/5/1981,38,Aquarius,Visa,4716715994153559,960,3/2024,8102054742081,"1Z 454 A20 14 2101 291 2",8866810206,44986764,Red,"Industrial-organizational psychologist","Wells & Wade","1996 Mini MK VI",,B+,140.6,63.9,"5' 8""",172,940ec903-7fa0-421a-a1c5-7620bea7f7e0,-26.161314,28.133482
71,male,Italian,Dr.,Manlio,M,Capon,"Lützelflühstrasse 122",Wil,,,5300,CH,Switzerland,,Theyear,ieThuo1fei,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","062 380 34 69",41,Folliero,9/17/1947,72,Virgo,Visa,4929441842722544,930,4/2021,,"1Z 15W 644 50 5740 843 7",5183466840,07258789,Black,"Billing and posting clerk","House Of Denmark","1997 Oldsmobile Eighty-Eight",,B+,162.1,73.7,"5' 10""",179,0e92196a-3502-41c6-83bc-9a3b43c49317,47.510722,8.303196
72,female,"Japanese (Anglicized)",Ms.,Tomomi,Y,Nishiyama,"Rua do Arenque 1634",Goiânia,GO,Goiás,74343-040,BR,Brazil,,Mille1991,aeneThoh6x,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","(62) 9976-7986",55,Ishizaki,8/23/1991,28,Virgo,Visa,4716240309647674,486,9/2020,406.537.117-19,"1Z 01V 07A 25 5957 147 4",4689360184,38520562,Purple,"Oxy-gas cutter","Lechters Housewares","2011 Chevrolet HHR",,B+,213.2,96.9,"5' 2""",158,5141a9c2-cd45-4813-9cbe-12d630eefce4,-16.687195,-49.226261
73,female,Russian,Dr.,Esther,R,Kalinina,"Τρικάλων 248",ΛΕΥΚΩΣΙΑ,NI,Λευκωσία,1687,CY,"Cyprus (Greek)",,Hics1952,euG0Aiqu2,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","22 723018",357,,7/5/1952,67,Cancer,MasterCard,5563120249912803,542,11/2024,,"1Z 734 93Y 11 9330 585 6",6006786087,26151161,Blue,"Ambulatory care nurse","Waccamaw Pottery","1999 MCC Smart",,A+,133.5,60.7,"5' 7""",170,642dbeb6-defe-4c89-bc0a-5c64ae807dcb,41.266749,-72.834759
74,male,Icelandic,Mr.,Boði,L,Zóphoníasson,"Árpád fejedelem útja 3.",Budapest,BU,Budapest,1184,HU,Hungary,,Inart1990,nugheiZ0eig5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","(1) 941-2250",36,,6/12/1990,29,Gemini,MasterCard,5199467250444016,134,12/2020,,"1Z 563 9F0 30 4262 753 9",9294260413,82490560,Orange,"Mail processing machine operator","Bell Markets","1997 Lada Natacha",,B+,144.5,65.7,"5' 9""",174,a3912b2f-c2dc-4ff7-ab81-af1e047108c5,47.515035,19.146851
75,female,England/Wales,Mrs.,Grace,G,Boyle,"62 Mavrokordatou Street",Foinikaria,LI,Limassol,4530,CY,"Cyprus (Anglicized)",,Fortat81,cahBoot3eH,"Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","25 880993",357,Thompson,9/27/1981,38,Libra,MasterCard,5597431608392937,582,5/2021,,"1Z 791 957 61 6161 657 2",2596874442,13926446,Blue,"Forensic technician","id Boutiques","1993 Bristol Beaufighter",,AB+,181.7,82.6,"5' 8""",173,149314e4-80a7-493a-8ece-9b6f0890fd5d,41.321303,-72.986114
76,female,England/Wales,Ms.,Naomi,S,Ryan,"Λ. Μιχαλακοπούλου 160",ΕΓΚΩΜΗ,NI,Λευκωσία,2417,CY,"Cyprus (Greek)",,Fien1988,Gaitha4Ei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","22 498317",357,Clayton,11/21/1988,31,Scorpio,MasterCard,5283115033422307,714,12/2023,,"1Z 587 192 06 3768 866 1",7494134303,25065703,Blue,"Sewer pipe cleaner","Monk Real Estate Service","1995 BMW Dinan",,O+,205.0,93.2,"5' 2""",158,65219a9f-2c32-459a-98d2-7a7332e0f52f,41.385016,-72.962431
77,male,Polish,Mr.,Szymon,B,Walczak,"2347 Lauzon Parkway",Windsor,ON,Ontario,"N9A 7A2",CA,Canada,,Dintep,oog4aize7Ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",519-566-3375,1,Pawłowska,9/18/1958,61,Virgo,MasterCard,5342019646824967,284,11/2023,"727 633 539","1Z 883 8V5 80 2897 856 7",2176879008,20843357,Silver,Neurosonographer,"De Pinna","2009 Nissan Frontier",,B+,222.2,101.0,"5' 10""",179,b2e01ab2-c265-42eb-90b8-e26a6361eed4,42.423583,-82.942171
78,female,Hungarian,Ms.,Mercédesz,S,Szôllôssy,"Atamaria 86","Fornelos de Montes",PO,Pontevedra,36847,ES,Spain,,Musere,Eif1ce0ee,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36","641 572 459",34,Hofmann,1/15/1944,75,Capricorn,MasterCard,5535165808125664,767,3/2023,,"1Z 681 2E4 44 0770 552 8",5631978558,99957155,Red,"Case management aide","White Hen Pantry","2000 Noble M12",,O+,213.6,97.1,"5' 7""",169,a5f515e0-310f-47ab-bc0c-a71bac531a1e,42.263983,-8.431245
79,male,Norwegian,Mr.,Edgar,E,Andreassen,"Zistelweg 32",UNTERLAND,SZ,Salzburg,5661,AT,Austria,,Waakis2000,Iejeiz1oodei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0664 701 04 17",43,Dybvik,6/29/2000,19,Cancer,Visa,4716078791994463,776,11/2020,,"1Z 311 159 63 7486 723 7",5685893521,99981816,Black,"Speech pathologist",Peaches,"2001 Pontiac Grand Am",,A-,127.8,58.1,"5' 8""",172,15337913-059c-4dd1-9feb-5dc426abe8c7,47.203021,12.910163
80,male,Slovenian,Mr.,Šemsudin,M,Vrhovski,"ul. Dawida Jana 124",Wrocław,,,50-527,PL,Poland,,Gother,keeLaz9lee0,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 OPR/58.0.3135.132","67 534 85 44",48,Pataki,10/13/1955,64,Libra,Visa,4716877835558592,363,3/2022,55101320290,"1Z E40 449 57 5657 736 1",2373733334,21925106,Blue,"ABE teacher","Integra Wealth Planners","2015 BMW X5 M",,B+,197.6,89.8,"5' 11""",180,5c11fc04-45f8-4989-801f-4102ff38d376,51.112923,17.027289
81,female,Finnish,Mrs.,Satu,A,Waltari,"2071 Maryland Avenue",Pinellas,FL,Florida,34624,US,"United States",,Stittair,jal6oNgoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:66.0) Gecko/20100101 Firefox/66.0",727-538-7059,1,Viitala,9/16/1995,24,Virgo,Visa,4556890465838575,158,12/2024,591-28-5104,"1Z 534 941 77 8508 193 2",5257097378,69898015,Yellow,"Soil scientist","White Hen Pantry","2003 Daihatsu Terios",,O+,141.0,64.1,"5' 5""",164,37de7e34-2624-444a-978d-b1b758fbc993,27.864456,-82.748032
82,male,German,Mr.,Matthias,S,Himmel,"Degnehøjvej 45",Silkeborg,MI,"Region Midtjylland",8600,DK,Denmark,,Barted,Jemu5poosoo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15",30-62-84-08,45,Trommler,8/23/1983,36,Virgo,MasterCard,5289947968601628,320,12/2024,230883-1143,"1Z 006 174 60 6563 087 1",3945260717,87205650,Blue,"Mental health social worker","Superior Appraisals","1998 Alpina B 12",,A+,143.4,65.2,"5' 10""",179,116742f3-f65e-45f1-a917-d13ad1db7bd4,56.199078,9.447827
83,female,Danish,Ms.,Mia,A,Frederiksen,"ul. Zuchów 65","Dąbrowa Górnicza",,,41-303,PL,Poland,,Fance1958,buY5faij,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0","53 459 91 54",48,Lauritsen,6/18/1958,61,Gemini,MasterCard,5454007072610160,208,4/2023,58061866242,"1Z 501 697 49 5209 014 8",9285893233,21439381,Blue,"Fine arts photographer","Coon Chicken Inn","2008 SSC Aero",,O+,181.1,82.3,"5' 2""",158,8969b475-9dce-4173-b060-32da08dbbf0d,50.417075,19.133549
84,male,Swedish,Mr.,Jesper,N,Lund,"Põllu 59",Kähu,VG,Valgamaa,68506,EE,Estonia,,Planstim,AeBeiNii0,"Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","763 2200",372,Lundgren,5/6/1938,81,Taurus,Visa,4539748641306150,938,3/2024,,"1Z 484 548 09 5749 331 5",6980674650,69979073,White,Photogrammetrist,"Expo Superstore","1997 Panoz AIV",,A+,178.0,80.9,"5' 10""",178,32b3dfb9-2eaa-4af2-a012-2a044f866550,57.915676,26.169326
85,male,Icelandic,Mr.,Guðgeir,S,Bergsveinsson,"Rue du Centre 320",Marke,VWV,"West Flanders",8510,BE,Belgium,,Ressen,phukieGae9c,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","0493 28 62 88",32,,4/2/1994,25,Aries,MasterCard,5432615789205137,688,11/2021,,"1Z 684 8A0 47 6831 298 9",0387412870,16497840,Orange,"Forging machine tender","Crafts & More","1994 Mitsubishi Sigma",,A+,165.9,75.4,"5' 8""",172,1016b5ad-56e3-4fb2-ba90-dc1f43694493,50.73779,3.22707
86,male,Icelandic,Mr.,Esjar,S,Sturluson,"Hauptstrasse 75","PUCH BEI HALLEIN",SZ,Salzburg,5412,AT,Austria,,Conetund,taeWeF2Eeph4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0699 465 17 25",43,,3/24/1970,49,Aries,MasterCard,5400451109415331,492,9/2021,,"1Z 807 0E7 80 7325 125 1",3810913760,52784081,Blue,"Dietetic technician","Endicott Johnson","2002 Smart ForFour",,B+,171.6,78.0,"6' 0""",182,b63ece12-4e7f-4348-bce8-1d8d5dc31dff,47.741555,13.137162
87,male,England/Wales,Mr.,Zak,M,Leonard,"27 Stroud Rd",OCHTERTYRE,,,"PH7 6LF",GB,"United Kingdom",,Surn1940,ohteeF5RaeM,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","078 3687 4061",44,Henry,4/4/1940,79,Aries,MasterCard,5336964062411492,178,11/2023,"KY 97 49 93 A","1Z 77F 43A 87 5585 068 2",5973298171,66698464,Blue,"Heat treating equipment tender","The Independent Planners","1996 Mitsubishi Verada",,O+,221.1,100.5,"5' 7""",171,d667dde7-082e-4480-98d9-5bdd383eb187,56.07981,-4.643057
88,female,"Chechen (Latin)",Mrs.,Ezinet,B,Umkhayev,"216 Karaiskaki Sq",Ineia,PA,Paphos,8704,CY,"Cyprus (Anglicized)",,Mothen1991,IeH2ceebae,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","97 696060",357,Masaev,3/27/1991,28,Aries,Visa,4929316208182816,160,4/2021,,"1Z 77V 114 81 0072 703 9",4908986837,97732729,Purple,"Quality assurance inspector","Chief Auto Parts","2005 Jaguar XKR",,B+,218.9,99.5,"5' 9""",175,7cb11d28-cbd3-49c2-be74-e6dcdca65cb4,41.270842,-72.883851
89,female,Russian,Mrs.,Lucia,V,Voronina,"75 Sale-Heyfield Road",KONGWAK,VIC,Victoria,3951,AU,Australia,,Riets1976,aY3ohbe8ai,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","(03) 5371 4059",61,,7/25/1976,43,Leo,MasterCard,5459182457974252,335,5/2024,,"1Z 618 731 57 5565 866 4",4199244189,20580381,Blue,"Eligibility interviewer","William Wanamaker & Sons","2012 Tata Indica",,O+,176.7,80.3,"5' 7""",169,bf8e68c3-2842-49da-8c4d-1ed4712a3852,-38.465215,145.830079
90,male,Slovenian,Mr.,Milorad,S,Musić,"Välja 61",Mustahamba,VR,Võrumaa,66258,EE,Estonia,,Entils,oan8Eiyoaz,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","789 0750",372,Flach,12/1/1968,51,Sagittarius,MasterCard,5211243470404849,924,2/2022,,"1Z F99 556 56 0740 270 3",0643392658,82292582,Blue,Housekeeper,"Cougar Investment","2000 Buick Rendezvous",,A+,244.4,111.1,"6' 1""",185,ea727fc4-6f40-412c-9574-b372a6aef26f,57.820634,26.970835
91,female,"Japanese (Anglicized)",Dr.,Chisaki,M,Fujimura,"1956 Uitsig St",Grahamstad,EC,"Eastern Cape",6139,ZA,"South Africa",,Rewhe1979,iayiQu9ahsie,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","082 875 2166",27,Wakabayashi,12/22/1979,40,Capricorn,Visa,4485068737963325,600,10/2020,7912221956187,"1Z 480 79V 08 4325 733 4",5381640942,65242007,Brown,"Extruding and drawing machine setters","Alert Alarm Company","2005 Porsche Cayenne",,O+,216.5,98.4,"5' 2""",157,b1a8327e-794e-4685-ab8b-43d20c08ed68,-33.370929,26.578978
92,female,Russian,Mrs.,Inessa,D,Samoylova,"Bachloh 60",WATZING,OO,"Upper Austria",4673,AT,Austria,,Rolong,moot9aQu2d,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","0664 396 27 04",43,,5/11/1956,63,Taurus,Visa,4556264124085418,064,5/2024,,"1Z 596 V55 76 1512 629 4",1323820614,72364512,Purple,Patternmaker,"Erb Lumber","2009 Honda CR-V",,O+,135.3,61.5,"5' 3""",161,13f411d7-0754-4233-a902-37a68ce4bb45,48.118598,13.654018
93,female,England/Wales,Ms.,Elise,C,Pearson,"215 Andrew Street",Monaco,,Nelson,7011,NZ,"New Zealand",,Norly1997,eeNgoes7aez,"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0","(027) 7329-039",64,Wilson,5/12/1997,22,Taurus,Visa,4929452838531450,902,4/2022,,"1Z 245 516 58 7073 695 7",9911299106,38307168,Blue,"Activity specialist","The Independent Planners","1995 Nissan President",,O+,162.1,73.7,"5' 9""",175,962cba25-bb3d-4b3b-8aca-e4689ee69dd5,-41.333626,173.307741
94,male,Norwegian,Mr.,Herman,A,Johansen,"1324 Mosman Rd","Alexander Bay",NC,"Northern Cape",8294,ZA,"South Africa",,Thinde,loaH6shiemoh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","083 779 9214",27,Smestad,7/29/1947,72,Leo,Visa,4929709070861949,812,9/2023,4707295736082,"1Z 2E1 Y32 51 4242 127 7",3751331085,25346448,Blue,"Cost estimator","Brown Derby","2014 Audi SQ5",,O+,182.8,83.1,"5' 10""",179,377c5af3-3ab2-405b-bc80-1ddd4f06ecca,-28.511777,16.410349
95,male,Russian,Mr.,Armen,D,Balabanov,"Bavorovská 788",Stachy,JC,"Jihoceský kraj","384 73",CZ,"Czech Republic",,Gery1975,Gaeg6uchoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","606 972 932",420,,1/11/1975,44,Capricorn,Visa,4532025806629404,529,1/2021,,"1Z 245 330 29 3529 731 4",5526134624,86584996,Orange,Rigger,"Sun Foods","1999 Isuzu VX-02",,A+,199.8,90.8,"5' 6""",168,ec3e15ac-78df-4844-a8e3-c6c491c6dd39,49.090909,13.642637
96,male,Russian,Mr.,Evdokim,Y,Bazarov,"Reykjarhóli 70",Fljót,,,570,IS,Iceland,,Deet1996,aiRubie9Poqu,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","413 4270",354,,10/5/1996,23,Libra,MasterCard,5413518495816218,546,9/2024,,"1Z 900 9A2 84 8206 503 2",0362518360,38724903,White,"Insurance investigator","Modern Realty","1996 ZAZ Wagon",,O+,212.1,96.4,"5' 8""",173,9112f339-d232-4fa7-a1f3-2e74571fd00a,66.154544,-17.801351
97,female,Hungarian,Mrs.,Agoti,B,Gyarmaty,"793 Buena Vista Avenue",Corvallis,OR,Oregon,97330,US,"United States",,Eage1963,pheiSha1aqu,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15",541-714-1388,1,Cseh,12/7/1963,56,Sagittarius,MasterCard,5328604768229802,989,8/2024,543-24-6755,"1Z 238 019 37 1904 563 8",2038177985,21602780,Blue,"Licensed clinical social worker","Circuit Design","2001 Suzuki Covie",,A+,140.8,64.0,"5' 4""",163,b1f4eed7-ff6b-4671-bf5d-a04c6a7b4beb,44.597298,-123.334112
98,male,England/Wales,Mr.,John,K,Carpenter,"Tavcarjeva 22",Senovo,,,8281,SI,Slovenia,,Foris1988,el6xoh7Qu,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0",070-783-977,386,Wheeler,1/10/1988,31,Capricorn,MasterCard,5174982341006037,269,3/2020,,"1Z E95 341 79 8897 978 9",0309942560,54166967,Blue,"Marketing coordinator","Balanced Fortune","1992 Mazda AZ-1",,O+,226.4,102.9,"5' 8""",173,d2754fd9-f1c9-47cd-b6e1-7c8d5c0eec30,46.102339,15.464625
99,female,Hispanic,Mrs.,Maha,A,Cazares,"Reyes Católicos 75","Chiclana de la Frontera",CA,Cádiz,11130,ES,Spain,,Martrust57,Ohqu6achie,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","624 412 511",34,Méndez,9/25/1957,62,Libra,Visa,4485545084297530,898,6/2020,,"1Z 508 603 87 4474 636 9",1330099115,99991665,Blue,"Diesel train engineer","Sew-Fro Fabrics","2001 Alfa Romeo GTV",,A+,213.2,96.9,"5' 8""",172,d768e30b-a6de-4977-8bb8-0c8432acce44,36.447765,-6.204969
100,female,American,Mrs.,Patricia,J,Nevels,"72 Acheron Road",BUNDALAGUAH,VIC,Victoria,3851,AU,Australia,,Butimis1962,eekea5Thoo,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","(03) 5301 7984",61,Thomas,7/30/1962,57,Leo,Visa,4716717759727577,065,9/2024,,"1Z 449 366 30 8287 656 2",3560472157,65535722,Blue,"Identification clerk","PriceRite Warehouse Club","2005 Infiniti QX56",,A-,209.9,95.4,"5' 3""",161,6156ce11-c2a6-4266-bb4d-f47b06292e4e,-38.111213,147.271178
2 1 female Czech Mrs. Marie J Hamanová P.O. Box 255 Kangerlussuaq QE Qeqqata 3910 GL Greenland Wasco1982 eiZookooB7 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 84 23 30 299 Kubíková 3/29/1982 37 Aries MasterCard 5545634085461876 511 1/2020 1Z 789 686 82 8979 914 6 6945116246 34746079 Purple Surveillance officer Simple Solutions 1995 Zastava 65 O+ 217.6 98.9 5' 5" 164 6781b04d-7b5f-4c1a-bceb-b953e6ef70d7 77.377518 -67.015569
3 2 female French Ms. Patricia G Desrosiers Avenida Noruega 42 Vila Real VR Vila Real 5000-047 PT Portugal Fultses eb6soCha4ae Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 21 259 903 5696 351 Daviau 2/28/1956 63 Pisces MasterCard 5317250628844522 874 3/2022 1Z V38 747 73 7311 832 9 7398998399 18093674 Blue Vascular technologist Formula Gray 2006 Lexus GS O+ 118.1 53.7 5' 0" 152 2b2e7e1a-855f-4089-a570-c0af2381a6d6 41.274541 -7.876658
4 3 female American Ms. Debra O Neal 1659 Hoog St Brakpan GA Gauteng 1553 ZA South Africa Cognoy sha3Sohzee Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/ Yowser/2.5 Safari/537.36 082 490 1693 27 Barrett 6/11/1957 62 Gemini Visa 4916429195104076 315 5/2020 5706114632083 1Z 061 1E5 71 3400 427 4 6186449862 58702271 Blue Information architect librarian Dahlkemper's 1993 Honda Prelude A+ 120.1 54.6 5' 4" 162 2ef83f4c-3102-4f79-839d-c75bf6a06f0a -26.22096 28.283398
5 4 male French Mr. Peverell C Racine 183 Epimenidou Street Limassol LI Limassol 3041 CY Cyprus (Anglicized) Restlys Aekie7ohs Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 25 470375 357 Grondin 6/14/1962 57 Gemini Visa 4485421519226702 653 5/2023 1Z F44 91V 14 3570 491 2 0850016444 52534088 Blue Desk clerk Quickbiz 2008 Infiniti G35 B+ 142.1 64.6 5' 9" 174 bfb4be71-3710-4ffa-baaf-5af6aa4b339e 41.30296 -72.989066
6 5 female Slovenian Mrs. Iolanda S Tratnik Karu põik 61 Pärnu PR Pärnumaa 80098 EE Estonia Trely1962 jeiziejohH3ai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 445 6271 372 Korbun 1/23/1962 57 Aquarius Visa 4532820383285186 893 4/2024 1Z 060 418 64 7516 574 4 1178606881 74806227 Purple Production assistant Dubrow's Cafeteria 2007 Fiat Idea O+ 141.5 64.3 5' 3" 160 0cbb7bf3-466f-4df6-bda3-9c9fe7bfc5c1 58.293395 24.434851
7 6 male Italian Mr. Domenico D Pisano Via Pisanelli 104 Traversara RA Ravenna 48020 IT Italy Hatelt lohhee8Zah Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36 0312 0828589 39 Conti 6/1/1979 40 Gemini Visa 4532872142737056 237 6/2023 WK48391724 1Z 175 1F5 29 1963 168 1 7448393148 31617424 Blue Professional scout Littler's 1998 Nissan Serena O+ 247.5 112.5 6' 0" 182 f4feeb24-e3b1-4d99-9c71-e8c6a95762fe 44.588081 12.055283
8 7 male Greenland Mr. Pavia A Rosing 29 Wattle St King William's Town EC Eastern Cape 5601 ZA South Africa Thattere aiCheed7tie Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/ Yowser/2.5 Safari/537.36 082 692 3461 27 Lennert 5/5/1937 82 Taurus Visa 4539980160229196 256 10/2020 3705057790082 1Z 507 770 52 3012 473 1 1256867146 65720899 Green Chemical engineer Pup 'N' Taco 2003 Peugeot Partner O- 192.1 87.3 6' 0" 182 b6b75cf9-dfbf-424d-a03c-90cdd859e9eb -32.787712 27.343649
9 8 female French Mrs. Ormazd M Jomphe Mattenstrasse 108 Sissach 4450 CH Switzerland Deace1999 oochui5Eboe5T Mozilla/5.0 (Windows NT 6.1; rv:66.0) Gecko/20100101 Firefox/66.0 061 947 83 90 41 Busson 1/14/1999 20 Capricorn Visa 4556603638439886 691 6/2024 1Z 091 192 83 9348 168 6 4380386435 24628087 Purple Clinical psychologist Linens 'n Things 1996 Plymouth Neon O+ 115.3 52.4 5' 1" 154 e5858bdb-9173-4991-9857-4e09b61e4e16 47.520557 7.863831
10 9 male Norwegian Mr. Severin L Akhtar 251 Charilaou Trikoupi Str. Pigenia NI Nicosia 2962 CY Cyprus (Anglicized) Heremer cieCipua8L Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 96 586625 357 Mathisen 4/30/1960 59 Taurus MasterCard 5230940651584482 785 10/2022 1Z W81 228 52 4912 032 1 8897778249 55031915 Green Pump operator Fragrant Flower Lawn Services 2005 Dodge Nitro B+ 155.3 70.6 6' 0" 182 64383596-6dc8-4b77-9476-c1a8ef23ffc6 41.335894 -72.908321
11 10 female Greenland Mrs. Margrethe H Kristiansen 94 boulevard Amiral Courbet ORLÉANS CE Centre 45100 FR France Theirturavid Aiv4ohwae Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 33 Berthelsen 12/13/1979 40 Sagittarius MasterCard 5306020150102745 915 12/2024 2791269679323 49 1Z 987 E42 01 7982 218 2 0231937615 18687876 Purple Systems software engineer Independent Wealth Management 2012 Porsche 911 A+ 200.2 91.0 5' 3" 159 ae7b1a56-0d6d-46ba-895e-75ee10482858 47.850047 1.875252
12 11 female Hispanic Mrs. Myrna G Feliciano Männi 12 Mustoja LV Lääne-Virumaa 45429 EE Estonia Wilthe84 Chu6shiRees Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 329 3803 372 Cortés 1/23/1984 35 Aquarius MasterCard 5402842306504596 788 7/2020 1Z 599 666 46 6430 018 3 8678427166 90687114 Blue Radiologic technician Monmax 2003 Mitsubishi Lancer O+ 212.7 96.7 5' 5" 164 4e8b5c6c-0b04-43c1-ad5b-2fc92365455c 59.638357 26.059683
13 12 male Czech Mr. Michal E Horký Algade 33 Guldborg SJ Region Sjælland 4862 DK Denmark Fiect1941 oxep7Aev Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 28-64-27-85 45 Siváková 6/19/1941 78 Gemini MasterCard 5108116586316493 376 2/2021 190641-4941 1Z 85E W86 50 8027 647 6 4547979244 81431928 Orange Dental assistant Pointers 1995 Daihatsu Rocky B+ 165.0 75.0 5' 10" 177 a2aa0138-07d3-41e3-b3b7-455d9854e31f 54.815396 11.760822
14 13 male French Mr. Donat M Lespérance 96 rue de Penthièvre PONTOISE IL Île-de-France 95000 FR France Sirep1950 re8ZieK4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 33 Dodier 11/12/1950 69 Scorpio Visa 4556904288472270 512 9/2022 1501143313127 93 1Z 070 9Y5 64 7265 236 0 1770634719 11974448 Black Neurosonographer American Appliance 2009 Kia Cerato A+ 180.8 82.2 5' 7" 170 78b497ed-6d7d-4e5f-8150-1155abf9716e 48.977559 1.976986
15 14 female Japanese (Anglicized) Ms. Yuuka M Shimasaki Mjövattnet 1 NYLAND 870 52 SE Sweden Coun1976 ohleaT4ae Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 0613-9040212 46 Kimura 2/11/1976 43 Aquarius MasterCard 5457403889440023 903 5/2021 760211-4105 1Z 975 450 29 2316 562 4 8652144021 72590241 Blue Credit checker Elek-Tek 2006 Ford Territory A- 151.8 69.0 5' 5" 165 df1a2d57-31d8-4a71-8ad6-cb687ee250d4 62.773416 17.853904
16 15 male Swedish Mr. Wiktor H Ek Norðurbraut 27 Reykjavík 112 IS Iceland Boally aigo2OoPhoi Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 434 6815 354 Göransson 5/4/1945 74 Taurus Visa 4532063379896779 489 4/2024 1Z 278 965 48 6106 268 1 1158883056 79465382 White Legal secretary Handy Andy Home Improvement Center 2014 Jaguar XF O+ 227.0 103.2 5' 10" 178 d39f58d7-7bb7-4f77-9956-9b505f8a4cc8 64.187422 -21.93344
17 16 female Slovenian Ms. Polona H Ranković Õli 68 Himmiste PL Põlvamaa 64204 EE Estonia Whou1985 eeZae5oech Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 798 9719 372 Orehek 7/18/1985 34 Cancer MasterCard 5433975988486550 723 7/2022 1Z A77 0E9 59 0580 345 8 0984960146 75114603 Blue Personal trainer Budget Tapes & Records 1995 Lancia Delta A+ 160.2 72.8 5' 5" 165 c8e80b15-7ff4-4b21-bebf-5737d8133bdc 57.990467 27.141455
18 17 male Scottish Mr. Ivan M King Pachergasse 64 BÜSCHENDORF ST Styria 8786 AT Austria Anempon XooJoh0se5sh Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 0660 475 13 89 43 Watson 9/15/1994 25 Virgo MasterCard 5257015834586726 714 12/2020 1Z 37A 329 60 9892 939 6 1176881962 86098320 Green Placement counselor Opticomp 2000 Citroen C 15 A+ 130.2 59.2 5' 7" 170 58aec14a-fc35-4e19-a61f-b8f170a9ec7d 47.492881 14.377639
19 18 female Finnish Mrs. Nelma M Grönholm Rostsestraat 222 Froidchapelle WLX Luxembourg 6440 BE Belgium Obect1946 aeJ6OhneiF3t Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 0479 50 54 03 32 Pelkonen 9/15/1946 73 Virgo MasterCard 5208559023644291 187 12/2021 1Z 416 A35 89 5065 644 8 4484163657 51232002 Green Claims adjuster Starship Tapes & Records 2010 Land Rover Defender A+ 216.5 98.4 5' 7" 169 2f575438-1e23-4123-ab60-990725e8c08b 50.072755 4.344408
20 19 female Hungarian Ms. Tünde F Hoffmann Via Nazario Sauro 112 Cusano Milanino MI Milano 20095 IT Italy Preacces aob4eiteiL Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 0352 9353380 39 Bagi 11/1/1951 68 Scorpio MasterCard 5559688562190559 258 7/2023 DA75119938 1Z 959 98A 67 7929 896 3 2123939613 78515387 Orange Apparel worker Hugh M. Woods 2005 BMW X5 O+ 186.1 84.6 4' 11" 151 0a14766e-42f5-4943-ae10-884bcadf43f8 45.644863 9.128014
21 20 female Finnish Ms. Riitta N Hahl ul. Elbląska 97 Olsztyn 10-672 PL Poland Thill1954 Aiwar2ooh1 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 67 788 02 83 48 Linna 3/2/1954 65 Pisces MasterCard 5150663209146952 542 2/2021 54030268860 1Z 626 853 00 4461 590 4 4657047492 90091015 Purple Medical secretary Edwards 2012 Toyota Prius O+ 187.7 85.3 5' 7" 170 e7825dc5-9fdf-478c-ba72-ab25879699f1 53.768852 20.536572
22 21 male Dutch Mr. Harwin R Galesloot Glynitveien 218 SKI 1400 NO Norway Saffive wohHaix5fa Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 914 54 925 47 Ramakers 9/28/1994 25 Libra MasterCard 5580997921941047 961 4/2022 1Z 761 410 55 6702 728 6 9467674600 08790710 Blue Electric motor repairer Buena Vista Realty Service 2002 ZAZ Slavuta B+ 217.1 98.7 5' 6" 167 4029654d-5c8f-407e-b003-72bde9f593a1 59.814132 10.87172
23 22 male Hispanic Mr. Azarías A Segovia Via Francesco Girardi 49 Carmignano Di Brenta PD Padova 35010 IT Italy Portalime duteiVeev1 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 0394 9130281 39 Nava 12/24/1949 70 Capricorn MasterCard 5146736492498053 173 1/2022 SZ30479384 1Z 931 W28 91 2882 876 3 2747014433 22062098 Blue Administrative office manager Total Serve 2005 BMW 325 B+ 140.4 63.8 6' 0" 184 916cbe50-55b2-4a11-acf4-b8d8d9cc9668 45.432966 12.000719
24 23 male Hungarian Mr. Adelbert A Kuncz Turjaška 115 Rečica ob Savinji 3332 SI Slovenia Firseten woh3ejoo2Ei Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 031-365-314 386 Pethô 10/17/1976 43 Libra MasterCard 5431971800958886 663 4/2022 1Z 5V7 161 85 3863 932 0 7652480699 67841291 Blue Extruding, forming, pressing, and compacting machine setter Red Owl 2002 Citroen C-Airdream O+ 193.6 88.0 6' 1" 186 6653d52f-870f-4956-8b1a-c1463007f387 46.357704 14.932894
25 24 female England/Wales Ms. Charlie S Campbell Rue du Château 414 Limont WLG Liège 4357 BE Belgium Lithatinquir Kae4aetah Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 0489 33 97 26 32 Tucker 10/16/1987 32 Libra MasterCard 5268537479220623 072 1/2023 1Z 3V4 354 38 1677 342 8 0119239854 24858321 Purple Dry-cleaning worker Red Robin Stores 2004 Audi S4 O+ 119.5 54.3 5' 8" 173 f3a846e6-ec1c-4ccc-bf51-f66bb64ff559 50.616062 5.247283
26 25 female American Ms. Thelma K Mitchell Väike-Laagri 80 Orissaare SA Saaremaa 94691 EE Estonia Trind1979 eeThooy3ieph Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 455 6186 372 Rumbaugh 5/23/1979 40 Gemini MasterCard 5252527082023181 249 9/2024 1Z 001 306 31 4745 481 7 4024903278 57842179 Black Camera repairer Omni Superstore 2000 Ford Artic A+ 118.4 53.8 5' 6" 168 6e2490c7-4a17-4ce8-9a80-4b82551d3180 58.540748 23.059808
27 26 male Brazil Mr. Davi G Santos Rákóczi út 66. Barnag VE Veszprém 8291 HU Hungary Cumeneamord AeYeRie2doo Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 (88) 158-170 36 Goncalves 3/15/1974 45 Pisces Visa 4539342451489007 234 9/2021 1Z 598 735 11 9286 476 3 9550539865 84997161 Blue Clinical manager Star Merchant Services 2011 Alfa Romeo Giulietta A+ 156.9 71.3 5' 11" 180 95501fdd-a13f-43fd-981c-0dced4dd23e6 47.04597 17.745945
28 27 male England/Wales Mr. Jonathan A Conway Rua Doutor Afrânio Junqueira 1460 São Paulo SP São Paulo 04581-040 BR Brazil Bessed Egaez2Vuo Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko (11) 7113-8192 55 Humphries 1/20/1938 81 Aquarius Visa 4916317241919037 680 3/2023 308.271.618-08 1Z 02V 42E 21 2992 331 4 3494650897 26033304 Blue Electrical drafter Pro Yard Services 2006 Dodge Caravan B+ 170.9 77.7 6' 0" 183 0eb93cfe-1260-4caa-99d5-e0f31f73b1e9 -23.594567 -46.709971
29 28 male French Mr. Guy A Migneault 90 Petworth Rd DUNSINNAN PH2 5HL GB United Kingdom Enut1960 nahming4Oo Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 077 5138 5842 44 Labrecque 4/28/1960 59 Taurus MasterCard 5581802812256860 692 3/2022 ZT 01 87 75 1Z 96V 661 27 4962 061 4 8863556519 92044686 Blue Technical trainer Rogers Peet 1999 Fiat Siena B+ 176.7 80.3 5' 7" 171 eca54fd7-3576-426c-b4f9-b617a3662931 56.025747 -3.640577
30 29 male Hispanic Dr. Breogan J Orosco Ηλίου 64 ΛΑΡΝΑΚΑ LA Λάρνακα 6031 CY Cyprus (Greek) Priback Quoo9choo Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 97 687579 357 Ceballos 9/19/1957 62 Virgo MasterCard 5567650882732569 392 10/2022 1Z A69 196 36 1803 521 0 6314780611 79148788 Green Machinery maintenance mechanic Jack Lang 2010 Hyundai i30 A+ 155.1 70.5 5' 10" 179 8dda77bb-d75c-40ef-b1c0-6381d305b053 41.348113 -72.957665
31 30 female Czech Mrs. Jaroslava M Kindlová 22 Rue de Sidi Bou Zid Zouarine 33 Governorate Kef 7170 TN Tunisia Tepen1939 thahShee7 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 78 427 062 216 Langová 11/28/1939 80 Sagittarius Visa 4532368231815457 583 12/2022 1Z 349 036 22 9992 262 9 5257019048 20372734 Black Copy editor Anthony's 2005 Nissan Altima B+ 108.7 49.4 5' 4" 163 94e73608-33dd-4f7a-a86d-9575aa2e63e6 37.219459 9.881268
32 31 male Croatian Mr. Stjepan A Perković Escuadro 26 Castelló de Rugat V Valencia 46841 ES Spain Spastry uLiH7iech3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 779 021 982 34 Topić 3/18/1978 41 Pisces Visa 4539449132248031 692 6/2023 1Z V92 2E8 34 9201 447 0 3497779171 87502347 Orange Geoscientist Monmax 2010 Chrysler PT Cruiser A+ 175.8 79.9 5' 10" 178 e49f858d-b218-4ce7-89ae-62599b3bb276 38.927331 -0.375157
33 32 male Croatian Mr. Stanko T Crnić Avda. Alameda Sundheim 46 Benasque HU Huesca 22440 ES Spain Waskents Piu4theeg1ae Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 793 358 347 34 Jozić 9/13/1974 45 Virgo Visa 4929607743905830 463 2/2021 1Z 981 F24 67 9469 260 4 6183936244 94973462 Blue Studio camera operator Wealthy Ideas 2004 Isuzu Axiom B+ 185.2 84.2 6' 1" 186 6ee36d07-1397-43c8-86a4-455bce264fbc 42.623207 0.475571
34 33 female Russian Ms. Marianne I Zhdanova 18 Rue de bayrout Cite Badrani 61 Governorate Sfax 3083 TN Tunisia Thock1968 uRai6thoh Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 74 849 807 216 11/30/1968 51 Sagittarius MasterCard 5223193173825541 622 7/2020 1Z 60W 422 22 2116 147 5 6606006372 32841170 Blue Geographic information specialist Star Interior Design 1996 Dodge Caravan B+ 168.3 76.5 5' 7" 169 eff7c2c8-85f4-45dc-8a4d-60770f4eda42 35.319852 9.785865
35 34 female Hungarian Ms. Ferike G Jónás Brucker Bundesstrasse 31 FÜRLING NO Lower Austria 4152 AT Austria Ofigaill49 ni8ooy1Thee Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 0681 563 12 72 43 Tolnay 4/12/1949 70 Aries Visa 4532628381402038 969 6/2021 1Z 099 5Y5 30 8995 126 3 7242712846 73346379 Red Computer systems administrator ABCO Foods 1999 Land Rover Discovery O+ 134.6 61.2 5' 5" 166 4687baf3-0486-4466-a63b-d0dff8903249 48.633035 13.984742
36 35 female Czech Ms. Lenka T Mizerová Rhinstrasse 91 München BY Freistaat Bayern 80975 DE Germany Daudgessed keer4Ceej9j Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 089 60 25 65 49 Brožová 1/14/1944 75 Capricorn Visa 4916250570933685 939 6/2023 1Z 580 A90 83 6948 520 2 9291535188 57982233 Red Maid AdventureSports! 2000 Toyota Sienna O+ 184.1 83.7 5' 1" 154 efb875ed-74dc-4ea3-b155-4a48cb61d187 48.189892 11.502201
37 36 female Icelandic Mrs. Eyþóra P Runólfsdóttir Ditscheinergasse 80 HUNDSHAGEN OO Upper Austria 4773 AT Austria Expregiat Beph8ieX Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 0699 830 58 07 43 11/22/1960 59 Sagittarius MasterCard 5426319483823638 482 11/2024 1Z 448 19V 41 8418 729 7 2455440003 04422500 Purple Building cleaning worker Matrix Architectural Service 2010 BMW 650 O- 130.7 59.4 5' 6" 168 02a29393-6fdc-4120-adf0-568906c8c111 48.266115 13.568714
38 37 female French Mrs. Rive T Lépicier 144 Souniou Ave. Menogeia LA Larnaca 7578 CY Cyprus (Anglicized) Carray ieJij3no Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0 24 102884 357 Lanteigne 4/15/1983 36 Aries MasterCard 5494902843118711 376 8/2024 1Z W04 891 69 0373 898 8 7751332568 91800864 Yellow Payroll and benefits specialist Lechters Housewares 2006 Nissan Pathfinder A+ 119.7 54.4 5' 4" 163 17afbc1b-4d95-4583-8602-9680b4fd7c5c 41.311392 -72.829123
39 38 male Danish Mr. Marcus O Paulsen Plattenstrasse 57 Räterschen 8352 CH Switzerland Mesee1943 Fi7eiva8Ah Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 044 347 47 26 41 Simonsen 10/8/1943 76 Libra MasterCard 5378357034830932 713 8/2021 1Z 387 E31 19 5962 225 9 2639724434 59408085 Black Management development specialist Parts and Pieces 2006 BMW M3 B+ 207.5 94.3 5' 10" 177 43ae4a6b-e1ca-4d5e-b6cd-3732697c9c71 47.488972 8.868299
40 39 female Chechen (Latin) Mrs. Zeliha I Sultygov Rookopli 96 Uralaane VG Valgamaa 68712 EE Estonia Dary1953 nohgief2A Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 763 5734 372 Desheriyev 7/27/1953 66 Leo Visa 4539696753097085 200 11/2021 1Z E17 641 44 8337 404 1 5174706823 32841274 Red Mental health assistant Reliable Investments 2000 Opel Signum O+ 119.9 54.5 5' 3" 160 68c2ef2c-9990-41bc-be94-afa07e6e2379 58.070181 26.064252
41 40 female Russian Mrs. Ilona B Pirogova Αγ. Ανδρέα 130 ΒΑΣΑ ΚΟΙΛΑΝΙΟΥ LI Λεμεσός 4771 CY Cyprus (Greek) Hatiere Ahsha4Ai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 25 750307 357 12/5/1970 49 Sagittarius Visa 4539731441846112 542 4/2024 1Z 008 309 69 5575 108 1 3971097056 06463075 Purple Land acquisition manager Wickes Furniture 1992 Ford Taurus O- 135.1 61.4 5' 1" 156 0c5899a0-ce9c-43a7-aba1-f279893620f9 41.295272 -72.961282
42 41 female Croatian Ms. Aleksandra K Petković Binzmühlestrasse 30 San Bernardino 6565 CH Switzerland Ramessanies1994 vie2quai7Ie8 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 091 808 37 22 41 Bašić 6/8/1994 25 Gemini MasterCard 5560110586075796 991 3/2023 1Z 11E 310 61 4037 919 3 0620594797 44658507 Purple Personal banker Ejecta 2005 Kia Amanti A+ 110.2 50.1 5' 4" 163 cc560302-0d00-410c-9629-2e68bb4ef864 46.506804 9.159102
43 42 male American Mr. Rogelio A Patrick Πλ Καραισκάκη 128 ΑΓΙΟΣ ΘΕΟ∆ΩΡΟΣ ΣΟΛΕΑΣ NI Λευκωσία 2823 CY Cyprus (Greek) Whortin1952 Yu7mah2z Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 95 432561 357 Thacker 10/25/1952 67 Scorpio Visa 4716135824324942 242 9/2021 1Z 14A 327 32 1181 648 4 9342714384 12993510 Blue Typesetting machine tender Vibrant Man 2001 Toyota MR2 B+ 170.3 77.4 5' 11" 180 76a566cc-be59-4327-862e-312da09e0c42 41.353523 -72.965839
44 43 female American Mrs. Evelyn R Tucker Kringlan 66 Reykjavík 107 IS Iceland Arresplet deiT0ahyu Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 450 3756 354 Burton 9/25/1986 33 Libra MasterCard 5592761939548814 873 9/2022 1Z 263 919 45 6552 555 7 5057305433 86266508 Purple Aquaculture farmer Weatherill's 2004 Mitsubishi Galant O+ 105.8 48.1 5' 3" 159 716b5321-34bf-4514-8bca-fce5c482d8c3 64.159592 -21.928397
45 44 male Icelandic Mr. Þorkell H Hallbjörnsson Školní 296 Kaplice 1 JC Jihoceský kraj 382 41 CZ Czech Republic Dessesid eeXahew1ui Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 772 616 930 420 12/19/1957 62 Sagittarius MasterCard 5534480983249093 443 6/2023 1Z 34E 320 47 9554 749 9 1825424609 81224507 Blue Financial aid director Grand Union 2004 Ford Explorer A- 236.9 107.7 6' 1" 186 393472fc-3454-4ba8-af3a-e1f7f626cfee 48.691433 14.516696
46 45 male Greenland Mr. Jan H Geisler Bayerhamerstrasse 79 GLAUBENDORF NO Lower Austria 3704 AT Austria Subjecould eepooz6U Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko 0699 456 17 84 43 Lange 10/25/2000 19 Scorpio MasterCard 5305776196130476 904 6/2020 1Z 084 34A 51 9322 259 5 7785136902 52035400 Blue Fire prevention specialist Mikrotechnic 2005 Bizzarrini BZ-2001 B+ 203.1 92.3 5' 9" 175 92630214-ba49-47e4-8717-93e4bba4262e 48.560667 15.917015
47 46 female Norwegian Mrs. Caroline M Landmark Via Tasso 21 Perugia PG Perugia 06122 IT Italy Sweves ooNee0iechoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 0378 8718408 39 Benjaminsen 9/26/1975 44 Libra MasterCard 5248190326919222 805 6/2024 PR74491787 1Z 2Y4 773 67 4365 263 8 7709648575 99170582 Green Allopathic physician Castro Convertibles 2012 Dodge Durango A- 187.7 85.3 5' 6" 168 dde3a962-10d8-4092-b9df-6da79f89f383 43.072973 12.459411
48 47 female Swedish Ms. Lena M Andersson Parkring 7 STEINPARZ OO Upper Austria 4730 AT Austria Freen1978 aeVaiHohy7 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 0650 858 08 11 43 Holm 10/24/1978 41 Scorpio MasterCard 5451268671996177 795 8/2023 1Z 683 821 70 2253 409 0 7986278354 59684998 Blue Chemical technician Wholesale Club, Inc. 2007 Kia Carnival AB+ 180.2 81.9 5' 1" 154 8d7f4c08-ee33-4024-9474-0beed004df45 48.234026 13.824536
49 48 male Danish Mr. Elias A Jepsen Βασιλέως Αλεξάνδρου 195 ΦΑΡΜΑΚΑΣ NI Λευκωσία 2620 CY Cyprus (Greek) Thenim uki0Zae7l Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 96 228011 357 Olsen 9/10/1967 52 Virgo Visa 4929867576889614 699 4/2023 1Z 348 2Y4 34 0337 091 2 3256227540 51352382 Blue Court, municipal, and license clerk Golden's Distributors 2011 Volvo XC70 A+ 214.5 97.5 5' 9" 174 f06e06a7-59b0-4052-9928-92df476d7753 41.326352 -72.962624
50 49 male French Mr. Honoré N Beaudouin 13 Faubourg Saint Honoré PAU IL Île-de-France 64000 FR France Slise1955 Zei7phaeSuutu Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 33 Daoust 11/2/1955 64 Scorpio Visa 4539798879618651 321 11/2020 1551124428495 35 1Z 424 792 46 6757 249 6 9796303410 07585755 Blue Dairy scientist York Steak House 1999 GAZ 3111 O+ 236.1 107.3 5' 9" 175 00a9f1f4-bec6-4dda-ac22-97e2860b1662 43.241847 -0.41343
51 50 male American Mr. Richard K Martinez Via Zannoni 49 Tiarno Di Sopra TN Trento 38060 IT Italy Himmest Feitee9ien Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 0328 4921229 39 Dickenson 9/2/1985 34 Virgo MasterCard 5258225275434802 882 2/2023 PJ53725800 1Z 4V9 5Y7 22 3010 519 9 5935776326 45852443 Blue Logistician Macroserve 1995 Fiat Bravo O+ 205.9 93.6 5' 9" 176 fbe7a3e7-ace4-4495-b6e6-0c2cb4abcc88 45.964528 10.759331
52 51 male Chechen (Latin) Mr. Salambek T Melikov 1678 Dorp St Claremont WC Western Cape 7740 ZA South Africa Brint1956 ehahCh1xai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 083 792 9726 27 Gairbekov 7/13/1956 63 Cancer MasterCard 5284109963636027 465 12/2023 5607139788084 1Z 558 445 48 7417 922 3 1631262867 63160277 Green Telephone operator Kinney Shoes 1998 Chevrolet Trans Sport O+ 211.9 96.3 5' 8" 172 b1628f36-9fdd-487d-8297-a35ee6d72ebf -33.89536 18.479041
53 52 female Greenland Mrs. Mette K Olsen ul. Karpacka 69 Bydgoszcz 85-164 PL Poland Liffew queiy6ooGh Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 88 165 40 96 48 Jeremiassen 1/30/1935 84 Aquarius MasterCard 5353410735290150 175 3/2023 35013096720 1Z 129 156 25 6468 002 5 2379843087 66415820 Yellow Lawyer MagnaSolution 2005 Peugeot 107 O- 117.0 53.2 5' 0" 152 01d98231-8e58-4e17-9c9a-bc5aa388928a 53.068638 18.093529
54 53 male Russian Mr. Spartacus N Ignatieff Bahnhofstrasse 57 Glovelier 2855 CH Switzerland Imeting1968 yi2Eep8gieh Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 032 803 90 31 41 6/16/1968 51 Gemini MasterCard 5399586162423418 375 3/2024 1Z 632 482 77 2949 860 1 2465813266 58937027 Blue Sales worker supervisor Schweggmanns 2008 Mazda 5 O+ 152.9 69.5 5' 8" 173 6365915b-b99c-426c-ae8e-698f846d3f03 47.232383 7.22689
55 54 male Brazil Mr. Kauã S Cardoso P.O. Box 194 Upernavik QA Qaasuitsup 3962 GL Greenland Searlitnot caiT4reN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 96 63 21 299 Castro 3/9/2000 19 Pisces MasterCard 5301206781704786 476 5/2024 1Z 346 5Y6 16 7677 108 7 7407791874 63200812 Blue Press secretary Dave Cooks 2002 Hyundai Elantra A+ 155.5 70.7 5' 10" 179 081371c1-479e-4055-95af-3110e72fc11a 72.786922 -56.131948
56 55 female Brazil Ms. Fernanda P Cavalcanti Via degli Aldobrandeschi 3 Jelsi CB Campobasso 86015 IT Italy Knour1941 ahChohqu4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 0327 9982793 39 Souza 11/30/1941 78 Sagittarius Visa 4929971103746071 969 10/2023 CI48765311 1Z 223 435 25 6742 103 3 6973121025 59054247 Blue Budget analyst Hughes & Hatcher 2001 Volkswagen Lupo B+ 106.3 48.3 5' 5" 165 c75dc0e6-fe45-431f-8907-6e58db479a3d 41.444028 14.707643
57 56 female Hungarian Ms. Mónika Z Göröncsér Nábřežní 243 Spálené Porící PL Plzenský kraj 335 61 CZ Czech Republic Thenetiong quohwae5Quoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 376 147 284 420 Szôts 6/11/1946 73 Gemini Visa 4916007260260864 417 6/2024 1Z A43 364 39 5822 708 0 8027400869 93626602 Black Precision printing worker Plunkett Home Furnishings 2001 Bugatti EB 118 B+ 146.3 66.5 5' 4" 162 13189ec1-db42-4f8c-b74e-e7a45d33a237 49.629 13.606864
58 57 female Czech Mrs. Zuzana M Kozáková Via del Pontiere 101 Birgi Aerostazione TP Trapani 91020 IT Italy Dinectich mai5eiXaexai Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 0391 7843193 39 Minarčíková 1/21/1953 66 Aquarius MasterCard 5164065137771907 924 5/2020 AT69882067 1Z 736 067 39 5591 664 0 0937144872 77392590 Red Industrial engineering technician Romp 1998 Isuzu VX-02 A+ 123.0 55.9 5' 3" 161 c414c613-db2b-4f1a-8bf1-a91e518afb85 37.610445 12.42306
59 58 female Russian Mrs. Eugene R Bykova Via Goffredo Mameli 149 Poggiovalle Di Borgorose RI Rieti 02020 IT Italy Ingentersed1943 aeN6eenul5 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134 0366 9434948 39 6/29/1943 76 Cancer Visa 4556039653807659 972 5/2022 KX92160250 1Z 975 21Y 29 8927 431 6 6861257852 04061254 Brown Executive secretary Kinney Shoes 1994 Plymouth Voyager A+ 195.8 89.0 4' 11" 150 7eb2e374-3e73-41d3-95b6-f9bb93993872 42.079608 12.989079
60 59 female Icelandic Ms. Nanna S Hallmundsdóttir 85 Gimblett Street Richmond Invercargill 9810 NZ New Zealand Wifflife1964 Deez4ooGi0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 (022) 7929-466 64 4/15/1964 55 Aries MasterCard 5504023231705718 236 9/2020 1Z 668 366 39 7132 126 6 7750395260 67884791 Black Photographic process worker Megatronic 1999 Dodge Avenger O+ 170.1 77.3 5' 0" 153 05974695-4733-4bd3-b1ac-bffb56992160 -46.301185 168.423853
61 60 female Polish Ms. Halina C Zielinska 1 Gloucester Road CLACHANDHU PA68 7QD GB United Kingdom Corsome74 Vaem2keeV6 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 077 5149 6820 44 Zielinska 8/7/1974 45 Leo Visa 4539422686560762 956 3/2022 TB 10 69 23 1Z 189 A93 08 3744 015 7 0022083989 01394977 Blue Surgical technician Monit 2011 Renault Grand Scenic AB+ 208.3 94.7 5' 7" 171 44fd6c97-052a-42a8-b2d2-fb8b3fc70ba8 55.954988 -5.872046
62 61 female Japanese (Anglicized) Ms. Hatsuho K Yoneda Dalmatinova 35 Žabnica 4209 SI Slovenia Therl1988 ahteel2maeSh Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134 051-632-354 386 Mikami 9/17/1988 31 Virgo MasterCard 5104564324299139 201 4/2021 1Z 486 074 10 6995 124 2 8337173450 63167043 Purple Treasurer Practi-Plan 2003 MG ZT O+ 134.4 61.1 5' 8" 173 e51aa8e2-9fc6-43e7-8488-168ca87b75d7 46.108258 14.328302
63 62 male Croatian Mr. Gojislav V Jukić Bahnhofstrasse 96 Gorgier 2023 CH Switzerland Tinguen GaiXa3ai Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 032 304 33 66 41 Pavić 9/21/1999 20 Virgo MasterCard 5165115278847765 074 12/2022 1Z 385 466 21 1758 512 0 6790588061 83119520 Red Extractive metallurgical engineer Value Giant 2011 Lexus LFA B+ 155.5 70.7 6' 1" 186 0165faab-d065-4a58-a543-057c540a6863 46.955216 6.830252
64 63 female England/Wales Mrs. Megan T Swift Postbox 23 Maniitsoq QE Qeqqata 3912 GL Greenland Thersevere ieGh5huoK6 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 81 32 04 299 Poole 4/30/1964 55 Taurus Visa 4929574688538812 336 4/2024 1Z E38 W63 04 0063 263 0 6213800707 38860322 Green Support service manager Little Folk Shops 2007 Chevrolet Optra B+ 168.7 76.7 5' 4" 163 394b47d6-e3cb-4869-b4b5-c42a81ef9b00 65.395922 -52.878832
65 64 male Slovenian Mr. Milan Franc S Košelnik Na Výsluní 272 Primda PL Plzenský kraj 348 06 CZ Czech Republic Lospay67 queeN0ies Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 733 162 319 420 Mankoč 1/14/1967 52 Capricorn MasterCard 5453102419958199 879 12/2021 1Z A16 354 25 3838 551 9 0820470346 05905803 Brown Clinical laboratory technologist Singer Lumber 2003 Holden UTE A+ 214.5 97.5 5' 9" 175 7234b959-896e-4335-8745-c5716b1c7638 49.619319 12.730847
66 65 female German Ms. Johanna P Maurer Kaisergasse 64 KURZENKIRCHEN OO Upper Austria 4770 AT Austria Reptaked1981 ohwae5Tee Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 0688 992 50 51 43 Schuhmacher 7/12/1981 38 Cancer Visa 4916529470563530 481 4/2020 1Z Y23 375 24 5962 121 9 0306589986 74875435 Purple Elevator repairer Solution Answers 2008 Rover Streetwise AB+ 186.6 84.8 5' 5" 164 30ce10a0-2ea3-4e24-871a-f72cb3beedfc 48.439595 13.547802
67 66 male Norwegian Mr. Teodor K Aune Gl. Sygehusvej 153 Narsaq KU Kujalleq 3921 GL Greenland Dickent ooph5leiG Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 66 28 45 299 Arntzen 7/18/1998 21 Cancer MasterCard 5296939640241254 624 2/2024 1Z 216 Y97 87 6791 863 9 0054598944 41175224 Blue Private household cook Builders Emporium 2008 Renault Laguna O+ 239.1 108.7 5' 9" 174 0562be22-b239-4dbc-a0f7-d64679ae153f 60.827346 -46.022413
68 67 male Dutch Mr. Abderrahman I Kempers Hjellestadnipen 66 HJELLESTAD 5259 NO Norway Ancery eeC4tien9 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 460 73 493 47 Cuperus 7/17/1977 42 Cancer MasterCard 5358951722758050 004 9/2024 1Z Y78 20V 88 9025 826 2 1986315622 50825375 Blue Time clerk Carrols Restaurant Group 2008 Dacia Sandero O+ 182.4 82.9 5' 6" 168 49c9f2d1-5e74-4ef9-8c9c-c5e6338e1f6d 60.280666 5.157123
69 68 male French Mr. Nicolas R Lebrun 95 Burton Avenue Okoia Wanganui 4500 NZ New Zealand Wroing iR2rahpaim2a Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 (027) 0336-972 64 Bondy 3/3/1964 55 Pisces MasterCard 5167661122227231 094 5/2021 1Z 418 1A1 09 5878 510 1 4757008829 69457336 Blue Corporate accountant Playworld 1997 Citroen Rally Raid O+ 209.4 95.2 5' 10" 178 851fc065-0061-4754-b807-421e4242b5ba -39.863379 174.967351
70 69 male Slovenian Mr. Ivan Martin J Bugarski Breivangvegen 38 TROMSØ 9010 NO Norway Dercy1937 iey2Xoh8o Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 448 63 713 47 Riboli 11/14/1937 82 Scorpio MasterCard 5364317183005716 388 12/2022 1Z 291 4Y2 69 2883 563 7 1469869904 37602111 Red Office clerk Quickbiz 2000 Chrysler Grand Voyager O- 174.2 79.2 5' 10" 179 ca8717a7-1eba-4f93-9f1d-970b2fa1a45e 69.651262 18.958466
71 70 female Czech Ms. Jarmila M Chloupková 729 Albert St Germiston GA Gauteng 1419 ZA South Africa Wountim81 yie8Cees Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 084 256 2607 27 Poláčková 2/5/1981 38 Aquarius Visa 4716715994153559 960 3/2024 8102054742081 1Z 454 A20 14 2101 291 2 8866810206 44986764 Red Industrial-organizational psychologist Wells & Wade 1996 Mini MK VI B+ 140.6 63.9 5' 8" 172 940ec903-7fa0-421a-a1c5-7620bea7f7e0 -26.161314 28.133482
72 71 male Italian Dr. Manlio M Capon Lützelflühstrasse 122 Wil 5300 CH Switzerland Theyear ieThuo1fei Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 062 380 34 69 41 Folliero 9/17/1947 72 Virgo Visa 4929441842722544 930 4/2021 1Z 15W 644 50 5740 843 7 5183466840 07258789 Black Billing and posting clerk House Of Denmark 1997 Oldsmobile Eighty-Eight B+ 162.1 73.7 5' 10" 179 0e92196a-3502-41c6-83bc-9a3b43c49317 47.510722 8.303196
73 72 female Japanese (Anglicized) Ms. Tomomi Y Nishiyama Rua do Arenque 1634 Goiânia GO Goiás 74343-040 BR Brazil Mille1991 aeneThoh6x Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 (62) 9976-7986 55 Ishizaki 8/23/1991 28 Virgo Visa 4716240309647674 486 9/2020 406.537.117-19 1Z 01V 07A 25 5957 147 4 4689360184 38520562 Purple Oxy-gas cutter Lechters Housewares 2011 Chevrolet HHR B+ 213.2 96.9 5' 2" 158 5141a9c2-cd45-4813-9cbe-12d630eefce4 -16.687195 -49.226261
74 73 female Russian Dr. Esther R Kalinina Τρικάλων 248 ΛΕΥΚΩΣΙΑ NI Λευκωσία 1687 CY Cyprus (Greek) Hics1952 euG0Aiqu2 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 22 723018 357 7/5/1952 67 Cancer MasterCard 5563120249912803 542 11/2024 1Z 734 93Y 11 9330 585 6 6006786087 26151161 Blue Ambulatory care nurse Waccamaw Pottery 1999 MCC Smart A+ 133.5 60.7 5' 7" 170 642dbeb6-defe-4c89-bc0a-5c64ae807dcb 41.266749 -72.834759
75 74 male Icelandic Mr. Boði L Zóphoníasson Árpád fejedelem útja 3. Budapest BU Budapest 1184 HU Hungary Inart1990 nugheiZ0eig5 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 (1) 941-2250 36 6/12/1990 29 Gemini MasterCard 5199467250444016 134 12/2020 1Z 563 9F0 30 4262 753 9 9294260413 82490560 Orange Mail processing machine operator Bell Markets 1997 Lada Natacha B+ 144.5 65.7 5' 9" 174 a3912b2f-c2dc-4ff7-ab81-af1e047108c5 47.515035 19.146851
76 75 female England/Wales Mrs. Grace G Boyle 62 Mavrokordatou Street Foinikaria LI Limassol 4530 CY Cyprus (Anglicized) Fortat81 cahBoot3eH Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 25 880993 357 Thompson 9/27/1981 38 Libra MasterCard 5597431608392937 582 5/2021 1Z 791 957 61 6161 657 2 2596874442 13926446 Blue Forensic technician id Boutiques 1993 Bristol Beaufighter AB+ 181.7 82.6 5' 8" 173 149314e4-80a7-493a-8ece-9b6f0890fd5d 41.321303 -72.986114
77 76 female England/Wales Ms. Naomi S Ryan Λ. Μιχαλακοπούλου 160 ΕΓΚΩΜΗ NI Λευκωσία 2417 CY Cyprus (Greek) Fien1988 Gaitha4Ei Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 22 498317 357 Clayton 11/21/1988 31 Scorpio MasterCard 5283115033422307 714 12/2023 1Z 587 192 06 3768 866 1 7494134303 25065703 Blue Sewer pipe cleaner Monk Real Estate Service 1995 BMW Dinan O+ 205.0 93.2 5' 2" 158 65219a9f-2c32-459a-98d2-7a7332e0f52f 41.385016 -72.962431
78 77 male Polish Mr. Szymon B Walczak 2347 Lauzon Parkway Windsor ON Ontario N9A 7A2 CA Canada Dintep oog4aize7Ai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 519-566-3375 1 Pawłowska 9/18/1958 61 Virgo MasterCard 5342019646824967 284 11/2023 727 633 539 1Z 883 8V5 80 2897 856 7 2176879008 20843357 Silver Neurosonographer De Pinna 2009 Nissan Frontier B+ 222.2 101.0 5' 10" 179 b2e01ab2-c265-42eb-90b8-e26a6361eed4 42.423583 -82.942171
79 78 female Hungarian Ms. Mercédesz S Szôllôssy Atamaria 86 Fornelos de Montes PO Pontevedra 36847 ES Spain Musere Eif1ce0ee Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36 641 572 459 34 Hofmann 1/15/1944 75 Capricorn MasterCard 5535165808125664 767 3/2023 1Z 681 2E4 44 0770 552 8 5631978558 99957155 Red Case management aide White Hen Pantry 2000 Noble M12 O+ 213.6 97.1 5' 7" 169 a5f515e0-310f-47ab-bc0c-a71bac531a1e 42.263983 -8.431245
80 79 male Norwegian Mr. Edgar E Andreassen Zistelweg 32 UNTERLAND SZ Salzburg 5661 AT Austria Waakis2000 Iejeiz1oodei Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 0664 701 04 17 43 Dybvik 6/29/2000 19 Cancer Visa 4716078791994463 776 11/2020 1Z 311 159 63 7486 723 7 5685893521 99981816 Black Speech pathologist Peaches 2001 Pontiac Grand Am A- 127.8 58.1 5' 8" 172 15337913-059c-4dd1-9feb-5dc426abe8c7 47.203021 12.910163
81 80 male Slovenian Mr. Šemsudin M Vrhovski ul. Dawida Jana 124 Wrocław 50-527 PL Poland Gother keeLaz9lee0 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 OPR/58.0.3135.132 67 534 85 44 48 Pataki 10/13/1955 64 Libra Visa 4716877835558592 363 3/2022 55101320290 1Z E40 449 57 5657 736 1 2373733334 21925106 Blue ABE teacher Integra Wealth Planners 2015 BMW X5 M B+ 197.6 89.8 5' 11" 180 5c11fc04-45f8-4989-801f-4102ff38d376 51.112923 17.027289
82 81 female Finnish Mrs. Satu A Waltari 2071 Maryland Avenue Pinellas FL Florida 34624 US United States Stittair jal6oNgoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:66.0) Gecko/20100101 Firefox/66.0 727-538-7059 1 Viitala 9/16/1995 24 Virgo Visa 4556890465838575 158 12/2024 591-28-5104 1Z 534 941 77 8508 193 2 5257097378 69898015 Yellow Soil scientist White Hen Pantry 2003 Daihatsu Terios O+ 141.0 64.1 5' 5" 164 37de7e34-2624-444a-978d-b1b758fbc993 27.864456 -82.748032
83 82 male German Mr. Matthias S Himmel Degnehøjvej 45 Silkeborg MI Region Midtjylland 8600 DK Denmark Barted Jemu5poosoo Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 30-62-84-08 45 Trommler 8/23/1983 36 Virgo MasterCard 5289947968601628 320 12/2024 230883-1143 1Z 006 174 60 6563 087 1 3945260717 87205650 Blue Mental health social worker Superior Appraisals 1998 Alpina B 12 A+ 143.4 65.2 5' 10" 179 116742f3-f65e-45f1-a917-d13ad1db7bd4 56.199078 9.447827
84 83 female Danish Ms. Mia A Frederiksen ul. Zuchów 65 Dąbrowa Górnicza 41-303 PL Poland Fance1958 buY5faij Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 53 459 91 54 48 Lauritsen 6/18/1958 61 Gemini MasterCard 5454007072610160 208 4/2023 58061866242 1Z 501 697 49 5209 014 8 9285893233 21439381 Blue Fine arts photographer Coon Chicken Inn 2008 SSC Aero O+ 181.1 82.3 5' 2" 158 8969b475-9dce-4173-b060-32da08dbbf0d 50.417075 19.133549
85 84 male Swedish Mr. Jesper N Lund Põllu 59 Kähu VG Valgamaa 68506 EE Estonia Planstim AeBeiNii0 Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 763 2200 372 Lundgren 5/6/1938 81 Taurus Visa 4539748641306150 938 3/2024 1Z 484 548 09 5749 331 5 6980674650 69979073 White Photogrammetrist Expo Superstore 1997 Panoz AIV A+ 178.0 80.9 5' 10" 178 32b3dfb9-2eaa-4af2-a012-2a044f866550 57.915676 26.169326
86 85 male Icelandic Mr. Guðgeir S Bergsveinsson Rue du Centre 320 Marke VWV West Flanders 8510 BE Belgium Ressen phukieGae9c Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 0493 28 62 88 32 4/2/1994 25 Aries MasterCard 5432615789205137 688 11/2021 1Z 684 8A0 47 6831 298 9 0387412870 16497840 Orange Forging machine tender Crafts & More 1994 Mitsubishi Sigma A+ 165.9 75.4 5' 8" 172 1016b5ad-56e3-4fb2-ba90-dc1f43694493 50.73779 3.22707
87 86 male Icelandic Mr. Esjar S Sturluson Hauptstrasse 75 PUCH BEI HALLEIN SZ Salzburg 5412 AT Austria Conetund taeWeF2Eeph4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 0699 465 17 25 43 3/24/1970 49 Aries MasterCard 5400451109415331 492 9/2021 1Z 807 0E7 80 7325 125 1 3810913760 52784081 Blue Dietetic technician Endicott Johnson 2002 Smart ForFour B+ 171.6 78.0 6' 0" 182 b63ece12-4e7f-4348-bce8-1d8d5dc31dff 47.741555 13.137162
88 87 male England/Wales Mr. Zak M Leonard 27 Stroud Rd OCHTERTYRE PH7 6LF GB United Kingdom Surn1940 ohteeF5RaeM Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 078 3687 4061 44 Henry 4/4/1940 79 Aries MasterCard 5336964062411492 178 11/2023 KY 97 49 93 A 1Z 77F 43A 87 5585 068 2 5973298171 66698464 Blue Heat treating equipment tender The Independent Planners 1996 Mitsubishi Verada O+ 221.1 100.5 5' 7" 171 d667dde7-082e-4480-98d9-5bdd383eb187 56.07981 -4.643057
89 88 female Chechen (Latin) Mrs. Ezinet B Umkhayev 216 Karaiskaki Sq Ineia PA Paphos 8704 CY Cyprus (Anglicized) Mothen1991 IeH2ceebae Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 97 696060 357 Masaev 3/27/1991 28 Aries Visa 4929316208182816 160 4/2021 1Z 77V 114 81 0072 703 9 4908986837 97732729 Purple Quality assurance inspector Chief Auto Parts 2005 Jaguar XKR B+ 218.9 99.5 5' 9" 175 7cb11d28-cbd3-49c2-be74-e6dcdca65cb4 41.270842 -72.883851
90 89 female Russian Mrs. Lucia V Voronina 75 Sale-Heyfield Road KONGWAK VIC Victoria 3951 AU Australia Riets1976 aY3ohbe8ai Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 (03) 5371 4059 61 7/25/1976 43 Leo MasterCard 5459182457974252 335 5/2024 1Z 618 731 57 5565 866 4 4199244189 20580381 Blue Eligibility interviewer William Wanamaker & Sons 2012 Tata Indica O+ 176.7 80.3 5' 7" 169 bf8e68c3-2842-49da-8c4d-1ed4712a3852 -38.465215 145.830079
91 90 male Slovenian Mr. Milorad S Musić Välja 61 Mustahamba VR Võrumaa 66258 EE Estonia Entils oan8Eiyoaz Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 789 0750 372 Flach 12/1/1968 51 Sagittarius MasterCard 5211243470404849 924 2/2022 1Z F99 556 56 0740 270 3 0643392658 82292582 Blue Housekeeper Cougar Investment 2000 Buick Rendezvous A+ 244.4 111.1 6' 1" 185 ea727fc4-6f40-412c-9574-b372a6aef26f 57.820634 26.970835
92 91 female Japanese (Anglicized) Dr. Chisaki M Fujimura 1956 Uitsig St Grahamstad EC Eastern Cape 6139 ZA South Africa Rewhe1979 iayiQu9ahsie Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 082 875 2166 27 Wakabayashi 12/22/1979 40 Capricorn Visa 4485068737963325 600 10/2020 7912221956187 1Z 480 79V 08 4325 733 4 5381640942 65242007 Brown Extruding and drawing machine setters Alert Alarm Company 2005 Porsche Cayenne O+ 216.5 98.4 5' 2" 157 b1a8327e-794e-4685-ab8b-43d20c08ed68 -33.370929 26.578978
93 92 female Russian Mrs. Inessa D Samoylova Bachloh 60 WATZING OO Upper Austria 4673 AT Austria Rolong moot9aQu2d Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 0664 396 27 04 43 5/11/1956 63 Taurus Visa 4556264124085418 064 5/2024 1Z 596 V55 76 1512 629 4 1323820614 72364512 Purple Patternmaker Erb Lumber 2009 Honda CR-V O+ 135.3 61.5 5' 3" 161 13f411d7-0754-4233-a902-37a68ce4bb45 48.118598 13.654018
94 93 female England/Wales Ms. Elise C Pearson 215 Andrew Street Monaco Nelson 7011 NZ New Zealand Norly1997 eeNgoes7aez Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0 (027) 7329-039 64 Wilson 5/12/1997 22 Taurus Visa 4929452838531450 902 4/2022 1Z 245 516 58 7073 695 7 9911299106 38307168 Blue Activity specialist The Independent Planners 1995 Nissan President O+ 162.1 73.7 5' 9" 175 962cba25-bb3d-4b3b-8aca-e4689ee69dd5 -41.333626 173.307741
95 94 male Norwegian Mr. Herman A Johansen 1324 Mosman Rd Alexander Bay NC Northern Cape 8294 ZA South Africa Thinde loaH6shiemoh Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 083 779 9214 27 Smestad 7/29/1947 72 Leo Visa 4929709070861949 812 9/2023 4707295736082 1Z 2E1 Y32 51 4242 127 7 3751331085 25346448 Blue Cost estimator Brown Derby 2014 Audi SQ5 O+ 182.8 83.1 5' 10" 179 377c5af3-3ab2-405b-bc80-1ddd4f06ecca -28.511777 16.410349
96 95 male Russian Mr. Armen D Balabanov Bavorovská 788 Stachy JC Jihoceský kraj 384 73 CZ Czech Republic Gery1975 Gaeg6uchoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 606 972 932 420 1/11/1975 44 Capricorn Visa 4532025806629404 529 1/2021 1Z 245 330 29 3529 731 4 5526134624 86584996 Orange Rigger Sun Foods 1999 Isuzu VX-02 A+ 199.8 90.8 5' 6" 168 ec3e15ac-78df-4844-a8e3-c6c491c6dd39 49.090909 13.642637
97 96 male Russian Mr. Evdokim Y Bazarov Reykjarhóli 70 Fljót 570 IS Iceland Deet1996 aiRubie9Poqu Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 413 4270 354 10/5/1996 23 Libra MasterCard 5413518495816218 546 9/2024 1Z 900 9A2 84 8206 503 2 0362518360 38724903 White Insurance investigator Modern Realty 1996 ZAZ Wagon O+ 212.1 96.4 5' 8" 173 9112f339-d232-4fa7-a1f3-2e74571fd00a 66.154544 -17.801351
98 97 female Hungarian Mrs. Agoti B Gyarmaty 793 Buena Vista Avenue Corvallis OR Oregon 97330 US United States Eage1963 pheiSha1aqu Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15 541-714-1388 1 Cseh 12/7/1963 56 Sagittarius MasterCard 5328604768229802 989 8/2024 543-24-6755 1Z 238 019 37 1904 563 8 2038177985 21602780 Blue Licensed clinical social worker Circuit Design 2001 Suzuki Covie A+ 140.8 64.0 5' 4" 163 b1f4eed7-ff6b-4671-bf5d-a04c6a7b4beb 44.597298 -123.334112
99 98 male England/Wales Mr. John K Carpenter Tavcarjeva 22 Senovo 8281 SI Slovenia Foris1988 el6xoh7Qu Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 070-783-977 386 Wheeler 1/10/1988 31 Capricorn MasterCard 5174982341006037 269 3/2020 1Z E95 341 79 8897 978 9 0309942560 54166967 Blue Marketing coordinator Balanced Fortune 1992 Mazda AZ-1 O+ 226.4 102.9 5' 8" 173 d2754fd9-f1c9-47cd-b6e1-7c8d5c0eec30 46.102339 15.464625
100 99 female Hispanic Mrs. Maha A Cazares Reyes Católicos 75 Chiclana de la Frontera CA Cádiz 11130 ES Spain Martrust57 Ohqu6achie Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 624 412 511 34 Méndez 9/25/1957 62 Libra Visa 4485545084297530 898 6/2020 1Z 508 603 87 4474 636 9 1330099115 99991665 Blue Diesel train engineer Sew-Fro Fabrics 2001 Alfa Romeo GTV A+ 213.2 96.9 5' 8" 172 d768e30b-a6de-4977-8bb8-0c8432acce44 36.447765 -6.204969
101 100 female American Mrs. Patricia J Nevels 72 Acheron Road BUNDALAGUAH VIC Victoria 3851 AU Australia Butimis1962 eekea5Thoo Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 (03) 5301 7984 61 Thomas 7/30/1962 57 Leo Visa 4716717759727577 065 9/2024 1Z 449 366 30 8287 656 2 3560472157 65535722 Blue Identification clerk PriceRite Warehouse Club 2005 Infiniti QX56 A- 209.9 95.4 5' 3" 161 6156ce11-c2a6-4266-bb4d-f47b06292e4e -38.111213 147.271178

2 rocket
3 racket

2 irocketiere
3 rock
4 pocket
5 racket

Просмотреть файл

@ -0,0 +1,4 @@
2 Rocket
3 Rocket
4 rocket

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,4 @@
My name is [FIRST_NAME] [LAST_NAME] and I fly a [ROCKET]
The customer's name is [LAST_NAME], [FIRST_NAME] where is my [ROCKET]
The customer's name is [FIRST_NAME] [ROCKET]

@ -0,0 +1,15 @@
My email is [EMAIL]
My address is [ADDRESS]
My first name is [FIRST_NAME] and my last is [LAST_NAME]
My name is [PERSON]
My zip is [ZIP]
I live in [CITY]
Here's my phone number: [PHONE_NUMBER]
You want my credit card? No problem: [CREDIT_CARD]
I was born on [BIRTHDAY]
My full address is [FULL_ADDRESS]
My kids are [PERSON] and [PERSON2]
I either live on [ADDRESS] or [ADDRESS2]
Our last names are [LAST_NAME] and [LAST_NAME2]
My first name is [FIRST_NAME] and [FIRST_NAME2]
My accounts are [ACCOUNT_NUMBER] and [ACCOUNT_NUMBER2]

Просмотреть файл

@ -0,0 +1,428 @@
"full_text": "My full address is Avda. Alameda Sundheim 46",
"masked": null,
"spans": [
"entity_type": "FULL_ADDRESS",
"entity_value": "Avda. Alameda Sundheim 46",
"start_position": 19,
"end_position": 44
"tokens": [
"text": "My",
"idx": 0,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
"text": "full",
"idx": 3,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "amod",
"lemma_": "full",
"_": {
"is_in_vocabulary": false
"text": "address",
"idx": 8,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "nsubj",
"lemma_": "address",
"_": {
"is_in_vocabulary": false
"text": "is",
"idx": 16,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "ROOT",
"lemma_": "be",
"_": {
"is_in_vocabulary": false
"text": "Avda",
"idx": 19,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Avda",
"_": {
"is_in_vocabulary": false
"text": ".",
"idx": 23,
"tag_": ".",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": ".",
"_": {
"is_in_vocabulary": false
"text": "Alameda",
"idx": 25,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "compound",
"lemma_": "Alameda",
"_": {
"is_in_vocabulary": false
"text": "Sundheim",
"idx": 33,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "ROOT",
"lemma_": "Sundheim",
"_": {
"is_in_vocabulary": false
"text": "46",
"idx": 42,
"tag_": "CD",
"pos_": "NUM",
"dep_": "nummod",
"lemma_": "46",
"_": {
"is_in_vocabulary": false
"tags": [
"template_id": null,
"metadata": {
"Gender": "male",
"NameSet": "Croatian",
"Country": "Uganda",
"Lowercase": false,
"Template#": 9
"full_text": "You want my credit card? No problem: 4532368231815457",
"masked": null,
"spans": [
"entity_type": "CREDIT_CARD",
"entity_value": "4532368231815457",
"start_position": 37,
"end_position": 53
"tokens": [
"text": "You",
"idx": 0,
"tag_": "PRP",
"pos_": "PRON",
"dep_": "nsubj",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
"text": "want",
"idx": 4,
"tag_": "VBP",
"pos_": "VERB",
"dep_": "ROOT",
"lemma_": "want",
"_": {
"is_in_vocabulary": false
"text": "my",
"idx": 9,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
"text": "credit",
"idx": 12,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "compound",
"lemma_": "credit",
"_": {
"is_in_vocabulary": false
"text": "card",
"idx": 19,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "dobj",
"lemma_": "card",
"_": {
"is_in_vocabulary": false
"text": "?",
"idx": 23,
"tag_": ".",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": "?",
"_": {
"is_in_vocabulary": false
"text": "No",
"idx": 25,
"tag_": "DT",
"pos_": "DET",
"dep_": "det",
"lemma_": "no",
"_": {
"is_in_vocabulary": false
"text": "problem",
"idx": 28,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "ROOT",
"lemma_": "problem",
"_": {
"is_in_vocabulary": false
"text": ":",
"idx": 35,
"tag_": ":",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": ":",
"_": {
"is_in_vocabulary": false
"text": "4532368231815457",
"idx": 37,
"tag_": "CD",
"pos_": "NUM",
"dep_": "appos",
"lemma_": "4532368231815457",
"_": {
"is_in_vocabulary": false
"tags": [
"template_id": null,
"metadata": {
"Gender": "female",
"NameSet": "Czech",
"Country": "Austria",
"Lowercase": false,
"Template#": 7
"full_text": "My first name is Rogelio and my last is Patrick",
"masked": null,
"spans": [
"entity_type": "PERSON",
"entity_value": "Rogelio",
"start_position": 17,
"end_position": 24
"entity_type": "PERSON",
"entity_value": "Patrick",
"start_position": 40,
"end_position": 47
"tokens": [
"text": "My",
"idx": 0,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
"text": "first",
"idx": 3,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "amod",
"lemma_": "first",
"_": {
"is_in_vocabulary": false
"text": "name",
"idx": 9,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "nsubj",
"lemma_": "name",
"_": {
"is_in_vocabulary": false
"text": "is",
"idx": 14,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "ROOT",
"lemma_": "be",
"_": {
"is_in_vocabulary": false
"text": "Rogelio",
"idx": 17,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Rogelio",
"_": {
"is_in_vocabulary": false
"text": "and",
"idx": 25,
"tag_": "CC",
"pos_": "CCONJ",
"dep_": "cc",
"lemma_": "and",
"_": {
"is_in_vocabulary": false
"text": "my",
"idx": 29,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
"text": "last",
"idx": 32,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "nsubj",
"lemma_": "last",
"_": {
"is_in_vocabulary": false
"text": "is",
"idx": 37,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "conj",
"lemma_": "be",
"_": {
"is_in_vocabulary": false
"text": "Patrick",
"idx": 40,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Patrick",
"_": {
"is_in_vocabulary": false
"tags": [
"template_id": null,
"metadata": {
"Gender": "male",
"NameSet": "American",
"Country": "California",
"Lowercase": false,
"Template#": 2

@ -0,0 +1,3 @@
from .model_mock import IdentityTokensMockModel, \
FiftyFiftyIdentityTokensMockModel, \

Просмотреть файл

@ -0,0 +1,50 @@
from typing import List
from presidio_evaluator import InputSample, ModelEvaluator
class MockTokensModel(ModelEvaluator):
Simulates a real model, returns the prediction given in the constructor
def __init__(self, prediction: List[str], entities_to_keep: List = None,
verbose: bool = False, **kwargs):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose,
self.prediction = prediction
def predict(self, sample: InputSample) -> List[str]:
return self.prediction
class IdentityTokensMockModel(ModelEvaluator):
Simulates a real model, always return the label as prediction
def __init__(self, entities_to_keep: List = None,
verbose: bool = False):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
def predict(self, sample: InputSample) -> List[str]:
return sample.tags
class FiftyFiftyIdentityTokensMockModel(ModelEvaluator):
Simulates a real model, returns the label or no predictions (list of 'O')
def __init__(self, entities_to_keep: List = None,
verbose: bool = False):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
self.counter = 0
def predict(self, sample: InputSample) -> List[str]:
self.counter += 1
if self.counter % 2 == 0:
return sample.tags
return ["O" for i in range(len(sample.tags))]

Просмотреть файл

@ -0,0 +1,22 @@
import numpy as np
from presidio_evaluator.crf_evaluator import CRFEvaluator
from presidio_evaluator.data_generator import read_synth_dataset
# no_test since the CRF model is not supplied with the package
def no_test_test_crf_simple():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
model_path = os.path.abspath(os.path.join(dir_path, "..", "model-outputs/crf.pickle"))
crf_evaluator = CRFEvaluator(model_pickle_path=model_path,entities_to_keep=['PERSON'])
evaluation_results = crf_evaluator.evaluate_all(input_samples)
scores = crf_evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
assert scores.pii_recall > 0
assert scores.pii_precision > 0

from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import read_synth_dataset
def test_to_conll():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
conll = InputSample.create_conll_dataset(input_samples)
sentences = conll['sentence'].unique()
assert len(sentences) == len(input_samples)
def test_to_spacy_all_entities():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_ver = InputSample.create_spacy_dataset(input_samples)
assert len(spacy_ver) == len(input_samples)
def test_to_spacy_all_entities_specific_entities():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_ver = InputSample.create_spacy_dataset(input_samples, entities=['PERSON'])
spacy_ver_with_labels = [sample for sample in spacy_ver if len(sample[1]['entities'])]
assert len(spacy_ver_with_labels) < len(input_samples)
assert len(spacy_ver_with_labels) > 0
def test_to_spach_json():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_ver = InputSample.create_spacy_json(input_samples)
assert len(spacy_ver) == len(input_samples)
assert 'id' in spacy_ver[0]
assert 'paragraphs' in spacy_ver[0]

from flair.models import SequenceTagger
except ImportError:
print("Flair is not installed by default")
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.flair_evaluator import FlairEvaluator
import numpy as np
# no-unit because flair is not a dependency by default
def no_unit_test_flair_simple():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
model = SequenceTagger.load('ner-ontonotes-fast') # .load('ner')
flair_evaluator = FlairEvaluator(model=model, entities_to_keep=['PERSON'])
evaluation_results = flair_evaluator.evaluate_all(input_samples)
scores = flair_evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
assert scores.pii_recall > 0
assert scores.pii_precision > 0

@ -0,0 +1,121 @@
from presidio_evaluator.data_generator import generate, read_synth_dataset, FakeDataGenerator
def get_fake_generator(template, fake_pii_df):
class MockFakeGenerator(FakeDataGenerator):
Mock class that doesn't add to the fake PII DF so you could inject entities yourself.
def __init__(self, **kwargs):
def prep_fake_pii(self, df):
return df
return MockFakeGenerator(templates=[template],
def test_generator_correct_output():
OUTPUT = "generated_test.txt"
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
fake_pii_csv = "{}/data/FakeNameGenerator.com_100.csv".format(dir_path)
utterances_file = "{}/data/templates.txt".format(dir_path)
dictionary = "{}/data/Dictionary_test.csv".format(dir_path)
input_samples = read_synth_dataset(OUTPUT)
for sample in input_samples:
assert len(sample.tags) == len(sample.tokens)
def test_a_turned_to_an():
fake_pii_df = get_mock_fake_df(GENDER="Ale")
template = "I am a [GENDER] living in [COUNTRY]"
bracket_location = template.find("[")
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
examples = [x for x in fake_generator.sample_examples(1)]
assert " an " in examples[0].full_text
# entity location updated
assert examples[0].spans[0].start_position == bracket_location + 1
def test_a_not_turning_into_an():
fake_pii_df = get_mock_fake_df(GENDER="Male")
template = "I am a [GENDER] living in [COUNTRY]"
previous_bracket = template.find("[")
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
examples = [x for x in fake_generator.sample_examples(1)]
assert " an " not in examples[0].full_text
assert examples[0].spans[0].start_position == previous_bracket
def test_A_turning_into_An():
fake_pii_df = get_mock_fake_df(GENDER="ale")
template = "A [GENDER] living in [COUNTRY]"
previous_bracket = template.find("[")
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
examples = [x for x in fake_generator.sample_examples(1)]
assert "An " in examples[0].full_text
assert examples[0].spans[0].start_position == previous_bracket + 1
def get_mock_fake_df(**kwargs):
dict = {
"Number": 1,
"Gender": "Male",
"NameSet": "English",
"Title": "Mr.",
"GivenName": "Dondo",
"MiddleInitial": "N",
"Surname": "Mondo",
"StreetAddress": "Where I live 15",
"City": "Amsterdam",
"State": "",
"StateFull": "",
"ZipCode": "12345",
"Country": "Netherlands",
"CountryFull": "Netherlands",
"EmailAddress": "",
"Username": "Dondo12",
"Password": "123456",
"TelephoneNumber": "+1412391",
"TelephoneCountryCode": "14",
"MothersMaiden": "",
"Birthday": "15 Aug 1966",
"Age": "200",
"CCType": "astercard",
"CCNumber": "12371832821",
"CVV2": "123",
"CCExpires": "19-19",
"NationalID": "14124",
"Occupation": "Hunter",
"Company": "Lolo and sons",
"Domain": ""}
import pandas as pd
fake_pii_df = pd.DataFrame(dict, index=[0])
return fake_pii_df

import numpy as np
import pytest
from presidio_evaluator import InputSample, EvaluationResult
from presidio_evaluator.data_generator import read_synth_dataset
from tests.mocks import IdentityTokensMockModel, \
FiftyFiftyIdentityTokensMockModel, MockTokensModel
def test_evaluator_simple():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample(full_text="I am the walrus",
masked="I am the [ANIMAL]",
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = model.evaluate_sample(sample)
final_evaluation = model.calculate_score(
assert final_evaluation.pii_precision == 1
assert final_evaluation.pii_recall == 1
def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction,
sample = InputSample(full_text="I am the walrus",
masked="I am the [ANIMAL]",
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = model.evaluate_sample(sample)
assert evaluated.results[("O", "O")] == 4
def test_evaluate_same_entity_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample(full_text="I dog the walrus",
masked="I [ANIMAL] the [ANIMAL]",
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = model.evaluate_sample(sample)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_entities_to_keep_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, labeling_scheme='BIO',
entities_to_keep=['ANIMAL', 'PLANT', 'SPACESHIP'])
sample = InputSample(full_text="I dog the walrus",
masked="I [ANIMAL] the [ANIMAL]",
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = model.evaluate_sample(sample)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_tokens_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
sample.tokens = ["I", "am", "the",
"walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O",
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 1
def test_evaluate_multiple_tokens_partial_match_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 4 / 6
def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
assert np.isnan(evaluation.pii_precision)
assert evaluation.pii_recall == 0
def test_evaluate_multiple_examples_correct_statistics():
prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
model = MockTokensModel(prediction=prediction,
input_sample = InputSample("My name is Raphael or David", masked=None,
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = model.evaluate_all(
[input_sample, input_sample, input_sample, input_sample])
scores = model.calculate_score(
assert scores.pii_precision == 0.5
assert scores.pii_recall == 0.5
def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
model = MockTokensModel(prediction=prediction,
entities_to_keep=['PERSON', 'TENNIS_PLAYER'])
input_sample = InputSample("My name is Raphael or David", masked=None,
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = model.evaluate_all(
[input_sample, input_sample, input_sample, input_sample])
scores = model.calculate_score(evaluated)
assert scores.pii_precision == 1
assert scores.pii_recall == 1
def test_confusion_matrix_correct_metrics():
from collections import Counter
evaluated = [EvaluationResult(results=Counter({
('O', 'O'): 150,
('O', 'PERSON'): 30,
('O', 'COMPANY'): 30,
('PERSON', 'PERSON'): 40,
('PERSON', 'COMPANY'): 10,
('COMPANY', 'PERSON'): 10,
('PERSON', 'O'): 30,
('COMPANY', 'O'): 30}), model_errors=None, text=None)]
model = MockTokensModel(prediction=None,
entities_to_keep=['PERSON', 'COMPANY'])
scores = model.calculate_score(evaluated, beta=2.5)
assert scores.pii_precision == 0.625
assert scores.pii_recall == 0.625
assert scores.entity_recall_dict['PERSON'] == 0.5
assert scores.entity_precision_dict['PERSON'] == 0.5
assert scores.entity_recall_dict['COMPANY'] == 0.5
assert scores.entity_precision_dict['COMPANY'] == 0.5
def test_confusion_matrix_2_correct_metrics():
from collections import Counter
evaluated = [EvaluationResult(results=Counter(
{('O', 'O'): 65467,
('O', 'ORG'): 4189,
('GPE', 'O'): 3370,
('PERSON', 'PERSON'): 2024,
('GPE', 'PERSON'): 1488,
('GPE', 'GPE'): 1033,
('O', 'GPE'): 964,
('ORG', 'ORG'): 914,
('O', 'PERSON'): 834,
('GPE', 'ORG'): 401,
('PERSON', 'ORG'): 35,
('PERSON', 'O'): 33,
('ORG', 'O'): 8,
('PERSON', 'GPE'): 5,
('ORG', 'PERSON'): 1}), model_errors=None, text=None)]
model = MockTokensModel(prediction=None)
scores = model.calculate_score(evaluated, beta=2.5)
pii_tp = evaluated[0].results[('PERSON', 'PERSON')] + \
evaluated[0].results[('ORG', 'ORG')] + \
evaluated[0].results[('GPE', 'GPE')] + \
evaluated[0].results[('ORG', 'GPE')] + \
evaluated[0].results[('ORG', 'PERSON')] + \
evaluated[0].results[('GPE', 'ORG')] + \
evaluated[0].results[('GPE', 'PERSON')] + \
evaluated[0].results[('PERSON', 'GPE')] + \
evaluated[0].results[('PERSON', 'ORG')]
pii_fp = evaluated[0].results[('O', 'PERSON')] + \
evaluated[0].results[('O', 'GPE')] + \
evaluated[0].results[('O', 'ORG')]
pii_fn = evaluated[0].results[('PERSON', 'O')] + \
evaluated[0].results[('GPE', 'O')] + \
evaluated[0].results[('ORG', 'O')]
assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
def test_dataset_to_metric_identity_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=10)
model = IdentityTokensMockModel()
evaluation_results = model.evaluate_all(input_samples)
metrics = model.calculate_score(
assert metrics.pii_precision == 1
assert metrics.pii_recall == 1
def test_dataset_to_metric_50_50_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=100)
# Replace 50% of the predictions with a list of "O"
model = FiftyFiftyIdentityTokensMockModel(entities_to_keep='PERSON')
evaluation_results = model.evaluate_all(input_samples)
metrics = model.calculate_score(
assert metrics.pii_precision == 1
assert metrics.pii_recall < 0.75
assert metrics.pii_recall > 0.25

Просмотреть файл

@ -0,0 +1,80 @@
Presidio Analyzer not yet on PyPI, ignoring temporarily
# import pytest
# from presidio_evaluator import InputSample, Span
# from presidio_evaluator.data_generator import read_synth_dataset
# from presidio_evaluator.presidio_analyzer import PresidioAnalyzer
# class GeneratedTextTestCase:
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
# self.test_name = test_name
# self.test_input = test_input
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
# def to_pytest_param(self):
# return pytest.param(self.test_input, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
# # generated-text test cases
# analyzer_test_generate_text_testdata = [
# # small set fixture which expects all results.
# GeneratedTextTestCase(
# test_name="small-set",
# test_input="{}/data/generated_small.txt",
# acceptance_threshold=0.3,
# marks=pytest.mark.none
# )
# ]
# @pytest.mark.skip(reason="Presidio analyzer not on PyPi")
# def test_analyzer_simple_input():
# model = PresidioAnalyzer(entities_to_keep=['PERSON'])
# sample = InputSample(full_text="My name is Mike",
# masked="My name is [PERSON]",
# spans=[Span('PERSON', 'Mike', 10, 14)],
# create_tags_from_span=True)
# evaluated = model.evaluate_sample(sample)
# metrics = model.calculate_score(
# [evaluated])
# assert metrics.pii_precision == 1
# assert metrics.pii_recall == 1
# # analyzer tests on generated data
# @pytest.mark.skip(reason="Presidio analyzer not on PyPi")
# @pytest.mark.parametrize("test_input,acceptance_threshold",
# [testcase.to_pytest_param() for testcase in
# analyzer_test_generate_text_testdata])
# def test_analyzer_with_generated_text(test_input, acceptance_threshold):
# """
# Test analyzer with a generated dataset text file
# :param test_input: input text file location
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
# # read test input from generated file
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# input_samples = read_synth_dataset(
# test_input.format(dir_path))
# updated_samples = PresidioAnalyzer. \
# align_input_samples_to_presidio_analyzer(input_samples)
# analyzer = PresidioAnalyzer()
# evaluated_samples = analyzer.evaluate_all(updated_samples)
# scores = analyzer.calculate_score(evaluation_results=evaluated_samples)
# assert acceptance_threshold <= scores.pii_precision
# assert acceptance_threshold <= scores.pii_recall

Presidio Analyzer not yet on PyPI, ignoring temporarily
# from presidio_evaluator.data_generator import read_synth_dataset
# from presidio_evaluator.presidio_recognizer_evaluator import score_presidio_recognizer
# import pytest
# from analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
# # test case parameters for tests with dataset which was previously generated.
# class GeneratedTextTestCase:
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
# self.test_name = test_name
# self.test_input = test_input
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
# def to_pytest_param(self):
# return pytest.param(self.test_input, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
# # generated-text test cases
# cc_test_generate_text_testdata = [
# # small set fixture which expects all type results.
# GeneratedTextTestCase(
# test_name="small-set",
# test_input="{}/data/generated_small.txt",
# acceptance_threshold=1,
# marks=pytest.mark.none
# ),
# # large set fixture which expects all type results. marked as "slow"
# GeneratedTextTestCase(
# test_name="large_set",
# test_input="{}/data/generated_large.txt",
# acceptance_threshold=1,
# marks=pytest.mark.slow
# )
# ]
# # credit card recognizer tests on generated data
# @pytest.mark.parametrize("test_input,acceptance_threshold",
# [testcase.to_pytest_param()
# for testcase in cc_test_generate_text_testdata])
# def test_credit_card_recognizer_with_generated_text(test_input, acceptance_threshold):
# """
# Test credit card recognizer with a generated dataset text file
# :param test_input: input text file location
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
# # read test input from generated file
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# input_samples = read_synth_dataset(
# test_input.format(dir_path))
# scores = score_presidio_recognizer(
# CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
# assert acceptance_threshold <= scores.pii_f

Presidio Analyzer not yet on PyPI, ignoring temporarily
# from presidio_evaluator.data_generator import generate
# from presidio_evaluator.presidio_recognizer_evaluator import \
# score_presidio_recognizer
# import pytest
# import numpy as np
# from analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
# # test case parameters for tests with dataset generated from a template and csv values
# class TemplateTextTestCase:
# def __init__(self, test_name, pii_csv, utterances, dictionary_path,
# num_of_examples, acceptance_threshold, marks):
# self.test_name = test_name
# self.pii_csv = pii_csv
# self.utterances = utterances
# self.dictionary_path = dictionary_path
# self.num_of_examples = num_of_examples
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
# def to_pytest_param(self):
# return pytest.param(self.pii_csv, self.utterances, self.dictionary_path,
# self.num_of_examples, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
# # template-dataset test cases
# cc_test_template_testdata = [
# # large dataset fixture. marked as slow
# TemplateTextTestCase(
# test_name="fake-names-100",
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# utterances="{}/data/templates.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=0.9,
# marks=pytest.mark.slow
# )
# ]
# # credit card recognizer tests on template-generates data
# @pytest.mark.parametrize("pii_csv, "
# "utterances, "
# "dictionary_path, "
# "num_of_examples, "
# "acceptance_threshold",
# [testcase.to_pytest_param()
# for testcase in cc_test_template_testdata])
# def test_credit_card_recognizer_with_template(pii_csv, utterances,
# dictionary_path,
# num_of_examples,
# acceptance_threshold):
# """
# Test credit card recognizer with a dataset generated from
# template and a CSV values file
# :param pii_csv: input csv file location
# :param utterances: template file location
# :param dictionary_path: dictionary/vocabulary file location
# :param num_of_examples: number of samples to be used from dataset
# to test
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
# # read template and CSV files
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# input_samples = generate(fake_pii_csv=pii_csv.format(dir_path),
# utterances_file=utterances.format(dir_path),
# dictionary_path=dictionary_path.format(dir_path),
# lower_case_ratio=0.5,
# num_of_examples=num_of_examples)
# scores = score_presidio_recognizer(
# CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
# if not np.isnan(scores.pii_f):
# assert acceptance_threshold <= scores.pii_f

Presidio Analyzer not yet on PyPI, ignoring temporarily
# from presidio_evaluator.data_generator import FakeDataGenerator
# from presidio_evaluator.presidio_recognizer_evaluator import \
# score_presidio_recognizer
# import pandas as pd
# import pytest
# import numpy as np
# from analyzer import Pattern, PatternRecognizer
# # test case parameters for tests with dataset generated from a template and
# # two csv value files, one containing the common-entities and another one with custom entities
# class PatternRecognizerTestCase:
# def __init__(self, test_name, entity_name, pattern, score, pii_csv, ext_csv,
# utterances, dictionary_path, num_of_examples, acceptance_threshold,
# max_mistakes_number, marks):
# self.test_name = test_name
# self.entity_name = entity_name
# self.pattern = pattern
# self.score = score
# self.pii_csv = pii_csv
# self.ext_csv = ext_csv
# self.utterances = utterances
# self.dictionary_path = dictionary_path
# self.num_of_examples = num_of_examples
# self.acceptance_threshold = acceptance_threshold
# self.max_mistakes_number = max_mistakes_number
# self.marks = marks
# def to_pytest_param(self):
# return pytest.param(self.pii_csv, self.ext_csv, self.utterances,
# self.dictionary_path,
# self.entity_name, self.pattern, self.score,
# self.num_of_examples, self.acceptance_threshold,
# self.max_mistakes_number, id=self.test_name,
# marks=self.marks)
# # template-dataset test cases
# rocket_test_template_testdata = [
# # large dataset fixture. marked as slow.
# # all input is correct, test is conclusive
# PatternRecognizerTestCase(
# test_name="rocket-no-errors",
# entity_name="ROCKET",
# pattern=r'\W*(rocket)\W*',
# score=0.8,
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# ext_csv="{}/data/FakeRocketGenerator.csv",
# utterances="{}/data/rocket_example_sentences.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=1,
# max_mistakes_number=0,
# marks=pytest.mark.slow
# ),
# # large dataset fixture. marked as slow
# # all input is correct, test is conclusive
# PatternRecognizerTestCase(
# test_name="rocket-all-errors",
# entity_name="ROCKET",
# pattern=r'\W*(rocket)\W*',
# score=0.8,
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# ext_csv="{}/data/FakeRocketErrorsGenerator.csv",
# utterances="{}/data/rocket_example_sentences.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=0,
# max_mistakes_number=100,
# marks=pytest.mark.slow
# ),
# # large dataset fixture. marked as slow
# # some input is correct some is not, test is inconclusive
# PatternRecognizerTestCase(
# test_name="rocket-some-errors",
# entity_name="ROCKET",
# pattern=r'\W*(rocket)\W*',
# score=0.8,
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# ext_csv="{}/data/FakeRocket50PercentErrorsGenerator.csv",
# utterances="{}/data/rocket_example_sentences.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=0.3,
# max_mistakes_number=70,
# marks=[pytest.mark.slow, pytest.mark.inconclusive]
# )
# ]
# @pytest.mark.parametrize(
# "pii_csv, ext_csv, utterances, dictionary_path, "
# "entity_name, pattern, score, num_of_examples, "
# "acceptance_threshold, max_mistakes_number",
# [testcase.to_pytest_param()
# for testcase in rocket_test_template_testdata])
# def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
# entity_name, pattern,
# score, num_of_examples, acceptance_threshold,
# max_mistakes_number):
# """
# Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
# and another CSV values file with a custom entity
# :param pii_csv: input csv file location with the common entities
# :param ext_csv: input csv file location with custom entities
# :param utterances: template file location
# :param dictionary_path: vocabulary/dictionary file location
# :param entity_name: custom entity name
# :param pattern: recognizer pattern
# :param num_of_examples: number of samples to be used from dataset to test
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# dfpii = pd.read_csv(pii_csv.format(dir_path), encoding='utf-8')
# dfext = pd.read_csv(ext_csv.format(dir_path), encoding='utf-8')
# dictionary_path = dictionary_path.format(dir_path)
# ext_column_name = dfext.columns[0]
# def get_from_ext(i):
# index = i % dfext.shape[0]
# return dfext.iat[index, 0]
# # extend pii with ext data
# dfpii[ext_column_name] = [get_from_ext(i) for i in range(0, dfpii.shape[0])]
# # generate examples
# generator = FakeDataGenerator(fake_pii_csv_file=dfpii,
# utterances_file=utterances.format(dir_path),
# dictionary_path=dictionary_path)
# examples = generator.sample_examples(num_of_examples)
# pattern = Pattern("test pattern", pattern, score)
# pattern_recognizer = PatternRecognizer(entity_name,
# name="test recognizer",
# patterns=[pattern])
# scores = score_presidio_recognizer(
# pattern_recognizer, [entity_name], examples)
# if not np.isnan(scores.pii_f):
# assert acceptance_threshold <= scores.pii_f
# assert max_mistakes_number >= len(scores.model_errors)

from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.spacy_evaluator import SpacyEvaluator
import numpy as np
def test_spacy_simple():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_evaluator = SpacyEvaluator(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
evaluation_results = spacy_evaluator.evaluate_all(input_samples)
scores = spacy_evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
assert scores.pii_recall > 0
assert scores.pii_precision > 0

Просмотреть файл

@ -0,0 +1,63 @@
Presidio Analyzer not yet on PyPI, ignoring temporarily
# from presidio_evaluator.data_generator import read_synth_dataset
# from presidio_evaluator.presidio_recognizer_evaluator import \
# score_presidio_recognizer
# import pytest
# from analyzer.predefined_recognizers.spacy_recognizer import SpacyRecognizer
# # test case parameters for tests with dataset which was previously generated.
# class GeneratedTextTestCase:
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
# self.test_name = test_name
# self.test_input = test_input
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
# def to_pytest_param(self):
# return pytest.param(self.test_input, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
# # generated-text test cases
# cc_test_generate_text_testdata = [
# # small dataset, inconclusive results
# GeneratedTextTestCase(
# test_name="small-set",
# test_input="{}/data/generated_small.txt",
# acceptance_threshold=0.5,
# marks=pytest.mark.inconclusive
# ),
# # large dataset - test is slow and inconclusive
# GeneratedTextTestCase(
# test_name="large-set",
# test_input="{}/data/generated_large.txt",
# acceptance_threshold=0.5,
# marks=pytest.mark.slow
# )
# ]
# # credit card recognizer tests on generated data
# @pytest.mark.parametrize("test_input,acceptance_threshold",
# [testcase.to_pytest_param() for testcase in
# cc_test_generate_text_testdata])
# def test_spacy_recognizer_with_generated_text(test_input, acceptance_threshold):
# """
# Test spacy recognizer with a generated dataset text file
# :param test_input: input text file location
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
# # read test input from generated file
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# input_samples = read_synth_dataset(
# test_input.format(dir_path))
# scores = score_presidio_recognizer(
# SpacyRecognizer(), ['PERSON'], input_samples, True)
# assert acceptance_threshold <= scores.pii_f

@ -0,0 +1,212 @@
from presidio_evaluator import span_to_tag
def test_span_to_bio_multiple_tokens():
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
start = 14
end = 38
tag = "ADDRESS"
bio = span_to_tag(BIO_SCHEME, text, [start], [end], [tag])
expected = ['O', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS', 'I-ADDRESS',
'I-ADDRESS', 'I-ADDRESS', 'I-ADDRESS', 'O', 'O', 'O', 'O', 'O']
assert bio == expected
def test_span_to_bio_single_at_end():
text = "My name is Josh"
start = 11
end = 15
tag = "NAME"
bilou = span_to_tag(BIO_SCHEME, text, [start], [end], [tag], )
expected = ['O', 'O', 'O', 'I-NAME']
assert bilou == expected
def test_span_to_bilou_multiple_tokens():
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
start = 14
end = 38
tag = "ADDRESS"
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], [tag])
expected = ['O', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS', 'I-ADDRESS',
'I-ADDRESS', 'I-ADDRESS', 'L-ADDRESS', 'O', 'O', 'O', 'O', 'O']
assert bilou == expected
def test_span_to_bilou_adjacent_entities():
text = "Mr. Tree"
start1 = 0
end1 = 2
start2 = 4
end2 = 8
start = [start1, start2]
end = [end1, end2]
tag = ["TITLE", "NAME"]
bilou = span_to_tag(BILOU_SCHEME, text, start, end, tag)
expected = ['U-TITLE', 'U-NAME']
assert bilou == expected
def test_span_to_bilou_single_at_end():
text = "My name is Josh"
start = 11
end = 15
tag = "NAME"
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], [tag])
expected = ['O', 'O', 'O', 'U-NAME']
assert bilou == expected
def test_span_to_bilou_multiple_entities():
text = "My name is Josh or David"
start1 = 11
end1 = 15
start2 = 19
end2 = 26
start = [start1, start2]
end = [end1, end2]
tag = ["NAME", "NAME"]
bilou = span_to_tag(BILOU_SCHEME, text, start, end, tag)
expected = ['O', 'O', 'O', 'U-NAME', 'O', 'U-NAME']
assert bilou == expected
def test_span_to_bio_multiple_entities():
text = "My name is Josh or David"
start1 = 11
end1 = 15
start2 = 19
end2 = 26
start = [start1, start2]
end = [end1, end2]
tag = ["NAME", "NAME"]
bilou = span_to_tag(scheme=BIO_SCHEME, text=text, start=start,
end=end, tag=tag)
expected = ['O', 'O', 'O', 'I-NAME', 'O', 'I-NAME']
assert bilou == expected
def test_span_to_bio_specific_input():
text = "Someone stole my credit card. The number is 5277716201469117 and " \
"the my name is Mary Anguiano"
start = 80
end = 93
expected = ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'B-PERSON', 'I-PERSON']
tag = ["PERSON"]
bilou = span_to_tag(BIO_SCHEME, text, [start], [end], tag)
assert bilou == expected
def test_span_to_bilou_specific_input():
text = "Someone stole my credit card. The number is 5277716201469117 and " \
"the my name is Mary Anguiano"
start = 80
end = 93
expected = ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'B-PERSON', 'L-PERSON']
tag = ["PERSON"]
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], tag)
assert bilou == expected
def test_span_to_bilou_adjecent_identical_entities():
text = "May I get access to Jessica Gump's account?"
start = 20
end = 32
expected = ['O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O']
tag = ["PERSON"]
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], tag)
assert bilou == expected
def test_overlapping_entities_first_ends_in_mid_second():
text = "My new phone number is 1 705 774 8720. Thanks, man"
start = [22, 25]
end = [37, 37]
scores = [0.6, 0.6]
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
assert io == expected
def test_overlapping_entities_second_embedded_in_first_with_lower_score():
text = "My new phone number is 1 705 774 8720. Thanks, man"
start = [22, 25]
end = [37, 33]
scores = [0.6, 0.5]
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
assert io == expected
def test_overlapping_entities_second_embedded_in_first_has_higher_score():
text = "My new phone number is 1 705 774 8720. Thanks, man"
start = [23, 25]
end = [37, 28]
scores = [0.6, 0.7]
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
assert io == expected
def test_overlapping_entities_pyramid():
text = "My new phone number is 1 705 999 774 8720. Thanks, cya"
start = [23, 25, 29]
end = [41, 36, 32]
scores = [0.6, 0.7, 0.8]
tag = ["A1", "B2","C3"]
expected = ['O', 'O', 'O', 'O', 'O', 'A1', 'B2', 'C3', 'B2',
'A1', 'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
assert io == expected

@ -0,0 +1,98 @@
import pytest
from presidio_evaluator import InputSample
from presidio_evaluator.validation import split_by_template, get_samples_by_pattern, split_dataset
def get_mock_dataset():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample5 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample6 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample7 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
sample8 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
return [sample1, sample2, sample3, sample4, sample5, sample6, sample7, sample8]
def test_split_by_template():
dataset = get_mock_dataset()
train_templates, test_templates = split_by_template(dataset, 0.5)
assert len(train_templates) == 2
assert len(test_templates) == 2
def test_get_samples_by_pattern():
dataset = get_mock_dataset()
train_templates, test_templates = split_by_template(dataset, 0.5)
train_samples = get_samples_by_pattern(dataset, train_templates)
test_samples = get_samples_by_pattern(dataset, test_templates)
dataset_templates = set([sample.metadata['Template#'] for sample in dataset])
train_samples_templates = set([sample.metadata['Template#'] for sample in train_samples])
test_samples_templates = set([sample.metadata['Template#'] for sample in test_samples])
assert len(train_samples) + len(test_samples) == len(dataset)
assert dataset_templates == train_samples_templates | test_samples_templates
assert train_samples_templates & test_samples_templates == set()
assert train_samples_templates == set(train_templates)
assert test_samples_templates == set(test_templates)
def test_split_dataset_two_sets():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
train, test = split_dataset([sample1, sample2, sample3, sample4], [0.5, 0.5])
assert len(train) == 2
assert len(test) == 2
def test_split_dataset_four_sets():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
dataset = [sample1, sample2, sample3, sample4]
train, test, val, dev = split_dataset(dataset, [0.25, 0.25, 0.25, 0.25])
assert len(train) == 1
assert len(test) == 1
assert len(val) == 1
assert len(dev) == 1
# make sure all original template IDs are in the new sets
original_keys = set([1, 2, 3, 4])
t1 = set([sample.metadata['Template#'] for sample in train])
t2 = set([sample.metadata['Template#'] for sample in test])
t3 = set([sample.metadata['Template#'] for sample in dev])
t4 = set([sample.metadata['Template#'] for sample in val])
assert original_keys == t1 | t2 | t3 | t4
def test_split_dataset_test_with_0_ratio():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
dataset = [sample1, sample2, sample3, sample4]
with pytest.raises(ValueError):
train, test, zero = split_dataset(dataset, [0.5, 0.5, 0])
def test_split_dataset_test_with_smallish_ratio():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
dataset = [sample1, sample2, sample3, sample4]
train, test, zero = split_dataset(dataset, [0.5, 0.4999995, 0.0000005])
assert len(train) == 2
assert len(test) == 2
assert len(zero) == 0