This commit is contained in:
omri374 2020-01-06 22:59:12 +02:00
Родитель d93c57420d
Коммит a25510b8bc
73 изменённых файлов: 170703 добавлений и 35 удалений

109
.gitignore поставляемый
Просмотреть файл

@ -20,8 +20,6 @@ parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
@ -40,14 +38,12 @@ pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
@ -59,7 +55,6 @@ coverage.xml
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
@ -77,26 +72,11 @@ target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
# celery beat schedule file
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
@ -122,8 +102,87 @@ venv.bak/
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# Binaries for programs and plugins
*.exe
*.exe~
*.dll
*.so
*.dylib
# Test binary, build with `go test -c`
*.test
# Output of the go coverage tool, specifically when used with LiteIDE
*.out
.vscode
debug.test
debug
/bin/
*/rootfs/*
/tests/testdata/*-generated.*
.DS_Store
vendor/
*.db
#pycharm
.idea/*
.idea
#Ignore thumbnails created by Windows
Thumbs.db
#Ignore files built by Visual Studio
*.obj
*.pdb
*.user
*.aps
*.pch
*.vspscc
*_i.c
*_p.c
*.ncb
*.suo
*.tlb
*.tlh
*.bak
*.cache
*.ilk
[Bb]in
[Dd]ebug*/
*.lib
*.sbr
obj/
[Rr]elease*/
_ReSharper*/
[Tt]est[Rr]esult*
.vs/
#Nuget packages folder
packages/
#R
# History files
.Rhistory
.Rapp.history
# Session Data files
.RData
# Example code in package build process
*-Ex.R
# Output files from R CMD build
/*.tar.gz
# Output files from R CMD check
/*.Rcheck/
# RStudio files
.Rproj.user/
.Rproj.user
model-outputs/
datasets/
/model-outputs/

138
README.md
Просмотреть файл

@ -1,14 +1,132 @@
# Presidio-evaluator
This package features data-science related tasks for developing new recognizers for Presidio.
It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models
# Contributing
## Who should use it?
Anyone interested in evaluating an existing Presidio instance, a specific PII recognizer or to develop new models or logic for detecting PII could leverage the preexisting work in this package.
Additionally, anyone interested in generating new data based on previous datasets (e.g. to increase the coverage of entity values) for Named Entity Recognition models could leverage the data generator contained in this package.
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
## What's in this package?
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
1. **Data generator** for PII recognizers and NER models
2. **Data representation layer** for data generation, modeling and analysis
3. Multiple **Model/Recognizer evaluation** files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
4. **Training and modeling code** for multiple models
4. Helper functions for **results analysis**
## 1. Data generation
See [Data Generator README](/presidio_evaluator/data_generator/README.md) for more details.
The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).
- For an example for running the generation process, see [this notebook](notebooks/Generate%20data.ipynb).
- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/PII%20EDA.ipynb).
Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See [this notebook for more details](notebooks/Split%20by%20pattern%20%23.ipynb).
## 2. Data representation
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see [data_objects.py](presidio_evaluator/data_objects.py).
## 3. Recognizer evaluation
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
The main logic lies in the [ModelEvaluator](presidio_evaluator/model_evaluator.py) class. It provides a structured way of evaluating models and recognizers.
### Ready evaluators
Some evaluators were developed for analysis and references. These include:
#### 1. Presidio API evaluator
Allows you to evaluate an existing Presidio deployment through the API. [See this notebook for details](notebooks/Evaluate%20Presidio-API.ipynb).
#### 2. Presidio analyzer evaluator
Allows you to evaluate the local Presidio-Analyzer package. Faster than the API option but requires you to have Presidio-Analyzer installed locally. [See this class for more information](presidio_evaluator/presidio_analyzer.py)
#### 3. One recognizer evaluator
Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/presidio_recognizer_evaluator.py)
## 4. Modeling
### Conditional Random Fields
To train a CRF on a new dataset, see [this notebook](notebooks/models/CRF).
To evaluate a CRF model, see [this notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/crf_evaluator.py).
### spaCy based models
There are three ways of interacting with spaCy models:
1. Evaluate an existing trained model
2. Train with pretrained embeddings
3. Fine tune an existing spaCy model
Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
See [this notebook for creating spaCy datasets](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb).
#### Evaluate an existing trained model
To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
#### Train with pretrain embeddings
In order to train a new spaCy model from scratch with pretrained embeddings (FastText wiki news subword in this case), follow these three steps:
##### 1. Download FastText pretrained (sub) word embeddings
``` sh
wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
unzip wiki-news-300d-1M-subword.vec.zip
```
##### 2. Init spaCy model with pre-trained embeddings
Using spaCy CLI:
``` sh
python -m spacy init-model en spacy_fasttext --vectors-loc wiki-news-300d-1M-subword.vec
```
##### 3. Train spaCy NER model
Using spaCy CLI:
``` sh
python -m spacy train en spacy_fasttext_100 train.json test.json --vectors spacy_fasttext --pipeline ner -n 100
```
#### Fine-tune an existing spaCy model
See [this code for retraining an existing spaCy model](models/spacy_retrain.py). Specifically, run a SpacyRetrainer:
First, you would have to create train and test pickle files for your train and test sets. See [this notebook](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb) for more information.
```python
from models import SpacyRetrainer
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
experiment_name='new_spacy_experiment',
n_iter=500, dropout=0.1, aml_config=None)
spacy_retrainer.run()
```
### Flair based models
To train a new model, see the [FlairTrainer](presidio_evaluator/models/flair_train.py) object.
For experimenting with other embedding types, change the `embeddings` object in the `train` method.
To train a Flair model, run:
```python
from models import FlairTrainer
train_samples = "../data/generated_train.json"
test_samples = "../data/generated_test.json"
val_samples = "../data/generated_validation.json"
trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
corpus = trainer.read_corpus("")
trainer.train(corpus)
```
To evaluate an existing model, see [this notebook](notebooks/models/Evaluate%20flair%20models.ipynb).
Copyright notice:
Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/)
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/). Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

2
VERSION Normal file
Просмотреть файл

@ -0,0 +1,2 @@
0.0

74124
data/synth_dataset.txt Normal file

Разница между файлами не показана из-за своего большого размера Загрузить разницу

2
models/__init__.py Normal file
Просмотреть файл

@ -0,0 +1,2 @@
from .spacy_retrain import SpacyRetrainer
from .flair_train import FlairTrainer

123
models/flair_train.py Normal file
Просмотреть файл

@ -0,0 +1,123 @@
from typing import List
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import read_synth_dataset
from os import path
class FlairTrainer:
@staticmethod
def to_flair_row(self, text, pos, label):
return "{} {} {}".format(text, pos, label)
def to_flair(self, df, outfile="flair_train.txt"):
sentence = 0
flair = []
for row in df.itertuples():
if row.sentence != sentence:
sentence += 1
flair.append("")
else:
flair.append(self.to_flair_row(row.text, row.pos, row.label))
if outfile:
with open(outfile, "w", encoding="utf-8") as f:
for item in flair:
f.write("{}\n".format(item))
def create_flair_corpus(self, train_samples_path, test_samples_path, val_samples_path):
if not path.exists("flair_train.txt"):
train_samples = read_synth_dataset(train_samples_path)
train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
print("Kept {} train samples after removal of non-tagged samples".format(len(train_tagged)))
train_data = InputSample.create_conll_dataset(train_tagged)
self.to_flair(train_data, outfile="flair_train.txt")
if not path.exists("flair_test.txt"):
test_samples = read_synth_dataset(test_samples_path)
test_data = InputSample.create_conll_dataset(test_samples)
self.to_flair(test_data, outfile="flair_test.txt")
if not path.exists("flair_val.txt"):
val_samples = read_synth_dataset(val_samples_path)
val_data = InputSample.create_conll_dataset(val_samples)
self.to_flair(val_data, outfile="flair_val.txt")
@staticmethod
def read_corpus(data_folder) -> Corpus:
columns = {0: 'text', 1: 'pos', 2: 'ner'}
corpus: Corpus = ColumnCorpus(data_folder, columns,
train_file='flair_train.txt',
test_file='flair_val.txt',
dev_file='flair_test.txt')
return corpus
@staticmethod
def train(corpus):
print(corpus)
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)
# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [
WordEmbeddings('glove'),
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward')
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
# 5. initialize sequence tagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
use_crf=True)
# 6. initialize trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
checkpoint = 'resources/taggers/presidio-ner/checkpoint.pt'
# trainer = ModelTrainer.load_checkpoint(checkpoint, corpus)
trainer.train('resources/taggers/presidio-ner',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
checkpoint=True)
sentence = Sentence('I am from Jerusalem')
# run NER over sentence
tagger.predict(sentence)
print(sentence)
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
if __name__ == "__main__":
train_samples = "../data/generated_train_November 12 2019.json"
test_samples = "../data/generated_test_November 12 2019.json"
val_samples = "../data/generated_validation_November 12 2019.json"
trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
corpus = trainer.read_corpus("")
trainer.train(corpus)

206
models/spacy_retrain.py Normal file
Просмотреть файл

@ -0,0 +1,206 @@
import logging
import pickle
import random
import sys
from pathlib import Path
import spacy
from azureml.core import Workspace, Experiment
from spacy.util import minibatch, compounding
from presidio_evaluator import SpacyEvaluator, InputSample
logging.basicConfig(level=logging.INFO)
root = logging.getLogger()
root.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)
root.addHandler(handler)
class SpacyRetrainer:
def __init__(self, original_model_name=None, experiment_name=None, n_iter=100, dropout=0.5,
aml_config='config.json', output_dir='../../model-outputs', train_pickle='../data/train.pickle',
test_pickle='../data/test.pickle'):
self.experiment_name = experiment_name
if aml_config:
self.ws = Workspace.from_config(aml_config)
self.experiment = Experiment(workspace=self.ws, name=experiment_name)
self.aml_run = self.experiment.start_logging()
self.has_aml = True
else:
self.has_aml = False
self.model = original_model_name
self.n_iter = n_iter
self.output_dir = output_dir
self.train_file = train_pickle
self.test_file = test_pickle
self.dropout = dropout
def run(self):
if self.has_aml:
self.aml_run.log("model", self.model)
self.aml_run.log("n_iter", self.n_iter)
self.aml_run.log("train_file", self.train_file)
self.aml_run.log("test_file", self.test_file)
self.aml_run.log("dropout rate", self.dropout)
model_path = self._train(self.model, self.output_dir, self.n_iter, self.train_file, self.experiment_name)
self._score_validate(model_path, self.test_file)
if self.has_aml:
self.aml_run.complete()
def print_scores(self, split, evaluation_result):
"""
Logs results into experiment run.
:param split: Name of this split. For ex 'train' or 'valid'
:param evaluation_result: EvaluationResult containing various metrics
:return: None. Writes to experiment runner and logs locally.
"""
logging.info('SPLIT: {0}. PII_precision: {1}, PII_recall: {2},'
'Person_precision: {3}, Person_recall: {4}'. \
format(split, evaluation_result.pii_precision, evaluation_result.pii_recall,
evaluation_result.entity_precision_dict['PERSON'],
evaluation_result.entity_recall_dict['PERSON']))
if self.has_aml:
self.aml_run.log('Precision', evaluation_result.pii_precision, split)
self.aml_run.log('Recall', evaluation_result.pii_recall, split)
@staticmethod
def _score(model, data):
"""
Score the model against the data
:param model: Trained model
:param data: Data split which is being scored.
:return: An EvaluationResult containing various metrics
"""
spacy_evaluator = SpacyEvaluator(model=model)
results = []
for text, ground_truth_annotations in data:
ground_truth_entities = ground_truth_annotations['entities']
input_sample = InputSample.from_spacy(text, ground_truth_entities)
results.append(spacy_evaluator.evaluate_sample(input_sample))
return spacy_evaluator.calculate_score(evaluation_results=results)
def _score_validate(self, model_path, test_data_file):
"""
Validation step for the model. Also prints the scores.
:param model_path: Path to trained model.
:param test_data_file: Data file which has the dataset for this split.
:return: None. Prints the scores.
"""
with open(test_data_file, 'rb') as f:
valid_data = pickle.load(f)
nlp = spacy.load(model_path)
self.print_scores('Valid', self._score(nlp, valid_data))
# @plac.annotations(
# model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
# output_dir=("Optional output directory", "option", "o", Path),
# n_iter=("Number of training iterations", "option", "n", int),
# train_file=("File containing pickled training Spacy NER formatted data", "option", "d", Path),
# test_file=("File containing pickled test Spacy NER formatted data", "option", "d", Path),
# exp_name=("Name of this experiment", "option", "e")
# )
def _train(self, model, output_dir, n_iter, train_file, exp_name):
"""Load the model, set up the pipeline and train the entity recognizer."""
nlp = self.load_or_create_empty_model(model)
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
else:
ner = nlp.get_pipe("ner")
with open(train_file, 'rb') as f:
train_data = pickle.load(f)
# DEBUG
train_data = train_data[:50]
# add labels
for _, annotations in train_data:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(train_data)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, drop=self.dropout, losses=losses, )
logging.debug("Losses", losses)
if self.has_aml:
self.aml_run.log('Losses', losses['ner'])
self.print_scores('Itn {}'.format(itn), self._score(nlp, train_data))
self.print_scores('Train', self._score(nlp, train_data))
saved_model_path = self.save_model(exp_name, nlp, output_dir)
return saved_model_path
@staticmethod
def save_model(exp_name, model, output_dir):
"""
Saves model to disk for later use.
:param exp_name: Name of the running experiment. This is used as folder name for storing the model.
:param model: Model being saved
:param output_dir: Directory where to save the model.
:return: Full path to saved model.
"""
saved_model_path = Path(output_dir, exp_name)
if not saved_model_path.exists():
saved_model_path.mkdir(parents=True)
model.to_disk(saved_model_path)
logging.info("Saved model to {}".format(output_dir))
return saved_model_path
@staticmethod
def load_model(exp_name, model_dir):
"""
Loads a spacy model from disk
:param exp_name: Name of experiment under which the model was saved
:param model_dir: path to saved model
:return: spacy model
"""
saved_model_path = Path(model_dir, exp_name)
return spacy.load(saved_model_path)
@staticmethod
def load_or_create_empty_model(model=None):
"""
Loads a given model or creates a blank english model.
:param model: Optional Model to load.
:return: Loaded or blank model.
"""
if model:
nlp = spacy.load(model)
logging.debug("Loaded model {}".format(model))
else:
nlp = spacy.blank("en")
logging.debug("Created blank 'en' model")
return nlp
if __name__ == "__main__":
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
experiment_name='spacy_new_ontonotes28',
n_iter=500, dropout=0.5, aml_config=None)
spacy_retrainer.run()

151
models/spacy_streamlit.py Normal file
Просмотреть файл

@ -0,0 +1,151 @@
# coding: utf-8
"""
Example of a Streamlit app for an interactive spaCy model visualizer. You can
either download the script, or point streamlit run to the raw URL of this
file. For more details, see https://streamlit.io.
Installation:
pip install streamlit
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download de_core_news_sm
Usage:
streamlit run streamlit_spacy.py
"""
from __future__ import unicode_literals
import streamlit as st
import spacy
from spacy import displacy
import pandas as pd
SPACY_MODEL_NAMES = ["en_core_web_lg", "spacy_new_ontonotes28","spacy_ft_100/model-final"]
DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
@st.cache(allow_output_mutation=True)
def load_model(name):
return spacy.load(name)
@st.cache(allow_output_mutation=True)
def process_text(model_name, text):
nlp = load_model(model_name)
return nlp(text)
st.sidebar.title("Interactive spaCy visualizer")
st.sidebar.markdown(
"""
Process text with [spaCy](https://spacy.io) models and visualize named entities,
dependencies and more. Uses spaCy's built-in
[displaCy](http://spacy.io/usage/visualizers) visualizer under the hood.
"""
)
spacy_model = st.sidebar.selectbox("Model name", SPACY_MODEL_NAMES)
model_load_state = st.info(f"Loading model '{spacy_model}'...")
nlp = load_model(spacy_model)
model_load_state.empty()
text = st.text_area("Text to analyze", DEFAULT_TEXT)
doc = process_text(spacy_model, text)
if "parser" in nlp.pipe_names:
st.header("Dependency Parse & Part-of-speech tags")
st.sidebar.header("Dependency Parse")
split_sents = st.sidebar.checkbox("Split sentences", value=True)
collapse_punct = st.sidebar.checkbox("Collapse punctuation", value=True)
collapse_phrases = st.sidebar.checkbox("Collapse phrases")
compact = st.sidebar.checkbox("Compact mode")
options = {
"collapse_punct": collapse_punct,
"collapse_phrases": collapse_phrases,
"compact": compact,
}
docs = [span.as_doc() for span in doc.sents] if split_sents else [doc]
for sent in docs:
html = displacy.render(sent, options=options)
# Double newlines seem to mess with the rendering
html = html.replace("\n\n", "\n")
if split_sents and len(docs) > 1:
st.markdown(f"> {sent.text}")
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
if "ner" in nlp.pipe_names:
st.header("Named Entities")
st.sidebar.header("Named Entities")
label_set = nlp.get_pipe("ner").labels
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
html = displacy.render(doc, style="ent", options={"ents": labels})
# Newlines seem to mess with the rendering
html = html.replace("\n", " ")
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
attrs = ["text", "label_", "start", "end", "start_char", "end_char"]
if "entity_linker" in nlp.pipe_names:
attrs.append("kb_id_")
data = [
[str(getattr(ent, attr)) for attr in attrs]
for ent in doc.ents
if ent.label_ in labels
]
df = pd.DataFrame(data, columns=attrs)
st.dataframe(df)
if "textcat" in nlp.pipe_names:
st.header("Text Classification")
st.markdown(f"> {text}")
df = pd.DataFrame(doc.cats.items(), columns=("Label", "Score"))
st.dataframe(df)
vector_size = nlp.meta.get("vectors", {}).get("width", 0)
if vector_size:
st.header("Vectors & Similarity")
st.code(nlp.meta["vectors"])
text1 = st.text_input("Text or word 1", "apple")
text2 = st.text_input("Text or word 2", "orange")
doc1 = process_text(spacy_model, text1)
doc2 = process_text(spacy_model, text2)
similarity = doc1.similarity(doc2)
if similarity > 0.5:
st.success(similarity)
else:
st.error(similarity)
st.header("Token attributes")
if st.button("Show token attributes"):
attrs = [
"idx",
"text",
"lemma_",
"pos_",
"tag_",
"dep_",
"head",
"ent_type_",
"ent_iob_",
"shape_",
"is_alpha",
"is_ascii",
"is_digit",
"is_punct",
"like_num",
]
data = [[str(getattr(token, attr)) for attr in attrs] for token in doc]
df = pd.DataFrame(data, columns=attrs)
st.dataframe(df)
st.header("JSON Doc")
if st.button("Show JSON Doc"):
st.json(doc.to_json())
st.header("JSON model meta")
if st.button("Show JSON model meta"):
st.json(nlp.meta)

Просмотреть файл

@ -0,0 +1,245 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"from presidio_evaluator import ModelEvaluator\n",
"from collections import Counter\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"MY_PRESIDIO_ENDPOINT = \"http://presidio-api.westeurope.cloudapp.azure.com/api/v1/projects/test/analyze\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate your Presidio instance via the Presidio API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A. Read dataset for evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_samples = read_synth_dataset(\"../data/synth_dataset.txt\")\n",
"print(\"Read {} samples\".format(len(input_samples)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### B. Descriptive statistics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flatten = lambda l: [item for sublist in l for item in sublist]\n",
"\n",
"count_per_entity = Counter([span.entity_type for span in flatten([input_sample.spans for input_sample in input_samples])])\n",
"count_per_entity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### C. Match the dataset's entity names with Presidio's entity names"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity\n",
"entities_mapping = {\n",
" 'PERSON': 'PERSON',\n",
" 'EMAIL': 'EMAIL_ADDRESS',\n",
" 'CREDIT_CARD': 'CREDIT_CARD',\n",
" 'FIRST_NAME': 'PERSON',\n",
" 'PHONE_NUMBER': 'PHONE_NUMBER',\n",
" 'LOCATION':'LOCATION',\n",
" # 'BIRTHDAY': 'DATE_TIME',\n",
" # 'DATE': 'DATE_TIME',\n",
" 'DOMAIN': 'DOMAIN',\n",
" # 'CITY': 'LOCATION',\n",
" # 'ADDRESS': 'LOCATION',\n",
" 'IBAN': 'IBAN_CODE',\n",
" # 'URL': 'DOMAIN_NAME',\n",
" 'US_SSN': 'US_SSN',\n",
" 'IP_ADDRESS': 'IP_ADDRESS',\n",
" # 'ORGANIZATION':'ORG'\n",
" 'O': 'O'\n",
"}\n",
"presidio_fields = ['CREDIT_CARD', 'CRYPTO', 'DATE_TIME', 'DOMAIN_NAME', 'EMAIL_ADDRESS', 'IBAN_CODE',\n",
" 'IP_ADDRESS', 'NRP', 'LOCATION', 'PERSON', 'PHONE_NUMBER', 'US_SSN']\n",
"\n",
"new_list = ModelEvaluator.align_input_samples_to_presidio_analyzer(input_samples,\n",
" entities_mapping,\n",
" presidio_fields)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### D. Recalculate statistics on updated dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## recheck counter\n",
"count_per_entity_new = Counter([span.entity_type for span in flatten([input_sample.spans for input_sample in new_list])])\n",
"count_per_entity_new"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### E. Run the presidio-evaluator framework with Presidio's API as the 'model' at test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator import PresidioAPIEvaluator\n",
"presidio = PresidioAPIEvaluator(entities_to_keep=list(count_per_entity_new.keys()),endpoint=MY_PRESIDIO_ENDPOINT)\n",
"evaluted_samples = presidio.evaluate_all(new_list[:100])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### F. Extract statistics\n",
"- Presicion, recall and F measure are calculated based on a PII/Not PII binary classification per token.\n",
"- Specific entity recall and precision are calculated on the specific PII entity level."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"evaluation_result = presidio.calculate_score(evaluted_samples)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"evaluation_result.print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### G. Analyze wrong predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"errors = evaluation_result.model_errors"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ModelEvaluator.most_common_fp_tokens(errors,n=5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
"fps_df[['full_text','token','prediction']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')\n",
"fns_df"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,226 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from tqdm import tqdm_notebook as tqdm\n",
"from presidio_evaluator.data_generator.main import generate,read_synth_dataset\n",
"\n",
"import datetime\n",
"import json"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generate fake PII data using Presidio's data generator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Presidio's data generator allows you to generate a synthetic dataset with two preriquisites:\n",
"1. A fake PII csv (We used https://www.fakenamegenerator.com/)\n",
"2. A text file with template sentences or paragraphs. In this file, each PII entity placeholder is written in brackets. The name of the PII entity should be one of the columns in the fake PII csv file.\n",
"\n",
"The generator creates fake sentences based on the provided fake PII csv AND a list of [extension functions](../presidio_evaluator/data_generator/extensions.py) and a few additional 3rd party libraries like `Faker`, and `haikunator`.\n",
"\n",
"\n",
"For example:\n",
"1. **A fake PII csv**:\n",
"\n",
"| FIRST_NAME | LAST_NAME | EMAIL |\n",
"|-------------|-------------|-----------|\n",
"| David | Brown | david.brown@jobhop.com |\n",
"| Mel | Brown | melb@hobjob.com |\n",
"\n",
"\n",
"2. **Templates**:\n",
"\n",
"My name is [FIRST_NAME]\n",
"\n",
"You can email me at [EMAIL]. Thanks, [FIRST_NAME]\n",
"\n",
"What's your last name? It's [LAST_NAME]\n",
"\n",
"Every time I see you falling I get down on my knees and pray\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate files\n",
"Based on these two prerequisites, a requested number of examples and an output file name:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"EXAMPLES = 100\n",
"SPAN_TO_TAG = True #Whether to create tokens + token labels (tags)\n",
"TEMPLATES_FILE = '../presidio_evaluator/data_generator/' \\\n",
" 'raw_data/ontonotes_based_templates.txt'\n",
"KEEP_ONLY_TAGGED = False\n",
"LOWER_CASE_RATIO = 0.1\n",
"IGNORE_TYPES = {\"IP_ADDRESS\", 'US_SSN', 'URL'}\n",
"\n",
"OUTPUT = \"generated_size_{}_date_{}.txt\".format(EXAMPLES, cur_time)\n",
"\n",
"cur_time = datetime.date.today().strftime(\"%B %d %Y\")\n",
"fake_pii_csv = '../presidio_evaluator/data_generator/' \\\n",
" 'raw_data/FakeNameGenerator.com_100.csv'\n",
"utterances_file = TEMPLATES_FILE\n",
"dictionary_path = '../presidio_evaluator/data_generator/' \\\n",
" 'raw_data/Dictionary.csv'\n",
"\n",
"examples = generate(fake_pii_csv=fake_pii_csv,\n",
" utterances_file=utterances_file,\n",
" dictionary_path=dictionary_path,\n",
" output_file=OUTPUT,\n",
" lower_case_ratio=LOWER_CASE_RATIO,\n",
" num_of_examples=EXAMPLES,\n",
" ignore_types=IGNORE_TYPES,\n",
" keep_only_tagged=KEEP_ONLY_TAGGED,\n",
" span_to_tag=SPAN_TO_TAG)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To read a dataset file into the InputSample format, use `read_synth_dataset`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"input_samples = read_synth_dataset(OUTPUT)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"input_samples[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The full structure of each input_sample is the following. It includes different feature values per token as calculated by Spacy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"input_samples[0].to_dict()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Verify randomness of dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from collections import Counter\n",
"count_per_template_id = Counter([sample.metadata['Template#'] for sample in input_samples])\n",
"for key in sorted(count_per_template_id):\n",
" print(\"{}: {}\".format(key,count_per_template_id[key]))\n",
" \n",
"print(sum(count_per_template_id.values()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Transform to the CONLL structure:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"\n",
"conll = InputSample.create_conll_dataset(input_samples)\n",
"conll.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Copyright notice:\n",
"\n",
"\n",
"Data generated for evaluation was created using Fake Name Generator.\n",
"\n",
"Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/) \n",
"are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/). Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

274
notebooks/PII EDA.ipynb Normal file
Просмотреть файл

@ -0,0 +1,274 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fake PII data: Exploratory data analysis\n",
"\n",
"This notebook is used to verify the different fake entities before and after the creation of a synthetic dataset / augmented dataset. First part looks at the generation details and stats, second part evaluates the created synthetic dataset after it has been generated."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from presidio_evaluator.data_generator.extensions import generate_iban, generate_ip_addresses, generate_SSNs, \\\n",
" generate_company_names, generate_url, generate_roles, generate_titles, generate_nationality, generate_nation_man, \\\n",
" generate_nation_woman, generate_nation_plural, generate_title\n",
"\n",
"from presidio_evaluator.data_generator import FakeDataGenerator, read_synth_dataset\n",
"\n",
"from collections import Counter\n",
"\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Evaluate generation logic and the fake PII bank used during generation"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"../presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_100000.csv\",encoding=\"utf-8\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"generator = FakeDataGenerator(fake_pii_df=df, \n",
" templates=None, \n",
" dictionary_path=None,\n",
" ignore_types={\"IP_ADDRESS\", 'US_SSN', 'URL','ADDRESS'})"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pii_df = generator.prep_fake_pii(df)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"for (name, series) in pii_df.iteritems():\n",
" print(name)\n",
" print(\"Unique values: {}\".format(len(series.unique())))\n",
" print(series.value_counts())\n",
" print(\"\\n**************\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from wordcloud import WordCloud\n",
"\n",
"def series_to_wordcloud(series):\n",
" freqs = series.value_counts()\n",
" wordcloud = WordCloud(background_color='white',width=800,height=400).generate_from_frequencies(freqs)\n",
" fig = plt.figure(figsize=(16, 8))\n",
" plt.suptitle(\"{} word cloud\".format(series.name))\n",
" plt.imshow(wordcloud, interpolation='bilinear')\n",
" plt.axis(\"off\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"series_to_wordcloud(df.FIRST_NAME)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"series_to_wordcloud(df.LAST_NAME)"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"series_to_wordcloud(df.COUNTRY)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"series_to_wordcloud(df.ORGANIZATION)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"series_to_wordcloud(df.CITY)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Evaluate different entities in the synthetic dataset after creation"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"synth = read_synth_dataset(\"../data/generated_train_November 12 2019.json\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"sentences_only = [(sample.full_text,sample.metadata) for sample in synth]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"sentences_only[2]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"print(\"Proportions of female vs. male based samples:\")\n",
"Counter([sentence[1]['Gender'] for sentence in sentences_only])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"print(\"Proportion of lower case samples:\")\n",
"Counter([sentence[1]['Lowercase'] for sentence in sentences_only])"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"print(\"Proportion of nameset across samples:\")\n",
"Counter([sentence[1]['NameSet'] for sentence in sentences_only])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"def get_entity_values_from_sample(sample,entity_types):\n",
" name_entities = [span.entity_value for span in sample.spans if span.entity_type in entity_types]\n",
" return name_entities\n",
" \n",
"names = [get_entity_values_from_sample(sample,['PERSON','FIRST_NAME','LAST_NAME']) for sample in synth]\n",
"names = [item for sublist in names for item in sublist]\n",
"series_to_wordcloud(pd.Series(names,name='PERSON, FIRST_NAME, LAST_NAME'))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"countries = [get_entity_values_from_sample(sample,['LOCATION']) for sample in synth]\n",
"countries = [item for sublist in countries for item in sublist]\n",
"series_to_wordcloud(pd.Series(countries,name='LOCATION'))"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"orgs = [get_entity_values_from_sample(sample,['ORGANIZATION']) for sample in synth]\n",
"orgs = [item for sublist in orgs for item in sublist]\n",
"series_to_wordcloud(pd.Series(orgs,name='ORGANIZATION'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,166 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train/Test/Validation split of input samples. \n",
"This notebook shows how train/test/split is being made on a List[InputSample]\n",
"\n",
"This is different for the normal split since we don't want sentences generated from the same pattern to be in more than one set. (Applicable only if the dataset was generated from templates)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"from presidio_evaluator.validation import split_dataset, save_to_json\n",
"\n",
"%reload_ext autoreload"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DATE_DATE = \"November 12 2019\"\n",
"SIZE = 80000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load full dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_samples = read_synth_dataset(\"../presidio_evaluator/data_generator/generated_size_{}_date_{}.txt\".format(SIZE, DATE_DATE))\n",
"print(len(all_samples))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split to train/test/dev"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"TRAIN_TEST_VAL_RATIOS = [0.7,0.2,0.1]\n",
"\n",
"train, test, validation = split_dataset(all_samples,TRAIN_TEST_VAL_RATIOS)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Train/Test only (no validation)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"#TRAIN_TEST_RATIOS = [0.7,0.3]\n",
"#train,test = split_dataset(all_sampleTRAIN_TEST_RATIOSEST_RATIOS)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save the different sets to files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"save_to_json(train,\"../data/train_{}.json\".format(DATE_DATE))\n",
"save_to_json(test,\"../data/test_{}.json\".format(DATE_DATE))\n",
"save_to_json(validation,\"../data/1validation_{}.json\".format(DATE_DATE))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(len(train))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(len(test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(len(validation))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert len(train) + len(test) + len(validation) == len(all_samples)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

308
notebooks/models/CRF.ipynb Normal file
Просмотреть файл

@ -0,0 +1,308 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"CRF trainer using the sklearn_crfsuite package (Python wrapper for CRFSuite): https://sklearn-crfsuite.readthedocs.io/en/latest/"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"\n",
"import sklearn_crfsuite\n",
"from sklearn_crfsuite import metrics\n",
"\n",
"from presidio_evaluator import InputSample\n",
"from presidio_evaluator.crf_evaluator import CRFEvaluator\n",
"from presidio_evaluator.data_generator import read_synth_dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DATA_DATE = \"November 12 2019\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Source a dataset to use for training / testing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"train_samples = read_synth_dataset(\"../../data/generated_train_{}.json\".format(DATA_DATE))\n",
"test_samples = read_synth_dataset(\"../../data/generated_test_{}.json\".format(DATA_DATE))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]\n",
"print(\"Kept {} train samples after removal of non-tagged samples\".format(len(train_tagged)))\n",
"train_data = InputSample.create_conll_dataset(train_tagged)\n",
"\n",
"test_data = InputSample.create_conll_dataset(test_samples)\n",
"test_data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Turn every sentence into a list of lists (list of tokens + pos + label)\n",
"test_sents=test_data.groupby('sentence')[['text','pos','label']].apply(lambda x: x.values.tolist())\n",
"train_sents=train_data.groupby('sentence')[['text','pos','label']].apply(lambda x: x.values.tolist())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create features for CRF"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"CRFEvaluator.sent2features(train_sents[0])[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"X_train = [CRFEvaluator.sent2features(s) for s in train_sents]\n",
"y_train = [CRFEvaluator.sent2labels(s) for s in train_sents]\n",
"\n",
"X_test = [CRFEvaluator.sent2features(s) for s in test_sents]\n",
"y_test = [CRFEvaluator.sent2labels(s) for s in test_sents]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"crf = sklearn_crfsuite.CRF(\n",
" algorithm='lbfgs',\n",
" c1=0.1,\n",
" c2=0.1,\n",
" max_iterations=100,\n",
" all_possible_transitions=True\n",
")\n",
"crf.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save trained model to pickle"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open(\"../../models/crf.pickle\",'wb') as f:\n",
" data = pickle.dump(crf, f,protocol=pickle.HIGHEST_PROTOCOL)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open saved model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"../../models/crf.pickle\", 'rb') as f:\n",
" crf = pickle.load(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract info and predictions from model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels = list(crf.classes_)\n",
"labels.remove('O')\n",
"labels"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_pred = crf.predict(X_test)\n",
"metrics.flat_f1_score(y_test, y_pred,\n",
" average='weighted', labels=labels)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## predict one:\n",
"y_5_pred = crf.predict([X_test[5]])\n",
"y_5_pred[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# group B and I results\n",
"sorted_labels = sorted(\n",
" labels,\n",
" key=lambda name: (name[1:], name[0])\n",
")\n",
"print(metrics.flat_classification_report(\n",
" y_test, y_pred, labels=sorted_labels, digits=3\n",
"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Model explainability"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def print_transitions(trans_features):\n",
" for (label_from, label_to), weight in trans_features:\n",
" print(\"%-6s -> %-7s %0.6f\" % (label_from, label_to, weight))\n",
"\n",
"print(\"Top likely transitions:\")\n",
"print_transitions(Counter(crf.transition_features_).most_common(20))\n",
"\n",
"print(\"\\nTop unlikely transitions:\")\n",
"print_transitions(Counter(crf.transition_features_).most_common()[-20:])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def print_state_features(state_features):\n",
" for (attr, label), weight in state_features:\n",
" print(\"%0.6f %-8s %s\" % (weight, label, attr))\n",
"\n",
"print(\"Top positive:\")\n",
"print_state_features(Counter(crf.state_features_).most_common(30))\n",
"\n",
"print(\"\\nTop negative:\")\n",
"print_state_features(Counter(crf.state_features_).most_common()[-30:])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}

Просмотреть файл

@ -0,0 +1,315 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spacy dataset creation"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This notebook takes train and test datasets (of type `List[InputSample]`)\n",
"and transforms them into two structures consumed by Spacy:\n",
"1. Spacy JSON (see https://spacy.io/api/annotation#json-input)\n",
"2. Spacy Pickle files (of structure `[(full_text,\"entities\":[(start, end, type),(...))]`. \n",
"See more details here: https://spacy.io/api/annotation#json-input)\n",
"\n",
"JSON is used for Spacy's CLI trainer. \n",
"Pickle is used for fine-tuning using the logic in [../models/spacy_retrain.py](../models/spacy_retrain.py)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"%reload_ext autoreload"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"DATA_DATE = 'November 12 2019'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"data_path = \"../data/generated_{}_{}.json\"\n",
"\n",
"train_samples = read_synth_dataset(data_path.format(\"train\",DATA_DATE))\n",
"print(\"Read {} samples\".format(len(train_samples)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"For training, keep only sentences with entities:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"train_tagged = [sample for sample in train_samples if len(sample.spans)>0]\n",
"print(\"Kept {} samples after removal of non-tagged samples\".format(len(train_tagged)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Evaluate training set's entities"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"print(\"Entities found in training set:\")\n",
"entities = []\n",
"for sample in train_tagged:\n",
" entities.extend([tag for tag in sample.tags])\n",
"set(entities)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Create Spacy dataset (option 2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"import pickle\n",
"\n",
"spacy_train = InputSample.create_spacy_dataset(train_tagged)\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"entities_spacy = [x[1]['entities'] for x in spacy_train]\n",
"entities_spacy\n",
"entities_spacy_flat = []\n",
"for samp in entities_spacy:\n",
" for ent in samp:\n",
" entities_spacy_flat.append(ent[2])\n",
"set(entities_spacy_flat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create Spacy dataset (option 1: JSON)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from presidio_evaluator import InputSample\n",
"spacy_train_json = InputSample.create_spacy_json(train_tagged)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quick evaluation of samples"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"[sample[0] for sample in spacy_train[:100]]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"spacy_train_json[0]['paragraphs'][0]['sentences']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dump training set to pickle and json respectively"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"import json\n",
"with open(\"../data/train.pickle\", 'wb') as handle:\n",
" pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
"\n",
"with open(\"../data/train.json\",\"w\") as f:\n",
" json.dump(spacy_train_json,f)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create JSON and pickle files for test dataset"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
"print(\"Read {} samples\".format(len(test_samples)))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"spacy_test = InputSample.create_spacy_dataset(test_samples)\n",
"spacy_test_json = InputSample.create_spacy_json(test_samples)\n",
"print(spacy_test[14])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dump test set to pickle and json respectively"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open(\"../data/test.pickle\", 'wb') as handle:\n",
" pickle.dump(spacy_test,handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
" \n",
"with open(\"../data/test.json\",\"w\") as f:\n",
" json.dump(spacy_test_json,f)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,380 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate CRF models for person names, orgs and locations using the Presidio Evaluator framework\n",
"\n",
"Data = `generated_test_November 12 2019`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from tqdm import tqdm_notebook as tqdm\n",
"import logging\n",
"from presidio_evaluator import InputSample\n",
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"import spacy\n",
"import pandas as pd\n",
"import pickle\n",
"\n",
"pd.set_option('display.width', 10000)\n",
"pd.set_option('display.max_colwidth', -1)\n",
"\n",
"\n",
"%reload_ext autoreload\n",
"%autoreload 2\n",
"\n",
"DATA_DATE = 'November 12 2019'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"data_path = \"../../data/generated_{}_{}.json\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select data for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#test_samples = read_synth_dataset(data_path.format(\"test\", DATA_DATE))\n",
"#print(len(test_samples))\n",
"\n",
"val_samples = read_synth_dataset(data_path.format(\"validation\", DATA_DATE))\n",
"#print(len(val_samples))\n",
"\n",
"#synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
"#print(len(synth_samples))\n",
"\n",
"#conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
"\n",
"DATASET = val_samples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from collections import Counter\n",
"entity_counter = Counter()\n",
"for sample in DATASET:\n",
" for tag in sample.tags:\n",
" entity_counter[tag]+=1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"entity_counter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"DATASET[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#max length sentence\n",
"max([len(sample.tokens) for sample in DATASET])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select models for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"crf_vanilla = \"../../model-outputs/crf.pickle\"\n",
" \n",
"models = [crf_vanilla]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run evaluation on all models:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from presidio_evaluator.crf_evaluator import CRFEvaluator\n",
"\n",
"for model in models:\n",
" print(\"-----------------------------------\")\n",
" print(\"Evaluating model {}\".format(model))\n",
" crf_evaluator = CRFEvaluator(model_pickle_path=model)\n",
" evaluation_results = crf_evaluator.evaluate_all(DATASET)\n",
" scores = crf_evaluator.calculate_score(evaluation_results)\n",
" \n",
" print(\"Confusion matrix:\")\n",
" print(scores.results)\n",
"\n",
" print(\"Precision and recall\")\n",
" scores.print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Custom evaluation of the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Try out the model\n",
"def sent_to_features(model_path,sent):\n",
" \"\"\"\n",
" Translates a sentence into a prediction using a saved CRF model\n",
" \"\"\"\n",
" \n",
" with open(model_path, 'rb') as f:\n",
" model = pickle.load(f)\n",
" \n",
" tokenizer = spacy.blank('en')\n",
" tokens = tokenizer(sent)\n",
" tags = ['O' for token in tokens] # Placeholder: Not used but required. \n",
" metadata = {'Template#':1,'Gender':'1','Country':'2'} #Placeholder: Not used but required\n",
" input_sample = InputSample(full_text=sent,masked=\"\",spans=None,tokens=tokens,tags=tags,metadata=metadata,create_tags_from_span=False,)\n",
"\n",
" return CRFEvaluator.crf_predict(input_sample, model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SENTENCE = \"Michael is American\"\n",
"\n",
"sent_to_features(model_path=crf_vanilla, sent=SENTENCE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False positives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Most false positive tokens:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"errors = scores.model_errors\n",
"\n",
"from presidio_evaluator import ModelEvaluator\n",
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. review false positives for entity 'PERSON'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
"fps_df[['full_text','token','prediction']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False negative examples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More FN analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fns_df[['full_text','token','annotation','prediction']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,326 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate Flair models for person names, orgs and locations using the Presidio Evaluator framework\n",
"\n",
"Data = `generated_test_November 12 2019`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"%reload_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"DATA_DATE = \"November 12 2019\"\n",
"data_path = \"../../data/generated_{}_{}.json\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select data for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
"#print(len(test_samples))\n",
"\n",
"#val_samples = read_synth_dataset(data_path.format(\"validation\",DATA_DATE))\n",
"#print(len(val_samples))\n",
"\n",
"#synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
"#print(len(synth_samples))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
"\n",
"DATASET = conll_samples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from collections import Counter\n",
"entity_counter = Counter()\n",
"for sample in DATASET:\n",
" for tag in sample.tags:\n",
" entity_counter[tag]+=1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"entity_counter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"DATASET[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#max length sentence\n",
"max([len(sample.tokens) for sample in DATASET])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select models for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"flair_ner = 'ner'\n",
"flair_ner_fast = 'ner-fast'\n",
"flair_ontonotes = 'ner-ontonotes-fast'\n",
"flair_bert_embeddings = '../../models/presidio-ner/flair-bert-embeddings.pt'\n",
"glove_flair_embeddings = '../../models/presidio-ner/flair-embeddings.pt'\n",
"models = [glove_flair_embeddings]\n",
"#models = [flair_bert_embeddings, glove_flair_embeddings, flair_ner,flair_ner_fast,flair_ontonotes ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from presidio_evaluator.flair_evaluator import FlairEvaluator\n",
"\n",
"for model in models:\n",
" print(\"-----------------------------------\")\n",
" print(\"Evaluating model {}\".format(model))\n",
" flair_evaluator = FlairEvaluator(model_path=model)\n",
" evaluation_results = flair_evaluator.evaluate_all(DATASET)\n",
" scores = flair_evaluator.calculate_score(evaluation_results)\n",
" \n",
" \n",
" print(\"Confusion matrix:\")\n",
" print(scores.results)\n",
"\n",
" print(\"Precision and recall\")\n",
" scores.print()\n",
" errors = scores.model_errors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Custom evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False positives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Most false positive tokens:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"errors = scores.model_errors\n",
"\n",
"from presidio_evaluator import ModelEvaluator\n",
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
"fps_df[['full_text','token','prediction']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. False negative examples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More FN analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"fns_df[['full_text','token','annotation','prediction']]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,404 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate Spacy models for person names, orgs and locations using the Presidio Evaluator framework\n",
"\n",
"Data = `generated_test_November 12 2019`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"import spacy\n",
"\n",
"from presidio_evaluator import ModelEvaluator\n",
"from presidio_evaluator.data_generator import read_synth_dataset\n",
"%reload_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DATA_DATE = \"November 12 2019\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#!pip freeze | grep en_core_web_lg\n",
"!pip freeze | findstr en-core-web-lg\n",
"!pip freeze | findstr spacy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"data_path = \"../../data/generated_{}_{}.json\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select data for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"# test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
"# print(len(test_samples))\n",
"\n",
"# val_samples = read_synth_dataset(data_path.format(\"validation\",DATA_DATE))\n",
"# print(len(val_samples))\n",
"\n",
"# synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
"# print(len(synth_samples))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
"\n",
"DATASET = conll_samples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from collections import Counter\n",
"entity_counter = Counter()\n",
"for sample in DATASET:\n",
" for span in sample.spans:\n",
" entity_counter[span.entity_type]+=1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"entity_counter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"DATASET[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#max length sentence\n",
"max([len(sample.tokens) for sample in DATASET])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select models for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"models = []\n",
"\n",
"en_core_web_lg = r\"en_core_web_lg\"\n",
"spacy_new_ontonotes28 = r\"C:\\Users\\ommendel\\OneDrive - Microsoft\\Projects\\presidio\\Presidio-internal\\presidio-evaluator\\models\\spacy_new_ontonotes28\"\n",
"\n",
"spacy_ft_100 = r\"C:\\Users\\ommendel\\OneDrive - Microsoft\\Projects\\presidio\\Presidio-internal\\presidio-evaluator\\models\\spacy_ft_100\\model-final\"\n",
"\n",
"models = [en_core_web_lg, spacy_new_ontonotes28, spacy_ft_100]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run evaluation on all models:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from presidio_evaluator.spacy_evaluator import SpacyEvaluator\n",
"\n",
"for model in models:\n",
" print(\"-----------------------------------\")\n",
" print(\"Evaluating model {}\".format(model))\n",
" nlp = spacy.load(model)\n",
" spacy_evaluator = SpacyEvaluator(model=nlp,entities_to_keep=['PERSON','GPE','ORG'])\n",
" evaluation_results = spacy_evaluator.evaluate_all(DATASET)\n",
" scores = spacy_evaluator.calculate_score(evaluation_results)\n",
" \n",
" print(\"Confusion matrix:\")\n",
" print(scores.results)\n",
"\n",
" print(\"Precision and recall\")\n",
" scores.print()\n",
" errors = scores.model_errors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Custom evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#evaluate custom sentences\n",
"nlp = spacy.load(spacy_ft_100)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"#sent = input(\"Enter sentence: \")\n",
"sent = 'David is talking loudly'\n",
"doc = nlp(sent)\n",
"for ent in doc.ents:\n",
" print(\"Entity = {} value = {}\".format(ent.label_,ent.text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### False positives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Most false positive tokens:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='LOCATION')\n",
"fps_df[['full_text','token','prediction']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. False negative examples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"errors = scores.model_errors\n",
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More FN analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='GPE')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"fns_df[['full_text','token','annotation','prediction']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"[print(error,\"\\n\") for error in errors]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,5 @@
from .span_to_tag import span_to_tag, tokenize
from .data_objects import Span, InputSample, EvaluationResult, ModelError
from .model_evaluator import ModelEvaluator
from .spacy_evaluator import SpacyEvaluator
from .presidio_api_evaluator import PresidioAPIEvaluator

Просмотреть файл

@ -0,0 +1,97 @@
import pickle
from typing import List
from presidio_evaluator import ModelEvaluator, InputSample
class CRFEvaluator(ModelEvaluator):
def __init__(self,
model_pickle_path: str = "../models/crf.pickle",
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True):
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io)
if model_pickle_path is None:
raise ValueError("model_pickle_path must be supplied")
with open(model_pickle_path, 'rb') as f:
self.model = pickle.load(f)
def predict(self, sample: InputSample) -> List[str]:
tags = CRFEvaluator.crf_predict(sample,self.model)
if len(tags) != len(sample.tokens):
print("mismatch between previous tokens and new tokens")
# translated_tags = sample.rename_from_spacy_tags(tags)
return tags
@staticmethod
def crf_predict(sample, model):
sample.translate_input_sample_tags()
conll = sample.to_conll(translate_tags=True)
sentence = [(di['text'], di['pos'], di['label']) for di in conll]
features = CRFEvaluator.sent2features(sentence)
return model.predict([features])[0]
@staticmethod
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i - 1][0]
postag1 = sent[i - 1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
if i < len(sent) - 1:
word1 = sent[i + 1][0]
postag1 = sent[i + 1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
})
else:
features['EOS'] = True
return features
@staticmethod
def sent2features(sent):
return [CRFEvaluator.word2features(sent, i) for i in range(len(sent))]
@staticmethod
def sent2labels(sent):
return [label for token, postag, label in sent]
@staticmethod
def sent2tokens(sent):
return [token for token, postag, label in sent]

Просмотреть файл

@ -0,0 +1,48 @@
# PII dataset generator
This data generator takes a text file with templates (e.g. `my name is [PERSON]`) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer) and tags in various schemas (BIO/IOB, IO, BILOU)
In addition it provides some off-the-shelf features on each token, like `pos`, `dep` and `is_in_vocabulary`
The main class is `FakeDataGenerator` however the `main` module has two functions for creating and reading a fake dataset.
During the generation process, the tool either takes fake PII from a provided CSV with a known format, and/or from extension functions which can be found in the extensions.py file.
The process in high level is the following:
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of templates: `My name is John` -> `My name is [PERSON]`
2. (Optional) adapt the FakeDataGenerator to support new extensions which could generate fake PII entities
3. Generate X samples using the templates list + a fake PII dataset + extensions that add additional PII entities
4. Split the generated dataset to train/test/validation while making sure that samples from the same template would only appear in one set
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
6. Train models
7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
Notes:
- For steps 5, 6, 7 see the main [README](../../README.md).
- For a simple data generation pipeline, [see this notebook](../../notebooks/Generate data.ipynb).
- For information on transforming a NER dataset into a templates, see the notebooks in the [helper notebooks](helper%20notebooks) folder.
Example run:
```python
TEMPLATES_FILE = 'raw_data/templates.txt'
OUTPUT = "generated_.txt"
## Should be downloaded from FakeNameGenerator
fake_pii_csv = 'raw_data/FakeNameGenerator.csv'
examples = generate(fake_pii_csv=fake_pii_csv,
utterances_file=TEMPLATES_FILE,
dictionary_path=None,
output_file=OUTPUT,
lower_case_ratio=0.1,
num_of_examples=100,
ignore_types={"IP_ADDRESS", 'US_SSN', 'URL'},
keep_only_tagged=False,
span_to_tag=True)
```
*Copyright notice:*
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

Просмотреть файл

@ -0,0 +1,2 @@
from .generator import FakeDataGenerator
from .main import generate, read_synth_dataset

Просмотреть файл

@ -0,0 +1,124 @@
import random
import pandas as pd
from faker import Faker
from haikunator import Haikunator
from presidio_evaluator.data_generator.nationality_generator import NationalityGenerator
from presidio_evaluator.data_generator.org_name_generator import OrgNameGenerator
fake = Faker()
haikunator = Haikunator()
IP_V4_RATIO = 0.8
org_name_generator = OrgNameGenerator()
nationality_generator = NationalityGenerator()
def generate_url(domain: pd.Series):
def generate_url_postfix():
length = random.randint(4, 8)
delim = "/" if random.random() > 0.5 else ""
postfix = haikunator.haikunate(delimiter=delim,
token_chars='abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ',
token_length=length)
return postfix
def generate_url_prefix():
rand = random.random()
if rand < 0.3:
return "http://"
elif rand < 0.6:
return "http://www."
else:
return ""
def concat_url(prefix, domain, postfix):
return "{}{}/{}".format(prefix, domain, postfix)
return domain.apply(lambda x: concat_url(generate_url_prefix(), x.lower(), generate_url_postfix()))
#
# urls = []
# for index, value in domain.items():
# url = "{}{}/{}".format(generate_url_prefix(), value.lower(), generate_url_postfix())
# urls.append(url)
#
# return urls
def generate_SSNs(length):
return [fake.ssn() for _ in range(length)]
def generate_iban(country: pd.Series):
def generate_one_iban(cntry):
try:
from schwifty.iban import _get_iban_spec, code_length, IBAN
import math
spec = _get_iban_spec(cntry)
bank_code_length = code_length(spec, 'bank_code')
branch_code_length = code_length(spec, 'branch_code')
bank_and_branch_code_length = bank_code_length + branch_code_length
account_code_length = code_length(spec, 'account_code')
bank_code = random.randint(1, math.pow(10, bank_and_branch_code_length) - 1)
account_code = random.randint(1, math.pow(10, account_code_length) - 1)
iban = IBAN.generate(cntry, str(bank_code), str(account_code))
return iban.formatted
except ValueError as err:
## Failed to generate IBAN
return "IL270126100000000544211"
return country.apply(generate_one_iban)
def generate_company_names(length):
return [org_name_generator.get_organization() for _ in range(length)]
def generate_ip_addresses(length):
def generate_one():
v = 4 if random.random() > IP_V4_RATIO else 6
return fake.ipv4() if v == 4 else fake.ipv6()
return [generate_one() for _ in range(length)]
def generate_title(gender=None):
MALE_TITLES = ['Mr.', 'Dr.', 'Professor.', 'Eng.', 'Prof.', 'Doctor.']
FEMALE_TITLES = ['Mrs.', 'Ms.', 'Miss', 'Dr.', 'Professor.', 'Eng.', 'Prof.', 'Doctor']
if gender.lower() == 'male':
return random.choices(MALE_TITLES, weights=[0.7, 0.1, 0.05, 0.05, 0.05, 0.05])[0]
else:
return random.choices(FEMALE_TITLES, weights=[0.3, 0.25, 0.20, 0.05, 0.05, 0.05, 0.05, 0.05])[0]
def generate_titles(gender: pd.Series):
return gender.apply(generate_title)
def generate_roles(length):
roles = ['President', 'Vice-president', 'Chief of staff', 'Chief Architect', 'CEO', 'CFO', 'Engineer', 'Accountant',
'Attorney', 'Scientist', 'Journalist', 'Operator', 'CIO', "Chief Information Officer", "General Manager",
"Manager", "Chief Executive Officer", 'Actuary', 'Secretary', 'Prime minister', 'Minister', 'Director']
return [random.choice(roles) for _ in range(length)]
def generate_nationality(length):
return [nationality_generator.get_nationality() for _ in range(length)]
def generate_country(length):
return [nationality_generator.get_country() for _ in range(length)]
def generate_nation_woman(length):
return [nationality_generator.get_nation_woman() for _ in range(length)]
def generate_nation_man(length):
return [nationality_generator.get_nation_man() for _ in range(length)]
def generate_nation_plural(length):
return [nationality_generator.get_nation_plural() for _ in range(length)]

Просмотреть файл

@ -0,0 +1,343 @@
import random
from typing import List
import re
from collections import Counter
import pandas as pd
from spacy.tokens import Token
from tqdm import tqdm
from presidio_evaluator import Span, InputSample
from presidio_evaluator.data_generator.extensions import generate_iban, generate_ip_addresses, generate_SSNs, \
generate_company_names, generate_url, generate_roles, generate_titles, generate_nationality, generate_nation_man, \
generate_nation_woman, generate_nation_plural, generate_title, generate_country
class FakeDataGenerator:
def __init__(self, fake_pii_df: pd.DataFrame, templates: List[str],
lower_case_ratio: float = 0.5, include_metadata=True,
dictionary_path: str = None,
ignore_types=None, span_to_tag=True, labeling_scheme="BILOU"):
"""
Fake data generator.
Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
e.g. "My name is [FIRST_NAME]"
:param fake_pii_df:
A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
:param templates: A list of templates
with place holders for PII entities.
For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
Note that in case you have multiple entities of the same type
in a template, you should put a number on the second. For example:
"I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
More than two are currently not supported but extending this
is straightforward.
:param lower_case_ratio: Percentage of names that should start
with lower case
:param include_metadata: Whether to include additional
information in the output
(e.g. NameSet from which the name was taken, gender, country etc.)
:param dictionary_path: A path to a csv containing a vocabulary of
a language, to check if a token exists in the vocabulary or not.
:param ignore_types: set of types to ignore
:param span_to_tag: whether to tokenize the generated samples or not
:param labeling_scheme: labeling scheme (BILOU, BIO, IO)
"""
if ignore_types is None:
ignore_types = {}
self.lower_case_ratio = lower_case_ratio
self.include_metadata = include_metadata
self.ignore_types = ignore_types
if dictionary_path:
vocab_df = pd.read_csv(dictionary_path, sep=',')
self.vocabulary_words = set(vocab_df['WORD'].values.tolist())
else:
print("Warning: Dictionary path not provided. "
"Feature `is_in_vocabulary` will be set to False for all samples")
self.vocabulary_words = []
Token.set_extension("is_in_vocabulary",
getter=self.get_is_in_vocabulary,
force=True)
if templates:
self.templates = self.prep_templates(templates)
else:
print("Warning: templates not provided")
self.templates = None
self.original_pii_df = fake_pii_df
self.fake_pii = None
self.span_to_tag = span_to_tag
self.labeling_scheme = labeling_scheme
def get_is_in_vocabulary(self, token):
return token.text.lower() in self.vocabulary_words
def prep_fake_pii(self, df):
print("Preparing fake PII data for ingestion")
# define new column names
column_names = {"Surname": "LAST_NAME", "GivenName": "FIRST_NAME",
"Title": "TITLE", "Gender": "GENDER",
"City": "CITY", "ZipCode": "ZIP",
"CountryFull": "COUNTRY",
"Occupation": "OCCUPTAION",
"TelephoneNumber": "PHONE_NUMBER",
"CCNumber": "CREDIT_CARD", "Birthday": "BIRTHDAY",
"EmailAddress": "EMAIL",
"StreetAddress": "FULL_ADDRESS",
"Domain": "DOMAIN_NAME"}
# Remove brackets as they interfere with the process
def remove_brackets(series):
if series.dtype == object or series.dtype == str:
series = series.str.replace("[", "(")
series = series.str.replace("]", ")")
return series
df = df.apply(remove_brackets, axis=0)
# change column names
column_names = {key: value for (key, value) in column_names.items() if value not in self.ignore_types}
df.rename(columns=column_names, inplace=True)
# define PERSON as FIRST_NAME + LAST_NAME
df["PERSON"] = df["FIRST_NAME"] + " " + df["LAST_NAME"]
df['COUNTRY'] = generate_country(len(df)) # replace previous country which has limited options
# Copied entities
df["DATE"] = df["BIRTHDAY"]
df['LOCATION'] = df[random.choice(["CITY", "COUNTRY"])].str.title()
df['LOCATION'] = self.reshuffle_entity(df['LOCATION']) # Reshuffle to not have the same location and country
if 'ADDRESS' not in self.ignore_types:
self.address_parts(df)
# title and role
if 'ROLE' not in self.ignore_types:
print("Generating roles")
df['ROLE'] = generate_roles(length=len(df))
if 'TITLE' not in self.ignore_types:
print("Generating titles")
df['TITLE'] = generate_titles(df['GENDER'])
df['FEMALE_TITLE'] = [generate_title('female') for _ in range(len(df))]
df['MALE_TITLE'] = [generate_title('male') for _ in range(len(df))]
if 'NATIONALITY' not in self.ignore_types:
print("Generating nationalities")
df['NATIONALITY'] = generate_nationality(len(df))
df['NATION_MAN'] = generate_nation_man(len(df))
df['NATION_WOMAN'] = generate_nation_woman(len(df))
df['NATION_PLURAL'] = generate_nation_plural(len(df))
if 'IBAN' not in self.ignore_types:
print("Generating IBANs")
df['IBAN'] = generate_iban(df['COUNTRY']) # "IL270126100000000544211"
if 'IP_ADDRESS' not in self.ignore_types:
print("Generating IP addresses")
df['IP_ADDRESS'] = generate_ip_addresses(len(df))
if 'US_SSN' not in self.ignore_types:
print("Generating SSN numbers")
df['US_SSN'] = generate_SSNs(len(df))
if 'URL' not in self.ignore_types:
print("Generating URLs")
df['URL'] = generate_url(df['DOMAIN_NAME'])
if 'ORGANIZATION' not in self.ignore_types:
print("Generating company names")
df['ORG'] = generate_company_names(len(df))
df['ORGANIZATION'] = df[random.choice(["Company", "ORG"])].str.title()
print("Finished preparing fake PII data")
return df
def address_parts(self, df):
# extract street no, street and full address
print("Generating address parts")
if 'STREET_NO' not in self.ignore_types:
df["STREET_NO"] = df["FULL_ADDRESS"].map(
lambda r: re.search(r"([\d]+)", r).group(1))
if 'STREET' not in self.ignore_types:
df["STREET"] = df["FULL_ADDRESS"].map(
lambda r: re.search(r"[\d]+(.*)", r).group(1))
if 'ADDRESS' not in self.ignore_types:
df["ADDRESS"] = df.apply(
lambda r: "{0}, {2} {1}".format(r["FULL_ADDRESS"],
r["ZIP"].replace(" ", ""),
r["CITY"]), axis=1)
@staticmethod
def get_additional_entity(df, entity):
return df.sample(1).iloc[0][entity]
@staticmethod
def reshuffle_entity(series):
shuffled = series.sample(frac=1)
shuffled.reset_index(inplace=True, drop=True)
return shuffled
@staticmethod
def prep_templates(raw_templates):
print("Preparing sample sentences for ingestion")
# Todo: introduce typos
templates = [l.strip().replace("[", "{").replace("]", "}") for l in
raw_templates]
return templates
@staticmethod
def get_template_entities(template):
templates = []
entities_count = Counter()
for m in re.finditer(r"\{([A-Z_0-9]+)\}", template):
ent = m.groups()[0]
start, end = m.span()
entities_count[ent] += 1
if entities_count.get(ent) == 1:
templates.append(ent)
else:
# Add an index to all additional entities of this type (LOCATION2, LOCATION3 etc.)
templates.append(ent + str(entities_count[ent]))
for entity, count in entities_count.items():
while count > 1:
template = template.replace("{" + entity + "}", "{" + entity + str(count) + "}", 1)
count -= 1
return template, templates, entities_count
def sample_examples(self, count):
if not self.fake_pii:
self.fake_pii = self.prep_fake_pii(self.original_pii_df)
for _ in tqdm(range(count)):
template_sentence_index = random.choice(range(len(self.templates)))
original_sentence = self.templates[template_sentence_index]
fake_pii_sample = self.fake_pii.sample(1).iloc[0]
# Find entities to be replaced + add running index for multiple entities of the same type
original_sentence, replacements, entity_counts = self.get_template_entities(original_sentence)
# Get additional fake entries in case of multiple entities of the same type
fake_pii_sample_duplicated = self.add_duplicated_entities(fake_pii_sample, entity_counts)
# Fill in fake entities for each template slot
values = {}
for h in replacements:
if h in fake_pii_sample_duplicated:
values[h] = str(fake_pii_sample_duplicated[h])
else:
print("Warning: entity {} is in the templates but not in the PII dataset. Ignoring.".format(h))
values[h] = ''
# Create a new InputSample combining template with fake PII data
input_sample = self.create_input_sample(original_sentence, values)
if self.include_metadata:
metadata = {"Gender": fake_pii_sample['GENDER'],
"NameSet": fake_pii_sample['NameSet'],
"Country": fake_pii_sample['COUNTRY'],
"Lowercase": input_sample.full_text.islower(),
"Template#": template_sentence_index
}
input_sample.metadata = metadata
self.consolidate_names(input_sample)
# Creating tokens only after entities consolidation
if self.span_to_tag:
tokens, tags = input_sample.get_tags(scheme=self.labeling_scheme)
input_sample.tokens = tokens
input_sample.tags = tags
yield input_sample
@staticmethod
def consolidate_names(input_sample):
locations = ("LOCATION", "CITY", "STATE", "COUNTRY", "ADDRESS", "STREET")
names = ("FIRST_NAME", "LAST_NAME", "PERSON")
for span in input_sample.spans:
if span.entity_type in names:
span.entity_type = 'PERSON'
elif span.entity_type in locations:
span.entity_type = "LOCATION"
masked = input_sample.masked
for location in locations:
masked = masked.replace("[" + location + "]", "[LOCATION]")
for name in names:
masked = masked.replace("[" + name + "]", "[PERSON]")
input_sample.masked = masked
def create_input_sample(self, original_sentence, values):
"""
Creates an InputSample out of a template sentence
and a dict of entity names and values
:param original_sentence: template (e.g. My name is [FIRST_NAME})
:param values: Key = entity name, value = entity value
(e.g. {"TITLE":"Mr."})
:return: a list of InputSamples
"""
sentence = original_sentence
spans = []
to_lower = random.random() < self.lower_case_ratio
i = 0
# replaces placeholders with values and retrieve indices
while i < len(sentence):
entity_start = re.search("{", sentence, flags=0)
if entity_start:
entity_start = entity_start.start()
else:
break
entity_end = re.search("}", sentence[entity_start:],
flags=0).start() + entity_start
entity = sentence[entity_start + 1:entity_end]
entity_value = values[entity]
entity_value = entity_value.strip()
# Remove duplicate entity indices:
entity = ''.join(i for i in entity if not i.isdigit())
entity_value_len = len(entity_value)
sentence = sentence[:entity_start] + entity_value + sentence[
entity_end + 1:]
# replace a with an if
if ((sentence[entity_start - 2: entity_start].lower() == "a " and entity_start == 2)
or (sentence[entity_start - 3: entity_start].lower() == " a ")) \
and entity_value[0].lower() in ['a', 'e', 'i', 'o', 'u']:
sentence = sentence[:entity_start - 1] + "n " + sentence[entity_start:]
entity_start = entity_start + 1
if to_lower:
entity_value = entity_value.lower()
spans.append(Span(entity_type=entity,
entity_value=entity_value,
start_position=entity_start,
end_position=entity_start + entity_value_len))
i = entity_start + entity_value_len
if to_lower:
sentence = sentence.lower()
# Not creating tokens here since we're consolidating names afterwards
return InputSample(sentence, original_sentence, spans,
create_tags_from_span=False)
def add_duplicated_entities(self, fake_pii_sample, entity_counts):
for entity, ent_count in entity_counts.items():
while ent_count > 1:
fake_pii_sample[entity + str(ent_count)] = self.get_additional_entity(self.fake_pii, entity)
ent_count -= 1
return fake_pii_sample

Просмотреть файл

@ -0,0 +1,661 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook takes the CONLL2003 dataset using deepavlov, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
"\n",
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.options.display.max_rows = 4000\n",
"pd.set_option('display.max_colwidth', -1)\n",
"from deeppavlov.dataset_readers.conll2003_reader import Conll2003DatasetReader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"reader = Conll2003DatasetReader()\n",
"dataset = reader.read(data_path =\"../../data\",dataset_name='conll2003')\n",
"#Note: make sure you haven't downloaded something else with this function before, \n",
"# as it will not download a new dataset (even if your previous download was for a different dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### To pandas + add sentence_idx"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"new_dataset = [list(zip(a,b)) for a,b in dataset['train']]\n",
"df_list = []\n",
"sentence_id = 0\n",
"for sentence in new_dataset:\n",
" \n",
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
" df[\"sentence_idx\"] = sentence_id\n",
" sentence_id+=1\n",
" df_list.append(df)\n",
"ner_dataset = pd.concat(df_list)\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==12]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"sentences = ner_dataset.groupby('sentence_idx')['word'].apply(lambda x: \" \".join(x))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"print(sentences[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example sentence:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==3]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Unique entities\n",
"ner_dataset['tag'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace tokenization replacements"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['word'] = ner_dataset['word']\\\n",
".replace('-LRB-','(')\\\n",
".replace('-RRB-',')')\\\n",
".replace('-LCB-','(')\\\n",
".replace('-RCB-',')')\\\n",
".replace('``','\"')\\\n",
".replace(\"''\",'\"')\\\n",
".replace('/.','.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# helper columns:\n",
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove unneeded (non PII) entities:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"TAGS_TO_IGNORE = ['CARDINAL','FAC','LAW','LANGUAGE','MISC','TIME','DATE','ORDINAL','EVENT','QUANTITY','WORK_OF_ART','MONEY','PRODUCT','PERCENT']\n",
"def remote_unwanted_tags(x):\n",
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
" return 'O'\n",
" else:\n",
" return x\n",
"\n",
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
"ner_dataset[ner_dataset['sentence_idx']==3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove PERSON tags if preceding word is 'the' (e.g. the Bush administration)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
"\n",
"def remove_tag_if_the_person(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" return row['tag']\n",
"\n",
"ner_dataset['prev-word']=ner_dataset['prev-word'].astype('str')\n",
"ner_dataset['prev-prev-word']=ner_dataset['prev-prev-word'].astype('str')\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove tag from 's (Joe Wilson's cat)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_apostraphe_after_tag(row):\n",
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Re-tag words from dictionaries (countries, nationalities, roles, titles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nationalities and countries:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
"nationalities.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"algeria\" in nationalities['country'].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"ner_dataset['metadata'] = None\n",
"\n",
"def get_nationality_as_metadata(row):\n",
" if row['word'].lower() in nationalities['country'].values:\n",
" return 'COUNTRY'\n",
" elif row['word'].lower() in nationalities['nationality'].values:\n",
" return 'NATIONALITY'\n",
" elif row['word'].lower() in nationalities['man'].values:\n",
" return 'NATION_MAN'\n",
" elif row['word'].lower() in nationalities['woman'].values:\n",
" return 'NATION_WOMAN'\n",
" elif row['word'].lower() in nationalities['plural'].values:\n",
" return 'NATION_PLURAL'\n",
" return row['metadata']\n",
"\n",
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
"\n",
"def update_tag_based_on_metadata(row):\n",
" if row['metadata'] is not None:\n",
" return \"B-\"+row['metadata']\n",
" else:\n",
" return row['tag']\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Titles"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MALE_TITLES = ['mr', 'dr', 'professor', 'eng','prof','doctor']\n",
"FEMALE_TITLES = ['mrs', 'ms', 'miss', 'dr', 'professor', 'eng', 'prof','doctor']\n",
"\n",
"def get_title_as_metadata(row):\n",
" if row['word'].lower() in MALE_TITLES:\n",
" return 'MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES:\n",
" return 'FEMALE_TITLE'\n",
" return row['metadata']\n",
"\n",
"\n",
"def update_title_tag_if_missing(row):\n",
" if row['word'].lower() in MALE_TITLES and row['tag']=='O':\n",
" return 'B-MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES and row['tag']=='O':\n",
" return 'B-FEMALE_TITLE'\n",
" else:\n",
" return row['tag']\n",
"\n",
"ner_dataset['metadata'] = ner_dataset.apply(get_title_as_metadata,axis=1)\n",
"ner_dataset['tag'] = ner_dataset.apply(update_title_tag_if_missing,axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==18]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove 'the' from 'the NORP' if NORP is not in nationalities list."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_the_norp(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove sentences with adjacent different entities (e.g calling from New York Larry King)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
"ner_dataset['next-entity']=ner_dataset['next-tag'].str[2:]\n",
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
"sentences_to_remove = ner_dataset[adjacent_idc]['sentence_idx'].values\n",
"sentences_to_remove\n",
"\n",
"ner_dataset=ner_dataset[~ner_dataset['sentence_idx'].isin(sentences_to_remove)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Update tag for discovered metadata values (eg. nationalities)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['tag'] = ner_dataset.apply(update_tag_based_on_metadata, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create templates base on NER dataset\n",
"Here we create the actual templates + handle multiple weird cases that should cause the template sentences to be weird. Note that a manual run over the templates dataset is still required after this step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"class SentenceGetter(object):\n",
" \n",
" def __init__(self, dataset):\n",
" self.n_sent = 1\n",
" self.dataset = dataset\n",
" self.empty = False\n",
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
" s[\"tag\"].values.tolist())]\n",
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
" self.sentences = [s for s in self.grouped]\n",
" \n",
" def get_next(self):\n",
" try:\n",
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
" self.n_sent += 1\n",
" return s\n",
" except:\n",
" return None\n",
" \n",
" @staticmethod \n",
" def cleanse_template(template, ents):\n",
" # Remove whitespace before certain punctuation marks\n",
" template = re.sub(r'\\s([?,:.!](?:|$))+', r'\\1', template)\n",
" \n",
" # Remove whitespaces within double quotes\n",
" template = re.sub('\\\"\\s*([^\\\"]*?)\\s*\\\"', r'\"\\1\"', template) \n",
" \n",
" # Remove whitespaces within quotes\n",
" template = re.sub(\"\\'\\s*([^\\']*?)\\s*\\'\", r\"'\\1'\", template) \n",
" \n",
" # Remove whitespaces within parentheses\n",
" template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
" \n",
" for ent in ents:\n",
" #Turn PERSON PERSON into PERSON\n",
" duplicates = \"[{}] [{}]\".format(ent,ent)\n",
" template = template.replace(duplicates,\"[{}]\".format(ent))\n",
" \n",
" \n",
" # Replace additional weird templates:\n",
" to_replace = {\n",
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
" \"[LOCATION] said\" : \"[PERSON] said\",\n",
" \"[ORGANIZATION] of [ORGANIZATION]\" : \"[ORGANIZATION]\",\n",
" \"the [COUNTRY]\" : \"[COUNTRY]\",\n",
" \" 's \":\"'s\",\n",
" \"] 's \":\"]'s \",\n",
" \"] 's,\":\"]'s,\",\n",
" \"] 's.\":\"]'s.\",\n",
" \" n't\" : \"n't\",\n",
" \"/?\":\"?\",\n",
" \"%u\":\"u\",\n",
" \"%m\":\"m\",\n",
" \"%e\":\"e\", \n",
" \"%h\":\"h\", \n",
" \"%a\":\"a\",\n",
" \" %\":\"%\",\n",
" \" ?\":\"?\",\n",
" \" /?\":\"?\",\n",
" \" ' .\":\"'.\",\n",
" \"[ \":\"(\",\n",
" \" ]\":\")\",\n",
" \"[PERSON] -- [PERSON]\":\"[PERSON]\",\n",
" \"[COUNTRY] -- [ORGANIZATION]\":\"[ORGANIZATION]\",\n",
" \"Jews\" : \"[NATIONALITY]\",\n",
" \"Chinese\" : \"[NATIONALITY]\",\n",
" \"Dutch\" : \"[NATIONALITY]\",\n",
" \"[LOCATION], [LOCATION]\":\"[LOCATION]\",\n",
" \"[LOCATION] [ORGANIZATION]\":\"[ORGANIZATION]\"\n",
" }\n",
" \n",
" for weird in to_replace.keys():\n",
" #if weird in template:\n",
" # print(\"Weird sentence\",template)\n",
" template = template.replace(weird,to_replace[weird])\n",
" \n",
" template = template.replace(\" -- \",\" - \")\n",
" \n",
" #Ignore templates that are incomplete\n",
" if \"/-\" in template:\n",
" template = \"\"\n",
" \n",
" #Ignore templates that have numbers after the end or start of the entity\n",
" if len(re.findall(r\"\\]\\s[0-9]\",template)) > 0:\n",
" template = \"\"\n",
" \n",
" if len(re.findall(r\"[0-9]\\s\\[\",template)) > 0:\n",
" template = \"\"\n",
" \n",
" if len(re.findall(r\"[0-9].\\s\\[\",template)) > 0:\n",
" template = \"\"\n",
" \n",
" \n",
" if \"[PERSON] ([COUNTRY])\" in template:\n",
" template = \"\"\n",
" if \"[PERSON] ([LOCATION])\" in template:\n",
" template = \"\"\n",
" \n",
" if template.count('\"') == 1:\n",
" template = template.replace('\"','')\n",
"\n",
" return template\n",
" \n",
" @staticmethod \n",
" def get_template(grouped,entity_name_replace_dict):\n",
" template = \"\"\n",
" i=0\n",
" cur_index = 0\n",
" ents = []\n",
" for token in grouped:\n",
" # remove brackets as they interefere with the data generation process\n",
" token_text = token[0].replace(\"[\", \"(\").replace(\"]\",\")\")\n",
" token_text = token[0].replace(\"{\", \"(\").replace(\"}\",\")\")\n",
" token_tag = token[1]\n",
" token_entity = token_tag[2:] if len(token_tag)>1 else token_tag\n",
" \n",
" if token_entity == 'O':\n",
" template += \" \" + token_text\n",
" elif 'B-' in token_tag and token_entity not in TAGS_TO_IGNORE:\n",
" #print(\"found entity: {}\".format(token_entity))\n",
" ent = entity_name_replace_dict[token_entity]\n",
" ents.append(ent)\n",
" \n",
" template += \" [\" + ent + \"]\"\n",
" #print(\"template: \",template)\n",
" \n",
" template = SentenceGetter.cleanse_template(template, ents)\n",
" \n",
" return template.strip()\n",
" \n",
"getter = SentenceGetter(ner_dataset)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ENTITIES_DICTIONARY = {\"PERSON\":\"PERSON\",\n",
" \"PER\":\"PERSON\",\n",
" \"GPE\":\"COUNTRY\",\n",
" \"NORP\":\"LOCATION\",\n",
" \"LOC\":\"LOCATION\",\n",
" \"ORG\":\"ORGANIZATION\",\n",
" \"MALE_TITLE\":\"MALE_TITLE\",\n",
" \"FEMALE_TITLE\":\"FEMALE_TITLE\",\n",
" \"COUNTRY\":\"COUNTRY\",\n",
" \"NATIONALITY\":\"NATIONALITY\",\n",
" \"NATION_WOMAN\":\"NATION_WOMAN\",\n",
" \"NATION_MAN\":\"NATION_MAN\",\n",
" \"NATION_PLURAL\":\"NATION_PLURAL\"}\n",
"\n",
"sentences = getter.sentences\n",
"\n",
"sent_id = 445\n",
"\n",
"print(\"original:\",sentences[sent_id])\n",
"print(\"template:\", getter.get_template(sentences[sent_id],entity_name_replace_dict=ENTITIES_DICTIONARY))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_templates = [getter.get_template(sentence,entity_name_replace_dict=ENTITIES_DICTIONARY) for sentence in sentences]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"original length of templates: {}\".format(len(all_templates)))\n",
"all_templates = list(set(all_templates))\n",
"print(\"length after duplicates removal: {}\".format(len(all_templates)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save templates to file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"../raw_data/conll_based_templates.txt\",\"w+\",encoding='utf-8') as f:\n",
" for template in all_templates:\n",
" f.write(\"%s\\n\" % template) "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,396 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate new examples based on this dataset: \n",
"https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus\n",
"\n",
"This notebook takes the ner dataset from the previous link, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
"Note that due to the nature of the tagging, there might be weird output sentences. For example:\n",
"\n",
"- The same entity shows multiple times in sentence: \"I travel from Argentina to Argentina\"\n",
"- Bad grammer due to the lack of inflection and changes to nouns due to context: \"*The statement said no Denmark or India-led troops were killed*\" instead of \"*The statement said no Danish or Indian led troops were killed*\"\n",
"- Unrealistic sentences due to change in entities: \"Prime minister Lebron James enters the government building in Kuala Lumpur\"\n",
"\n",
"\n",
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#First, Download ner.csv from https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus\n",
"ner_dataset = pd.read_csv(\"ner.csv\",encoding = \"ISO-8859-1\", error_bad_lines=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(ner_dataset)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset = ner_dataset.drop_duplicates()\n",
"len(ner_dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example sentence:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==13][['sentence_idx','word','tag','prev-word','prev-prev-word','next-word']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### New entities - Title and Role\n",
"\n",
"- **Title**: Mr., Mrs., Professor, Doctor, ...\n",
"- **Role**: President, Secretary General, U.N. Secretary, ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quick exploratory analysis of frequencies:\n",
"- First PER token\n",
"- Second PER token\n",
"- First and second PER token\n",
"- One before and first tokens of PER"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Evaluate words before I-per\n",
"bper = ner_dataset[ner_dataset['tag']=='B-per']\n",
"bper_tokens = bper['word']\n",
"prev_bper_token = bper['prev-word']\n",
"next_bper_token = bper['next-word']\n",
"two_prev_tokens = zip(prev_bper_token, bper_tokens)\n",
"two_next_tokens = zip(bper_tokens, next_bper_token)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"print(\"20 most common PER token frequencies:\")\n",
"Counter(bper_tokens).most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"20 most common previous and first PER token frequencies:\")\n",
"Counter(two_prev_tokens).most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"20 most common first and second PER token frequencies:\")\n",
"Counter(two_next_tokens).most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Lists of titles and roles to update as ttl, rol\n",
"TITLES = ['Mr.','Ms.','Mrs.']\n",
"ROLES = ['President','General','Senator','Secretary-General','Minister','General']\n",
"BIGRAMS_ROLES = [('Prime','Minister'),('prime','minister'),('U.S.','President'),\n",
" ('Venezuelan', 'President'),('Vice','President'), ('Foreign', 'Minister'),\n",
" ('U.S.','Secretary'),('U.N.','Secretary'),('Defence','Secretary')]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Update title and per for most common cases\n",
"\n",
"def fix_bigram_title(df, row,index,first='Prime',second='Minister',tag='ttl'):\n",
" if row['word'] == first and row['next-word'] == second and 'per' in row['tag']:\n",
" df.loc[index,'tag'] = 'B-{}'.format(tag)\n",
" elif row['word'] == second and row['prev-word'] == first and 'per' in row['tag']:\n",
" df.loc[index,'tag'] = 'I-{}'.format(tag)\n",
" elif row['tag']== 'I-per' and row['prev-word'] == second and 'per' in row['tag']:\n",
" df.loc[index,'tag'] = 'B-per'\n",
"\n",
"def fix_unigram_title(df, prev_row,prev_index, row , index, title='President',tag='ttl'):\n",
" #print(row)\n",
" if prev_row['word'] == title and prev_row['tag'] == 'B-per' and row['tag']=='I-per':\n",
" df.loc[prev_index,'tag']='B-{}'.format(tag)\n",
" df.loc[index,'tag'] = 'B-per'\n",
"\n",
"prev_row = None\n",
"prev_index = None\n",
"for index, row in ner_dataset.iterrows():\n",
" # Handle 'Prime Minister'\n",
" for bigram in BIGRAMS_ROLES:\n",
" fix_bigram_title(ner_dataset,row,index,bigram[0],bigram[1],'rol')\n",
"\n",
" if prev_row is not None:\n",
" for title in TITLES:\n",
" fix_unigram_title(df=ner_dataset,prev_row=prev_row,prev_index=prev_index,row=row,index=index,title=title,tag='ttl')\n",
" for role in ROLES:\n",
" fix_unigram_title(ner_dataset,prev_row,prev_index,row,index,role,'rol')\n",
"\n",
" prev_row = row\n",
" prev_index = index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==13][['sentence_idx','word','tag','prev-word','prev-prev-word','next-word']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# keep only relevant columns\n",
"dataset = ner_dataset[['sentence_idx','word','tag']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset.to_csv(\"../../../datasets/ner_with_titles.csv\",encoding = \"ISO-8859-1\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create templates base on NER dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"class SentenceGetter(object):\n",
" \n",
" def __init__(self, dataset):\n",
" self.n_sent = 1\n",
" self.dataset = dataset\n",
" self.empty = False\n",
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
" s[\"tag\"].values.tolist())]\n",
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
" self.sentences = [s for s in self.grouped]\n",
" \n",
" def get_next(self):\n",
" try:\n",
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
" self.n_sent += 1\n",
" return s\n",
" except:\n",
" return None\n",
" \n",
" @staticmethod \n",
" def get_template(grouped,entity_name_replace_dict=None):\n",
" TAGS_TO_IGNORE = ['nat','eve','art','tim']\n",
" template = \"\"\n",
" i=0\n",
" cur_index = 0\n",
" ents = []\n",
" for token in grouped:\n",
" token_text = token[0].replace(\"[\", \"\").replace(\"]\",\"\")\n",
" token_tag = token[1]\n",
" if token_tag == 'O':\n",
" template += \" \" + token_text\n",
" elif 'B-' in token_tag and token_tag[2:] not in TAGS_TO_IGNORE:\n",
" if entity_name_replace_dict:\n",
" ent = entity_name_replace_dict[token[1][2:]]\n",
" else:\n",
" ent = token_tag[2:]\n",
" ents.append(ent)\n",
" template += \" [\" + ent + \"]\"\n",
" template = re.sub(r'\\s([?,\\':.!\"](?:|$))+', r'\\1', template)\n",
" \n",
" for ent in ents:\n",
" weird = \"[{}] [{}]\".format(ent,ent)\n",
" template = template.replace(weird,\"[{}]\".format(ent))\n",
" \n",
" #remove additional weird combinations:\n",
" \n",
" to_replace = {\n",
" \"[COUNTRY] [ROLE] [PERSON]\": \"[ROLE] [PERSON]\",\n",
" \"[COUNTRY] [ROLE]\" : \"[ROLE]\",\n",
" \"[ORGANIZATION] [ROLE] [PERSON]\" : \"[ORGANIZATION]'s [ROLE] [PERSON]\",\n",
" \"[COUNTRY] [LOCATION]\" : \"[LOCATION]\",\n",
" \"[LOCATION] [COUNTRY]\": \"[LOCATION]\",\n",
" \"[PERSON] [COUNTRY]\" : \"[PERSON]\",\n",
" \"[PERSON] [LOCATION]\" : \"[PERSON]\",\n",
" \"[COUNTRY] [PERSON]\" : \"[PERSON]\",\n",
" \"[LOCATION] [PERSON]\" : \"[PERSON]\"],\n",
" \"The [ORGANIZATION]\" : \"[ORGANIZATION]\"\n",
" \"[PERSON] [ORGANIZATION]\" : \"[PERSON]\",\n",
" \"of [ORGANIZATION] [PERSON]\" : \"of [ORGANIZATION], [PERSON]\",\n",
" \"[ORGANIZATION] [PERSON]\" : \"[PERSON]\",\n",
" \"[PERSON] [PERSON]\": \"[PERSON]\",\n",
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
" \"[LOCATION] said\" : \"[PERSON] said\"\n",
" \n",
" \n",
" }\n",
" \n",
" for weird in to_replace.keys():\n",
" template = template.replace(weird,to_replace[weird])\n",
" \n",
" return template.strip()\n",
" \n",
"getter = SentenceGetter(dataset)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ENTITIES_DICTIONARY = {\"per\":\"PERSON\",\"gpe\":\"COUNTRY\",\"geo\":\"LOCATION\",\"org\":\"ORGANIZATION\",'ttl':'TITLE','rol':'ROLE'}\n",
"\n",
"sentences = getter.sentences\n",
"print(\"original:\",sentences[12])\n",
"print(\"template:\", getter.get_template(sentences[12],entity_name_replace_dict=ENTITIES_DICTIONARY))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_templates = [SentenceGetter.get_template(sentence, ENTITIES_DICTIONARY) for sentence in sentences]\n",
"new_templates[:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save to file\n",
"\n",
"with open(\"../../presidio_evaluator/data_generator/raw_data/new_templates2.txt\",\"w+\", encoding = \"ISO-8859-1\") as f:\n",
" for template in new_templates:\n",
" f.write(\"%s\\n\" % template)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,664 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook takes the ontonoes ner dataset, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
"\n",
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.options.display.max_rows = 4000\n",
"pd.set_option('display.max_colwidth', -1)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"## Download OntoNotes data\n",
"ontonotes = \"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### To pandas + add sentence_idx"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"df_list = []\n",
"sentence_id = 0\n",
"for sentence in ontonotes:\n",
" \n",
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
" df[\"sentence_idx\"] = sentence_id\n",
" sentence_id+=1\n",
" df_list.append(df)\n",
"ner_dataset = pd.concat(df_list)\n",
"ner_dataset.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"sentences = ner_dataset.groupby('sentence_idx')['word'].apply(lambda x: \" \".join(x))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"print(sentences[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example sentence:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==3]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Unique entities\n",
"ner_dataset['tag'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace tokenization replacements"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['word'] = ner_dataset['word']\\\n",
".replace('-LRB-','(')\\\n",
".replace('-RRB-',')')\\\n",
".replace('-LCB-','(')\\\n",
".replace('-RCB-',')')\\\n",
".replace('``','\"')\\\n",
".replace(\"''\",'\"')\\\n",
".replace('/.','.')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# helper columns:\n",
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove unneeded (non PII) entities:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"TAGS_TO_IGNORE = ['CARDINAL','FAC','LAW','LANGUAGE','TIME','DATE','ORDINAL','EVENT','QUANTITY','WORK_OF_ART','MONEY','PRODUCT','PERCENT']\n",
"def remote_unwanted_tags(x):\n",
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
" return 'O'\n",
" else:\n",
" return x\n",
"\n",
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
"ner_dataset[ner_dataset['sentence_idx']==3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove PERSON tags if preceding word is 'the' (e.g. the Bush administration)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
"\n",
"def remove_tag_if_the_person(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" return row['tag']\n",
"\n",
"ner_dataset['prev-word']=ner_dataset['prev-word'].astype('str')\n",
"ner_dataset['prev-prev-word']=ner_dataset['prev-prev-word'].astype('str')\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove tag from 's (Joe Wilson's cat)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_apostraphe_after_tag(row):\n",
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Re-tag words from dictionaries (countries, nationalities, roles, titles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nationalities and countries:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
"nationalities.head()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"\"algeria\" in nationalities['country'].values"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"\n",
"ner_dataset['metadata'] = None\n",
"\n",
"def get_nationality_as_metadata(row):\n",
" if row['word'].lower() in nationalities['country'].values:\n",
" return 'COUNTRY'\n",
" elif row['word'].lower() in nationalities['nationality'].values:\n",
" return 'NATIONALITY'\n",
" elif row['word'].lower() in nationalities['man'].values:\n",
" return 'NATION_MAN'\n",
" elif row['word'].lower() in nationalities['woman'].values:\n",
" return 'NATION_WOMAN'\n",
" elif row['word'].lower() in nationalities['plural'].values:\n",
" return 'NATION_PLURAL'\n",
" return row['metadata']\n",
"\n",
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
"\n",
"def update_tag_based_on_metadata(row):\n",
" if row['tag'] != 'O' and row['metadata'] is not None:\n",
" return row['tag'][:2] + row['metadata']\n",
" else:\n",
" return row['tag']\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Titles"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"MALE_TITLES = ['mr', 'dr', 'professor', 'eng','prof','doctor']\n",
"FEMALE_TITLES = ['mrs', 'ms', 'miss', 'dr', 'professor', 'eng', 'prof','doctor']\n",
"\n",
"def get_title_as_metadata(row):\n",
" if row['word'].lower() in MALE_TITLES:\n",
" return 'MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES:\n",
" return 'FEMALE_TITLE'\n",
" return row['metadata']\n",
"\n",
"\n",
"def update_title_tag_if_missing(row):\n",
" if row['word'].lower() in MALE_TITLES and row['tag']=='O':\n",
" return 'B-MALE_TITLE'\n",
" elif row['word'].lower() in FEMALE_TITLES and row['tag']=='O':\n",
" return 'B-FEMALE_TITLE'\n",
" else:\n",
" return row['tag']\n",
"\n",
"ner_dataset['metadata'] = ner_dataset.apply(get_title_as_metadata,axis=1)\n",
"ner_dataset['tag'] = ner_dataset.apply(update_title_tag_if_missing,axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==18]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove 'the' from 'the NORP' if NORP is not in nationalities list."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_the_norp(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove sentences with adjacent different entities (e.g calling from New York Larry King)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
"ner_dataset['next-entity']=ner_dataset['next-tag'].str[2:]\n",
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
"sentences_to_remove = ner_dataset[adjacent_idc]['sentence_idx'].values\n",
"sentences_to_remove\n",
"\n",
"ner_dataset=ner_dataset[~ner_dataset['sentence_idx'].isin(sentences_to_remove)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Update tag for discovered metadata values (eg. nationalities)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['tag'] = ner_dataset.apply(update_tag_based_on_metadata, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create templates base on NER dataset"
]
},
{
"cell_type": "code",
"execution_count": 331,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"class SentenceGetter(object):\n",
" \n",
" def __init__(self, dataset):\n",
" self.n_sent = 1\n",
" self.dataset = dataset\n",
" self.empty = False\n",
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
" s[\"tag\"].values.tolist())]\n",
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
" self.sentences = [s for s in self.grouped]\n",
" \n",
" def get_next(self):\n",
" try:\n",
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
" self.n_sent += 1\n",
" return s\n",
" except:\n",
" return None\n",
" \n",
" @staticmethod \n",
" def cleanse_template(template, ents):\n",
" # Remove whitespace before certain punctuation marks\n",
" template = re.sub(r'\\s([?,:.!](?:|$))+', r'\\1', template)\n",
" \n",
" # Remove whitespaces within double quotes\n",
" template = re.sub('\\\"\\s*([^\\\"]*?)\\s*\\\"', r'\"\\1\"', template) \n",
" \n",
" # Remove whitespaces within quotes\n",
" template = re.sub(\"\\'\\s*([^\\']*?)\\s*\\'\", r\"'\\1'\", template) \n",
" \n",
" # Remove whitespaces within parentheses\n",
" template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
" \n",
" for ent in ents:\n",
" #Turn PERSON PERSON into PERSON\n",
" duplicates = \"[{}] [{}]\".format(ent,ent)\n",
" template = template.replace(duplicates,\"[{}]\".format(ent))\n",
" \n",
" \n",
" # Replace additional weird templates:\n",
" to_replace = {\n",
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
" \"[LOCATION] said\" : \"[PERSON] said\",\n",
" \"[ORGANIZATION] of [ORGANIZATION]\" : \"[ORGANIZATION]\",\n",
" \"the [COUNTRY]\" : \"[COUNTRY]\",\n",
" \" 's \":\"'s\",\n",
" \"] 's \":\"]'s \",\n",
" \"] 's,\":\"]'s,\",\n",
" \"] 's.\":\"]'s.\",\n",
" \" n't\" : \"n't\",\n",
" \"/?\":\"?\",\n",
" \"%u\":\"u\",\n",
" \"%m\":\"m\",\n",
" \"%e\":\"e\", \n",
" \"%h\":\"h\", \n",
" \"%a\":\"a\",\n",
" \" %\":\"%\",\n",
" \" ?\":\"?\",\n",
" \" /?\":\"?\",\n",
" \" ' .\":\"'.\",\n",
" \"[ \":\"(\",\n",
" \" ]\":\")\",\n",
" \"[PERSON] -- [PERSON]\":\"[PERSON]\",\n",
" \"[COUNTRY] -- [ORGANIZATION]\":\"[ORGANIZATION]\",\n",
" \"Jews\" : \"[NATIONALITY]\",\n",
" \"Chinese\" : \"[NATIONALITY]\",\n",
" \"Dutch\" : \"[NATIONALITY]\",\n",
" \"[LOCATION], [LOCATION]\":\"[LOCATION]\"\n",
" }\n",
" \n",
" for weird in to_replace.keys():\n",
" #if weird in template:\n",
" # print(\"Weird sentence\",template)\n",
" template = template.replace(weird,to_replace[weird])\n",
" \n",
" template = template.replace(\" -- \",\" - \")\n",
" \n",
" #Ignore templates that are incomplete\n",
" if \"/-\" in template:\n",
" template = \"\"\n",
" \n",
" if template.count('\"') == 1:\n",
" template = template.replace('\"','')\n",
"\n",
" return template\n",
" \n",
" @staticmethod \n",
" def get_template(grouped,entity_name_replace_dict):\n",
" template = \"\"\n",
" i=0\n",
" cur_index = 0\n",
" ents = []\n",
" for token in grouped:\n",
" # remove brackets as they interefere with the data generation process\n",
" token_text = token[0].replace(\"[\", \"(\").replace(\"]\",\")\")\n",
" token_text = token[0].replace(\"{\", \"(\").replace(\"}\",\")\")\n",
" token_tag = token[1]\n",
" token_entity = token_tag[2:] if len(token_tag)>1 else token_tag\n",
" \n",
" if token_entity == 'O':\n",
" template += \" \" + token_text\n",
" elif 'B-' in token_tag and token_entity not in TAGS_TO_IGNORE:\n",
" #print(\"found entity: {}\".format(token_entity))\n",
" ent = entity_name_replace_dict[token_entity]\n",
" ents.append(ent)\n",
" \n",
" template += \" [\" + ent + \"]\"\n",
" #print(\"template: \",template)\n",
" \n",
" template = SentenceGetter.cleanse_template(template, ents)\n",
" \n",
" return template.strip()\n",
" \n",
"getter = SentenceGetter(ner_dataset)"
]
},
{
"cell_type": "code",
"execution_count": 321,
"metadata": {},
"outputs": [],
"source": [
"ENTITIES_DICTIONARY = {\"PERSON\":\"PERSON\",\n",
" \"GPE\":\"COUNTRY\",\n",
" \"NORP\":\"LOCATION\",\n",
" \"LOC\":\"LOCATION\",\n",
" \"ORG\":\"ORGANIZATION\",\n",
" \"MALE_TITLE\":\"MALE_TITLE\",\n",
" \"FEMALE_TITLE\":\"FEMALE_TITLE\",\n",
" \"COUNTRY\":\"COUNTRY\",\n",
" \"NATIONALITY\":\"NATIONALITY\",\n",
" \"NATION_WOMAN\":\"NATION_WOMAN\",\n",
" \"NATION_MAN\":\"NATION_MAN\",\n",
" \"NATION_PLURAL\":\"NATION_PLURAL\"}\n",
" \n",
"\n",
"\n",
"sentences = getter.sentences\n",
"\n",
"sent_id = 445\n",
"\n",
"print(\"original:\",sentences[sent_id])\n",
"print(\"template:\", getter.get_template(sentences[sent_id],entity_name_replace_dict=ENTITIES_DICTIONARY))"
]
},
{
"cell_type": "code",
"execution_count": 322,
"metadata": {},
"outputs": [],
"source": [
"all_templates = [getter.get_template(sentence,entity_name_replace_dict=ENTITIES_DICTIONARY) for sentence in sentences]"
]
},
{
"cell_type": "code",
"execution_count": 323,
"metadata": {},
"outputs": [],
"source": [
"print(\"original length of templates: {}\".format(len(all_templates)))\n",
"all_templates = list(set(all_templates))\n",
"print(\"length after duplicates removal: {}\".format(len(all_templates)))"
]
},
{
"cell_type": "code",
"execution_count": 324,
"metadata": {},
"outputs": [],
"source": [
"# save to file\n",
"\n",
"with open(\"../raw_data/ontonotes_based_templates.txt\",\"w+\",encoding='utf-8') as f:\n",
" for template in all_templates:\n",
" f.write(\"%s\\n\" % template)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 330,
"metadata": {},
"outputs": [],
"source": [
"template = \"[NATIONALITY]'s[MALE_TITLE]'\"\n",
"\n",
"template = getter.cleanse_template(template,[])\n",
"#template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
"template"
]
},
{
"cell_type": "code",
"execution_count": 326,
"metadata": {},
"outputs": [],
"source": [
"if template.count(\"'\")==1:\n",
" print(True)\n",
" template = template.replace(\"'\",'')"
]
},
{
"cell_type": "code",
"execution_count": 327,
"metadata": {},
"outputs": [],
"source": [
"template"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,436 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exploratory data analysis on the OntoNotes dataset, to gain insights towards the templating of the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.options.display.max_rows = 4000\n",
"pd.set_option('display.max_colwidth', -1)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"conll = \"\" # Download CoNLL-2003\n",
"\n",
"df_list = []\n",
"sentence_id = 0\n",
"for sentence in conll:\n",
" \n",
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
" df[\"sentence_idx\"] = sentence_id\n",
" sentence_id+=1\n",
" df_list.append(df)\n",
"ner_dataset = pd.concat(df_list)\n",
"ner_dataset.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"TAGS_TO_IGNORE = ['CARDINAL','FAC','LAW','LANGUAGE','TIME','DATE','ORDINAL','EVENT','QUANTITY','WORK_OF_ART','MONEY','PRODUCT','PERCENT']\n",
"def remote_unwanted_tags(x):\n",
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
" return 'O'\n",
" else:\n",
" return x\n",
"\n",
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
"ner_dataset[ner_dataset['sentence_idx']==3]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"sentences = ner_dataset.groupby('sentence_idx')['word'].transform(lambda x: ' '.join(x)).unique().tolist()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"len(sentences)\n",
"#print(sentences[:5])\n",
"with open(\"raw_sentences.txt\",\"w\",encoding=\"utf8\") as f:\n",
" for item in sentences:\n",
" f.write(\"{}\\n\".format(item))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Number of labels per tag"
]
},
{
"cell_type": "code",
"execution_count": 261,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset.groupby('tag')['tag'].count()"
]
},
{
"cell_type": "code",
"execution_count": 264,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['word'] = ner_dataset['word'].replace('-LRB-',')')\\\n",
".replace('-RRB-',')')\\\n",
".replace('``',\"\\\"\")\\\n",
".replace(\"''\",'\"')\\\n",
".replace('/.','.')"
]
},
{
"cell_type": "code",
"execution_count": 265,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"Counter(ner_dataset['word']).most_common(30)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Add lead and lag words and tags to dataset_no_punct"
]
},
{
"cell_type": "code",
"execution_count": 267,
"metadata": {},
"outputs": [],
"source": [
"import string\n",
"punct = [c for c in string.punctuation]\n",
"punct.extend([\"--\",\"''\",\"/.\"])\n",
"print(punct)\n",
"dataset_no_punct = ner_dataset[~ner_dataset.word.str.strip().isin(punct)]\n",
"dataset_no_punct['prev-word'] = dataset_no_punct.word.shift(1)\n",
"dataset_no_punct['prev-prev-word'] = dataset_no_punct['word'].shift(2)\n",
"dataset_no_punct['next-word'] = dataset_no_punct['word'].shift(-1)\n",
"dataset_no_punct['prev-tag'] = dataset_no_punct['tag'].shift(1)\n",
"dataset_no_punct['next-tag'] = dataset_no_punct['tag'].shift(-1)\n",
"dataset_no_punct.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Add features for easier manipulation"
]
},
{
"cell_type": "code",
"execution_count": 268,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Gather statistics on the first person token"
]
},
{
"cell_type": "code",
"execution_count": 269,
"metadata": {},
"outputs": [],
"source": [
"bper = dataset_no_punct[dataset_no_punct['tag']=='B-PERSON']"
]
},
{
"cell_type": "code",
"execution_count": 270,
"metadata": {},
"outputs": [],
"source": [
"# histogram of B-PERSON tokens\n",
"from collections import Counter\n",
"Counter(bper['word']).most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": 271,
"metadata": {},
"outputs": [],
"source": [
"prev_bper_token = bper['prev-word'].str.lower()\n",
"Counter(prev_bper_token).most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": 272,
"metadata": {},
"outputs": [],
"source": [
"prev_prev_bper_token = bper['prev-prev-word']\n",
"two_prev_tokens = zip(prev_prev_bper_token.str.lower(), prev_bper_token.str.lower())\n",
"Counter(two_prev_tokens).most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": 273,
"metadata": {},
"outputs": [],
"source": [
"# find \"the\" followed by B-PERSON\n",
"the_PERSON = ner_dataset[(ner_dataset['prev-word'].str.lower()==\"the\") & (ner_dataset['tag']=='B-PERSON')]\n",
"print(the_PERSON['prev-word']+\" \"+the_PERSON['word']+\" \"+the_PERSON['next-word']+\" \"+the_PERSON['next-next-word'].values)"
]
},
{
"cell_type": "code",
"execution_count": 296,
"metadata": {},
"outputs": [],
"source": [
"## add metadata for nationalities (to differentiate between America, Americans and US citizen)\n",
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
"nationalities.head()\n",
"\n",
"ner_dataset['metadata'] = None\n",
"\n",
"def get_nationality_as_metadata(row):\n",
" if row['word'].lower() in nationalities['country'].values:\n",
" return 'COUNTRY'\n",
" elif row['word'].lower() in nationalities['nationality'].values:\n",
" return 'NATIONALITY'\n",
" elif row['word'].lower() in nationalities['man'].values:\n",
" return 'NATION_MAN'\n",
" elif row['word'].lower() in nationalities['woman'].values:\n",
" return 'NATION_WOMAN'\n",
" return row['metadata']\n",
"\n",
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
"\n",
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 297,
"metadata": {},
"outputs": [],
"source": [
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
"\n",
"def remove_tag_if_the_person(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
" return 'O'\n",
" return row['tag']\n",
"\n",
"def remove_tag_if_the_norp(row):\n",
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
" return 'O'\n",
" return row['tag']\n",
"\n",
"ner_dataset['prev-word']=ner_dataset['prev-word'].astype('str')\n",
"ner_dataset['prev-prev-word']=ner_dataset['prev-prev-word'].astype('str')\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 299,
"metadata": {},
"outputs": [],
"source": [
"# find \"the\" followed by B-NORP\n",
"the_NORP = ner_dataset[(ner_dataset['prev-word'].str.lower()==\"the\") & (ner_dataset['tag']=='B-NORP')]\n",
"print(the_NORP['prev-word']+\" \"+the_NORP['word']+\" \"+the_NORP['next-word']+\" \"+the_NORP['next-next-word'].values + \" (\" + the_NORP['metadata'] + \")\")"
]
},
{
"cell_type": "code",
"execution_count": 276,
"metadata": {},
"outputs": [],
"source": [
"def remove_tag_if_apostraphe_after_tag(row):\n",
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
" return 'O'\n",
" return row['tag']\n",
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_apostraphe_after_tag,axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 277,
"metadata": {},
"outputs": [],
"source": [
"sentences_with_president=ner_dataset[ner_dataset['word'].str.lower() == 'president']['sentence_idx']\n",
"ner_dataset[ner_dataset['sentence_idx']==sentences_with_president.iloc[0]]"
]
},
{
"cell_type": "code",
"execution_count": 279,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['tag']=='B-PERSON']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Adjacent tags"
]
},
{
"cell_type": "code",
"execution_count": 281,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
"ner_dataset['next-entity']=ner_dataset['next-tag'].str[2:]\n"
]
},
{
"cell_type": "code",
"execution_count": 286,
"metadata": {},
"outputs": [],
"source": [
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
"print(\"sentences with duplicate different entities: \",str(len(ner_dataset[adjacent_idc])))\n",
"ner_dataset[adjacent_idc]['sentence_idx']\n"
]
},
{
"cell_type": "code",
"execution_count": 289,
"metadata": {},
"outputs": [],
"source": [
"ner_dataset[ner_dataset['sentence_idx']==8759]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NORP values"
]
},
{
"cell_type": "code",
"execution_count": 293,
"metadata": {},
"outputs": [],
"source": [
"norp_values = ner_dataset[ner_dataset['entity']=='NORP']['word']\n",
"Counter(norp_values).most_common(50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The country?"
]
},
{
"cell_type": "code",
"execution_count": 311,
"metadata": {},
"outputs": [],
"source": [
"the_X_idx = (ner_dataset['prev-word']=='the') & (ner_dataset['tag'] != 'O')\n",
"the_X_sentences = ner_dataset[the_X_idx]['sentence_idx']\n",
"the_X_sentences.values[0]\n",
"ner_dataset[ner_dataset['sentence_idx']==the_X_sentences.values[0]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,116 @@
import datetime
import json
import pandas as pd
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import FakeDataGenerator
def read_utterances(utterances_file):
with open(utterances_file) as f:
return f.readlines()
def generate(fake_pii_csv,
utterances_file,
output_file=None,
num_of_examples=1000,
dictionary_path=None,
store_masked_text=False,
keep_only_tagged=False,
**kwargs):
"""
:param fake_pii_csv: csv containing fake PII
:param utterances_file: txt file containing template sentences
:param output_file: filepath for json or csv output
:param num_of_examples: number of examples to generate
:param dictionary_path: path to vocabulary file
:param store_masked_text: Whether to remove or keep masked version of text
:param keep_only_tagged: Ignore utterances with no entity
(e.g. Remove: 'I went to the shop today', Keep: '[PERSON] went to the shop today')
:return: list of generated InputSamples
"""
if not output_file:
raise ValueError("Please provide an output file path")
templates = read_utterances(utterances_file)
if keep_only_tagged:
templates = [template for template in templates if "[" in template]
df = pd.read_csv(fake_pii_csv, encoding='utf-8')
generator = FakeDataGenerator(fake_pii_df=df,
dictionary_path=dictionary_path,
templates=templates, **kwargs)
counter = 0
examples = []
for example in generator.sample_examples(num_of_examples):
if not store_masked_text:
example.masked = None
examples.append(example)
examples_json = [example.to_dict() for example in examples]
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
json.dump(examples_json, f, ensure_ascii=False, indent=4)
print("generated {} examples".format(len(examples)))
print("Finished creating generated dataset. File location:{}".format(output_file))
return examples
def read_synth_dataset(filepath=None, length=None):
import json
with open(filepath, "r", encoding="utf-8") as f:
dataset = json.load(f)
if length:
dataset = dataset[:length]
input_samples = [InputSample.from_json(row) for row in dataset]
return input_samples
if __name__ == "__main__":
# PARAMS:
EXAMPLES = 30
PII_FILE_SIZE = 3000
SPAN_TO_TAG = True
TEMPLATES_FILE = 'raw_data/templates.txt'
KEEP_ONLY_TAGGED = False
LOWER_CASE_RATIO = 0.1
IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}
cur_time = datetime.date.today().strftime("%B %d %Y")
OUTPUT = "generated_size_{}_date_{}.txt".format(EXAMPLES, cur_time)
fake_pii_csv = '../../presidio_evaluator/data_generator/' \
'raw_data/FakeNameGenerator.com_{}.csv'.format(PII_FILE_SIZE)
utterances_file = TEMPLATES_FILE
dictionary_path = None
examples = generate(fake_pii_csv=fake_pii_csv,
utterances_file=utterances_file,
dictionary_path=dictionary_path,
output_file=OUTPUT,
lower_case_ratio=LOWER_CASE_RATIO,
num_of_examples=EXAMPLES,
ignore_types=IGNORE_TYPES,
keep_only_tagged=KEEP_ONLY_TAGGED,
span_to_tag=SPAN_TO_TAG)
# sanity
input_samples = read_synth_dataset(OUTPUT)
for sample in input_samples:
if len(sample.tags) != len(sample.tokens):
print("ERROR during generation. sample: {}".format(sample))
print(input_samples[:10])

Просмотреть файл

@ -0,0 +1,38 @@
import random
import os
from pathlib import Path
import pandas as pd
import re
class NationalityGenerator:
def __init__(self, company_name_file_path="raw_data/nationalities.csv"):
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = Path(dir_path, company_name_file_path)
df = pd.read_csv(str(file_path))
self.df = df
def get_country(self):
## [COUNTRY]
return NationalityGenerator.capitalizeWords(random.choice(self.df['country'].values))
def get_nationality(self):
## [NATIONALITY]
return NationalityGenerator.capitalizeWords(random.choice(self.df['nationality'].values))
def get_nation_woman(self):
## [NATION_WOMAN]
return NationalityGenerator.capitalizeWords(random.choice(self.df['woman'].values))
def get_nation_man(self):
## [NATION_MAN]
return NationalityGenerator.capitalizeWords(random.choice(self.df['man'].values))
def get_nation_plural(self):
## [NATION_PLURAL]
return NationalityGenerator.capitalizeWords(random.choice(self.df['plural'].values))
@staticmethod
def capitalizeWords(s):
return re.sub(r'\w+', lambda m: m.group(0).capitalize(), s)

Просмотреть файл

@ -0,0 +1,16 @@
import random
import os
from pathlib import Path
class OrgNameGenerator:
def __init__(self, company_name_file_path="raw_data/organizations.csv"):
self.companies = []
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = Path(dir_path, company_name_file_path)
with open(str(file_path)) as file:
self.companies = file.read().splitlines()
def get_organization(self):
return random.choice(self.companies)

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,203 @@
country,nationality,man,woman,plural
algeria,algerian,algerian,algerian,algerians
andorra,andorran,andorran,andorran,andorrans
angola,angolan,angolan,angolan,angolans
argentina,argentinian,argentinian,argentinian,argentinians
armenia,armenian,armenian,armenian,armenians
australia,australian,australian,australian,australians
austria,austrian,austrian,austrian,austrians
azerbaijan,azerbaijani,azerbaijani,azerbaijani,azerbaijanis
bahamas,bahamian,bahamian,bahamian,bahamians
bahrain,bahraini,bahraini,bahraini,bahrainis
bangladesh,bangladeshi,bangladeshi,bangladeshi,bangladeshis
barbados,barbadian,barbadian,barbadian,barbadians
belarus,belarusian,belarusian,belarusian,belarusians
belgium,belgian,belgian,belgian,belgians
belize,belizian,belizian,belizian,belizians
benin,beninese,beninese,beninese,benineses
bhutan,bhutanese,bhutanese,bhutanese,bhutaneses
bolivia,bolivian,bolivian,bolivian,bolivians
bosnia-herzegovina,bosnian,bosnian,bosnian,bosnians
botswana,botswanan,tswana,tswana,tswanas
brazil,brazilian,brazilian,brazilian,brazilians
britain,british,briton,briton,britons
brunei,bruneian,bruneian,bruneian,bruneians
bulgaria,bulgarian,bulgarian,bulgarian,bulgarians
burkina,burkinese,burkinese,burkinese,burkineses
burma (or myanmar),burmese,burmese,burmese,burmeses
burundi,burundian,burundian,burundian,burundians
cambodia,cambodian,cambodian,cambodian,cambodians
cameroon,cameroonian,cameroonian,cameroonian,cameroonians
canada,canadian,canadian,canadian,canadians
cape verde islands,cape verdean,cape verdean,cape verdean,cape verdeans
chad,chadian,chadian,chadian,chadians
chile,chilean,chilean,chilean,chileans
china,chinese,chinese,chinese,chineses
colombia,colombian,colombian,colombian,colombians
congo,congolese,congolese,congolese,congoleses
costa rica,costa rican,costrican,costrican,costricans
croatia,croatian,croatian,croatian,croatians
cuba,cuban,cuban,cuban,cubans
cyprus,cypriot,cypriot,cypriot,cypriots
czech republic,czech,czech,czech,czechs
denmark,danish,dane,dane,danes
djibouti,djiboutian,djiboutian,djiboutian,djiboutians
dominica,dominican,dominican,dominican,dominicans
dominican republic,dominican,dominican,dominican,dominicans
ecuador,ecuadorean,ecuadorean,ecuadorean,ecuadoreans
egypt,egyptian,egyptian,egyptian,egyptians
el salvador,salvadorean,salvadorean,salvadorean,salvadoreans
england,english,englishman,englishwoman,englishmans
eritrea,eritrean,eritrean,eritrean,eritreans
estonia,estonian,estonian,estonian,estonians
ethiopia,ethiopian,ethiopian,ethiopian,ethiopians
fiji,fijian,fijian,fijian,fijians
finland,finnish,finn,finn,finns
france,french,frenchman,frenchwoman,frenchmans
gabon,gabonese,gabonese,gabonese,gaboneses
gambia,gambian,gambian,gambian,gambians
georgia,georgian,georgian,georgian,georgians
germany,german,german,german,germans
ghana,ghanaian,ghanaian,ghanaian,ghanaians
greece,greek,greek,greek,greeks
grenada,grenadian,grenadian,grenadian,grenadians
guatemala,guatemalan,guatemalan,guatemalan,guatemalans
guinea,guinean,guinean,guinean,guineans
guyana,guyanese,guyanese,guyanese,guyaneses
haiti,haitian,haitian,haitian,haitians
holland,dutch,dutchman,dutchwoman,dutchmans
netherlands,dutch,dutchman,dutchwoman,dutchmans
honduras,honduran,honduran,honduran,hondurans
hungary,hungarian,hungarian,hungarian,hungarians
iceland,icelandic,icelander,icelander,icelanders
india,indian,indian,indian,indians
indonesia,indonesian,indonesian,indonesian,indonesians
iran,iranian,iranian,iranian,iranians
iraq,iraqi,iraqi,iraqi,iraqis
ireland, irish,irishman,irishwoman,irishmans
republic of ireland,irish,irishman,irishwoman,irishmans
israel,israeli,israeli,israeli,israelis
italy,italian,italian,italian,italians
jamaica,jamaican,jamaican,jamaican,jamaicans
japan,japanese,japanese,japanese,japaneses
jordan,jordanian,jordanian,jordanian,jordanians
kazakhstan,kazakh,kazakh,kazakh,kazakhs
kenya,kenyan,kenyan,kenyan,kenyans
kuwait,kuwaiti,kuwaiti,kuwaiti,kuwaitis
laos,laotian,laotian,laotian,laotians
latvia,latvian,latvian,latvian,latvians
lebanon,lebanese,lebanese,lebanese,lebaneses
liberia,liberian,liberian,liberian,liberians
libya,libyan,libyan,libyan,libyans
liechtenstein,liechtensteiner,liechtensteiner,liechtensteiner,liechtensteiners
lithuania,lithuanian,lithuanian,lithuanian,lithuanians
luxembourg,luxembourger,luxembourger,luxembourger,luxembourgers
macedonia,macedonian,macedonian,macedonian,macedonians
madagascar,madagascan,malagasy,malagasy,malagasys
malawi,malawian,malawian,malawian,malawians
malaysia,malaysian,malaysian,malaysian,malaysians
maldives,maldivian,maldivian,maldivian,maldivians
mali,malian,malian,malian,malians
malta,maltese,maltese,maltese,malteses
mauritania,mauritanian,mauritanian,mauritanian,mauritanians
mauritius,mauritian,mauritian,mauritian,mauritians
mexico,mexican,mexican,mexican,mexicans
moldova,moldovan,moldovan,moldovan,moldovans
monaco,monã©gasque,monacan,monacan,monacans
mongolia,mongolian,mongolian,mongolian,mongolians
montenegro,montenegrin,montenegrin,montenegrin,montenegrins
morocco,moroccan,moroccan,moroccan,moroccans
mozambique,mozambican,mozambican,mozambican,mozambicans
namibia,namibian,namibian,namibian,namibians
nepal,nepalese,nepalese,nepalese,nepaleses
new zealand,new zealand,new zealander,new zealander,new zealanders
nicaragua,nicaraguan,nicaraguan,nicaraguan,nicaraguans
niger,nigerien,nigerien,nigerien,nigeriens
nigeria,nigerian,nigerian,nigerian,nigerians
north korea,north korean,north korean,north korean,north koreans
norway,norwegian,norwegian,norwegian,norwegians
oman,omani,omani,omani,omanis
pakistan,pakistani,pakistani,pakistani,pakistanis
panama,panamanian,panamanian,panamanian,panamanians
papua new guinea,papua new guinean,papunew guinean,papunew guinean,papunew guineans
paraguay,paraguayan,paraguayan,paraguayan,paraguayans
peru,peruvian,peruvian,peruvian,peruvians
the philippines,philippine,filipino,filipino,filipinos
poland,polish,pole,pole,poles
portugal,portuguese,portuguese,portuguese,portugueses
qatar,qatari,qatari,qatari,qataris
romania,romanian,romanian,romanian,romanians
russia,russian,russian,russian,russians
rwanda,rwandan,rwandan,rwandan,rwandans
saudi arabia,saudi arabian,saudi,saudi,saudis
scotland,scottish,scot,scot,scots
senegal,senegalese,senegalese,senegalese,senegaleses
serbia,serbian,serbian,serbian,serbians
seychelles,seychellois,seychellois,seychellois,seychellois
sierra leone,sierra leonian,sierrleonian,sierrleonian,sierrleonians
singapore,singaporean,singaporean,singaporean,singaporeans
slovakia,slovak,slovak,slovak,slovaks
slovenia,slovenian,slovenian,slovenian,slovenians
solomon islands,solomon islander,solomon islander,solomon islander,solomon islanders
somalia,somali,somali,somali,somalis
south africa,south african,south african,south african,south africans
south korea,south korean,south korean,south korean,south koreans
spain,spanish,spaniard,spaniard,spaniards
sri lanka,sri lankan,sri lankan,sri lankan,sri lankans
sudan,sudanese,sudanese,sudanese,sudaneses
suriname,surinamese,surinamese,surinamese,surinameses
swaziland,swazi,swazi,swazi,swazis
sweden,swedish,swede,swede,swedes
switzerland,swiss,swiss,swiss,swiss
syria,syrian,syrian,syrian,syrians
taiwan,taiwanese,taiwanese,taiwanese,taiwanese
tajikistan,tajik,tajik,tajik,tajiks
tanzania,tanzanian,tanzanian,tanzanian,tanzanians
thailand,thai,thai,thai,thais
togo,togolese,togolese,togolese,togoleses
trinidad and tobago,trinidadian,trinidadian,trinidadian,trinidadians
tunisia,tunisian,tunisian,tunisian,tunisians
turkey,turkish,turk,turk,turks
turkmenistan,turkmen,turkmen,turkmen,turkmens
tuvali,tuvaluan,tuvaluan,tuvaluan,tuvaluans
uganda,ugandan,ugandan,ugandan,ugandans
ukraine,ukrainian,ukrainian,ukrainian,ukrainians
united arab emirates (uae),emirati,emirati,emirati,emiratis
united arab emirates,emirati,emirati,emirati,emiratis
uae,emirati,emirati,emirati,emiratis
united kingdom,british,briton,briton,britons
england,british,briton,briton,britons
uk,british,briton,briton,britons
united states of america (usa),american,american,american,american
united states of america,american,american,american,american
usa,american,american,american,american
us,american,us citizen,us citizen,us citizens
u.s.a,american,us citizen,us citizen,us citizens
uruguay,uruguayan,uruguayan,uruguayan,uruguayans
uzbekistan,uzbek,uzbek,uzbek,uzbeks
vanuata,vanuatuan,vanuatuan,vanuatuan,vanuatuans
vatican city,vatican,vatican,vatican,vaticans
venezuela,venezuelan,venezuelan,venezuelan,venezuelans
vietnam,vietnamese,vietnamese,vietnamese,vietnameses
wales,welsh,welshman,welshwoman,welshmans
western samoa,western samoan,western samoan,western samoan,western samoans
yemen,yemeni,yemeni,yemeni,yemenis
yugoslavia,yugoslav,yugoslav,yugoslav,yugoslavs
zaire,zairean,zairean,zairean,zaireans
zambia,zambian,zambian,zambian,zambians
zimbabwe,zimbabwean,zimbabwean,zimbabwean,zimbabweans
europe,european,european,european,europeans
america,american,american,american,americans
asia,asian,asian,asian,asians
africa,african,african,african,africans
middle east,middle-eastern,middle-eastern,middle-eastern,middle-easterns
middle-east,middle-eastern,middle-eastern,middle-eastern,middle-easterns
south-america,south-american,south-american,south-american,south-americans
north-american,north-american,north-american,north-american,north-americans
california,californian,californian,californian,californians
new-york,new-yorker,new-yorker,new-yorker,new-yorkers
palestine,palestenian,palestenian,palestenian,palestenians
sunni,sunni,sunni,sunni,sunnis
israel,jewish,jewish,jewish,jews
israel,jew,jew,jew,jews
kurdistan,kurd,kurd,kurd,kurds
1 country nationality man woman plural
2 algeria algerian algerian algerian algerians
3 andorra andorran andorran andorran andorrans
4 angola angolan angolan angolan angolans
5 argentina argentinian argentinian argentinian argentinians
6 armenia armenian armenian armenian armenians
7 australia australian australian australian australians
8 austria austrian austrian austrian austrians
9 azerbaijan azerbaijani azerbaijani azerbaijani azerbaijanis
10 bahamas bahamian bahamian bahamian bahamians
11 bahrain bahraini bahraini bahraini bahrainis
12 bangladesh bangladeshi bangladeshi bangladeshi bangladeshis
13 barbados barbadian barbadian barbadian barbadians
14 belarus belarusian belarusian belarusian belarusians
15 belgium belgian belgian belgian belgians
16 belize belizian belizian belizian belizians
17 benin beninese beninese beninese benineses
18 bhutan bhutanese bhutanese bhutanese bhutaneses
19 bolivia bolivian bolivian bolivian bolivians
20 bosnia-herzegovina bosnian bosnian bosnian bosnians
21 botswana botswanan tswana tswana tswanas
22 brazil brazilian brazilian brazilian brazilians
23 britain british briton briton britons
24 brunei bruneian bruneian bruneian bruneians
25 bulgaria bulgarian bulgarian bulgarian bulgarians
26 burkina burkinese burkinese burkinese burkineses
27 burma (or myanmar) burmese burmese burmese burmeses
28 burundi burundian burundian burundian burundians
29 cambodia cambodian cambodian cambodian cambodians
30 cameroon cameroonian cameroonian cameroonian cameroonians
31 canada canadian canadian canadian canadians
32 cape verde islands cape verdean cape verdean cape verdean cape verdeans
33 chad chadian chadian chadian chadians
34 chile chilean chilean chilean chileans
35 china chinese chinese chinese chineses
36 colombia colombian colombian colombian colombians
37 congo congolese congolese congolese congoleses
38 costa rica costa rican costrican costrican costricans
39 croatia croatian croatian croatian croatians
40 cuba cuban cuban cuban cubans
41 cyprus cypriot cypriot cypriot cypriots
42 czech republic czech czech czech czechs
43 denmark danish dane dane danes
44 djibouti djiboutian djiboutian djiboutian djiboutians
45 dominica dominican dominican dominican dominicans
46 dominican republic dominican dominican dominican dominicans
47 ecuador ecuadorean ecuadorean ecuadorean ecuadoreans
48 egypt egyptian egyptian egyptian egyptians
49 el salvador salvadorean salvadorean salvadorean salvadoreans
50 england english englishman englishwoman englishmans
51 eritrea eritrean eritrean eritrean eritreans
52 estonia estonian estonian estonian estonians
53 ethiopia ethiopian ethiopian ethiopian ethiopians
54 fiji fijian fijian fijian fijians
55 finland finnish finn finn finns
56 france french frenchman frenchwoman frenchmans
57 gabon gabonese gabonese gabonese gaboneses
58 gambia gambian gambian gambian gambians
59 georgia georgian georgian georgian georgians
60 germany german german german germans
61 ghana ghanaian ghanaian ghanaian ghanaians
62 greece greek greek greek greeks
63 grenada grenadian grenadian grenadian grenadians
64 guatemala guatemalan guatemalan guatemalan guatemalans
65 guinea guinean guinean guinean guineans
66 guyana guyanese guyanese guyanese guyaneses
67 haiti haitian haitian haitian haitians
68 holland dutch dutchman dutchwoman dutchmans
69 netherlands dutch dutchman dutchwoman dutchmans
70 honduras honduran honduran honduran hondurans
71 hungary hungarian hungarian hungarian hungarians
72 iceland icelandic icelander icelander icelanders
73 india indian indian indian indians
74 indonesia indonesian indonesian indonesian indonesians
75 iran iranian iranian iranian iranians
76 iraq iraqi iraqi iraqi iraqis
77 ireland irish irishman irishwoman irishmans
78 republic of ireland irish irishman irishwoman irishmans
79 israel israeli israeli israeli israelis
80 italy italian italian italian italians
81 jamaica jamaican jamaican jamaican jamaicans
82 japan japanese japanese japanese japaneses
83 jordan jordanian jordanian jordanian jordanians
84 kazakhstan kazakh kazakh kazakh kazakhs
85 kenya kenyan kenyan kenyan kenyans
86 kuwait kuwaiti kuwaiti kuwaiti kuwaitis
87 laos laotian laotian laotian laotians
88 latvia latvian latvian latvian latvians
89 lebanon lebanese lebanese lebanese lebaneses
90 liberia liberian liberian liberian liberians
91 libya libyan libyan libyan libyans
92 liechtenstein liechtensteiner liechtensteiner liechtensteiner liechtensteiners
93 lithuania lithuanian lithuanian lithuanian lithuanians
94 luxembourg luxembourger luxembourger luxembourger luxembourgers
95 macedonia macedonian macedonian macedonian macedonians
96 madagascar madagascan malagasy malagasy malagasys
97 malawi malawian malawian malawian malawians
98 malaysia malaysian malaysian malaysian malaysians
99 maldives maldivian maldivian maldivian maldivians
100 mali malian malian malian malians
101 malta maltese maltese maltese malteses
102 mauritania mauritanian mauritanian mauritanian mauritanians
103 mauritius mauritian mauritian mauritian mauritians
104 mexico mexican mexican mexican mexicans
105 moldova moldovan moldovan moldovan moldovans
106 monaco monã©gasque monacan monacan monacans
107 mongolia mongolian mongolian mongolian mongolians
108 montenegro montenegrin montenegrin montenegrin montenegrins
109 morocco moroccan moroccan moroccan moroccans
110 mozambique mozambican mozambican mozambican mozambicans
111 namibia namibian namibian namibian namibians
112 nepal nepalese nepalese nepalese nepaleses
113 new zealand new zealand new zealander new zealander new zealanders
114 nicaragua nicaraguan nicaraguan nicaraguan nicaraguans
115 niger nigerien nigerien nigerien nigeriens
116 nigeria nigerian nigerian nigerian nigerians
117 north korea north korean north korean north korean north koreans
118 norway norwegian norwegian norwegian norwegians
119 oman omani omani omani omanis
120 pakistan pakistani pakistani pakistani pakistanis
121 panama panamanian panamanian panamanian panamanians
122 papua new guinea papua new guinean papunew guinean papunew guinean papunew guineans
123 paraguay paraguayan paraguayan paraguayan paraguayans
124 peru peruvian peruvian peruvian peruvians
125 the philippines philippine filipino filipino filipinos
126 poland polish pole pole poles
127 portugal portuguese portuguese portuguese portugueses
128 qatar qatari qatari qatari qataris
129 romania romanian romanian romanian romanians
130 russia russian russian russian russians
131 rwanda rwandan rwandan rwandan rwandans
132 saudi arabia saudi arabian saudi saudi saudis
133 scotland scottish scot scot scots
134 senegal senegalese senegalese senegalese senegaleses
135 serbia serbian serbian serbian serbians
136 seychelles seychellois seychellois seychellois seychellois
137 sierra leone sierra leonian sierrleonian sierrleonian sierrleonians
138 singapore singaporean singaporean singaporean singaporeans
139 slovakia slovak slovak slovak slovaks
140 slovenia slovenian slovenian slovenian slovenians
141 solomon islands solomon islander solomon islander solomon islander solomon islanders
142 somalia somali somali somali somalis
143 south africa south african south african south african south africans
144 south korea south korean south korean south korean south koreans
145 spain spanish spaniard spaniard spaniards
146 sri lanka sri lankan sri lankan sri lankan sri lankans
147 sudan sudanese sudanese sudanese sudaneses
148 suriname surinamese surinamese surinamese surinameses
149 swaziland swazi swazi swazi swazis
150 sweden swedish swede swede swedes
151 switzerland swiss swiss swiss swiss
152 syria syrian syrian syrian syrians
153 taiwan taiwanese taiwanese taiwanese taiwanese
154 tajikistan tajik tajik tajik tajiks
155 tanzania tanzanian tanzanian tanzanian tanzanians
156 thailand thai thai thai thais
157 togo togolese togolese togolese togoleses
158 trinidad and tobago trinidadian trinidadian trinidadian trinidadians
159 tunisia tunisian tunisian tunisian tunisians
160 turkey turkish turk turk turks
161 turkmenistan turkmen turkmen turkmen turkmens
162 tuvali tuvaluan tuvaluan tuvaluan tuvaluans
163 uganda ugandan ugandan ugandan ugandans
164 ukraine ukrainian ukrainian ukrainian ukrainians
165 united arab emirates (uae) emirati emirati emirati emiratis
166 united arab emirates emirati emirati emirati emiratis
167 uae emirati emirati emirati emiratis
168 united kingdom british briton briton britons
169 england british briton briton britons
170 uk british briton briton britons
171 united states of america (usa) american american american american
172 united states of america american american american american
173 usa american american american american
174 us american us citizen us citizen us citizens
175 u.s.a american us citizen us citizen us citizens
176 uruguay uruguayan uruguayan uruguayan uruguayans
177 uzbekistan uzbek uzbek uzbek uzbeks
178 vanuata vanuatuan vanuatuan vanuatuan vanuatuans
179 vatican city vatican vatican vatican vaticans
180 venezuela venezuelan venezuelan venezuelan venezuelans
181 vietnam vietnamese vietnamese vietnamese vietnameses
182 wales welsh welshman welshwoman welshmans
183 western samoa western samoan western samoan western samoan western samoans
184 yemen yemeni yemeni yemeni yemenis
185 yugoslavia yugoslav yugoslav yugoslav yugoslavs
186 zaire zairean zairean zairean zaireans
187 zambia zambian zambian zambian zambians
188 zimbabwe zimbabwean zimbabwean zimbabwean zimbabweans
189 europe european european european europeans
190 america american american american americans
191 asia asian asian asian asians
192 africa african african african africans
193 middle east middle-eastern middle-eastern middle-eastern middle-easterns
194 middle-east middle-eastern middle-eastern middle-eastern middle-easterns
195 south-america south-american south-american south-american south-americans
196 north-american north-american north-american north-american north-americans
197 california californian californian californian californians
198 new-york new-yorker new-yorker new-yorker new-yorkers
199 palestine palestenian palestenian palestenian palestenians
200 sunni sunni sunni sunni sunnis
201 israel jewish jewish jewish jews
202 israel jew jew jew jews
203 kurdistan kurd kurd kurd kurds

Просмотреть файл

@ -0,0 +1,659 @@
3 Round Stones Inc
48 Factoring Inc
5Psolutions
Abt Associates
Accela
Accenture
Accuweather
Acxiom
Adaptive
Adobe Digital Government
Aidin
Alarmcom
Allianz
Allied Van Lines
Allstate Insurance Group
Alltuition
Altova
Amazon Web Services
American Red Ball Movers
Amida Technology Solutions
Analytica
Apextech LLC
Appallicious
Aquicore
Archimedes Inc
Areavibes Inc
Arpin Van Lines
Arrive Labs
Asc Partners
Asset4
Atlas Van Lines
Atsite
Aunt Bertha Inc
Aureus Sciences Now Part Of Elsevier
Autogrid Systems
Avalara
Avvo
Ayasdi
Azavea
Balefire Global
Barchart
Be Informed
Bekins
Berkery Noyes Mandasoft
Berkshire Hathaway
Betterlesson
Billguard
Bing
Biovia
Bizvizz
Blackrock
Bloomberg
Booz Allen Hamilton
Boston Consulting Group
Boundless
Bridgewater
Brightscope
Buildfax
Buildingeye
Buildzoom
Business And Legal Resources
Business Monitor International
Calcbench Inc
Cambridge Information Group
Cambridge Semantics
Can Capital
Canon
Capital Cube
Cappex
Captricity
Careset Systems
Caresetcom
Carfax
Caspio
Castle Biosciences
Cb Insights
Ceiba Solutions
Center For Responsive Politics
Cerner
Certara
CGI
Charles River Associates
Charles Schwab Corp.
Chemical Abstracts Service
Child Care Desk
Chubb
Citigroup
Cityscan
Citysourced
Civic Impulse LLC
Civic Insight
Civinomics
Civis Analytics
Clean Power Finance
Clearhealthcosts
Clearstory Data
Climate Corporation
Clinicast
Cloudmade
Cloudspyre
Code For America
Coden
Collective Ip
College Abacus An Ecmc Initiative
College Board
Compared Care
Compendia Bioscience Life Technologies
Compliance And Risks
Computer Packages Inc
Connectdot LLC
Connectedu
Connotate
Construction Monitor LLC
Consumer Reports
Coolclimate
Copyright Clearance Center
Corelogic
Costquest
Credit Karma
Credit Sesame
Crowdanalytix
Dabo Health
Datalogix
Datamade
Datamarket
Datamyne
Dataweave
Deloitte
Demystdata
Department Of Better Technology
Development Seed
Docket Alarm Inc
Dow Jones Co
Dun Bradstreet
Earth Networks
Earthobserver App
Earthquake Alert
Eat Shop Sleep
Ecodesk
Einstitutional
Embark
Emc
Energy Points Inc
Energy Solutions Forum
Enervee Corporation
Enigmaio
Ensco
Environmental Data Resources
Epsilon
Equal Pay For Women
Equifax
Equilar
Ernst Young Llp
Escholar LLC
Esri
Estately
Everyday Health
Evidera
Experian
Expert Health Data Programming Inc
Exversion
Ezxbrl
Factset
Factual
Farmers
Farmlogs
Fastcase
Fidelity Investments
Findthebestcom
First Fuel Software
Firstpoint Inc
Fitch
Flightaware
Flightstats
Flightview
Foodtech Connect
Forrester Research
Foursquare
Fujitsu
Funding Circle
Futureadvisor
Fuzion Apps Inc
Gallup
Galorath Incorporated
Garmin
Genability
Genospace
Geofeedia
Geolytics
Geoscape
Getraised
Github
Glassy Media
Golden Helix
Goodguide
Google Maps
Google Public Data Explorer
Government Transaction Services
Govini
Govtribe
Govzilla Inc
Gradiant Research LLC
Graebel Van Lines
Graematter Inc
Granicus
Greatschools
Guidestar
H3 Biomedicine
Harris Corporation
Hdscores Inc
Headlight
Healthgrades
Healthline
Healthmap
Healthpocket Inc
Hellowallet
Here
Honest Buildings
Hopstop
Housefax
Hows My Offer
Ibm
Ideas42
Ifactor Consulting
Ifi Claims Patent Services
Imedicare
Impact Forecasting Aon
Impaq International
Intuit
Importio
Ims Health
Incadence
Indoors
Infocommerce Group
Informatica
Innocentive
Innography
Innovest Systems
Inovalon
Inrix Traffic
Intelius
Intermap Technologies
Investormill
Iodine
Iphix
Irecycle
Itriage
Ives Group Inc
Iw Financial
Jj Keller
Jp Morgan Chase
Junar Inc
Junyo
Jurispect
Kaiser Permanante
Karmadata
Keychain Logistics Corp.
Kidadmit Inc
Kimono Labs
Kld Research
Knoema
Knowledge Agency
Kpmg
Kroll Bond Ratings Agency
Kyruus
Lawdragon
Legal Science Partners
Legcyte
Legination Inc
Legistorm
Lenddo
Lending Club
Level One Technologies
Lexisnexis
Liberty Mutual Insurance Cos
Lilly Open Innovation Drug Discovery
Liquid Robotics
Locavore
Logixdata LLC
Loopnet
Loqate Inc
Loseitcom
Loveland Technologies
Lucid
Lumesis Inc
Mango Transit
Mapbox
Maponics
Mapquest
Marinexplore Inc
Marketsense
Marlin Associates
Marlin Alter And Associates
Mcgraw Hill Financial
Mckinsey
Medwatcher
Mercaris
Merrill Corp.
Merrill Lynch
Metlife
Mhealthcoach
Microbilt Corporation
Microsoft Corporation
Mint
Moodys
Morgan Stanley
Morningstar Inc.
Mozio
Muckrockcom
Munetrix
Municode
National Van Lines
Nationwide Mutual Insurance Company
Nautilytics
Navico
Nera Economic Consulting
Nerdwallet
New Media Parents
Next Step Living
Nextbus
Ngap Incorporated
Nielsen
Noesis
Nonprofitmetrics
North American Van Lines
Noveda Technologies
Nucivic
Numedii
Oliver Wyman
Ondeck
Onstar
Ontodia Inc.
Onvia
Open Data Nation
Opencounter
Opengov
Openplans
Opportunityspace Inc.
Optensity
Optigov
Optuminsight
Orlin Research
Osisoft
Otc Markets
Outline
Oversight Systems
Overture Technologies
Owler
Palantir Technologies
Panjiva
Parsons Brinckerhoff
Patentlyo
Patientslikeme
Pave
Paxata
Payscale Inc.
Peerj
People Power
Persint
Personal Democracy Media
Personal Inc.
Personalis
Petersons
Pev4Mecom
Pixia Corp.
Placeilivecom
Planetecosystems
Plotwatt
Plusu
Policymap
Politify
Poncho App
Popvox
Porch
Possibilityu
Poweradvocate
Practice Fusion
Predilytics
Pricewaterhousecoopers Pwc
Programmableweb
Progressive Insurance Group
Propeller Health
Propublica
Publicengines
Pya Analytics
Qado Energy Inc.
Quandl
Quertle
Quid
R R Donnelley
Rand Corporation
Rand Mcnally
Rank And Filed
Ranku
Rapid Cycle Solutions
Realtorcom
Recargo
Recipal
Redfin
Redlaser
Reed Elsevier
Rei Systems
Relationship Science
Remi
Retroficiency
Revaluate
Revelstone
Rezolve Group
Rivet Software
Roadify Transit
Robinson Yu
Russell Investments
Sage Bionetworks
SAP
Sap
Sas
Scale Unlimited
Science Exchange
Seabourne
Seeclickfix
Sigfig
Simple Energy
Simpletuition
Slashdb
Smart Utility Systems
Smartasset
Smartprocure
Smartronix
Snapsense
Social Explorer
Social Health Insights
Socialeffort Inc.
Socrata
Solar Census
Solarlist
Sophic Systems Alliance
Sp Capital Iq
Spacecurve
Speso Health
Spikes Cavell Analytic Inc.
Splunk
Spokeo
Spotcrime
Spotherocom
Stamen Design
Standard And Poors
State Farm Insurance
Sterling Infosystems
Stevens Worldwide Van Lines
Stillwater Supercomputing Inc.
Stocksmart
Stormpulse
Streamlink Software
Streetcred Software Inc.
Streeteasy
Suddath
Symcat
Synthicity
T Rowe Price
Tableau Software
Tagnifi
Telenav
Tendril
Teradata
The Advisory Board Company
The Bridgespan Group
The Docgraph Journal
The Govtech Fund
The Schork Report
The Vanguard Group
Think Computer Corporation
Thinknum
Thomson Reuters
Topcoder
Towerdata
Transparagov
Transunion
Trialtrove
Trialx
Trintech
Truecar
Trulia
Trustedid
Tuvalabs
Uber
Unigo LLC
United Mayflower
Urban Airship
Urban Mapping Inc.
Us Green Data
Us News Schools
Usaa Group
Ussearch
Verdafero
Vimo
BioFlower
MysticWeb
DeepOntoscomy
6Sigma
Visualdod LLC
Vital Axiom Niinja
Vitalchek
Vitals
Vizzuality
Votizen
Walk Score
Watersmart Software
Wattzon
Way Better Patents
Weather Channel
Weather Decision Technologies
Weather Underground
Webfilings
Webitects
Webmd
Weight Watchers
Wemakeitsafer
Wheaton World Wide Moving
Whitby Group
Wolfram Research
Wolters Kluwer
Workhands
Xatori
Xcential
Xdayta
Xignite
Yahoo
Yei Healthcare
Yelp
Yourmapper
Zillow
Zocdoc
Zonability
Zoner
Zurich Insurance Risk Room
Smith'S
H&M
Ministry Of Defence
Ministry Of Agriculture
NSA
23 And Me
E&Y
Ortiz LLC
Hill Inc.
Underwood Group
White, Nelson and Townsend
Lester-Smith
Rosales-Mcguire
Johnson, Wallace and Santos
Macdonald-Clark
Scott Group
Mills, Smith and Lopez
Vazquez-Riggs
Marshall, Hernandez and Simpson
Mayer-Watkins
Smith Ltd.
Parker, Williams and Hill
Jones, Mitchell and Williams
Aguilar LLC
Thomas, Holt and Myers
Mendoza-Thompson
Johnson Inc.
Rose, Turner and Thompson
Weeks-Rivas
Frost LLC
Henderson, Hicks and Brown
Davis, Reynolds and Williamson
Taylor-Jones
Glover, Ruiz and Armstrong
Nguyen-Johnson
Hubbard-Thomas
Jones, Smith and Davis
Hawkins, Richardson and Santana
Butler-Peters
Barnett, Melton and Garcia
Valentine-Murray
Weeks, Smith and Jones
Green Inc.
Fernandez Inc.
Mclaughlin Ltd.
Drake PLC
Conway Inc.
Becker-Shaffer
Hopkins, Marshall and Bruce
Ramsey-Johnson
Bennett-Howell
Smith, Roberts and Turner
Allen, Mitchell and Jones
Davis Ltd.
Gray, Hawkins and Williamson
Freeman, Ho and Hoffman
Clark, Romero and Hall
Williams LLC
Hubbard, Fox and Gillespie
Valenzuela, King and Acosta
Sharp, Lynn and Jones
Williams, Morgan and Lynch
Watson, Jones and Wright
Mayo-Walters
Smith-Lawrence
Vasquez-Rivas
Davidson, Holmes and Rodriguez
Thomas and Sons
Weber-Santana
Evans-Bonilla
Larsen Ltd.
Brown-Weaver
Ford LLC
Rogers-Baxter
White, Willis and Hoffman
Hamilton, Diaz and Contreras
Hoover, Morris and Johnson
Lopez-Lang
Rivera, Patel and Guerra
Garcia-Smith
Brown-Oneal
Young-Stokes
Garcia-Roberson
Evans-Miller
Perry-Sullivan
Hinton LLC
Kelly-Green
Powers-Garcia
Ellis-Ingram
Huber LLC
Baker, Moody and Williams
Carr-Schaefer
Coleman Group
Underwood-Brown
Mccarthy-Hill
Wolf-Carpenter
Graham, Ochoa and Vasquez
Shepherd Ltd.
Michael Inc.
Cantrell Ltd.
Fritz-Armstrong
Miller Ltd.
Lopez, Santos and Coleman
Craig, Palmer and Quinn
Sanders-Gill
Rodriguez, West and Lynch
Olsen, Mitchell and Jackson
Owens, Duran and Oneal
Thomas-James
Moore LLC
Green Group
Lara-Cruz
Crawford PL
U.N
NATO
Seeds of peace
The Bill & Melinda Gates Foundation
AppleSeeds
U.N.
CNN
CBS
BBC
SKY
Sky News
Не удается отобразить этот файл, потому что он имеет неправильное количество полей в строке 546.

Просмотреть файл

@ -0,0 +1,126 @@
I want to increase limit on my card # [CREDIT_CARD] for certain duration of time. is it possible?
My credit card [CREDIT_CARD] has been lost, Can I request you to block it.
Need to change billing date of my card [CREDIT_CARD]
I want to upadte my primary and secondary address to same: [ADDRESS]
In case of my child's account, we need to add [PERSON] as guardian
Are there any charges applied for money transfer from [IBAN] to other bank accounts
Are there any charges applied to widraw money from ATM with the card [CREDIT_CARD]
Not getting bank documents on my addres. Can you please validate the following [ADDRESS]
Please update billing addrress with [ADDRESS] for this card: [CREDIT_CARD]
Need to see last 10 transaction of card [CREDIT_CARD]
I have lost my card [CREDIT_CARD]. Could you please block my credit card ASAP ? , My name is [PERSON].
My card [CREDIT_CARD] is expiring this month. Please let me know process to it's extend validity.
I have done an online order but didn't get any message on my registered [PHONE_NUMBER]. Could you please look into it ?
What is procedure to redeem points won on credit card [CREDIT_CARD] transactions ?
My card [CREDIT_CARD] expires soon <20> when will I get a new one?
How do I check my balance on my credit card?
Could I change the payment due date of my credit card?
How can I request a new credit card pin ?
Can I withdraw cash using my card [CREDIT_CARD] at aTM center ?
How do I change the address linked to my credit card to [ADDRESS]?
How do I open my credit card statement?
I'm originally from [COUNTRY]
I will be travelling to [COUNTRY] next week, so I need my passport to be ready by then
Who's coming to [COUNTRY] with me?
[COUNTRY] was super fun to visit!
Could you please email me the statement for laste month , my credit card number is [CREDIT_CARD]?
Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL]?
How do I change my address to [ADDRESS] for post mail?
My name appears incorrectly on credit card statement could you please correct it to [TITLE] [PERSON]?
card number [CREDIT_CARD] is lost, can you please send a new one to [ADDRESS] i am in [CITY] for a business trip
Please transfer all funds from my account to this hackers' [EMAIL]
I can't browse to your site, keep getting address [IP_ADDRESS] blocked error
My religion does not allow speaking to bots, they are evil and hacked by the Devil
Excuse me, Sir bot, but I really don't like this tone
WHAT ??? I DONT KNOW WHAT TO PRESS NEXT!!! ? !! ?!
Please have the manager call me at [PHONE_NUMBER] I'd like to join accounts with ms. [FIRST_NAME]
Inject SELECT * FROM Users WHERE clinet_ip = ?%//!%20\|[IP_ADDRESS]|%20/
[FIRST_NAME], can I please speak to your boss?
May I request to have the statement sent to [ADDRESS]?
Will my account stay active? It's under my partner's name [PERSON]
What are my options?
Bot: Where would you like this to be sent to? User: [ADDRESS]
Bot: What's the name on the account? User: [PERSON]
I would like to stop receiving messages to [PHONE_NUMBER]
CAN I SPEAK TO A REAL PERSON?!?!
I would like to remove my kid [FIRST_NAME] from the will. How do I do that?
The name in the account is not correct, please change it to [PERSON]
Hello I moved, please update my new address is [ADDRESS]
I need to add addresses, here they are: [ADDRESS], [ADDRESS]
Please send my portfolio to this email [EMAIL]
Hello, this is [TITLE] [PERSON]. Who are you?
I want to add [PERSON] as a beneficiary to my account
I want to cancel my card [CREDIT_CARD] because I lost it
Please block card no [CREDIT_CARD]
What is the limit for card [CREDIT_CARD]?
Can someone call me on [PHONE_NUMBER]? I have some questions about opening an account.
My nam is [FIRST_NAME]
I'm moving out of the country, so please cancel my subscription
My name is [PERSON] but everyone calls me [FIRST_NAME]
Please tell me your date of birth. It's [BIRTHDAY]
You said your email is [EMAIL]. Is that correct?
I once lived in [ADDRESS]. I now live in [ADDRESS]
I'd like to order a taxi to [ADDRESS]
Please charge my credit card. Number is [CREDIT_CARD]
What's your email? [EMAIL]
What's your credit card? [CREDIT_CARD]
What's your name? [PERSON]
What's your last name? [LAST_NAME]
How can we reach you? You can call [PHONE_NUMBER]
I'd like it to be sent to [ADDRESS]
Meet me at [ADDRESS]
So where are we meeting? There's this nice new Thai place downtown. Cool, what's the address? Oh do they serve vegan stuff? It's in [ADDRESS]
Hi [FIRST_NAME], I'm contacting you about a problem I have with sending a wire transfer using this IBAN [IBAN]
She was born on [BIRTHDAY]. Her maiden name is [LAST_NAME]
Sometimes people call me [FIRST_NAME]
Maybe it's under [PERSON]
It's like that since [BIRTHDAY]
Just posted a photo [URL]
My website is [URL]
My IBAN is [IBAN]
I've shared files with you [URL]
I work for [ORGANIZATION]
[PERSON] from [ORGANIZATION] is the keynote speaker
[FIRST_NAME] is from [ORGANIZATION]
The address of [ORGANIZATION] is [ADDRESS]
His social security number is [US_SSN]
Here's my SSN: [US_SSN]
[FIRST_NAME] is a very sympathetic person. He's also a good listener
[FIRST_NAME] is very reliable. You can always depend on him.
Why is [FIRST_NAME] so impulsive?
[PERSON] will be talking in the conference
have you heard [PERSON] speak yet?
Have you been to a [PERSON] concert before?
I'm so jealous! said [FIRST_NAME] to [FIRST_NAME]
The true gender of [FIRST_NAME] has been under debate for years, but the riff and building energy is a rock masterpiece regardless.
For my take on Mr. [LAST_NAME], see Guilty Pleasures: 5 Musicians Of The 70s You're Supposed To Hate (But Secretly Love)
Unlike the [LAST_NAME] novel, it's not about necrophilia. What it is about, I suppose is anyone's guess. A brilliant piece of baroque pop.
One of the most depressing songs on the list. He's injured from the waist down from [COUNTRY], but [FIRST_NAME] just has to get laid. Don't go to town, [FIRST_NAME]!
Is there a better crafted pop song on this list? [LAST_NAME] and [LAST_NAME] were precision engineers.
C'mon, sing it with me: "You picked a fine time to leave me [FIRST_NAME], four hungry children and a crop in the field..."
A tribute to [PERSON] – sadly, she wasn't impressed.
When they weren't singing about Hobbits, satanic felines and interstellar journeys, they were singing about the verses from [PERSON]'s Cautionary Tales. Is there a better example of unbridled creativity than early [LAST_NAME]?
A great song made even greater by a mandolin coda (not by [PERSON]).
[PERSON] listed his top 20 songs for Entertainment Weekly and had the balls to list this song at #15. (What did he put at #1 you ask? Answer:"Tube Snake Boogie" by [PERSON] – go figure)
From the film American graffiti (also features [PERSON]. What's not to love?
You can tell [FIRST_NAME] was a huge [PERSON] fan. Written when he was only 14.
This song by ex-Zombie [LAST_NAME] is a perfect example of why you shouldn't concentrate on the order of this list. An argument could be made that this should be at number one, and I wouldn't argue with it.
The title refers to [STREET] Street in [CITY]. It was on this street that many of the clubs where Metallica first played were situated. "Battery is found in me" shows that these early shows on [STREET] Street were important to them. Battery is where "lunacy finds you" and you "smash through the boundaries."
Blink-182 pay tribute here to the [COUNTRY]. Producer [PERSON] explained to Fuse TV: "We all liked the idea of writing a song about our state, where we live and love. To me it's the most beautiful place in the world, this song was us giving credit to how lucky we are to have lived here and grown up here, raising families here, the whole thing."
It may be too that [LAST_NAME] was influenced by an earlier song, "Carry Me Back To [COUNTRY]," which was arranged and sung by [PERSON] in 1847 (though [LAST_NAME]'s song was actually about a boat!).
The [PERSON] version recorded for [ORGANIZATION] became the first celebrity recording by a classical musician to sell one million copies. The song was awarded the seventh gold disc ever granted.
In [COUNTRY]] they have company songs, musical expressions of employee loyalty sung by salarymen. Unfortunately, as regular RR commenter [PERSON] points out, "most are horrible".
"The big three" of The Big Three Killed My Baby are the car manufacturers that dominate the economy of the White Stripes' home city [CITY]: [ORGANIZATION], [ORGANIZATION] and [ORGANIZATION]. "Don't feed me planned obsolescence," says [PERSON] in an uncharacteristically political song, lamenting the demise of the unions in the 60s.
[ORGANIZATION] songwriter [PERSON] employs corporate lingo in the first verse of his [ORGANIZATION] Resignation Letter
Mission Statement: This non-profit founded by radio executives "serves as an advocate for the value of music" and "supports its songwriters, composers and publishers by taking care of an important aspect of their careers – getting paid," according to the [ORGANIZATION] website. They offer blanket music licenses to businesses and organizations that allow them to play nearly 13 million musical works.
The [ORGANIZATION] Orchestra was founded in 1929. Since then, the TSO has grown from a volunteer community orchestra to a fully professional orchestra serving Southern [COUNTRY]
Celebrating its 10th year in [CITY], [ORGANIZATION] is a 501(c)3 that invites songwriters from around the world to Texas to share the universal language of music in collaborations designed to bridge cultures, build friendships and cultivate peace.
[ORGANIZATION] is the brainchild of our 3 founders: [PERSON], [PERSON] and [PERSON]. The idea was born (on the beach) while they were constructing a website to be the basis of another start-up idea.
[ORGANIZATION] is an [NATIONALITY] multinational investment bank and financial services company
Zoolander is a 2001 American action-comedy film directed by [PERSON] and starring [LAST_NAME]
During the 1990s, [ORGANIZATION] invested heavily in new microprocessor designs fostering the rapid growth of the computer industry.
On 29 March 2017, the [NATIONALITY] government formally began the process of withdrawal by invoking Article 50 of the Treaty on European Union
[FIRST_NAME] shouted at [FIRST_NAME]: "What are you doing here?"
[LAST_NAME] spent a year at [ORGANIZATION] as the assistant to [PERSON], and the following year at [ORGANIZATION] in [CITY], which later became [ORGANIZATION] in 1965.
[LAST_NAME] began writing as a teenager, publishing her first story, "The Dimensions of a Shadow", in 1950 while studying English and journalism at the University of [CITY].

Просмотреть файл

@ -0,0 +1,541 @@
from typing import List, Counter, Dict
import spacy
import srsly
from spacy.tokens import Token
from tqdm import tqdm
from presidio_evaluator import span_to_tag, tokenize
SPACY_PRESIDIO_ENTITIES = {
"ORG": "ORGANIZATION",
"NORP": "ORGANIZATION",
"GPE": "LOCATION",
"LOC": "LOCATION",
"FAC": "LOCATION",
"PERSON": "PERSON",
"LOCATION": "LOCATION",
"ORGANIZATION": "ORGANIZATION"
}
PRESIDIO_SPACY_ENTITIES = {
"ORGANIZATION": "ORG",
"COUNTRY": "GPE",
"CITY": "GPE",
"LOCATION": "GPE",
"PERSON": "PERSON",
"FIRST_NAME": "PERSON",
"LAST_NAME": "PERSON",
"NATION_MAN": "GPE",
"NATION_WOMAN": "GPE",
"NATION_PLURAL": "GPE",
"NATIONALITY": "GPE",
"GPE": "GPE",
"ORG": "ORG",
}
class Span:
"""
Holds information about the start, end, type nad value
of an entity in a text
"""
def __init__(self, entity_type, entity_value, start_position, end_position):
self.entity_type = entity_type
self.entity_value = entity_value
self.start_position = start_position
self.end_position = end_position
def intersect(self, other, ignore_entity_type: bool):
"""
Checks if self intersects with a different Span
:return: If interesecting, returns the number of
intersecting characters.
If not, returns 0
"""
# if they do not overlap the intersection is 0
if self.end_position < other.start_position or other.end_position < \
self.start_position:
return 0
# if we are accounting for entity type a diff type means intersection 0
if not ignore_entity_type and (self.entity_type != other.entity_type):
return 0
# otherwise the intersection is min(end) - max(start)
return min(self.end_position, other.end_position) - max(
self.start_position,
other.start_position)
def __repr__(self):
return "Type: {}, value: {}, start: {}, end: {}".format(
self.entity_type, self.entity_value, self.start_position,
self.end_position)
def __eq__(self, other):
return self.entity_type == other.entity_type \
and self.entity_value == other.entity_value \
and self.start_position == other.start_position \
and self.end_position == other.end_position
def __hash__(self):
return hash(('entity_type', self.entity_type,
'entity_value', self.entity_value,
'start_position', self.start_position,
'end_position', self.end_position))
@classmethod
def from_json(cls, data):
return cls(**data)
class SimpleSpacyExtensions(object):
def __init__(self, **kwargs):
"""
Serialization of Spacy Token extensions.
see https://spacy.io/api/token#set_extension
:param kwargs: dictionary of spacy extensions and their values
"""
self.__dict__.update(kwargs)
def to_dict(self):
return self.__dict__
class SimpleToken(object):
"""
A class mimicking the Spacy Token class, for serialization purposes
"""
def __init__(self, text, idx, tag_=None,
pos_=None,
dep_=None,
lemma_=None,
spacy_extensions: SimpleSpacyExtensions = None,
**kwargs):
self.text = text
self.idx = idx
self.tag_ = tag_
self.pos_ = pos_
self.dep_ = dep_
self.lemma_ = lemma_
# serialization for Spacy extensions:
if spacy_extensions is None:
self._ = SimpleSpacyExtensions()
else:
self._ = spacy_extensions
self.params = kwargs
@classmethod
def from_spacy_token(cls, token):
if isinstance(token, SimpleToken):
return token
elif isinstance(token, Token):
if token._ and token._._extensions:
extensions = list(token._.token_extensions.keys())
extension_values = {}
for extension in extensions:
extension_values[extension] = token._.__getattr__(extension)
spacy_extensions = SimpleSpacyExtensions(**extension_values)
else:
spacy_extensions = None
return cls(text=token.text,
idx=token.idx,
tag_=token.tag_,
pos_=token.pos_,
dep_=token.dep_,
lemma_=token.lemma_,
spacy_extensions=spacy_extensions)
def to_dict(self):
return {
"text": self.text,
"idx": self.idx,
"tag_": self.tag_,
"pos_": self.pos_,
"dep_": self.dep_,
"lemma_": self.lemma_,
"_": self._.to_dict()
}
def __repr__(self):
return self.text
@classmethod
def from_json(cls, data):
if '_' in data:
data['spacy_extensions'] = \
SimpleSpacyExtensions(**data['_'])
return cls(**data)
class InputSample(object):
def __init__(self, full_text: str, masked: str, spans: List[Span],
tokens=[], tags=[],
create_tags_from_span=True, scheme="IO", metadata=None, template_id=None):
"""
Holds all the information needed for evaluation in the
presidio-evaluator framework.
Can generate tags (BIO/BILOU/IO) based on spans
:param full_text: The raw text of this sample
:param masked: Masked version of the raw text (desired output)
:param spans: List of spans for entities
:param create_tags_from_span: True if tags (tokens+taks) should be added
:param scheme: IO, BIO/IOB or BILOU. Only applicable if span_to_tag=True
:param tokens: list of items of type SimpleToken
:param tags: list of strings representing the label for each token,
given the scheme
:param metadata: A dictionary of additional metadata on the sample,
in the English (or other language) vocabulary
:param template_id: Original template (utterance) of sample, in case it was generated
"""
self.full_text = full_text
self.masked = masked
self.spans = spans if spans else []
self.metadata = metadata
# generated samples have a template from which they were generated
if not template_id and self.metadata:
self.template_id = self.metadata.get("Template#")
else:
self.template_id = template_id
if create_tags_from_span:
tokens, tags = self.get_tags(scheme)
self.tokens = tokens
self.tags = tags
else:
self.tokens = tokens
self.tags = tags
def __repr__(self):
return "Full text: {}\n" \
"Spans: {}\n" \
"Tokens: {}\n" \
"Tags: {}\n".format(self.full_text, self.spans, self.tokens,
self.tags)
def to_dict(self):
return {
"full_text": self.full_text,
"masked": self.masked,
"spans": [span.__dict__ for span in self.spans],
"tokens": [SimpleToken.from_spacy_token(token).to_dict()
for token in self.tokens],
"tags": self.tags,
"template_id": self.template_id,
"metadata": self.metadata
}
@classmethod
def from_json(cls, data):
if 'spans' in data:
data['spans'] = [Span.from_json(span) for span in data['spans']]
if 'tokens' in data:
data['tokens'] = [SimpleToken.from_json(val) for val in
data['tokens']]
return cls(**data, create_tags_from_span=False)
def get_tags(self, scheme="IOB"):
start_indices = [span.start_position for span in self.spans]
end_indices = [span.end_position for span in self.spans]
tags = [span.entity_type for span in self.spans]
tokens = tokenize(self.full_text)
labels = span_to_tag(scheme=scheme, text=self.full_text, tag=tags,
start=start_indices, end=end_indices,
tokens=tokens)
return tokens, labels
def to_conll(self, translate_tags, scheme="BIO"):
conll = []
for i, token in enumerate(self.tokens):
if translate_tags:
label = self.translate_tag(self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
else:
label = self.tags[i]
conll.append({"text": token.text,
"pos": token.pos_,
"tag": token.tag_,
"Template#": self.metadata['Template#'],
"gender": self.metadata['Gender'],
"country": self.metadata['Country'],
"label": label},
)
return conll
def get_template_id(self):
return self.metadata['Template#']
@staticmethod
def create_conll_dataset(dataset, translate_tags=True, to_bio=True):
import pandas as pd
conlls = []
i = 0
for sample in dataset:
if to_bio:
sample.bilou_to_bio()
conll = sample.to_conll(translate_tags=translate_tags)
for token in conll:
token['sentence'] = i
conlls.append(token)
i += 1
return pd.DataFrame(conlls)
def to_spacy(self, entities=None, translate_tags=True):
entities = [(span.start_position, span.end_position, span.entity_type)
for span in self.spans if (entities is None) or (span.entity_type in entities)]
new_entities = []
if translate_tags:
for entity in entities:
new_tag = self.translate_tag(entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
new_entities.append((entity[0], entity[1], new_tag))
else:
new_entities = entities
return (self.full_text,
{"entities": new_entities})
@classmethod
def from_spacy(cls, text, annotations, translate_from_spacy=True):
spans = []
for annotation in annotations:
tag = cls.rename_from_spacy_tags([annotation[2]])[0] if translate_from_spacy else annotation[2]
span = Span(tag, text[annotation[0]: annotation[1]], annotation[0], annotation[1])
spans.append(span)
return cls(full_text=text, masked=None, spans=spans)
@staticmethod
def create_spacy_dataset(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
def template_sort(x):
return x.metadata['Template#']
if sort_by_template_id:
dataset.sort(key=template_sort)
return [sample.to_spacy(entities=entities, translate_tags=translate_tags) for sample in dataset]
def to_spacy_json(self, entities=None, translate_tags=True):
token_dicts = []
for i, token in enumerate(self.tokens):
if entities:
tag = self.tags[i] if self.tags[i][2:] in entities else 'O'
else:
tag = self.tags[i]
if translate_tags:
tag = self.translate_tag(tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
token_dicts.append({
"orth": token.text,
"tag": token.tag_,
"ner": tag
})
spacy_json_sentence = {
"raw": self.full_text,
"sentences": [{
"tokens": token_dicts
}
]
}
return spacy_json_sentence
def to_spacy_doc(self):
doc = self.tokens
spacy_spans = []
for span in self.spans:
start_token = [token.i for token in self.tokens if token.idx == span.start_position][0]
end_token = [token.i for token in self.tokens if token.idx + len(token.text) == span.end_position][0] + 1
spacy_span = spacy.tokens.span.Span(doc, start=start_token, end=end_token,
label=span.entity_type)
spacy_spans.append(spacy_span)
doc.ents = spacy_spans
return doc
@staticmethod
def create_spacy_json(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
def template_sort(x):
return x.metadata['Template#']
if sort_by_template_id:
dataset.sort(key=template_sort)
json_str = []
for i, sample in tqdm(enumerate(dataset)):
paragraph = sample.to_spacy_json(entities=entities, translate_tags=translate_tags)
json_str.append({
"id": i,
"paragraphs": [paragraph]
})
return json_str
@staticmethod
def translate_tags(tags, dictionary, ignore_unknown):
"""
Translates entity types from one set to another
:param tags: list of entities to translate, e.g. ["LOCATION","O","PERSON"]
:param dictionary: Dictionary of old tags to new tags
:param ignore_unknown: Whether to put "O" when word not in dictionary or keep old entity type
:return: list of translated entities
"""
new_tags = []
for tag in tags:
new_tags.append(InputSample.translate_tag(tag, dictionary, ignore_unknown))
return new_tags
@staticmethod
def translate_tag(tag, dictionary, ignore_unknown):
has_prefix = len(tag) > 2 and tag[1] == '-'
no_prefix = tag[2:] if has_prefix else tag
if no_prefix in dictionary.keys():
return tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
else:
if ignore_unknown:
return "O"
else:
return tag
def bilou_to_bio(self):
new_tags = []
for tag in self.tags:
new_tag = tag
has_prefix = len(tag) > 2 and tag[1] == '-'
if has_prefix:
if tag[0] == 'U':
new_tag = 'B' + tag[1:]
elif tag[0] == 'L':
new_tag = 'I' + tag[1:]
new_tags.append(new_tag)
self.tags = new_tags
@staticmethod
def rename_from_spacy_tags(spacy_tags, ignore_unknown=False):
return InputSample.translate_tags(spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown)
@staticmethod
def rename_to_spacy_tags(tags, ignore_unknown=True):
return InputSample.translate_tags(tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown)
@staticmethod
def write_spacy_json_from_docs(dataset, filename="spacy_output.json"):
docs = [sample.to_spacy_doc() for sample in dataset]
srsly.write_json(filename, [spacy.gold.docs_to_json(docs)])
def to_flair(self):
for token, i in enumerate(self.tokens):
return "{} {} {}".format(token, token.pos_, self.tags[i])
def translate_input_sample_tags(self, dictionary=PRESIDIO_SPACY_ENTITIES, ignore_unknown=True):
self.tags = InputSample.translate_tags(self.tags, dictionary, ignore_unknown=ignore_unknown)
for span in self.spans:
if span.entity_value in PRESIDIO_SPACY_ENTITIES:
span.entity_value = PRESIDIO_SPACY_ENTITIES[span.entity_value]
elif ignore_unknown:
span.entity_value = 'O'
@staticmethod
def create_flair_dataset(dataset):
flair_samples = []
for sample in dataset:
flair_samples.append(sample.to_flair())
return flair_samples
class ModelError:
def __init__(self, error_type, annotation, prediction, token, full_text, metadata):
"""
Holds information about an error a model made for analysis purposes
:param error_type: str, e.g. FP, FN, Person->Address etc.
:param annotation: ground truth value
:param prediction: predicted value
:param token: token in question
:param full_text: full input text
:param metadata: metadata on text from InputSample
"""
self.error_type = error_type
self.annotation = annotation
self.prediction = prediction
self.token = token
self.full_text = full_text
self.metadata = metadata
def __str__(self):
return "type: {}, " \
"Annotation = {}, " \
"prediction = {}, " \
"Token = {}, " \
"Full text = {}, " \
"Metadata = {}".format(self.error_type,
self.annotation,
self.prediction,
self.token,
self.full_text,
self.metadata)
def __repr__(self):
return r"<ModelError {{0}}>".format(self.__str__())
class EvaluationResult(object):
def __init__(self, results: Counter, model_errors: List[ModelError], text: str = None):
"""
Holds the output of a comparison between ground truth and predicted
:param results: List of objects of type Counter
with structure {(actual, predicted) : count}
:param model_errors: List of ModelError
:param text: sample's full text (if used for one sample)
:type results: Counter
:type model_errors : List[ModelError]
:type text: object
"""
self.results = results
self.model_errors = model_errors
self.text = text
self.pii_recall = None
self.pii_precision = None
self.pii_f = None
self.entity_recall_dict = None
self.entity_precision_dict = None
def print(self):
recall_dict = self.entity_recall_dict
precision_dict = self.entity_precision_dict
recall_dict["PII"] = self.pii_recall
precision_dict["PII"] = self.pii_precision
entities = recall_dict.keys()
recall = recall_dict.values()
precision = precision_dict.values()
row_format = "{:>30}{:>30.2%}{:>30.2%}"
header_format = "{:>30}" * 3
print(header_format.format(*("Entity", "Precision", "Recall")))
for entity, precision, recall in zip(entities, recall, precision):
print(row_format.format(entity, precision, recall))
print("PII F measure: {}".format(self.pii_f))

Просмотреть файл

@ -0,0 +1,74 @@
from typing import List
try:
from flair.data import Sentence, build_spacy_tokenizer
from flair.models import SequenceTagger
except ImportError:
print("Flair is not installed by default")
from presidio_evaluator import ModelEvaluator, InputSample
import spacy
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
class FlairEvaluator(ModelEvaluator):
def __init__(self,
model=None,
model_path: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True,
translate_to_spacy_entities=True):
"""
Evaluator for Flair models
:param model: model of type SequenceTagger
:param model_path:
:param entities_to_keep:
:param verbose:
:param labeling_scheme:
:param compare_by_io:
:param translate_to_spacy_entities:
"""
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io)
if model is None:
if model_path is None:
raise ValueError("Either model_path or model object must be supplied")
self.model = SequenceTagger.load(model_path)
else:
self.model = model
self.spacy_tokenizer = build_spacy_tokenizer(model=spacy.blank('en'))
self.translate_to_spacy_entities = translate_to_spacy_entities
if self.translate_to_spacy_entities:
print("Translating entities using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
def predict(self, sample: InputSample) -> List[str]:
if self.translate_to_spacy_entities:
sample.translate_input_sample_tags()
sentence = Sentence(text=sample.full_text, use_tokenizer=self.spacy_tokenizer)
self.model.predict(sentence)
tags = self.get_tags_from_sentence(sentence)
if len(tags) != len(sample.tokens):
print("mismatch between previous tokens and new tokens")
return tags
@staticmethod
def get_tags_from_sentence(sentence):
tags = []
for token in sentence:
tags.append(token.get_tag('ner').value)
new_tags = []
for tag in tags:
new_tags.append("PERSON" if tag == "PER" else tag)
return new_tags

Просмотреть файл

@ -0,0 +1,398 @@
from abc import ABC, abstractmethod
from typing import List, Tuple, Dict
from collections import Counter
import numpy as np
import pandas as pd
from presidio_evaluator import InputSample, EvaluationResult, ModelError
from tqdm import tqdm
class ModelEvaluator(ABC):
def __init__(self, entities_to_keep: List[str] = None,
verbose: bool = False,
use_spans: bool = False, labeling_scheme="BIO",
compare_by_io=True):
"""
Abstract class for evaluating NER models and others
:param entities_to_keep: Which entities should be evaluated? All other
entities are ignored. If None, none are filtered
:param verbose: Whether to print more debug info
:param labeling_scheme: Type of scheme used for labeling (BILOU,
BIO/LOB or IO)
:param compare_by_io: True if comparison should be done on the entity
level and not the sub-entity level
"""
self.entities = entities_to_keep
self.verbose = verbose
self.use_spans = use_spans
self.compare_by_io = compare_by_io
self.labeling_scheme = labeling_scheme
@abstractmethod
def predict(self, sample: InputSample) -> List[str]:
"""
Abstract. Returns the predicted tokens/spans from the evaluated model
:param sample: Sample to be evaluated
:return: if self.use spans: list of spans
if not self.use_spans: tags in self.labeling_scheme format
"""
pass
def compare(self, input_sample: InputSample, prediction: List[str]):
"""
Compares gound truth tags (annotation) and predicted (prediction)
:param input_sample: input sample containing list of tags with scheme
:param prediction: predicted value for each token
self.labeling_scheme
"""
annotation = input_sample.tags
tokens = input_sample.tokens
if len(annotation) != len(prediction):
print("Annotation and prediction do not have the"
"same length. Sample={}".format(input_sample))
return Counter(), []
results = Counter()
mistakes = []
new_annotation = annotation.copy()
if self.compare_by_io:
new_annotation = self._to_io(new_annotation)
prediction = self._to_io(prediction)
# Ignore annotations that aren't in the list of
# requested entities.
if self.entities:
prediction = self._adjust_per_entities(prediction)
new_annotation = self._adjust_per_entities(new_annotation)
for i in range(0, len(new_annotation)):
results[(new_annotation[i], prediction[i])] += 1
if self.verbose:
print('Annotation:', new_annotation[i])
print('Prediction:', prediction[i])
print(results)
# check if there was an error
is_error = (new_annotation[i] != prediction[i])
if is_error:
if prediction[i] == 'O':
mistakes.append(ModelError("FN",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata))
elif new_annotation[i] == 'O':
mistakes.append(ModelError("FP",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata))
else:
mistakes.append(ModelError("Wrong entity",
new_annotation[i],
prediction[i],
tokens[i],
input_sample.full_text,
input_sample.metadata))
return results, mistakes
def _adjust_per_entities(self, tags):
if self.entities:
return [tag if tag in self.entities else 'O' for tag in tags]
@staticmethod
def _to_io(tags):
"""
Translates BILOU/BIO/IOB to IO - only In or Out of entity.
['B-PERSON','I-PERSON','L-PERSON'] is translated into
['PERSON','PERSON','PERSON']
:param tags: the input tags in BILOU/IOB/BIO format
:return: a new list of IO tags
"""
return [tag[2:] if '-' in tag else tag for tag in tags]
def evaluate_sample(self, sample: InputSample) -> EvaluationResult:
if self.verbose:
print("Input sentence: {}".format(sample.full_text))
prediction = self.predict(sample)
results, mistakes = self.compare(
input_sample=sample,
prediction=prediction)
return EvaluationResult(results, mistakes, sample.full_text)
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
evaluation_results = []
for sample in tqdm(dataset, desc='Evaluating {}'.format(self.__class__)):
evaluation_result = self.evaluate_sample(sample)
evaluation_results.append(evaluation_result)
return evaluation_results
def calculate_score(self, evaluation_results: List[
EvaluationResult], beta: float = 1) \
-> EvaluationResult:
"""
Returns the pii_precision, pii_recall and f_measure either for each entity
or for all entities (ignore_entity_type = True)
:param evaluation_results: List of EvaluationResult
:param beta: F measure beta value
between different entity types, or to treat these as misclassifications
:return: EvaluationResult with precision, recall and f measures
"""
# aggregate results
all_results = sum([er.results for er in evaluation_results], Counter())
# compute pii_recall per entity
entity_recall = {}
entity_precision = {}
if self.entities:
entities = self.entities
else:
entities = list(
set([x[0] for x in all_results.keys() if x[0] != 'O']))
for entity in entities:
# all annotation of given type
annotated = sum(
[all_results[x] for x in all_results if x[0] == entity])
predicted = sum(
[all_results[x] for x in all_results if x[1] == entity])
tp = all_results[(entity, entity)]
if annotated > 0:
entity_recall[entity] = tp / annotated
else:
entity_recall[entity] = np.NaN
if predicted > 0:
per_entity_tp = all_results[(entity, entity)]
entity_precision[entity] = per_entity_tp / predicted
else:
entity_precision[entity] = np.NaN
# compute pii_precision and pii_recall
annotated_all = sum(
[all_results[x] for x in all_results if x[0] != 'O'])
predicted_all = sum(
[all_results[x] for x in all_results if x[1] != 'O'])
if annotated_all > 0:
pii_recall = sum([all_results[x] for x in all_results if
(x[0] != 'O' and x[1] != 'O')]) / annotated_all
else:
pii_recall = np.NaN
if predicted_all > 0:
pii_precision = sum([all_results[x] for x in all_results if
(x[0] != 'O' and x[1] != 'O')]) / predicted_all
else:
pii_precision = np.NaN
# compute pii_f_beta-score
pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
# aggregate errors
errors = []
for res in evaluation_results:
if res.model_errors:
errors.extend(res.model_errors)
evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
evaluation_result.pii_precision = pii_precision
evaluation_result.pii_recall = pii_recall
evaluation_result.entity_recall_dict = entity_recall
evaluation_result.entity_precision_dict = entity_precision
evaluation_result.pii_f = pii_f_beta
return evaluation_result
@staticmethod
def precision(tp: int, fp: int) -> float:
return tp / (tp + fp + 1e-100)
@staticmethod
def recall(tp: int, fn: int) -> float:
return tp / (tp + fn + 1e-100)
@staticmethod
def f_beta(precision: float, recall: float, beta: float) -> float:
"""
Returns the F score for precision, recall and a beta parameter
:param precision: a float with the precision value
:param recall: a float with the recall value
:param beta: a float with the beta parameter of the F measure,
which gives more or less weight to precision
vs. recall
:return: a float value of the f(beta) measure.
"""
if np.isnan(precision) or np.isnan(recall) or (
precision == 0 and recall == 0):
return np.nan
return ((1 + beta ** 2) * precision * recall) / (
((beta ** 2) * precision) + recall)
@staticmethod
def align_input_samples_to_presidio_analyzer(input_samples: List[InputSample],
entities_mapping: Dict[str, str],
presidio_fields: List[str]=None) \
-> List[InputSample]:
"""
Change input samples to conform with Presidio's entities
:return: new list of InputSample
"""
new_input_samples = input_samples.copy()
# Match entity names to Presidio's
if not presidio_fields:
presidio_fields = ['CREDIT_CARD', 'CRYPTO', 'DATE_TIME', 'DOMAIN_NAME', 'EMAIL_ADDRESS', 'IBAN_CODE',
'IP_ADDRESS', 'NRP', 'LOCATION', 'PERSON', 'PHONE_NUMBER', 'US_SSN']
# A list that will contain updated input samples,
new_list = []
# Iterate on all samples
for input_sample in new_input_samples:
contains_presidio_field = False
new_spans = []
# Update spans to match Presidio's entity name
for span in input_sample.spans:
in_presidio_field = False
if span.entity_type in entities_mapping.keys():
new_name = entities_mapping.get(span.entity_type)
span.entity_type = new_name
contains_presidio_field = True
# Add to new span list, if the span contains an entity relevant to Presidio
new_spans.append(span)
input_sample.spans = new_spans
# Update tags in case this sample has relevant entities for evaluation
if contains_presidio_field:
for i, tag in enumerate(input_sample.tags):
has_prefix = '-' in tag
if has_prefix:
prefix = tag[:2]
clean = tag[2:]
else:
prefix = ""
clean = tag
if clean in entities_mapping.keys():
new_name = entities_mapping.get(clean)
input_sample.tags[i] = "{}{}".format(prefix, new_name)
else:
input_sample.tags[i] = 'O'
new_list.append(input_sample)
return new_list
@staticmethod
def get_false_positives(errors=List[ModelError], entity=None):
"""
Get a list of all false positive errors in the results
"""
if isinstance(entity, str):
entity = [entity]
if entity:
return [model_error for model_error in errors if
model_error.error_type == 'FP' and model_error.prediction in entity]
else:
return [model_error for model_error in errors if model_error.error_type == 'FP']
@staticmethod
def get_false_negatives(errors=List[ModelError], entity=None):
"""
Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
"""
if isinstance(entity, str):
entity = [entity]
if entity:
return [model_error for model_error in errors if
model_error.error_type != 'FP' and model_error.annotation in entity]
else:
return [model_error for model_error in errors if model_error.error_type != 'FP']
@staticmethod
def most_common_fp_tokens(errors=List[ModelError], n: int = 10, entity=None):
"""
Print the n most common false positive tokens (tokens thought to be an entity)
"""
fps = ModelEvaluator.get_false_positives(errors, entity)
tokens = [err.token.text for err in fps]
from collections import Counter
by_frequency = Counter(tokens)
most_common = by_frequency.most_common(n)
print("Most common false positive tokens:")
print(most_common)
print("Example sentence with each FP token:")
for tok, val in most_common:
with_tok = [err for err in fps if err.token.text == tok]
print(with_tok[0].full_text)
@staticmethod
def most_common_fn_tokens(errors=List[ModelError], n: int = 10, entity=None):
"""
Print all tokens that were missed by the model, including an example of the full text in which they appear
"""
fns = ModelEvaluator.get_false_negatives(errors, entity)
fns_tokens = [err.token.text for err in fns]
from collections import Counter
by_frequency_fns = Counter(fns_tokens)
most_common_fns = by_frequency_fns.most_common(50)
print(most_common_fns)
for tok, val in most_common_fns:
with_tok = [err for err in fns if err.token.text == tok]
print("Token: {}, Annotation: {}, Full text: {}".format(with_tok[0].token, with_tok[0].annotation,
with_tok[0].full_text))
@staticmethod
def get_errors_df(errors=List[ModelError], entity: List[str] = None, error_type: str = 'FN'):
"""
Get ModelErrors as pd.DataFrame
"""
if error_type == 'FN':
filtered_errors = ModelEvaluator.get_false_negatives(errors, entity)
elif error_type == 'FP':
filtered_errors = ModelEvaluator.get_false_positives(errors, entity)
else:
raise ValueError("error_type should be either FP or FN")
if len(filtered_errors) == 0:
print("No errors of type {} and entity {} were found".format(error_type,entity))
return None
errors_df = pd.DataFrame.from_records([error.__dict__ for error in filtered_errors])
metadata_df = pd.DataFrame(errors_df['metadata'].tolist())
errors_df.drop(['metadata'], axis=1, inplace=True)
new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
return new_errors_df
@staticmethod
def get_fps_dataframe(errors=List[ModelError], entity: List[str] = None):
"""
Get false positive ModelErrors as pd.DataFrame
"""
return ModelEvaluator.get_errors_df(errors, entity, error_type='FP')
@staticmethod
def get_fns_dataframe(errors=List[ModelError], entity: List[str] = None):
"""
Get false negative ModelErrors as pd.DataFrame
"""
return ModelEvaluator.get_errors_df(errors, entity, error_type='FN')

Просмотреть файл

@ -0,0 +1,136 @@
'''
Presidio Analyzer not yet on PyPI, cannot explicitly reference it
'''
from typing import List, Dict
#
from presidio_evaluator import ModelEvaluator, InputSample, span_to_tag
#
from presidio_evaluator.data_generator import read_synth_dataset
#
#
class PresidioAnalyzer(ModelEvaluator):
def __init__(self, analyzer,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme="BIO",
compare_by_io=True,
score_threshold=0.4
):
"""
Evaluation wrapper for the Presidio Analyzer
:param analyzer: object of type AnalyzerEngine (from presidio-analyzer)
"""
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io)
self.analyzer = analyzer
self.score_threshold = score_threshold
def predict(self, sample: InputSample) -> List[str]:
if self.entities is None or len(self.entities) == 0:
all_fields = True
else:
all_fields = None
results = self.analyzer.analyze(sample.full_text, self.entities,
language='en', all_fields=all_fields)
starts = []
ends = []
scores = []
tags = []
#
for res in results:
#
if res.score >= self.score_threshold:
starts.append(res.start)
ends.append(res.end)
tags.append(res.entity_type)
scores.append(res.score)
#
response_tags = span_to_tag(scheme=self.labeling_scheme,
text=sample.full_text,
start=starts,
end=ends,
tokens=sample.tokens,
scores=scores,
tag=tags)
return response_tags
if __name__ == "__main__":
print("Reading dataset")
input_samples = read_synth_dataset("../data/generated_size_30000_date_July 24 2019.txt")
print("Preparing dataset by aligning entity names to Presidio's entity names")
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
entities_mapping = {
'PERSON': 'PERSON',
'EMAIL': 'EMAIL_ADDRESS',
'CREDIT_CARD': 'CREDIT_CARD',
'FIRST_NAME': 'PERSON',
'PHONE_NUMBER': 'PHONE_NUMBER',
'BIRTHDAY': 'DATE_TIME',
'DATE': 'DATE_TIME',
'DOMAIN': 'DOMAIN',
'CITY': 'LOCATION',
'ADDRESS': 'LOCATION',
'IBAN': 'IBAN_CODE',
'URL': 'DOMAIN_NAME',
'US_SSN': 'US_SSN',
'IP_ADDRESS': 'IP_ADDRESS',
'ORGANIZATION': 'ORG',
'O': 'O'
}
updated_samples = ModelEvaluator.align_input_samples_to_presidio_analyzer(input_samples,
entities_mapping)
flatten = lambda l: [item for sublist in l for item in sublist]
from collections import Counter
count_per_entity = Counter(
[span.entity_type for span in flatten([input_sample.spans for input_sample in updated_samples])])
print("Evaluating samples")
analyzer = PresidioAnalyzer(entities_to_keep=count_per_entity.keys())
evaluated_samples = analyzer.evaluate_all(updated_samples)
#
print("Estimating metrics")
precision, recall, \
entity_recall, entity_precision, \
f, errors = analyzer.calculate_score(evaluation_results=evaluated_samples, beta=2.5)
#
print("precision: {}".format(precision))
print("Recall: {}".format(recall))
print("F 2.5: {}".format(f))
print("Precision per entity: {}".format(entity_precision))
print("Recall per entity: {}".format(entity_recall))
#
FN_mistakes = [mistake for mistake in flatten(errors) if mistake[0:2] == 'FN']
FP_mistakes = [mistake for mistake in flatten(errors) if mistake[0:2] == 'FP']
other_mistakes = [mistake for mistake in flatten(errors) if "Wrong entity" in mistake]
fn = open('../data/fn_30000.txt', 'w+', encoding='utf-8')
fn1 = '\n'.join(FN_mistakes)
fn.write(fn1)
fn.close()
fp = open('../data/fp_30000.txt', 'w+', encoding='utf-8')
fp1 = '\n'.join(FP_mistakes)
fp.write(fp1)
fp.close()
mistakes_file = open('../data/mistakes_30000.txt', 'w+', encoding='utf-8')
mistakes1 = '\n'.join(other_mistakes)
mistakes_file.write(mistakes1)
mistakes_file.close()
from pickle import dump
dump(evaluated_samples, open("../data/evaluated_samples_30000.pickle", "wb"))

Просмотреть файл

@ -0,0 +1,133 @@
import json
from typing import List
import requests
from presidio_evaluator import InputSample, ModelEvaluator
from presidio_evaluator.span_to_tag import span_to_tag, tokenize
ENDPOINT = "http://40.113.201.221:8080/api/v1/projects/test/analyze"
class PresidioAPIEvaluator(ModelEvaluator):
def __init__(self, endpoint=None, all_fields=False, entities_to_keep=None,
verbose=False, labeling_scheme="IO", **kwargs):
"""
evaluator model for the presidio API as a system
:param endpoint: url of presidio API
:param all_fields: boolean, true if no entities filtering should take
place
:param entities_to_keep: list of entities to return if found
:param labeling_scheme: BIO/IOB or BILOU
:param verbose:
:param kwargs:
"""
if not endpoint:
print(
"Endpoint is missing. using default presidio API at {}".format(
ENDPOINT))
self.endpoint = ENDPOINT
else:
self.endpoint = endpoint
if not entities_to_keep and not all_fields:
raise ValueError("Please provide either a list of entities or"
"all_fields=true")
if all_fields:
entities_to_keep = None
super().__init__(verbose=verbose, entities_to_keep=entities_to_keep,
labeling_scheme=labeling_scheme, **kwargs)
self.set_analyze_template(all_fields=all_fields,
entities=entities_to_keep)
def predict(self, sample: InputSample):
text = sample.full_text
request = {"text": text,
"analyzeTemplate": self.analyze_template
}
# Call presidio API
r = requests.post(self.endpoint, json=request)
starts = []
ends = []
tags = []
if r.status_code == 200:
analyzer_results = json.loads(r.text)
if self.verbose:
print(analyzer_results)
if analyzer_results:
for res in analyzer_results:
if not res['location'].get('start'):
res['location']['start'] = 0
starts.append(res['location']['start'])
ends.append(res['location']['end'])
tags.append(res['field']['name'])
response_tags = span_to_tag(scheme=self.labeling_scheme,
text=text,
start=starts,
end=ends,
tag=tags)
elif r.status_code == 400 or r.text == "":
if self.verbose:
print("Status 400 received")
response_tags = ['O' for token in sample.tokens]
else:
print("Error getting result from Presidio API")
print("Request = {}".format(request))
print("Response = {}".format(r.text))
raise Exception(r)
return response_tags
def set_analyze_template(self, all_fields: bool, entities: List[str]):
template = {
"fields": [{"name": "EMAIL_ADDRESS"}, {"name": "IP_ADDRESS"},
{"name": "US_DRIVER_LICENSE"},
{"name": "US_ITIN"}, {"name": "US_SSN"},
{"name": "DOMAIN_NAME"},
{"name": "IBAN_CODE"}, {"name": "PERSON"},
{"name": "PHONE_NUMBER"},
{"name": "US_BANK_NUMBER"}, {"name": "CRYPTO"},
{"name": "NRP"},
{"name": "UK_NHS"}, {"name": "CREDIT_CARD"},
{"name": "DATE_TIME"},
{"name": "LOCATION"}, {"name": "US_PASSPORT"}]}
if all_fields:
self.analyze_template = template
return
requested_fields = []
for entity in entities:
for field in template['fields']:
if entity == field['name']:
requested_fields.append(field)
new_template = {'fields': requested_fields}
self.analyze_template = new_template
if __name__ == "__main__":
# Example:
text = "My siblings are Dan and magen"
bilou_tags = ['O', 'O', 'O', 'U-PERSON', 'O', 'U-PERSON']
presidio = PresidioAPIEvaluator(verbose=True, all_fields=True, compare_by_io=True)
tokens = tokenize(text)
s = InputSample(text, masked=None, spans=None)
s.tokens = tokens
s.tags = bilou_tags
evaluated_sample = presidio.evaluate_sample(s)
p, r, entity_recall, f, mistakes = presidio.calculate_score([evaluated_sample])
print("Precision = {}\n"
"Recall = {}\n"
"F_3 = {}\n"
"Errors = {}".format(p, r, f, mistakes))

Просмотреть файл

@ -0,0 +1,82 @@
'''
Presidio Analyzer not yet on PyPI, therefore it cannot be referenced explicitly
'''
import math
from typing import List, Tuple, Dict
from presidio_evaluator import ModelEvaluator, InputSample
from presidio_evaluator.span_to_tag import span_to_tag
class PresidioRecognizerEvaluator(ModelEvaluator):
def __init__(self, recognizer, nlp_engine, entities_to_keep=None,
with_nlp_artifacts=False, verbose=False, compare_by_io=True,
):
"""
Evaluator for one recognizer
:param recognizer: An object of type EntityRecognizer (in presidion-analyzer)
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
"""
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose, compare_by_io=compare_by_io)
self.withNlpArtifacts = with_nlp_artifacts
self.recognizer = recognizer
self.nlp_engine = nlp_engine
#
def __make_nlp_artifacts(self, text: str):
return self.nlp_engine.process_text(text, 'en')
#
def predict(self, sample: InputSample) -> List[str]:
nlpArtifacts = None
if self.withNlpArtifacts:
nlpArtifacts = self.__make_nlp_artifacts(sample.full_text)
results = self.recognizer.analyze(sample.full_text, self.entities,
nlpArtifacts)
starts = []
ends = []
tags = []
scores = []
for res in results:
if not res.start:
res.start = 0
starts.append(res.start)
ends.append(res.end)
tags.append(res.entity_type)
scores.append(res.score)
response_tags = span_to_tag(scheme=self.labeling_scheme,
text=sample.full_text,
start=starts,
end=ends,
tag=tags,
tokens=sample.tokens,
scores=scores,
io_tags_only=self.compare_by_io)
if len(sample.tags) == 0:
sample.tags = ['0' for word in response_tags]
return response_tags
def score_presidio_recognizer(recognizer, entities_to_keep, input_samples,
withNlpArtifacts=False) \
-> Tuple[Dict[str, float], Dict[str, float], Dict[str, float], Dict[
str, float], Dict[str, float], List[str]]:
model = PresidioRecognizerEvaluator(recognizer=recognizer,
entities_to_keep=entities_to_keep,
with_nlp_artifacts=withNlpArtifacts)
evaluated_samples = model.evaluate_all(input_samples[:])
precision, recall, ent_recall, \
ent_precision, fscore, mistakes = model.calculate_score(
evaluated_samples, beta=2.5)
print("p={precision}, r={recall},f={f},"
"entity recall={ent},entity precision={prec}".format(
precision=precision,
recall=recall,
f=fscore,
ent=ent_recall,
prec=ent_precision))
if math.isnan(precision):
precision = 0
return precision, recall, ent_recall, ent_precision, fscore, mistakes

Просмотреть файл

@ -0,0 +1,52 @@
from typing import List
from presidio_evaluator import ModelEvaluator, InputSample
import spacy
from spacy.language import Language
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
class SpacyEvaluator(ModelEvaluator):
def __init__(self,
model: spacy.language.Language = None,
model_name: str = None,
entities_to_keep: List[str] = None,
verbose: bool = False,
labeling_scheme: str = "BIO",
compare_by_io: bool = True,
translate_to_spacy_ents = True):
super().__init__(entities_to_keep=entities_to_keep,
verbose=verbose,
labeling_scheme=labeling_scheme,
compare_by_io=compare_by_io)
if model is None:
if model_name is None:
raise ValueError("Either model_name or model object must be supplied")
self.model = spacy.load(model_name)
else:
self.model = model
self.translate_to_spacy_ents = translate_to_spacy_ents
if self.translate_to_spacy_ents:
print("Translating entites using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
def predict(self, sample: InputSample) -> List[str]:
if self.translate_to_spacy_ents:
sample.translate_input_sample_tags()
doc = self.model(sample.full_text)
tags = self.get_tags_from_doc(doc)
if len(doc) != len(sample.tokens):
print("mismatch between input tokens and new tokens")
return tags
@staticmethod
def get_tags_from_doc(doc):
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
return tags

Просмотреть файл

@ -0,0 +1,164 @@
from collections import namedtuple
from typing import List
import spacy
loaded_spacy = {}
def get_spacy(loaded_spacy=loaded_spacy, model_version="en_core_web_lg"):
if model_version not in loaded_spacy:
disable = ['vectors', 'textcat', 'ner']
print("loading model {}".format(model_version))
loaded_spacy[model_version] = spacy.load(model_version, disable=disable)
return loaded_spacy[model_version]
def tokenize(text, model_version="en_core_web_lg"):
return get_spacy(model_version=model_version)(text)
def _get_detailed_tags(scheme, cur_tags):
"""
Replaces IO tags (e.g. PERSON PERSON) with IOB/BIO/BILOU tags
:param cur_tags:
:param scheme:
:return:
"""
if all([tag == 'O' for tag in cur_tags]):
return cur_tags
return_tags = []
if len(cur_tags) == 1:
if scheme == "BILOU":
return_tags.append("U-{}".format(cur_tags[0]))
else:
return_tags.append("I-{}".format(cur_tags[0]))
elif len(cur_tags) > 0:
tg = cur_tags[0]
for j in range(0, len(cur_tags)):
if j == 0:
return_tags.append("B-{}".format(tg))
elif j == len(cur_tags) - 1:
if scheme == "BILOU":
return_tags.append("L-{}".format(tg))
else:
return_tags.append("I-{}".format(tg))
else:
return_tags.append("I-{}".format(tg))
return return_tags
def _sort_spans(start, end, tag, score):
if len(start) > 0:
tpl = [(a, b, c, d) for a, b, c, d in sorted(zip(start, end, tag, score), key=lambda pair: pair[0])]
start, end, tag, score = [[x[i] for x in tpl] for i in range(len(tpl[0]))]
return start, end, tag, score
def _handle_overlaps(start, end, tag, score):
start, end, tag, score = _sort_spans(start, end, tag, score)
if len(start) == 0:
return start, end, tag, score
max_end = max(end)
index = min(start)
number_of_spans = len(start)
i = 0
while i < number_of_spans-1:
for j in range(i+1,number_of_spans):
# Span j intersects with span i
if start[i] <= start[j] <= end[i]:
# i's score is higher, remove intersecting part
if score[i] > score[j]:
# j is contained within i but has lower score, remove
if start[i] >= end[j] >= end[i]:
score[j] = 0
# else, j continues after i ended:
else:
start[j] = end[i] + 1
# j's score is higher, break i
else:
# If i finishes after j ended, split i
if end[j] < end[i]:
# create new span at the end
start.append(end[j] + 1)
end.append(end[i])
score.append(score[i])
tag.append(tag[i])
number_of_spans += 1
# truncate the current i to end at start(j)
end[i] = start[j] - 1
# else, i finishes before j ended. truncate i
else:
end[i] = start[j] - 1
i += 1
start, end, tag, score = _sort_spans(start, end, tag, score)
return start, end, tag, score
def span_to_tag(scheme: str,
text: str,
start: List[int],
end: List[int],
tag: List[str],
scores: List[float] = None,
tokens: List[spacy.tokens.Token] = None,
io_tags_only=False) -> List[str]:
"""
Turns a list of start and end values with corresponding labels, into a NER
tagging (BILOU,BIO/IOB)
:param scheme: labeling scheme, either BILOU, BIO/IOB or IO
:param text: input text
:param tokens: text tokenized to tokens
:param start: list of indices where entities in the text start
:param end: list of indices where entities in the text end
:param tag: list of entity names
:param scores: score of tag (confidence)
:param io_tags_only: Whether to return only I and O tags
:return: list of strings, representing either BILOU or BIO for the input
"""
if not scores:
# assume all scores are of equal weight
scores = [0.5 for start in start]
start, end, tag, scores = _handle_overlaps(start, end, tag, scores)
if not tokens:
tokens = tokenize(text)
io_tags = []
for token in tokens:
found = False
for span_index in range(0, len(start)):
if start[span_index] <= token.idx < end[span_index]:
io_tags.append(tag[span_index])
found = True
break
if not found:
io_tags.append("O")
if io_tags_only or scheme == "IO":
return io_tags
# Set tagging based on scheme (BIO/IOB or BILOU)
current_tag = ""
span_index = 0
changes = []
for io_tag in io_tags:
if io_tag != current_tag:
changes.append(span_index)
span_index += 1
current_tag = io_tag
changes.append(len(io_tags))
new_return_tags = []
for i in range(len(changes) - 1):
new_return_tags.extend(
_get_detailed_tags(scheme=scheme,
cur_tags=io_tags[changes[i]:changes[i + 1]]))
return new_return_tags

Просмотреть файл

@ -0,0 +1,79 @@
from collections import defaultdict
import random
import numpy as np
from typing import List, Dict
import json
from presidio_evaluator import InputSample
def split_dataset(dataset : List[InputSample], ratios):
"""
Splits a provided dataset into n groups, by the Template# attribute in each sample's metadata
:param dataset: List of InputSamples to be splitted
:param ratios: list of percentages. The len of the list would be the len of the splits returned,
e.g. [0.7,0.2,0.1] for train, test, validation
"""
splits = []
remaining_dataset = dataset
remaining_ratio = 1.0
if sum(ratios) > 1 or sum(ratios) < 0.999:
raise ValueError("Ratios should sum to 1 and be in (0,1]")
for ratio in ratios:
if 1 >= ratio > 0:
first_templates, second_templates = split_by_template(remaining_dataset, ratio/remaining_ratio)
first_split = get_samples_by_pattern(remaining_dataset, first_templates)
second_split = get_samples_by_pattern(remaining_dataset, second_templates)
splits.append(first_split)
remaining_dataset = second_split
remaining_ratio -= ratio
else:
raise ValueError("Ratio needs to be in (0,1]")
return tuple(splits)
def group_by_template(dataset: List[InputSample]) -> Dict[str, List[InputSample]]:
"""
Creates a dict of key = template ID and value = List[InputSamples] for this template id
"""
samples_pattern_tup = [(sample.metadata["Template#"],sample) for sample in dataset]
group_by_template = defaultdict(list)
for sample in samples_pattern_tup:
group_by_template[sample[0]].append(sample[1])
return group_by_template
def split_by_template(input_samples: List[InputSample], train_pct: float = 0.7):
"""
Splits a daset of type List[InputSample] into a tuple of train template IDs and test template IDs
"""
samples_grpd = group_by_template(input_samples)
templates = np.array(list(samples_grpd.keys()))
train_ind = set(random.sample(range(len(templates)), round(train_pct * len(templates))))
test_ind = set(range(len(templates))) - train_ind
return templates[list(train_ind)], templates[list(test_ind)]
def get_samples_by_pattern(input_samples, patterns_list):
samples_grpd = group_by_template(input_samples)
dataset = []
for pattern in patterns_list:
dataset.extend(samples_grpd[pattern])
random.shuffle(dataset)
return dataset
def save_to_json(samples, output_file):
examples_dict = [example.to_dict() for example in samples]
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
json.dump(examples_dict, f, ensure_ascii=False, indent=4)

14
pytest.ini Normal file
Просмотреть файл

@ -0,0 +1,14 @@
[pytest]
testpaths = .
markers =
slow: marks tests as slow (deselect with '-m "not slow"')
inconclusive: marks tests as those that may sometimes fail due to threshold
none: regular tests
serial
# Commented out to avoid performance tests failures. Uncoment when debugging tests.
#log_cli = true
#log_level = DEBUG
filterwarnings =
ignore::DeprecationWarning

17
requirements.txt Normal file
Просмотреть файл

@ -0,0 +1,17 @@
spacy
requests==2.22.0
numpy==1.17.2
jupyter==1.0.0
pandas==0.25.1
tqdm
haikunator
schwifty
faker
sklearn
https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz
regex
#azureml
#azureml-sdk
#flair
sklearn_crfsuite
pytest

38
setup.py Normal file
Просмотреть файл

@ -0,0 +1,38 @@
from setuptools import setup
import os.path
# read the contents of the README file
from os import path
this_directory = path.abspath(path.dirname(__file__))
with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
# print(long_description)
__version__ = ""
with open(os.path.join(this_directory, 'VERSION')) as version_file:
__version__ = version_file.read().strip()
setup(
name='presidio-evaluator',
long_description=long_description,
long_description_content_type='text/markdown',
version=__version__,
packages=['presidio_evaluator', 'presidio_evaluator.data_generator'
],
url='https://www.github.com/microsoft/presidio',
license='MIT',
description='PII dataset generator, model evaluator for Presidio and PII data in general',
install_requires=[
'spacy>=2.2.0',
'requests==2.22.0',
'numpy==1.16.4',
'pandas>=0.24.2',
'tqdm>=4.32.1',
'jupyter>=1.0.0',
'pytest>=4.6.2',
'haikunator',
'schwifty',
'faker',
'sklearn_crfsuite']
)

0
tests/__init__.py Normal file
Просмотреть файл

31
tests/conftest.py Normal file
Просмотреть файл

@ -0,0 +1,31 @@
import pytest
# pytest configuration file
# the configuration allow 3 kind of tests:
# * unmarked tests run on all pytest execution
# * tests with large datasets\long testing time are marked as "slow" and have to be run with pytest run --runslow
# * tests with inconclusive result are marked as "inconclusive" have to be run with pytest run --runinconclusive
# * tests can be both slow and inconclusive and have to be run with pytest run --runslow --runinconclusive
def pytest_addoption(parser):
parser.addoption(
"--runslow", action="store_true", default=False, help="run slow tests"
)
parser.addoption(
"--runinconclusive", action="store_true", default=False, help="run slow tests"
)
def pytest_collection_modifyitems(items, config):
if not config.getoption("--runslow"):
skip_slow = pytest.mark.skip(reason="need --runslow option to run")
for item in items:
if "slow" in item.keywords:
item.add_marker(skip_slow)
if not config.getoption("--runinconclusive"):
skip_slow = pytest.mark.skip(reason="need --runinconclusive option to run")
for item in items:
if "inconclusive" in item.keywords:
item.add_marker(skip_slow)

Просмотреть файл

@ -0,0 +1,18 @@
WORD,PARSING
a,()
a-,()
a 1,()
a b c,()
a cappella,()
a fortiori,()
a mensa et thoro,()
a posteriori,()
a priori,()
aam,(n.)
aard-vark,(n.)
aard-wolf,(n.)
aaronic,(a.)
aaronical,(a.)
aaron's rod,()
ab,(n.)
ab-,()
1 WORD PARSING
2 a ()
3 a- ()
4 a 1 ()
5 a b c ()
6 a cappella ()
7 a fortiori ()
8 a mensa et thoro ()
9 a posteriori ()
10 a priori ()
11 aam (n.)
12 aard-vark (n.)
13 aard-wolf (n.)
14 aaronic (a.)
15 aaronical (a.)
16 aaron's rod ()
17 ab (n.)
18 ab- ()

Просмотреть файл

@ -0,0 +1,101 @@
Number,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,EmailAddress,Username,Password,BrowserUserAgent,TelephoneNumber,TelephoneCountryCode,MothersMaiden,Birthday,Age,TropicalZodiac,CCType,CCNumber,CVV2,CCExpires,NationalID,UPS,WesternUnionMTCN,MoneyGramMTCN,Color,Occupation,Company,Vehicle,Domain,BloodType,Pounds,Kilograms,FeetInches,Centimeters,GUID,Latitude,Longitude
1,female,Czech,Mrs.,Marie,J,Hamanová,"P.O. Box 255",Kangerlussuaq,QE,Qeqqata,3910,GL,Greenland,MarieHamanova@armyspy.com,Wasco1982,eiZookooB7,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","84 23 30",299,Kubíková,3/29/1982,37,Aries,MasterCard,5545634085461876,511,1/2020,,"1Z 789 686 82 8979 914 6",6945116246,34746079,Purple,"Surveillance officer","Simple Solutions","1995 Zastava 65",MarathonDancing.gl,O+,217.6,98.9,"5' 5""",164,6781b04d-7b5f-4c1a-bceb-b953e6ef70d7,77.377518,-67.015569
2,female,French,Ms.,Patricia,G,Desrosiers,"Avenida Noruega 42","Vila Real",VR,"Vila Real",5000-047,PT,Portugal,PatriciaDesrosiers@superrito.com,Fultses,eb6soCha4ae,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","21 259 903 5696",351,Daviau,2/28/1956,63,Pisces,MasterCard,5317250628844522,874,3/2022,,"1Z V38 747 73 7311 832 9",7398998399,18093674,Blue,"Vascular technologist","Formula Gray","2006 Lexus GS",LostMillions.com.pt,O+,118.1,53.7,"5' 0""",152,2b2e7e1a-855f-4089-a570-c0af2381a6d6,41.274541,-7.876658
3,female,American,Ms.,Debra,O,Neal,"1659 Hoog St",Brakpan,GA,Gauteng,1553,ZA,"South Africa",DebraONeal@fleckens.hu,Cognoy,sha3Sohzee,"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/19.4.0.2397 Yowser/2.5 Safari/537.36","082 490 1693",27,Barrett,6/11/1957,62,Gemini,Visa,4916429195104076,315,5/2020,5706114632083,"1Z 061 1E5 71 3400 427 4",6186449862,58702271,Blue,"Information architect librarian",Dahlkemper's,"1993 Honda Prelude",MediumTube.co.za,A+,120.1,54.6,"5' 4""",162,2ef83f4c-3102-4f79-839d-c75bf6a06f0a,-26.22096,28.283398
4,male,French,Mr.,Peverell,C,Racine,"183 Epimenidou Street",Limassol,LI,Limassol,3041,CY,"Cyprus (Anglicized)",PeverellRacine@teleworm.us,Restlys,Aekie7ohs,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","25 470375",357,Grondin,6/14/1962,57,Gemini,Visa,4485421519226702,653,5/2023,,"1Z F44 91V 14 3570 491 2",0850016444,52534088,Blue,"Desk clerk",Quickbiz,"2008 Infiniti G35",ImproveLook.com.cy,B+,142.1,64.6,"5' 9""",174,bfb4be71-3710-4ffa-baaf-5af6aa4b339e,41.30296,-72.989066
5,female,Slovenian,Mrs.,Iolanda,S,Tratnik,"Karu põik 61",Pärnu,PR,Pärnumaa,80098,EE,Estonia,IolandaTratnik@teleworm.us,Trely1962,jeiziejohH3ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","445 6271",372,Korbun,1/23/1962,57,Aquarius,Visa,4532820383285186,893,4/2024,,"1Z 060 418 64 7516 574 4",1178606881,74806227,Purple,"Production assistant","Dubrow's Cafeteria","2007 Fiat Idea",PostTan.com.ee,O+,141.5,64.3,"5' 3""",160,0cbb7bf3-466f-4df6-bda3-9c9fe7bfc5c1,58.293395,24.434851
6,male,Italian,Mr.,Domenico,D,Pisano,"Via Pisanelli 104",Traversara,RA,Ravenna,48020,IT,Italy,DomenicoPisano@armyspy.com,Hatelt,lohhee8Zah,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36","0312 0828589",39,Conti,6/1/1979,40,Gemini,Visa,4532872142737056,237,6/2023,WK48391724,"1Z 175 1F5 29 1963 168 1",7448393148,31617424,Blue,"Professional scout",Littler's,"1998 Nissan Serena",HardDriveBlog.it,O+,247.5,112.5,"6' 0""",182,f4feeb24-e3b1-4d99-9c71-e8c6a95762fe,44.588081,12.055283
7,male,Greenland,Mr.,Pavia,A,Rosing,"29 Wattle St","King William's Town",EC,"Eastern Cape",5601,ZA,"South Africa",PaviaRosing@superrito.com,Thattere,aiCheed7tie,"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/19.4.0.2397 Yowser/2.5 Safari/537.36","082 692 3461",27,Lennert,5/5/1937,82,Taurus,Visa,4539980160229196,256,10/2020,3705057790082,"1Z 507 770 52 3012 473 1",1256867146,65720899,Green,"Chemical engineer","Pup 'N' Taco","2003 Peugeot Partner",RecruitSuit.co.za,O-,192.1,87.3,"6' 0""",182,b6b75cf9-dfbf-424d-a03c-90cdd859e9eb,-32.787712,27.343649
8,female,French,Mrs.,Ormazd,M,Jomphe,"Mattenstrasse 108",Sissach,,,4450,CH,Switzerland,OrmazdJomphe@rhyta.com,Deace1999,oochui5Eboe5T,"Mozilla/5.0 (Windows NT 6.1; rv:66.0) Gecko/20100101 Firefox/66.0","061 947 83 90",41,Busson,1/14/1999,20,Capricorn,Visa,4556603638439886,691,6/2024,,"1Z 091 192 83 9348 168 6",4380386435,24628087,Purple,"Clinical psychologist","Linens 'n Things","1996 Plymouth Neon",CyclingMonthly.ch,O+,115.3,52.4,"5' 1""",154,e5858bdb-9173-4991-9857-4e09b61e4e16,47.520557,7.863831
9,male,Norwegian,Mr.,Severin,L,Akhtar,"251 Charilaou Trikoupi Str.",Pigenia,NI,Nicosia,2962,CY,"Cyprus (Anglicized)",SeverinAkhtar@rhyta.com,Heremer,cieCipua8L,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 586625",357,Mathisen,4/30/1960,59,Taurus,MasterCard,5230940651584482,785,10/2022,,"1Z W81 228 52 4912 032 1",8897778249,55031915,Green,"Pump operator","Fragrant Flower Lawn Services","2005 Dodge Nitro",GainPain.com.cy,B+,155.3,70.6,"6' 0""",182,64383596-6dc8-4b77-9476-c1a8ef23ffc6,41.335894,-72.908321
10,female,Greenland,Mrs.,Margrethe,H,Kristiansen,"94 boulevard Amiral Courbet",ORLÉANS,CE,Centre,45100,FR,France,MargretheKristiansen@gustr.com,Theirturavid,Aiv4ohwae,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",02.64.62.52.37,33,Berthelsen,12/13/1979,40,Sagittarius,MasterCard,5306020150102745,915,12/2024,"2791269679323 49","1Z 987 E42 01 7982 218 2",0231937615,18687876,Purple,"Systems software engineer","Independent Wealth Management","2012 Porsche 911",ToyProtection.fr,A+,200.2,91.0,"5' 3""",159,ae7b1a56-0d6d-46ba-895e-75ee10482858,47.850047,1.875252
11,female,Hispanic,Mrs.,Myrna,G,Feliciano,"Männi 12",Mustoja,LV,Lääne-Virumaa,45429,EE,Estonia,MyrnaFelicianoCortes@superrito.com,Wilthe84,Chu6shiRees,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","329 3803",372,Cortés,1/23/1984,35,Aquarius,MasterCard,5402842306504596,788,7/2020,,"1Z 599 666 46 6430 018 3",8678427166,90687114,Blue,"Radiologic technician",Monmax,"2003 Mitsubishi Lancer",SharkStatistics.com.ee,O+,212.7,96.7,"5' 5""",164,4e8b5c6c-0b04-43c1-ad5b-2fc92365455c,59.638357,26.059683
12,male,Czech,Mr.,Michal,E,Horký,"Algade 33",Guldborg,SJ,"Region Sjælland",4862,DK,Denmark,MichalHorky@rhyta.com,Fiect1941,oxep7Aev,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0",28-64-27-85,45,Siváková,6/19/1941,78,Gemini,MasterCard,5108116586316493,376,2/2021,190641-4941,"1Z 85E W86 50 8027 647 6",4547979244,81431928,Orange,"Dental assistant",Pointers,"1995 Daihatsu Rocky",ExShows.dk,B+,165.0,75.0,"5' 10""",177,a2aa0138-07d3-41e3-b3b7-455d9854e31f,54.815396,11.760822
13,male,French,Mr.,Donat,M,Lespérance,"96 rue de Penthièvre",PONTOISE,IL,Île-de-France,95000,FR,France,DonatLesperance@gustr.com,Sirep1950,re8ZieK4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",03.34.08.71.06,33,Dodier,11/12/1950,69,Scorpio,Visa,4556904288472270,512,9/2022,"1501143313127 93","1Z 070 9Y5 64 7265 236 0",1770634719,11974448,Black,Neurosonographer,"American Appliance","2009 Kia Cerato",BasketballBiz.fr,A+,180.8,82.2,"5' 7""",170,78b497ed-6d7d-4e5f-8150-1155abf9716e,48.977559,1.976986
14,female,"Japanese (Anglicized)",Ms.,Yuuka,M,Shimasaki,"Mjövattnet 1",NYLAND,,,"870 52",SE,Sweden,YuukaShimasaki@cuvox.de,Coun1976,ohleaT4ae,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",0613-9040212,46,Kimura,2/11/1976,43,Aquarius,MasterCard,5457403889440023,903,5/2021,760211-4105,"1Z 975 450 29 2316 562 4",8652144021,72590241,Blue,"Credit checker",Elek-Tek,"2006 Ford Territory",ClassInsider.se,A-,151.8,69.0,"5' 5""",165,df1a2d57-31d8-4a71-8ad6-cb687ee250d4,62.773416,17.853904
15,male,Swedish,Mr.,Wiktor,H,Ek,"Norðurbraut 27",Reykjavík,,,112,IS,Iceland,WiktorEk@rhyta.com,Boally,aigo2OoPhoi,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","434 6815",354,Göransson,5/4/1945,74,Taurus,Visa,4532063379896779,489,4/2024,,"1Z 278 965 48 6106 268 1",1158883056,79465382,White,"Legal secretary","Handy Andy Home Improvement Center","2014 Jaguar XF",SSLAlert.is,O+,227.0,103.2,"5' 10""",178,d39f58d7-7bb7-4f77-9956-9b505f8a4cc8,64.187422,-21.93344
16,female,Slovenian,Ms.,Polona,H,Ranković,"Õli 68",Himmiste,PL,Põlvamaa,64204,EE,Estonia,PolonaRankovic@rhyta.com,Whou1985,eeZae5oech,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36","798 9719",372,Orehek,7/18/1985,34,Cancer,MasterCard,5433975988486550,723,7/2022,,"1Z A77 0E9 59 0580 345 8",0984960146,75114603,Blue,"Personal trainer","Budget Tapes & Records","1995 Lancia Delta",TodayAlert.com.ee,A+,160.2,72.8,"5' 5""",165,c8e80b15-7ff4-4b21-bebf-5737d8133bdc,57.990467,27.141455
17,male,Scottish,Mr.,Ivan,M,King,"Pachergasse 64",BÜSCHENDORF,ST,Styria,8786,AT,Austria,IvanKing@einrot.com,Anempon,XooJoh0se5sh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0660 475 13 89",43,Watson,9/15/1994,25,Virgo,MasterCard,5257015834586726,714,12/2020,,"1Z 37A 329 60 9892 939 6",1176881962,86098320,Green,"Placement counselor",Opticomp,"2000 Citroen C 15",InvestmentBrowse.at,A+,130.2,59.2,"5' 7""",170,58aec14a-fc35-4e19-a61f-b8f170a9ec7d,47.492881,14.377639
18,female,Finnish,Mrs.,Nelma,M,Grönholm,"Rostsestraat 222",Froidchapelle,WLX,Luxembourg,6440,BE,Belgium,NelmaGronholm@jourrapide.com,Obect1946,aeJ6OhneiF3t,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0479 50 54 03",32,Pelkonen,9/15/1946,73,Virgo,MasterCard,5208559023644291,187,12/2021,,"1Z 416 A35 89 5065 644 8",4484163657,51232002,Green,"Claims adjuster","Starship Tapes & Records","2010 Land Rover Defender",MobileKicks.be,A+,216.5,98.4,"5' 7""",169,2f575438-1e23-4123-ab60-990725e8c08b,50.072755,4.344408
19,female,Hungarian,Ms.,Tünde,F,Hoffmann,"Via Nazario Sauro 112","Cusano Milanino",MI,Milano,20095,IT,Italy,HoffmannTunde@armyspy.com,Preacces,aob4eiteiL,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","0352 9353380",39,Bagi,11/1/1951,68,Scorpio,MasterCard,5559688562190559,258,7/2023,DA75119938,"1Z 959 98A 67 7929 896 3",2123939613,78515387,Orange,"Apparel worker","Hugh M. Woods","2005 BMW X5",CrabDealer.it,O+,186.1,84.6,"4' 11""",151,0a14766e-42f5-4943-ae10-884bcadf43f8,45.644863,9.128014
20,female,Finnish,Ms.,Riitta,N,Hahl,"ul. Elbląska 97",Olsztyn,,,10-672,PL,Poland,RiittaHahl@einrot.com,Thill1954,Aiwar2ooh1,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","67 788 02 83",48,Linna,3/2/1954,65,Pisces,MasterCard,5150663209146952,542,2/2021,54030268860,"1Z 626 853 00 4461 590 4",4657047492,90091015,Purple,"Medical secretary",Edwards,"2012 Toyota Prius",DustingSprays.pl,O+,187.7,85.3,"5' 7""",170,e7825dc5-9fdf-478c-ba72-ab25879699f1,53.768852,20.536572
21,male,Dutch,Mr.,Harwin,R,Galesloot,"Glynitveien 218",SKI,,,1400,NO,Norway,HarwinGalesloot@einrot.com,Saffive,wohHaix5fa,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","914 54 925",47,Ramakers,9/28/1994,25,Libra,MasterCard,5580997921941047,961,4/2022,,"1Z 761 410 55 6702 728 6",9467674600,08790710,Blue,"Electric motor repairer","Buena Vista Realty Service","2002 ZAZ Slavuta",SwankBlog.no,B+,217.1,98.7,"5' 6""",167,4029654d-5c8f-407e-b003-72bde9f593a1,59.814132,10.87172
22,male,Hispanic,Mr.,Azarías,A,Segovia,"Via Francesco Girardi 49","Carmignano Di Brenta",PD,Padova,35010,IT,Italy,AzariasSegoviaNava@cuvox.de,Portalime,duteiVeev1,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","0394 9130281",39,Nava,12/24/1949,70,Capricorn,MasterCard,5146736492498053,173,1/2022,SZ30479384,"1Z 931 W28 91 2882 876 3",2747014433,22062098,Blue,"Administrative office manager","Total Serve","2005 BMW 325",EmployeeVerified.it,B+,140.4,63.8,"6' 0""",184,916cbe50-55b2-4a11-acf4-b8d8d9cc9668,45.432966,12.000719
23,male,Hungarian,Mr.,Adelbert,A,Kuncz,"Turjaška 115","Rečica ob Savinji",,,3332,SI,Slovenia,KunczAdelbert@fleckens.hu,Firseten,woh3ejoo2Ei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",031-365-314,386,Pethô,10/17/1976,43,Libra,MasterCard,5431971800958886,663,4/2022,,"1Z 5V7 161 85 3863 932 0",7652480699,67841291,Blue,"Extruding, forming, pressing, and compacting machine setter","Red Owl","2002 Citroen C-Airdream",BankingDetective.si,O+,193.6,88.0,"6' 1""",186,6653d52f-870f-4956-8b1a-c1463007f387,46.357704,14.932894
24,female,England/Wales,Ms.,Charlie,S,Campbell,"Rue du Château 414",Limont,WLG,Liège,4357,BE,Belgium,CharlieCampbell@gustr.com,Lithatinquir,Kae4aetah,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0489 33 97 26",32,Tucker,10/16/1987,32,Libra,MasterCard,5268537479220623,072,1/2023,,"1Z 3V4 354 38 1677 342 8",0119239854,24858321,Purple,"Dry-cleaning worker","Red Robin Stores","2004 Audi S4",PinkCheek.be,O+,119.5,54.3,"5' 8""",173,f3a846e6-ec1c-4ccc-bf51-f66bb64ff559,50.616062,5.247283
25,female,American,Ms.,Thelma,K,Mitchell,"Väike-Laagri 80",Orissaare,SA,Saaremaa,94691,EE,Estonia,ThelmaKMitchell@einrot.com,Trind1979,eeThooy3ieph,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","455 6186",372,Rumbaugh,5/23/1979,40,Gemini,MasterCard,5252527082023181,249,9/2024,,"1Z 001 306 31 4745 481 7",4024903278,57842179,Black,"Camera repairer","Omni Superstore","2000 Ford Artic",BedroomRental.com.ee,A+,118.4,53.8,"5' 6""",168,6e2490c7-4a17-4ce8-9a80-4b82551d3180,58.540748,23.059808
26,male,Brazil,Mr.,Davi,G,Santos,"Rákóczi út 66.",Barnag,VE,Veszprém,8291,HU,Hungary,DaviGoncalvesSantos@superrito.com,Cumeneamord,AeYeRie2doo,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","(88) 158-170",36,Goncalves,3/15/1974,45,Pisces,Visa,4539342451489007,234,9/2021,,"1Z 598 735 11 9286 476 3",9550539865,84997161,Blue,"Clinical manager","Star Merchant Services","2011 Alfa Romeo Giulietta",ConfidentialCash.hu,A+,156.9,71.3,"5' 11""",180,95501fdd-a13f-43fd-981c-0dced4dd23e6,47.04597,17.745945
27,male,England/Wales,Mr.,Jonathan,A,Conway,"Rua Doutor Afrânio Junqueira 1460","São Paulo",SP,"São Paulo",04581-040,BR,Brazil,JonathanConway@jourrapide.com,Bessed,Egaez2Vuo,"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko","(11) 7113-8192",55,Humphries,1/20/1938,81,Aquarius,Visa,4916317241919037,680,3/2023,308.271.618-08,"1Z 02V 42E 21 2992 331 4",3494650897,26033304,Blue,"Electrical drafter","Pro Yard Services","2006 Dodge Caravan",AidsRate.com.br,B+,170.9,77.7,"6' 0""",183,0eb93cfe-1260-4caa-99d5-e0f31f73b1e9,-23.594567,-46.709971
28,male,French,Mr.,Guy,A,Migneault,"90 Petworth Rd",DUNSINNAN,,,"PH2 5HL",GB,"United Kingdom",GuyMigneault@teleworm.us,Enut1960,nahming4Oo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","077 5138 5842",44,Labrecque,4/28/1960,59,Taurus,MasterCard,5581802812256860,692,3/2022,"ZT 01 87 75","1Z 96V 661 27 4962 061 4",8863556519,92044686,Blue,"Technical trainer","Rogers Peet","1999 Fiat Siena",ReligiousCounselor.co.uk,B+,176.7,80.3,"5' 7""",171,eca54fd7-3576-426c-b4f9-b617a3662931,56.025747,-3.640577
29,male,Hispanic,Dr.,Breogan,J,Orosco,"Ηλίου 64",ΛΑΡΝΑΚΑ,LA,Λάρνακα,6031,CY,"Cyprus (Greek)",BreoganOroscoCeballos@teleworm.us,Priback,Quoo9choo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","97 687579",357,Ceballos,9/19/1957,62,Virgo,MasterCard,5567650882732569,392,10/2022,,"1Z A69 196 36 1803 521 0",6314780611,79148788,Green,"Machinery maintenance mechanic","Jack Lang","2010 Hyundai i30",TalkAbuse.com.cy,A+,155.1,70.5,"5' 10""",179,8dda77bb-d75c-40ef-b1c0-6381d305b053,41.348113,-72.957665
30,female,Czech,Mrs.,Jaroslava,M,Kindlová,"22 Rue de Sidi Bou Zid",Zouarine,33,"Governorate Kef",7170,TN,Tunisia,JaroslavaKindlova@armyspy.com,Tepen1939,thahShee7,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763","78 427 062",216,Langová,11/28/1939,80,Sagittarius,Visa,4532368231815457,583,12/2022,,"1Z 349 036 22 9992 262 9",5257019048,20372734,Black,"Copy editor",Anthony's,"2005 Nissan Altima",ScanFund.tn,B+,108.7,49.4,"5' 4""",163,94e73608-33dd-4f7a-a86d-9575aa2e63e6,37.219459,9.881268
31,male,Croatian,Mr.,Stjepan,A,Perković,"Escuadro 26","Castelló de Rugat",V,Valencia,46841,ES,Spain,StjepanPerkovic@jourrapide.com,Spastry,uLiH7iech3,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","779 021 982",34,Topić,3/18/1978,41,Pisces,Visa,4539449132248031,692,6/2023,,"1Z V92 2E8 34 9201 447 0",3497779171,87502347,Orange,Geoscientist,Monmax,"2010 Chrysler PT Cruiser",AudioBoom.es,A+,175.8,79.9,"5' 10""",178,e49f858d-b218-4ce7-89ae-62599b3bb276,38.927331,-0.375157
32,male,Croatian,Mr.,Stanko,T,Crnić,"Avda. Alameda Sundheim 46",Benasque,HU,Huesca,22440,ES,Spain,StankoCrnic@fleckens.hu,Waskents,Piu4theeg1ae,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","793 358 347",34,Jozić,9/13/1974,45,Virgo,Visa,4929607743905830,463,2/2021,,"1Z 981 F24 67 9469 260 4",6183936244,94973462,Blue,"Studio camera operator","Wealthy Ideas","2004 Isuzu Axiom",PokerPortraits.es,B+,185.2,84.2,"6' 1""",186,6ee36d07-1397-43c8-86a4-455bce264fbc,42.623207,0.475571
33,female,Russian,Ms.,Marianne,I,Zhdanova,"18 Rue de bayrout","Cite Badrani",61,"Governorate Sfax",3083,TN,Tunisia,MarianneZhdanova@armyspy.com,Thock1968,uRai6thoh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","74 849 807",216,,11/30/1968,51,Sagittarius,MasterCard,5223193173825541,622,7/2020,,"1Z 60W 422 22 2116 147 5",6606006372,32841170,Blue,"Geographic information specialist","Star Interior Design","1996 Dodge Caravan",PreviewBuy.tn,B+,168.3,76.5,"5' 7""",169,eff7c2c8-85f4-45dc-8a4d-60770f4eda42,35.319852,9.785865
34,female,Hungarian,Ms.,Ferike,G,Jónás,"Brucker Bundesstrasse 31",FÜRLING,NO,"Lower Austria",4152,AT,Austria,JonasFerike@dayrep.com,Ofigaill49,ni8ooy1Thee,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763","0681 563 12 72",43,Tolnay,4/12/1949,70,Aries,Visa,4532628381402038,969,6/2021,,"1Z 099 5Y5 30 8995 126 3",7242712846,73346379,Red,"Computer systems administrator","ABCO Foods","1999 Land Rover Discovery",SemiCheap.at,O+,134.6,61.2,"5' 5""",166,4687baf3-0486-4466-a63b-d0dff8903249,48.633035,13.984742
35,female,Czech,Ms.,Lenka,T,Mizerová,"Rhinstrasse 91",München,BY,"Freistaat Bayern",80975,DE,Germany,LenkaMizerova@fleckens.hu,Daudgessed,keer4Ceej9j,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","089 60 25 65",49,Brožová,1/14/1944,75,Capricorn,Visa,4916250570933685,939,6/2023,,"1Z 580 A90 83 6948 520 2",9291535188,57982233,Red,Maid,AdventureSports!,"2000 Toyota Sienna",ExitMarketing.de,O+,184.1,83.7,"5' 1""",154,efb875ed-74dc-4ea3-b155-4a48cb61d187,48.189892,11.502201
36,female,Icelandic,Mrs.,Eyþóra,P,Runólfsdóttir,"Ditscheinergasse 80",HUNDSHAGEN,OO,"Upper Austria",4773,AT,Austria,EythoraRunolfsdottir@jourrapide.com,Expregiat,Beph8ieX,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","0699 830 58 07",43,,11/22/1960,59,Sagittarius,MasterCard,5426319483823638,482,11/2024,,"1Z 448 19V 41 8418 729 7",2455440003,04422500,Purple,"Building cleaning worker","Matrix Architectural Service ","2010 BMW 650",PaidValue.at,O-,130.7,59.4,"5' 6""",168,02a29393-6fdc-4120-adf0-568906c8c111,48.266115,13.568714
37,female,French,Mrs.,Rive,T,Lépicier,"144 Souniou Ave.",Menogeia,LA,Larnaca,7578,CY,"Cyprus (Anglicized)",RiveLepicier@teleworm.us,Carray,ieJij3no,"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0","24 102884",357,Lanteigne,4/15/1983,36,Aries,MasterCard,5494902843118711,376,8/2024,,"1Z W04 891 69 0373 898 8",7751332568,91800864,Yellow,"Payroll and benefits specialist","Lechters Housewares","2006 Nissan Pathfinder",MLSModels.com.cy,A+,119.7,54.4,"5' 4""",163,17afbc1b-4d95-4583-8602-9680b4fd7c5c,41.311392,-72.829123
38,male,Danish,Mr.,Marcus,O,Paulsen,"Plattenstrasse 57",Räterschen,,,8352,CH,Switzerland,MarcusOPaulsen@dayrep.com,Mesee1943,Fi7eiva8Ah,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","044 347 47 26",41,Simonsen,10/8/1943,76,Libra,MasterCard,5378357034830932,713,8/2021,,"1Z 387 E31 19 5962 225 9",2639724434,59408085,Black,"Management development specialist","Parts and Pieces","2006 BMW M3",MalpracticeAgents.ch,B+,207.5,94.3,"5' 10""",177,43ae4a6b-e1ca-4d5e-b6cd-3732697c9c71,47.488972,8.868299
39,female,"Chechen (Latin)",Mrs.,Zeliha,I,Sultygov,"Rookopli 96",Uralaane,VG,Valgamaa,68712,EE,Estonia,ZelihaSultygov@cuvox.de,Dary1953,nohgief2A,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","763 5734",372,Desheriyev,7/27/1953,66,Leo,Visa,4539696753097085,200,11/2021,,"1Z E17 641 44 8337 404 1",5174706823,32841274,Red,"Mental health assistant","Reliable Investments","2000 Opel Signum",MeDue.com.ee,O+,119.9,54.5,"5' 3""",160,68c2ef2c-9990-41bc-be94-afa07e6e2379,58.070181,26.064252
40,female,Russian,Mrs.,Ilona,B,Pirogova,"Αγ. Ανδρέα 130","ΒΑΣΑ ΚΟΙΛΑΝΙΟΥ",LI,Λεμεσός,4771,CY,"Cyprus (Greek)",IlonaPirogova@superrito.com,Hatiere,Ahsha4Ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","25 750307",357,,12/5/1970,49,Sagittarius,Visa,4539731441846112,542,4/2024,,"1Z 008 309 69 5575 108 1",3971097056,06463075,Purple,"Land acquisition manager","Wickes Furniture","1992 Ford Taurus",UGLive.com.cy,O-,135.1,61.4,"5' 1""",156,0c5899a0-ce9c-43a7-aba1-f279893620f9,41.295272,-72.961282
41,female,Croatian,Ms.,Aleksandra,K,Petković,"Binzmühlestrasse 30","San Bernardino",,,6565,CH,Switzerland,AleksandraPetkovic@fleckens.hu,Ramessanies1994,vie2quai7Ie8,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","091 808 37 22",41,Bašić,6/8/1994,25,Gemini,MasterCard,5560110586075796,991,3/2023,,"1Z 11E 310 61 4037 919 3",0620594797,44658507,Purple,"Personal banker",Ejecta,"2005 Kia Amanti",WirelessRelief.ch,A+,110.2,50.1,"5' 4""",163,cc560302-0d00-410c-9629-2e68bb4ef864,46.506804,9.159102
42,male,American,Mr.,Rogelio,A,Patrick,"Πλ Καραισκάκη 128","ΑΓΙΟΣ ΘΕΟ∆ΩΡΟΣ ΣΟΛΕΑΣ",NI,Λευκωσία,2823,CY,"Cyprus (Greek)",RogelioAPatrick@dayrep.com,Whortin1952,Yu7mah2z,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","95 432561",357,Thacker,10/25/1952,67,Scorpio,Visa,4716135824324942,242,9/2021,,"1Z 14A 327 32 1181 648 4",9342714384,12993510,Blue,"Typesetting machine tender","Vibrant Man","2001 Toyota MR2",MacroSigns.com.cy,B+,170.3,77.4,"5' 11""",180,76a566cc-be59-4327-862e-312da09e0c42,41.353523,-72.965839
43,female,American,Mrs.,Evelyn,R,Tucker,"Kringlan 66",Reykjavík,,,107,IS,Iceland,EvelynRTucker@armyspy.com,Arresplet,deiT0ahyu,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36","450 3756",354,Burton,9/25/1986,33,Libra,MasterCard,5592761939548814,873,9/2022,,"1Z 263 919 45 6552 555 7",5057305433,86266508,Purple,"Aquaculture farmer",Weatherill's,"2004 Mitsubishi Galant",TheyTell.is,O+,105.8,48.1,"5' 3""",159,716b5321-34bf-4514-8bca-fce5c482d8c3,64.159592,-21.928397
44,male,Icelandic,Mr.,Þorkell,H,Hallbjörnsson,"Školní 296","Kaplice 1",JC,"Jihoceský kraj","382 41",CZ,"Czech Republic",THorkellHallbjornsson@gustr.com,Dessesid,eeXahew1ui,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","772 616 930",420,,12/19/1957,62,Sagittarius,MasterCard,5534480983249093,443,6/2023,,"1Z 34E 320 47 9554 749 9",1825424609,81224507,Blue,"Financial aid director","Grand Union","2004 Ford Explorer",AttorneyBiographies.cz,A-,236.9,107.7,"6' 1""",186,393472fc-3454-4ba8-af3a-e1f7f626cfee,48.691433,14.516696
45,male,Greenland,Mr.,Jan,H,Geisler,"Bayerhamerstrasse 79",GLAUBENDORF,NO,"Lower Austria",3704,AT,Austria,JanGeisler@jourrapide.com,Subjecould,eepooz6U,"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko","0699 456 17 84",43,Lange,10/25/2000,19,Scorpio,MasterCard,5305776196130476,904,6/2020,,"1Z 084 34A 51 9322 259 5",7785136902,52035400,Blue,"Fire prevention specialist",Mikrotechnic,"2005 Bizzarrini BZ-2001",ProfilePeek.at,B+,203.1,92.3,"5' 9""",175,92630214-ba49-47e4-8717-93e4bba4262e,48.560667,15.917015
46,female,Norwegian,Mrs.,Caroline,M,Landmark,"Via Tasso 21",Perugia,PG,Perugia,06122,IT,Italy,CarolineLandmark@superrito.com,Sweves,ooNee0iechoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","0378 8718408",39,Benjaminsen,9/26/1975,44,Libra,MasterCard,5248190326919222,805,6/2024,PR74491787,"1Z 2Y4 773 67 4365 263 8",7709648575,99170582,Green,"Allopathic physician","Castro Convertibles","2012 Dodge Durango",CheapWarrants.it,A-,187.7,85.3,"5' 6""",168,dde3a962-10d8-4092-b9df-6da79f89f383,43.072973,12.459411
47,female,Swedish,Ms.,Lena,M,Andersson,"Parkring 7",STEINPARZ,OO,"Upper Austria",4730,AT,Austria,LenaAndersson@jourrapide.com,Freen1978,aeVaiHohy7,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","0650 858 08 11",43,Holm,10/24/1978,41,Scorpio,MasterCard,5451268671996177,795,8/2023,,"1Z 683 821 70 2253 409 0",7986278354,59684998,Blue,"Chemical technician","Wholesale Club, Inc.","2007 Kia Carnival",ProvidenceSold.at,AB+,180.2,81.9,"5' 1""",154,8d7f4c08-ee33-4024-9474-0beed004df45,48.234026,13.824536
48,male,Danish,Mr.,Elias,A,Jepsen,"Βασιλέως Αλεξάνδρου 195",ΦΑΡΜΑΚΑΣ,NI,Λευκωσία,2620,CY,"Cyprus (Greek)",EliasAJepsen@dayrep.com,Thenim,uki0Zae7l,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 228011",357,Olsen,9/10/1967,52,Virgo,Visa,4929867576889614,699,4/2023,,"1Z 348 2Y4 34 0337 091 2",3256227540,51352382,Blue,"Court, municipal, and license clerk","Golden's Distributors","2011 Volvo XC70",StudRules.com.cy,A+,214.5,97.5,"5' 9""",174,f06e06a7-59b0-4052-9928-92df476d7753,41.326352,-72.962624
49,male,French,Mr.,Honoré,N,Beaudouin,"13 Faubourg Saint Honoré",PAU,IL,Île-de-France,64000,FR,France,HonoreBeaudouin@superrito.com,Slise1955,Zei7phaeSuutu,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",05.39.21.79.15,33,Daoust,11/2/1955,64,Scorpio,Visa,4539798879618651,321,11/2020,"1551124428495 35","1Z 424 792 46 6757 249 6",9796303410,07585755,Blue,"Dairy scientist","York Steak House","1999 GAZ 3111",RankHunter.fr,O+,236.1,107.3,"5' 9""",175,00a9f1f4-bec6-4dda-ac22-97e2860b1662,43.241847,-0.41343
50,male,American,Mr.,Richard,K,Martinez,"Via Zannoni 49","Tiarno Di Sopra",TN,Trento,38060,IT,Italy,RichardKMartinez@rhyta.com,Himmest,Feitee9ien,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","0328 4921229",39,Dickenson,9/2/1985,34,Virgo,MasterCard,5258225275434802,882,2/2023,PJ53725800,"1Z 4V9 5Y7 22 3010 519 9",5935776326,45852443,Blue,Logistician,Macroserve,"1995 Fiat Bravo",YellowShoppers.it,O+,205.9,93.6,"5' 9""",176,fbe7a3e7-ace4-4495-b6e6-0c2cb4abcc88,45.964528,10.759331
51,male,"Chechen (Latin)",Mr.,Salambek,T,Melikov,"1678 Dorp St",Claremont,WC,"Western Cape",7740,ZA,"South Africa",SalambekMelikov@teleworm.us,Brint1956,ehahCh1xai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","083 792 9726",27,Gairbekov,7/13/1956,63,Cancer,MasterCard,5284109963636027,465,12/2023,5607139788084,"1Z 558 445 48 7417 922 3",1631262867,63160277,Green,"Telephone operator","Kinney Shoes","1998 Chevrolet Trans Sport",TripMetro.co.za,O+,211.9,96.3,"5' 8""",172,b1628f36-9fdd-487d-8297-a35ee6d72ebf,-33.89536,18.479041
52,female,Greenland,Mrs.,Mette,K,Olsen,"ul. Karpacka 69",Bydgoszcz,,,85-164,PL,Poland,MetteOlsen@cuvox.de,Liffew,queiy6ooGh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","88 165 40 96",48,Jeremiassen,1/30/1935,84,Aquarius,MasterCard,5353410735290150,175,3/2023,35013096720,"1Z 129 156 25 6468 002 5",2379843087,66415820,Yellow,Lawyer,MagnaSolution,"2005 Peugeot 107",ChildGaming.pl,O-,117.0,53.2,"5' 0""",152,01d98231-8e58-4e17-9c9a-bc5aa388928a,53.068638,18.093529
53,male,Russian,Mr.,Spartacus,N,Ignatieff,"Bahnhofstrasse 57",Glovelier,,,2855,CH,Switzerland,SpartacusIgnatieff@jourrapide.com,Imeting1968,yi2Eep8gieh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","032 803 90 31",41,,6/16/1968,51,Gemini,MasterCard,5399586162423418,375,3/2024,,"1Z 632 482 77 2949 860 1",2465813266,58937027,Blue,"Sales worker supervisor",Schweggmanns,"2008 Mazda 5",SoccerInstructor.ch,O+,152.9,69.5,"5' 8""",173,6365915b-b99c-426c-ae8e-698f846d3f03,47.232383,7.22689
54,male,Brazil,Mr.,Kauã,S,Cardoso,"P.O. Box 194",Upernavik,QA,Qaasuitsup,3962,GL,Greenland,KauaSantosCardoso@fleckens.hu,Searlitnot,caiT4reN,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 63 21",299,Castro,3/9/2000,19,Pisces,MasterCard,5301206781704786,476,5/2024,,"1Z 346 5Y6 16 7677 108 7",7407791874,63200812,Blue,"Press secretary","Dave Cooks","2002 Hyundai Elantra",SpaRules.gl,A+,155.5,70.7,"5' 10""",179,081371c1-479e-4055-95af-3110e72fc11a,72.786922,-56.131948
55,female,Brazil,Ms.,Fernanda,P,Cavalcanti,"Via degli Aldobrandeschi 3",Jelsi,CB,Campobasso,86015,IT,Italy,FernandaPereiraCavalcanti@superrito.com,Knour1941,ahChohqu4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0327 9982793",39,Souza,11/30/1941,78,Sagittarius,Visa,4929971103746071,969,10/2023,CI48765311,"1Z 223 435 25 6742 103 3",6973121025,59054247,Blue,"Budget analyst","Hughes & Hatcher","2001 Volkswagen Lupo",MartiniMobile.it,B+,106.3,48.3,"5' 5""",165,c75dc0e6-fe45-431f-8907-6e58db479a3d,41.444028,14.707643
56,female,Hungarian,Ms.,Mónika,Z,Göröncsér,"Nábřežní 243","Spálené Porící",PL,"Plzenský kraj","335 61",CZ,"Czech Republic",GoroncserMonika@fleckens.hu,Thenetiong,quohwae5Quoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","376 147 284",420,Szôts,6/11/1946,73,Gemini,Visa,4916007260260864,417,6/2024,,"1Z A43 364 39 5822 708 0",8027400869,93626602,Black,"Precision printing worker","Plunkett Home Furnishings","2001 Bugatti EB 118",LeftJournal.cz,B+,146.3,66.5,"5' 4""",162,13189ec1-db42-4f8c-b74e-e7a45d33a237,49.629,13.606864
57,female,Czech,Mrs.,Zuzana,M,Kozáková,"Via del Pontiere 101","Birgi Aerostazione",TP,Trapani,91020,IT,Italy,ZuzanaKozakova@fleckens.hu,Dinectich,mai5eiXaexai,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","0391 7843193",39,Minarčíková,1/21/1953,66,Aquarius,MasterCard,5164065137771907,924,5/2020,AT69882067,"1Z 736 067 39 5591 664 0",0937144872,77392590,Red,"Industrial engineering technician",Romp,"1998 Isuzu VX-02",EugeneTownhouse.it,A+,123.0,55.9,"5' 3""",161,c414c613-db2b-4f1a-8bf1-a91e518afb85,37.610445,12.42306
58,female,Russian,Mrs.,Eugene,R,Bykova,"Via Goffredo Mameli 149","Poggiovalle Di Borgorose",RI,Rieti,02020,IT,Italy,EugeneBykova@einrot.com,Ingentersed1943,aeN6eenul5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134","0366 9434948",39,,6/29/1943,76,Cancer,Visa,4556039653807659,972,5/2022,KX92160250,"1Z 975 21Y 29 8927 431 6",6861257852,04061254,Brown,"Executive secretary","Kinney Shoes","1994 Plymouth Voyager",GrandLunch.it,A+,195.8,89.0,"4' 11""",150,7eb2e374-3e73-41d3-95b6-f9bb93993872,42.079608,12.989079
59,female,Icelandic,Ms.,Nanna,S,Hallmundsdóttir,"85 Gimblett Street",Richmond,,Invercargill,9810,NZ,"New Zealand",NannaHallmundsdottir@cuvox.de,Wifflife1964,Deez4ooGi0,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","(022) 7929-466",64,,4/15/1964,55,Aries,MasterCard,5504023231705718,236,9/2020,,"1Z 668 366 39 7132 126 6",7750395260,67884791,Black,"Photographic process worker",Megatronic,"1999 Dodge Avenger",PrepaidCDs.co.nz,O+,170.1,77.3,"5' 0""",153,05974695-4733-4bd3-b1ac-bffb56992160,-46.301185,168.423853
60,female,Polish,Ms.,Halina,C,Zielinska,"1 Gloucester Road",CLACHANDHU,,,"PA68 7QD",GB,"United Kingdom",HalinaZielinska@jourrapide.com,Corsome74,Vaem2keeV6,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","077 5149 6820",44,Zielinska,8/7/1974,45,Leo,Visa,4539422686560762,956,3/2022,"TB 10 69 23","1Z 189 A93 08 3744 015 7",0022083989,01394977,Blue,"Surgical technician",Monit,"2011 Renault Grand Scenic",CreditEducate.co.uk,AB+,208.3,94.7,"5' 7""",171,44fd6c97-052a-42a8-b2d2-fb8b3fc70ba8,55.954988,-5.872046
61,female,"Japanese (Anglicized)",Ms.,Hatsuho,K,Yoneda,"Dalmatinova 35",Žabnica,,,4209,SI,Slovenia,HatsuhoYoneda@fleckens.hu,Therl1988,ahteel2maeSh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134",051-632-354,386,Mikami,9/17/1988,31,Virgo,MasterCard,5104564324299139,201,4/2021,,"1Z 486 074 10 6995 124 2",8337173450,63167043,Purple,Treasurer,Practi-Plan,"2003 MG ZT",ConventionalMedicines.si,O+,134.4,61.1,"5' 8""",173,e51aa8e2-9fc6-43e7-8488-168ca87b75d7,46.108258,14.328302
62,male,Croatian,Mr.,Gojislav,V,Jukić,"Bahnhofstrasse 96",Gorgier,,,2023,CH,Switzerland,GojislavJukic@dayrep.com,Tinguen,GaiXa3ai,"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36","032 304 33 66",41,Pavić,9/21/1999,20,Virgo,MasterCard,5165115278847765,074,12/2022,,"1Z 385 466 21 1758 512 0",6790588061,83119520,Red,"Extractive metallurgical engineer","Value Giant","2011 Lexus LFA",AmericasFunny.ch,B+,155.5,70.7,"6' 1""",186,0165faab-d065-4a58-a543-057c540a6863,46.955216,6.830252
63,female,England/Wales,Mrs.,Megan,T,Swift,"Postbox 23",Maniitsoq,QE,Qeqqata,3912,GL,Greenland,MeganSwift@teleworm.us,Thersevere,ieGh5huoK6,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","81 32 04",299,Poole,4/30/1964,55,Taurus,Visa,4929574688538812,336,4/2024,,"1Z E38 W63 04 0063 263 0",6213800707,38860322,Green,"Support service manager","Little Folk Shops","2007 Chevrolet Optra",MontereySea.gl,B+,168.7,76.7,"5' 4""",163,394b47d6-e3cb-4869-b4b5-c42a81ef9b00,65.395922,-52.878832
64,male,Slovenian,Mr.,"Milan Franc",S,Košelnik,"Na Výsluní 272",Primda,PL,"Plzenský kraj","348 06",CZ,"Czech Republic",MilanFrancKoselnik@teleworm.us,Lospay67,queeN0ies,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","733 162 319",420,Mankoč,1/14/1967,52,Capricorn,MasterCard,5453102419958199,879,12/2021,,"1Z A16 354 25 3838 551 9",0820470346,05905803,Brown,"Clinical laboratory technologist","Singer Lumber","2003 Holden UTE",TextFraud.cz,A+,214.5,97.5,"5' 9""",175,7234b959-896e-4335-8745-c5716b1c7638,49.619319,12.730847
65,female,German,Ms.,Johanna,P,Maurer,"Kaisergasse 64",KURZENKIRCHEN,OO,"Upper Austria",4770,AT,Austria,JohannaMaurer@armyspy.com,Reptaked1981,ohwae5Tee,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","0688 992 50 51",43,Schuhmacher,7/12/1981,38,Cancer,Visa,4916529470563530,481,4/2020,,"1Z Y23 375 24 5962 121 9",0306589986,74875435,Purple,"Elevator repairer","Solution Answers","2008 Rover Streetwise",PublicityAid.at,AB+,186.6,84.8,"5' 5""",164,30ce10a0-2ea3-4e24-871a-f72cb3beedfc,48.439595,13.547802
66,male,Norwegian,Mr.,Teodor,K,Aune,"Gl. Sygehusvej 153",Narsaq,KU,Kujalleq,3921,GL,Greenland,TeodorAune@superrito.com,Dickent,ooph5leiG,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","66 28 45",299,Arntzen,7/18/1998,21,Cancer,MasterCard,5296939640241254,624,2/2024,,"1Z 216 Y97 87 6791 863 9",0054598944,41175224,Blue,"Private household cook","Builders Emporium","2008 Renault Laguna",GraffitiRoom.gl,O+,239.1,108.7,"5' 9""",174,0562be22-b239-4dbc-a0f7-d64679ae153f,60.827346,-46.022413
67,male,Dutch,Mr.,Abderrahman,I,Kempers,"Hjellestadnipen 66",HJELLESTAD,,,5259,NO,Norway,AbderrahmanKempers@einrot.com,Ancery,eeC4tien9,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","460 73 493",47,Cuperus,7/17/1977,42,Cancer,MasterCard,5358951722758050,004,9/2024,,"1Z Y78 20V 88 9025 826 2",1986315622,50825375,Blue,"Time clerk","Carrols Restaurant Group","2008 Dacia Sandero",PlatinumVoice.no,O+,182.4,82.9,"5' 6""",168,49c9f2d1-5e74-4ef9-8c9c-c5e6338e1f6d,60.280666,5.157123
68,male,French,Mr.,Nicolas,R,Lebrun,"95 Burton Avenue",Okoia,,Wanganui,4500,NZ,"New Zealand",NicolasLebrun@armyspy.com,Wroing,iR2rahpaim2a,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","(027) 0336-972",64,Bondy,3/3/1964,55,Pisces,MasterCard,5167661122227231,094,5/2021,,"1Z 418 1A1 09 5878 510 1",4757008829,69457336,Blue,"Corporate accountant",Playworld,"1997 Citroen Rally Raid",PopularFlicks.co.nz,O+,209.4,95.2,"5' 10""",178,851fc065-0061-4754-b807-421e4242b5ba,-39.863379,174.967351
69,male,Slovenian,Mr.,"Ivan Martin",J,Bugarski,"Breivangvegen 38",TROMSØ,,,9010,NO,Norway,IvanMartinBugarski@einrot.com,Dercy1937,iey2Xoh8o,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","448 63 713",47,Riboli,11/14/1937,82,Scorpio,MasterCard,5364317183005716,388,12/2022,,"1Z 291 4Y2 69 2883 563 7",1469869904,37602111,Red,"Office clerk",Quickbiz,"2000 Chrysler Grand Voyager",StrictlyIdeas.no,O-,174.2,79.2,"5' 10""",179,ca8717a7-1eba-4f93-9f1d-970b2fa1a45e,69.651262,18.958466
70,female,Czech,Ms.,Jarmila,M,Chloupková,"729 Albert St",Germiston,GA,Gauteng,1419,ZA,"South Africa",JarmilaChloupkova@superrito.com,Wountim81,yie8Cees,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","084 256 2607",27,Poláčková,2/5/1981,38,Aquarius,Visa,4716715994153559,960,3/2024,8102054742081,"1Z 454 A20 14 2101 291 2",8866810206,44986764,Red,"Industrial-organizational psychologist","Wells & Wade","1996 Mini MK VI",SleepsAround.co.za,B+,140.6,63.9,"5' 8""",172,940ec903-7fa0-421a-a1c5-7620bea7f7e0,-26.161314,28.133482
71,male,Italian,Dr.,Manlio,M,Capon,"Lützelflühstrasse 122",Wil,,,5300,CH,Switzerland,ManlioCapon@einrot.com,Theyear,ieThuo1fei,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","062 380 34 69",41,Folliero,9/17/1947,72,Virgo,Visa,4929441842722544,930,4/2021,,"1Z 15W 644 50 5740 843 7",5183466840,07258789,Black,"Billing and posting clerk","House Of Denmark","1997 Oldsmobile Eighty-Eight",PrepaidHoliday.ch,B+,162.1,73.7,"5' 10""",179,0e92196a-3502-41c6-83bc-9a3b43c49317,47.510722,8.303196
72,female,"Japanese (Anglicized)",Ms.,Tomomi,Y,Nishiyama,"Rua do Arenque 1634",Goiânia,GO,Goiás,74343-040,BR,Brazil,TomomiNishiyama@fleckens.hu,Mille1991,aeneThoh6x,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","(62) 9976-7986",55,Ishizaki,8/23/1991,28,Virgo,Visa,4716240309647674,486,9/2020,406.537.117-19,"1Z 01V 07A 25 5957 147 4",4689360184,38520562,Purple,"Oxy-gas cutter","Lechters Housewares","2011 Chevrolet HHR",PharmacyFile.com.br,B+,213.2,96.9,"5' 2""",158,5141a9c2-cd45-4813-9cbe-12d630eefce4,-16.687195,-49.226261
73,female,Russian,Dr.,Esther,R,Kalinina,"Τρικάλων 248",ΛΕΥΚΩΣΙΑ,NI,Λευκωσία,1687,CY,"Cyprus (Greek)",EstherKalinina@dayrep.com,Hics1952,euG0Aiqu2,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","22 723018",357,,7/5/1952,67,Cancer,MasterCard,5563120249912803,542,11/2024,,"1Z 734 93Y 11 9330 585 6",6006786087,26151161,Blue,"Ambulatory care nurse","Waccamaw Pottery","1999 MCC Smart",ShapeConsultant.com.cy,A+,133.5,60.7,"5' 7""",170,642dbeb6-defe-4c89-bc0a-5c64ae807dcb,41.266749,-72.834759
74,male,Icelandic,Mr.,Boði,L,Zóphoníasson,"Árpád fejedelem útja 3.",Budapest,BU,Budapest,1184,HU,Hungary,BodiZophoniasson@jourrapide.com,Inart1990,nugheiZ0eig5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","(1) 941-2250",36,,6/12/1990,29,Gemini,MasterCard,5199467250444016,134,12/2020,,"1Z 563 9F0 30 4262 753 9",9294260413,82490560,Orange,"Mail processing machine operator","Bell Markets","1997 Lada Natacha",ReportDiscount.hu,B+,144.5,65.7,"5' 9""",174,a3912b2f-c2dc-4ff7-ab81-af1e047108c5,47.515035,19.146851
75,female,England/Wales,Mrs.,Grace,G,Boyle,"62 Mavrokordatou Street",Foinikaria,LI,Limassol,4530,CY,"Cyprus (Anglicized)",GraceBoyle@einrot.com,Fortat81,cahBoot3eH,"Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","25 880993",357,Thompson,9/27/1981,38,Libra,MasterCard,5597431608392937,582,5/2021,,"1Z 791 957 61 6161 657 2",2596874442,13926446,Blue,"Forensic technician","id Boutiques","1993 Bristol Beaufighter",BayNeck.com.cy,AB+,181.7,82.6,"5' 8""",173,149314e4-80a7-493a-8ece-9b6f0890fd5d,41.321303,-72.986114
76,female,England/Wales,Ms.,Naomi,S,Ryan,"Λ. Μιχαλακοπούλου 160",ΕΓΚΩΜΗ,NI,Λευκωσία,2417,CY,"Cyprus (Greek)",NaomiRyan@rhyta.com,Fien1988,Gaitha4Ei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","22 498317",357,Clayton,11/21/1988,31,Scorpio,MasterCard,5283115033422307,714,12/2023,,"1Z 587 192 06 3768 866 1",7494134303,25065703,Blue,"Sewer pipe cleaner","Monk Real Estate Service","1995 BMW Dinan",MobLag.com.cy,O+,205.0,93.2,"5' 2""",158,65219a9f-2c32-459a-98d2-7a7332e0f52f,41.385016,-72.962431
77,male,Polish,Mr.,Szymon,B,Walczak,"2347 Lauzon Parkway",Windsor,ON,Ontario,"N9A 7A2",CA,Canada,SzymonWalczak@teleworm.us,Dintep,oog4aize7Ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",519-566-3375,1,Pawłowska,9/18/1958,61,Virgo,MasterCard,5342019646824967,284,11/2023,"727 633 539","1Z 883 8V5 80 2897 856 7",2176879008,20843357,Silver,Neurosonographer,"De Pinna","2009 Nissan Frontier",MissingWeapons.ca,B+,222.2,101.0,"5' 10""",179,b2e01ab2-c265-42eb-90b8-e26a6361eed4,42.423583,-82.942171
78,female,Hungarian,Ms.,Mercédesz,S,Szôllôssy,"Atamaria 86","Fornelos de Montes",PO,Pontevedra,36847,ES,Spain,SzollossyMercedesz@jourrapide.com,Musere,Eif1ce0ee,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36","641 572 459",34,Hofmann,1/15/1944,75,Capricorn,MasterCard,5535165808125664,767,3/2023,,"1Z 681 2E4 44 0770 552 8",5631978558,99957155,Red,"Case management aide","White Hen Pantry","2000 Noble M12",NeedCharge.es,O+,213.6,97.1,"5' 7""",169,a5f515e0-310f-47ab-bc0c-a71bac531a1e,42.263983,-8.431245
79,male,Norwegian,Mr.,Edgar,E,Andreassen,"Zistelweg 32",UNTERLAND,SZ,Salzburg,5661,AT,Austria,EdgarAndreassen@fleckens.hu,Waakis2000,Iejeiz1oodei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0664 701 04 17",43,Dybvik,6/29/2000,19,Cancer,Visa,4716078791994463,776,11/2020,,"1Z 311 159 63 7486 723 7",5685893521,99981816,Black,"Speech pathologist",Peaches,"2001 Pontiac Grand Am",WordRegistrar.at,A-,127.8,58.1,"5' 8""",172,15337913-059c-4dd1-9feb-5dc426abe8c7,47.203021,12.910163
80,male,Slovenian,Mr.,Šemsudin,M,Vrhovski,"ul. Dawida Jana 124",Wrocław,,,50-527,PL,Poland,SemsudinVrhovski@rhyta.com,Gother,keeLaz9lee0,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 OPR/58.0.3135.132","67 534 85 44",48,Pataki,10/13/1955,64,Libra,Visa,4716877835558592,363,3/2022,55101320290,"1Z E40 449 57 5657 736 1",2373733334,21925106,Blue,"ABE teacher","Integra Wealth Planners","2015 BMW X5 M",VirginExpo.pl,B+,197.6,89.8,"5' 11""",180,5c11fc04-45f8-4989-801f-4102ff38d376,51.112923,17.027289
81,female,Finnish,Mrs.,Satu,A,Waltari,"2071 Maryland Avenue",Pinellas,FL,Florida,34624,US,"United States",SatuWaltari@teleworm.us,Stittair,jal6oNgoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:66.0) Gecko/20100101 Firefox/66.0",727-538-7059,1,Viitala,9/16/1995,24,Virgo,Visa,4556890465838575,158,12/2024,591-28-5104,"1Z 534 941 77 8508 193 2",5257097378,69898015,Yellow,"Soil scientist","White Hen Pantry","2003 Daihatsu Terios",kupitorta.com,O+,141.0,64.1,"5' 5""",164,37de7e34-2624-444a-978d-b1b758fbc993,27.864456,-82.748032
82,male,German,Mr.,Matthias,S,Himmel,"Degnehøjvej 45",Silkeborg,MI,"Region Midtjylland",8600,DK,Denmark,MatthiasHimmel@armyspy.com,Barted,Jemu5poosoo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15",30-62-84-08,45,Trommler,8/23/1983,36,Virgo,MasterCard,5289947968601628,320,12/2024,230883-1143,"1Z 006 174 60 6563 087 1",3945260717,87205650,Blue,"Mental health social worker","Superior Appraisals","1998 Alpina B 12",StLouisLighting.dk,A+,143.4,65.2,"5' 10""",179,116742f3-f65e-45f1-a917-d13ad1db7bd4,56.199078,9.447827
83,female,Danish,Ms.,Mia,A,Frederiksen,"ul. Zuchów 65","Dąbrowa Górnicza",,,41-303,PL,Poland,MiaAFrederiksen@rhyta.com,Fance1958,buY5faij,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0","53 459 91 54",48,Lauritsen,6/18/1958,61,Gemini,MasterCard,5454007072610160,208,4/2023,58061866242,"1Z 501 697 49 5209 014 8",9285893233,21439381,Blue,"Fine arts photographer","Coon Chicken Inn","2008 SSC Aero",WrestlingMonthly.pl,O+,181.1,82.3,"5' 2""",158,8969b475-9dce-4173-b060-32da08dbbf0d,50.417075,19.133549
84,male,Swedish,Mr.,Jesper,N,Lund,"Põllu 59",Kähu,VG,Valgamaa,68506,EE,Estonia,JesperLund@armyspy.com,Planstim,AeBeiNii0,"Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","763 2200",372,Lundgren,5/6/1938,81,Taurus,Visa,4539748641306150,938,3/2024,,"1Z 484 548 09 5749 331 5",6980674650,69979073,White,Photogrammetrist,"Expo Superstore","1997 Panoz AIV",CreditChaos.com.ee,A+,178.0,80.9,"5' 10""",178,32b3dfb9-2eaa-4af2-a012-2a044f866550,57.915676,26.169326
85,male,Icelandic,Mr.,Guðgeir,S,Bergsveinsson,"Rue du Centre 320",Marke,VWV,"West Flanders",8510,BE,Belgium,GudgeirBergsveinsson@armyspy.com,Ressen,phukieGae9c,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","0493 28 62 88",32,,4/2/1994,25,Aries,MasterCard,5432615789205137,688,11/2021,,"1Z 684 8A0 47 6831 298 9",0387412870,16497840,Orange,"Forging machine tender","Crafts & More","1994 Mitsubishi Sigma",SoldierResources.be,A+,165.9,75.4,"5' 8""",172,1016b5ad-56e3-4fb2-ba90-dc1f43694493,50.73779,3.22707
86,male,Icelandic,Mr.,Esjar,S,Sturluson,"Hauptstrasse 75","PUCH BEI HALLEIN",SZ,Salzburg,5412,AT,Austria,EsjarSturluson@teleworm.us,Conetund,taeWeF2Eeph4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0699 465 17 25",43,,3/24/1970,49,Aries,MasterCard,5400451109415331,492,9/2021,,"1Z 807 0E7 80 7325 125 1",3810913760,52784081,Blue,"Dietetic technician","Endicott Johnson","2002 Smart ForFour",MicroLists.at,B+,171.6,78.0,"6' 0""",182,b63ece12-4e7f-4348-bce8-1d8d5dc31dff,47.741555,13.137162
87,male,England/Wales,Mr.,Zak,M,Leonard,"27 Stroud Rd",OCHTERTYRE,,,"PH7 6LF",GB,"United Kingdom",ZakLeonard@fleckens.hu,Surn1940,ohteeF5RaeM,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","078 3687 4061",44,Henry,4/4/1940,79,Aries,MasterCard,5336964062411492,178,11/2023,"KY 97 49 93 A","1Z 77F 43A 87 5585 068 2",5973298171,66698464,Blue,"Heat treating equipment tender","The Independent Planners","1996 Mitsubishi Verada",HumorVids.co.uk,O+,221.1,100.5,"5' 7""",171,d667dde7-082e-4480-98d9-5bdd383eb187,56.07981,-4.643057
88,female,"Chechen (Latin)",Mrs.,Ezinet,B,Umkhayev,"216 Karaiskaki Sq",Ineia,PA,Paphos,8704,CY,"Cyprus (Anglicized)",EzinetUmkhayev@dayrep.com,Mothen1991,IeH2ceebae,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","97 696060",357,Masaev,3/27/1991,28,Aries,Visa,4929316208182816,160,4/2021,,"1Z 77V 114 81 0072 703 9",4908986837,97732729,Purple,"Quality assurance inspector","Chief Auto Parts","2005 Jaguar XKR",WealthyGadgets.com.cy,B+,218.9,99.5,"5' 9""",175,7cb11d28-cbd3-49c2-be74-e6dcdca65cb4,41.270842,-72.883851
89,female,Russian,Mrs.,Lucia,V,Voronina,"75 Sale-Heyfield Road",KONGWAK,VIC,Victoria,3951,AU,Australia,LuciaVoronina@gustr.com,Riets1976,aY3ohbe8ai,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","(03) 5371 4059",61,,7/25/1976,43,Leo,MasterCard,5459182457974252,335,5/2024,,"1Z 618 731 57 5565 866 4",4199244189,20580381,Blue,"Eligibility interviewer","William Wanamaker & Sons","2012 Tata Indica",CheatPrevention.com.au,O+,176.7,80.3,"5' 7""",169,bf8e68c3-2842-49da-8c4d-1ed4712a3852,-38.465215,145.830079
90,male,Slovenian,Mr.,Milorad,S,Musić,"Välja 61",Mustahamba,VR,Võrumaa,66258,EE,Estonia,MiloradMusic@dayrep.com,Entils,oan8Eiyoaz,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","789 0750",372,Flach,12/1/1968,51,Sagittarius,MasterCard,5211243470404849,924,2/2022,,"1Z F99 556 56 0740 270 3",0643392658,82292582,Blue,Housekeeper,"Cougar Investment","2000 Buick Rendezvous",StickerEmporium.com.ee,A+,244.4,111.1,"6' 1""",185,ea727fc4-6f40-412c-9574-b372a6aef26f,57.820634,26.970835
91,female,"Japanese (Anglicized)",Dr.,Chisaki,M,Fujimura,"1956 Uitsig St",Grahamstad,EC,"Eastern Cape",6139,ZA,"South Africa",ChisakiFujimura@cuvox.de,Rewhe1979,iayiQu9ahsie,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","082 875 2166",27,Wakabayashi,12/22/1979,40,Capricorn,Visa,4485068737963325,600,10/2020,7912221956187,"1Z 480 79V 08 4325 733 4",5381640942,65242007,Brown,"Extruding and drawing machine setters","Alert Alarm Company","2005 Porsche Cayenne",NoteBack.co.za,O+,216.5,98.4,"5' 2""",157,b1a8327e-794e-4685-ab8b-43d20c08ed68,-33.370929,26.578978
92,female,Russian,Mrs.,Inessa,D,Samoylova,"Bachloh 60",WATZING,OO,"Upper Austria",4673,AT,Austria,InessaSamoylova@fleckens.hu,Rolong,moot9aQu2d,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","0664 396 27 04",43,,5/11/1956,63,Taurus,Visa,4556264124085418,064,5/2024,,"1Z 596 V55 76 1512 629 4",1323820614,72364512,Purple,Patternmaker,"Erb Lumber","2009 Honda CR-V",EthanolSpecialist.at,O+,135.3,61.5,"5' 3""",161,13f411d7-0754-4233-a902-37a68ce4bb45,48.118598,13.654018
93,female,England/Wales,Ms.,Elise,C,Pearson,"215 Andrew Street",Monaco,,Nelson,7011,NZ,"New Zealand",ElisePearson@rhyta.com,Norly1997,eeNgoes7aez,"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0","(027) 7329-039",64,Wilson,5/12/1997,22,Taurus,Visa,4929452838531450,902,4/2022,,"1Z 245 516 58 7073 695 7",9911299106,38307168,Blue,"Activity specialist","The Independent Planners","1995 Nissan President",USFirm.co.nz,O+,162.1,73.7,"5' 9""",175,962cba25-bb3d-4b3b-8aca-e4689ee69dd5,-41.333626,173.307741
94,male,Norwegian,Mr.,Herman,A,Johansen,"1324 Mosman Rd","Alexander Bay",NC,"Northern Cape",8294,ZA,"South Africa",HermanJohansen@gustr.com,Thinde,loaH6shiemoh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","083 779 9214",27,Smestad,7/29/1947,72,Leo,Visa,4929709070861949,812,9/2023,4707295736082,"1Z 2E1 Y32 51 4242 127 7",3751331085,25346448,Blue,"Cost estimator","Brown Derby","2014 Audi SQ5",TypoPro.co.za,O+,182.8,83.1,"5' 10""",179,377c5af3-3ab2-405b-bc80-1ddd4f06ecca,-28.511777,16.410349
95,male,Russian,Mr.,Armen,D,Balabanov,"Bavorovská 788",Stachy,JC,"Jihoceský kraj","384 73",CZ,"Czech Republic",ArmenBalabanov@teleworm.us,Gery1975,Gaeg6uchoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","606 972 932",420,,1/11/1975,44,Capricorn,Visa,4532025806629404,529,1/2021,,"1Z 245 330 29 3529 731 4",5526134624,86584996,Orange,Rigger,"Sun Foods","1999 Isuzu VX-02",MyBloggers.cz,A+,199.8,90.8,"5' 6""",168,ec3e15ac-78df-4844-a8e3-c6c491c6dd39,49.090909,13.642637
96,male,Russian,Mr.,Evdokim,Y,Bazarov,"Reykjarhóli 70",Fljót,,,570,IS,Iceland,EvdokimBazarov@einrot.com,Deet1996,aiRubie9Poqu,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","413 4270",354,,10/5/1996,23,Libra,MasterCard,5413518495816218,546,9/2024,,"1Z 900 9A2 84 8206 503 2",0362518360,38724903,White,"Insurance investigator","Modern Realty","1996 ZAZ Wagon",WeekendScores.is,O+,212.1,96.4,"5' 8""",173,9112f339-d232-4fa7-a1f3-2e74571fd00a,66.154544,-17.801351
97,female,Hungarian,Mrs.,Agoti,B,Gyarmaty,"793 Buena Vista Avenue",Corvallis,OR,Oregon,97330,US,"United States",GyarmatyAgoti@jourrapide.com,Eage1963,pheiSha1aqu,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15",541-714-1388,1,Cseh,12/7/1963,56,Sagittarius,MasterCard,5328604768229802,989,8/2024,543-24-6755,"1Z 238 019 37 1904 563 8",2038177985,21602780,Blue,"Licensed clinical social worker","Circuit Design","2001 Suzuki Covie",adrifza.com,A+,140.8,64.0,"5' 4""",163,b1f4eed7-ff6b-4671-bf5d-a04c6a7b4beb,44.597298,-123.334112
98,male,England/Wales,Mr.,John,K,Carpenter,"Tavcarjeva 22",Senovo,,,8281,SI,Slovenia,JohnCarpenter@dayrep.com,Foris1988,el6xoh7Qu,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0",070-783-977,386,Wheeler,1/10/1988,31,Capricorn,MasterCard,5174982341006037,269,3/2020,,"1Z E95 341 79 8897 978 9",0309942560,54166967,Blue,"Marketing coordinator","Balanced Fortune","1992 Mazda AZ-1",KeywordAlbum.si,O+,226.4,102.9,"5' 8""",173,d2754fd9-f1c9-47cd-b6e1-7c8d5c0eec30,46.102339,15.464625
99,female,Hispanic,Mrs.,Maha,A,Cazares,"Reyes Católicos 75","Chiclana de la Frontera",CA,Cádiz,11130,ES,Spain,MahaCazaresMendez@superrito.com,Martrust57,Ohqu6achie,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","624 412 511",34,Méndez,9/25/1957,62,Libra,Visa,4485545084297530,898,6/2020,,"1Z 508 603 87 4474 636 9",1330099115,99991665,Blue,"Diesel train engineer","Sew-Fro Fabrics","2001 Alfa Romeo GTV",WellnessPlant.es,A+,213.2,96.9,"5' 8""",172,d768e30b-a6de-4977-8bb8-0c8432acce44,36.447765,-6.204969
100,female,American,Mrs.,Patricia,J,Nevels,"72 Acheron Road",BUNDALAGUAH,VIC,Victoria,3851,AU,Australia,PatriciaJNevels@rhyta.com,Butimis1962,eekea5Thoo,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","(03) 5301 7984",61,Thomas,7/30/1962,57,Leo,Visa,4716717759727577,065,9/2024,,"1Z 449 366 30 8287 656 2",3560472157,65535722,Blue,"Identification clerk","PriceRite Warehouse Club","2005 Infiniti QX56",PlayDetails.com.au,A-,209.9,95.4,"5' 3""",161,6156ce11-c2a6-4266-bb4d-f47b06292e4e,-38.111213,147.271178
1 Number Gender NameSet Title GivenName MiddleInitial Surname StreetAddress City State StateFull ZipCode Country CountryFull EmailAddress Username Password BrowserUserAgent TelephoneNumber TelephoneCountryCode MothersMaiden Birthday Age TropicalZodiac CCType CCNumber CVV2 CCExpires NationalID UPS WesternUnionMTCN MoneyGramMTCN Color Occupation Company Vehicle Domain BloodType Pounds Kilograms FeetInches Centimeters GUID Latitude Longitude
2 1 female Czech Mrs. Marie J Hamanová P.O. Box 255 Kangerlussuaq QE Qeqqata 3910 GL Greenland MarieHamanova@armyspy.com Wasco1982 eiZookooB7 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 84 23 30 299 Kubíková 3/29/1982 37 Aries MasterCard 5545634085461876 511 1/2020 1Z 789 686 82 8979 914 6 6945116246 34746079 Purple Surveillance officer Simple Solutions 1995 Zastava 65 MarathonDancing.gl O+ 217.6 98.9 5' 5" 164 6781b04d-7b5f-4c1a-bceb-b953e6ef70d7 77.377518 -67.015569
3 2 female French Ms. Patricia G Desrosiers Avenida Noruega 42 Vila Real VR Vila Real 5000-047 PT Portugal PatriciaDesrosiers@superrito.com Fultses eb6soCha4ae Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 21 259 903 5696 351 Daviau 2/28/1956 63 Pisces MasterCard 5317250628844522 874 3/2022 1Z V38 747 73 7311 832 9 7398998399 18093674 Blue Vascular technologist Formula Gray 2006 Lexus GS LostMillions.com.pt O+ 118.1 53.7 5' 0" 152 2b2e7e1a-855f-4089-a570-c0af2381a6d6 41.274541 -7.876658
4 3 female American Ms. Debra O Neal 1659 Hoog St Brakpan GA Gauteng 1553 ZA South Africa DebraONeal@fleckens.hu Cognoy sha3Sohzee Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/19.4.0.2397 Yowser/2.5 Safari/537.36 082 490 1693 27 Barrett 6/11/1957 62 Gemini Visa 4916429195104076 315 5/2020 5706114632083 1Z 061 1E5 71 3400 427 4 6186449862 58702271 Blue Information architect librarian Dahlkemper's 1993 Honda Prelude MediumTube.co.za A+ 120.1 54.6 5' 4" 162 2ef83f4c-3102-4f79-839d-c75bf6a06f0a -26.22096 28.283398
5 4 male French Mr. Peverell C Racine 183 Epimenidou Street Limassol LI Limassol 3041 CY Cyprus (Anglicized) PeverellRacine@teleworm.us Restlys Aekie7ohs Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 25 470375 357 Grondin 6/14/1962 57 Gemini Visa 4485421519226702 653 5/2023 1Z F44 91V 14 3570 491 2 0850016444 52534088 Blue Desk clerk Quickbiz 2008 Infiniti G35 ImproveLook.com.cy B+ 142.1 64.6 5' 9" 174 bfb4be71-3710-4ffa-baaf-5af6aa4b339e 41.30296 -72.989066
6 5 female Slovenian Mrs. Iolanda S Tratnik Karu põik 61 Pärnu PR Pärnumaa 80098 EE Estonia IolandaTratnik@teleworm.us Trely1962 jeiziejohH3ai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 445 6271 372 Korbun 1/23/1962 57 Aquarius Visa 4532820383285186 893 4/2024 1Z 060 418 64 7516 574 4 1178606881 74806227 Purple Production assistant Dubrow's Cafeteria 2007 Fiat Idea PostTan.com.ee O+ 141.5 64.3 5' 3" 160 0cbb7bf3-466f-4df6-bda3-9c9fe7bfc5c1 58.293395 24.434851
7 6 male Italian Mr. Domenico D Pisano Via Pisanelli 104 Traversara RA Ravenna 48020 IT Italy DomenicoPisano@armyspy.com Hatelt lohhee8Zah Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36 0312 0828589 39 Conti 6/1/1979 40 Gemini Visa 4532872142737056 237 6/2023 WK48391724 1Z 175 1F5 29 1963 168 1 7448393148 31617424 Blue Professional scout Littler's 1998 Nissan Serena HardDriveBlog.it O+ 247.5 112.5 6' 0" 182 f4feeb24-e3b1-4d99-9c71-e8c6a95762fe 44.588081 12.055283
8 7 male Greenland Mr. Pavia A Rosing 29 Wattle St King William's Town EC Eastern Cape 5601 ZA South Africa PaviaRosing@superrito.com Thattere aiCheed7tie Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/19.4.0.2397 Yowser/2.5 Safari/537.36 082 692 3461 27 Lennert 5/5/1937 82 Taurus Visa 4539980160229196 256 10/2020 3705057790082 1Z 507 770 52 3012 473 1 1256867146 65720899 Green Chemical engineer Pup 'N' Taco 2003 Peugeot Partner RecruitSuit.co.za O- 192.1 87.3 6' 0" 182 b6b75cf9-dfbf-424d-a03c-90cdd859e9eb -32.787712 27.343649
9 8 female French Mrs. Ormazd M Jomphe Mattenstrasse 108 Sissach 4450 CH Switzerland OrmazdJomphe@rhyta.com Deace1999 oochui5Eboe5T Mozilla/5.0 (Windows NT 6.1; rv:66.0) Gecko/20100101 Firefox/66.0 061 947 83 90 41 Busson 1/14/1999 20 Capricorn Visa 4556603638439886 691 6/2024 1Z 091 192 83 9348 168 6 4380386435 24628087 Purple Clinical psychologist Linens 'n Things 1996 Plymouth Neon CyclingMonthly.ch O+ 115.3 52.4 5' 1" 154 e5858bdb-9173-4991-9857-4e09b61e4e16 47.520557 7.863831
10 9 male Norwegian Mr. Severin L Akhtar 251 Charilaou Trikoupi Str. Pigenia NI Nicosia 2962 CY Cyprus (Anglicized) SeverinAkhtar@rhyta.com Heremer cieCipua8L Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 96 586625 357 Mathisen 4/30/1960 59 Taurus MasterCard 5230940651584482 785 10/2022 1Z W81 228 52 4912 032 1 8897778249 55031915 Green Pump operator Fragrant Flower Lawn Services 2005 Dodge Nitro GainPain.com.cy B+ 155.3 70.6 6' 0" 182 64383596-6dc8-4b77-9476-c1a8ef23ffc6 41.335894 -72.908321
11 10 female Greenland Mrs. Margrethe H Kristiansen 94 boulevard Amiral Courbet ORLÉANS CE Centre 45100 FR France MargretheKristiansen@gustr.com Theirturavid Aiv4ohwae Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 02.64.62.52.37 33 Berthelsen 12/13/1979 40 Sagittarius MasterCard 5306020150102745 915 12/2024 2791269679323 49 1Z 987 E42 01 7982 218 2 0231937615 18687876 Purple Systems software engineer Independent Wealth Management 2012 Porsche 911 ToyProtection.fr A+ 200.2 91.0 5' 3" 159 ae7b1a56-0d6d-46ba-895e-75ee10482858 47.850047 1.875252
12 11 female Hispanic Mrs. Myrna G Feliciano Männi 12 Mustoja LV Lääne-Virumaa 45429 EE Estonia MyrnaFelicianoCortes@superrito.com Wilthe84 Chu6shiRees Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 329 3803 372 Cortés 1/23/1984 35 Aquarius MasterCard 5402842306504596 788 7/2020 1Z 599 666 46 6430 018 3 8678427166 90687114 Blue Radiologic technician Monmax 2003 Mitsubishi Lancer SharkStatistics.com.ee O+ 212.7 96.7 5' 5" 164 4e8b5c6c-0b04-43c1-ad5b-2fc92365455c 59.638357 26.059683
13 12 male Czech Mr. Michal E Horký Algade 33 Guldborg SJ Region Sjælland 4862 DK Denmark MichalHorky@rhyta.com Fiect1941 oxep7Aev Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 28-64-27-85 45 Siváková 6/19/1941 78 Gemini MasterCard 5108116586316493 376 2/2021 190641-4941 1Z 85E W86 50 8027 647 6 4547979244 81431928 Orange Dental assistant Pointers 1995 Daihatsu Rocky ExShows.dk B+ 165.0 75.0 5' 10" 177 a2aa0138-07d3-41e3-b3b7-455d9854e31f 54.815396 11.760822
14 13 male French Mr. Donat M Lespérance 96 rue de Penthièvre PONTOISE IL Île-de-France 95000 FR France DonatLesperance@gustr.com Sirep1950 re8ZieK4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 03.34.08.71.06 33 Dodier 11/12/1950 69 Scorpio Visa 4556904288472270 512 9/2022 1501143313127 93 1Z 070 9Y5 64 7265 236 0 1770634719 11974448 Black Neurosonographer American Appliance 2009 Kia Cerato BasketballBiz.fr A+ 180.8 82.2 5' 7" 170 78b497ed-6d7d-4e5f-8150-1155abf9716e 48.977559 1.976986
15 14 female Japanese (Anglicized) Ms. Yuuka M Shimasaki Mjövattnet 1 NYLAND 870 52 SE Sweden YuukaShimasaki@cuvox.de Coun1976 ohleaT4ae Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 0613-9040212 46 Kimura 2/11/1976 43 Aquarius MasterCard 5457403889440023 903 5/2021 760211-4105 1Z 975 450 29 2316 562 4 8652144021 72590241 Blue Credit checker Elek-Tek 2006 Ford Territory ClassInsider.se A- 151.8 69.0 5' 5" 165 df1a2d57-31d8-4a71-8ad6-cb687ee250d4 62.773416 17.853904
16 15 male Swedish Mr. Wiktor H Ek Norðurbraut 27 Reykjavík 112 IS Iceland WiktorEk@rhyta.com Boally aigo2OoPhoi Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 434 6815 354 Göransson 5/4/1945 74 Taurus Visa 4532063379896779 489 4/2024 1Z 278 965 48 6106 268 1 1158883056 79465382 White Legal secretary Handy Andy Home Improvement Center 2014 Jaguar XF SSLAlert.is O+ 227.0 103.2 5' 10" 178 d39f58d7-7bb7-4f77-9956-9b505f8a4cc8 64.187422 -21.93344
17 16 female Slovenian Ms. Polona H Ranković Õli 68 Himmiste PL Põlvamaa 64204 EE Estonia PolonaRankovic@rhyta.com Whou1985 eeZae5oech Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 798 9719 372 Orehek 7/18/1985 34 Cancer MasterCard 5433975988486550 723 7/2022 1Z A77 0E9 59 0580 345 8 0984960146 75114603 Blue Personal trainer Budget Tapes & Records 1995 Lancia Delta TodayAlert.com.ee A+ 160.2 72.8 5' 5" 165 c8e80b15-7ff4-4b21-bebf-5737d8133bdc 57.990467 27.141455
18 17 male Scottish Mr. Ivan M King Pachergasse 64 BÜSCHENDORF ST Styria 8786 AT Austria IvanKing@einrot.com Anempon XooJoh0se5sh Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 0660 475 13 89 43 Watson 9/15/1994 25 Virgo MasterCard 5257015834586726 714 12/2020 1Z 37A 329 60 9892 939 6 1176881962 86098320 Green Placement counselor Opticomp 2000 Citroen C 15 InvestmentBrowse.at A+ 130.2 59.2 5' 7" 170 58aec14a-fc35-4e19-a61f-b8f170a9ec7d 47.492881 14.377639
19 18 female Finnish Mrs. Nelma M Grönholm Rostsestraat 222 Froidchapelle WLX Luxembourg 6440 BE Belgium NelmaGronholm@jourrapide.com Obect1946 aeJ6OhneiF3t Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 0479 50 54 03 32 Pelkonen 9/15/1946 73 Virgo MasterCard 5208559023644291 187 12/2021 1Z 416 A35 89 5065 644 8 4484163657 51232002 Green Claims adjuster Starship Tapes & Records 2010 Land Rover Defender MobileKicks.be A+ 216.5 98.4 5' 7" 169 2f575438-1e23-4123-ab60-990725e8c08b 50.072755 4.344408
20 19 female Hungarian Ms. Tünde F Hoffmann Via Nazario Sauro 112 Cusano Milanino MI Milano 20095 IT Italy HoffmannTunde@armyspy.com Preacces aob4eiteiL Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 0352 9353380 39 Bagi 11/1/1951 68 Scorpio MasterCard 5559688562190559 258 7/2023 DA75119938 1Z 959 98A 67 7929 896 3 2123939613 78515387 Orange Apparel worker Hugh M. Woods 2005 BMW X5 CrabDealer.it O+ 186.1 84.6 4' 11" 151 0a14766e-42f5-4943-ae10-884bcadf43f8 45.644863 9.128014
21 20 female Finnish Ms. Riitta N Hahl ul. Elbląska 97 Olsztyn 10-672 PL Poland RiittaHahl@einrot.com Thill1954 Aiwar2ooh1 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 67 788 02 83 48 Linna 3/2/1954 65 Pisces MasterCard 5150663209146952 542 2/2021 54030268860 1Z 626 853 00 4461 590 4 4657047492 90091015 Purple Medical secretary Edwards 2012 Toyota Prius DustingSprays.pl O+ 187.7 85.3 5' 7" 170 e7825dc5-9fdf-478c-ba72-ab25879699f1 53.768852 20.536572
22 21 male Dutch Mr. Harwin R Galesloot Glynitveien 218 SKI 1400 NO Norway HarwinGalesloot@einrot.com Saffive wohHaix5fa Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 914 54 925 47 Ramakers 9/28/1994 25 Libra MasterCard 5580997921941047 961 4/2022 1Z 761 410 55 6702 728 6 9467674600 08790710 Blue Electric motor repairer Buena Vista Realty Service 2002 ZAZ Slavuta SwankBlog.no B+ 217.1 98.7 5' 6" 167 4029654d-5c8f-407e-b003-72bde9f593a1 59.814132 10.87172
23 22 male Hispanic Mr. Azarías A Segovia Via Francesco Girardi 49 Carmignano Di Brenta PD Padova 35010 IT Italy AzariasSegoviaNava@cuvox.de Portalime duteiVeev1 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 0394 9130281 39 Nava 12/24/1949 70 Capricorn MasterCard 5146736492498053 173 1/2022 SZ30479384 1Z 931 W28 91 2882 876 3 2747014433 22062098 Blue Administrative office manager Total Serve 2005 BMW 325 EmployeeVerified.it B+ 140.4 63.8 6' 0" 184 916cbe50-55b2-4a11-acf4-b8d8d9cc9668 45.432966 12.000719
24 23 male Hungarian Mr. Adelbert A Kuncz Turjaška 115 Rečica ob Savinji 3332 SI Slovenia KunczAdelbert@fleckens.hu Firseten woh3ejoo2Ei Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 031-365-314 386 Pethô 10/17/1976 43 Libra MasterCard 5431971800958886 663 4/2022 1Z 5V7 161 85 3863 932 0 7652480699 67841291 Blue Extruding, forming, pressing, and compacting machine setter Red Owl 2002 Citroen C-Airdream BankingDetective.si O+ 193.6 88.0 6' 1" 186 6653d52f-870f-4956-8b1a-c1463007f387 46.357704 14.932894
25 24 female England/Wales Ms. Charlie S Campbell Rue du Château 414 Limont WLG Liège 4357 BE Belgium CharlieCampbell@gustr.com Lithatinquir Kae4aetah Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 0489 33 97 26 32 Tucker 10/16/1987 32 Libra MasterCard 5268537479220623 072 1/2023 1Z 3V4 354 38 1677 342 8 0119239854 24858321 Purple Dry-cleaning worker Red Robin Stores 2004 Audi S4 PinkCheek.be O+ 119.5 54.3 5' 8" 173 f3a846e6-ec1c-4ccc-bf51-f66bb64ff559 50.616062 5.247283
26 25 female American Ms. Thelma K Mitchell Väike-Laagri 80 Orissaare SA Saaremaa 94691 EE Estonia ThelmaKMitchell@einrot.com Trind1979 eeThooy3ieph Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 455 6186 372 Rumbaugh 5/23/1979 40 Gemini MasterCard 5252527082023181 249 9/2024 1Z 001 306 31 4745 481 7 4024903278 57842179 Black Camera repairer Omni Superstore 2000 Ford Artic BedroomRental.com.ee A+ 118.4 53.8 5' 6" 168 6e2490c7-4a17-4ce8-9a80-4b82551d3180 58.540748 23.059808
27 26 male Brazil Mr. Davi G Santos Rákóczi út 66. Barnag VE Veszprém 8291 HU Hungary DaviGoncalvesSantos@superrito.com Cumeneamord AeYeRie2doo Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 (88) 158-170 36 Goncalves 3/15/1974 45 Pisces Visa 4539342451489007 234 9/2021 1Z 598 735 11 9286 476 3 9550539865 84997161 Blue Clinical manager Star Merchant Services 2011 Alfa Romeo Giulietta ConfidentialCash.hu A+ 156.9 71.3 5' 11" 180 95501fdd-a13f-43fd-981c-0dced4dd23e6 47.04597 17.745945
28 27 male England/Wales Mr. Jonathan A Conway Rua Doutor Afrânio Junqueira 1460 São Paulo SP São Paulo 04581-040 BR Brazil JonathanConway@jourrapide.com Bessed Egaez2Vuo Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko (11) 7113-8192 55 Humphries 1/20/1938 81 Aquarius Visa 4916317241919037 680 3/2023 308.271.618-08 1Z 02V 42E 21 2992 331 4 3494650897 26033304 Blue Electrical drafter Pro Yard Services 2006 Dodge Caravan AidsRate.com.br B+ 170.9 77.7 6' 0" 183 0eb93cfe-1260-4caa-99d5-e0f31f73b1e9 -23.594567 -46.709971
29 28 male French Mr. Guy A Migneault 90 Petworth Rd DUNSINNAN PH2 5HL GB United Kingdom GuyMigneault@teleworm.us Enut1960 nahming4Oo Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 077 5138 5842 44 Labrecque 4/28/1960 59 Taurus MasterCard 5581802812256860 692 3/2022 ZT 01 87 75 1Z 96V 661 27 4962 061 4 8863556519 92044686 Blue Technical trainer Rogers Peet 1999 Fiat Siena ReligiousCounselor.co.uk B+ 176.7 80.3 5' 7" 171 eca54fd7-3576-426c-b4f9-b617a3662931 56.025747 -3.640577
30 29 male Hispanic Dr. Breogan J Orosco Ηλίου 64 ΛΑΡΝΑΚΑ LA Λάρνακα 6031 CY Cyprus (Greek) BreoganOroscoCeballos@teleworm.us Priback Quoo9choo Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 97 687579 357 Ceballos 9/19/1957 62 Virgo MasterCard 5567650882732569 392 10/2022 1Z A69 196 36 1803 521 0 6314780611 79148788 Green Machinery maintenance mechanic Jack Lang 2010 Hyundai i30 TalkAbuse.com.cy A+ 155.1 70.5 5' 10" 179 8dda77bb-d75c-40ef-b1c0-6381d305b053 41.348113 -72.957665
31 30 female Czech Mrs. Jaroslava M Kindlová 22 Rue de Sidi Bou Zid Zouarine 33 Governorate Kef 7170 TN Tunisia JaroslavaKindlova@armyspy.com Tepen1939 thahShee7 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 78 427 062 216 Langová 11/28/1939 80 Sagittarius Visa 4532368231815457 583 12/2022 1Z 349 036 22 9992 262 9 5257019048 20372734 Black Copy editor Anthony's 2005 Nissan Altima ScanFund.tn B+ 108.7 49.4 5' 4" 163 94e73608-33dd-4f7a-a86d-9575aa2e63e6 37.219459 9.881268
32 31 male Croatian Mr. Stjepan A Perković Escuadro 26 Castelló de Rugat V Valencia 46841 ES Spain StjepanPerkovic@jourrapide.com Spastry uLiH7iech3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 779 021 982 34 Topić 3/18/1978 41 Pisces Visa 4539449132248031 692 6/2023 1Z V92 2E8 34 9201 447 0 3497779171 87502347 Orange Geoscientist Monmax 2010 Chrysler PT Cruiser AudioBoom.es A+ 175.8 79.9 5' 10" 178 e49f858d-b218-4ce7-89ae-62599b3bb276 38.927331 -0.375157
33 32 male Croatian Mr. Stanko T Crnić Avda. Alameda Sundheim 46 Benasque HU Huesca 22440 ES Spain StankoCrnic@fleckens.hu Waskents Piu4theeg1ae Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 793 358 347 34 Jozić 9/13/1974 45 Virgo Visa 4929607743905830 463 2/2021 1Z 981 F24 67 9469 260 4 6183936244 94973462 Blue Studio camera operator Wealthy Ideas 2004 Isuzu Axiom PokerPortraits.es B+ 185.2 84.2 6' 1" 186 6ee36d07-1397-43c8-86a4-455bce264fbc 42.623207 0.475571
34 33 female Russian Ms. Marianne I Zhdanova 18 Rue de bayrout Cite Badrani 61 Governorate Sfax 3083 TN Tunisia MarianneZhdanova@armyspy.com Thock1968 uRai6thoh Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 74 849 807 216 11/30/1968 51 Sagittarius MasterCard 5223193173825541 622 7/2020 1Z 60W 422 22 2116 147 5 6606006372 32841170 Blue Geographic information specialist Star Interior Design 1996 Dodge Caravan PreviewBuy.tn B+ 168.3 76.5 5' 7" 169 eff7c2c8-85f4-45dc-8a4d-60770f4eda42 35.319852 9.785865
35 34 female Hungarian Ms. Ferike G Jónás Brucker Bundesstrasse 31 FÜRLING NO Lower Austria 4152 AT Austria JonasFerike@dayrep.com Ofigaill49 ni8ooy1Thee Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 0681 563 12 72 43 Tolnay 4/12/1949 70 Aries Visa 4532628381402038 969 6/2021 1Z 099 5Y5 30 8995 126 3 7242712846 73346379 Red Computer systems administrator ABCO Foods 1999 Land Rover Discovery SemiCheap.at O+ 134.6 61.2 5' 5" 166 4687baf3-0486-4466-a63b-d0dff8903249 48.633035 13.984742
36 35 female Czech Ms. Lenka T Mizerová Rhinstrasse 91 München BY Freistaat Bayern 80975 DE Germany LenkaMizerova@fleckens.hu Daudgessed keer4Ceej9j Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 089 60 25 65 49 Brožová 1/14/1944 75 Capricorn Visa 4916250570933685 939 6/2023 1Z 580 A90 83 6948 520 2 9291535188 57982233 Red Maid AdventureSports! 2000 Toyota Sienna ExitMarketing.de O+ 184.1 83.7 5' 1" 154 efb875ed-74dc-4ea3-b155-4a48cb61d187 48.189892 11.502201
37 36 female Icelandic Mrs. Eyþóra P Runólfsdóttir Ditscheinergasse 80 HUNDSHAGEN OO Upper Austria 4773 AT Austria EythoraRunolfsdottir@jourrapide.com Expregiat Beph8ieX Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 0699 830 58 07 43 11/22/1960 59 Sagittarius MasterCard 5426319483823638 482 11/2024 1Z 448 19V 41 8418 729 7 2455440003 04422500 Purple Building cleaning worker Matrix Architectural Service 2010 BMW 650 PaidValue.at O- 130.7 59.4 5' 6" 168 02a29393-6fdc-4120-adf0-568906c8c111 48.266115 13.568714
38 37 female French Mrs. Rive T Lépicier 144 Souniou Ave. Menogeia LA Larnaca 7578 CY Cyprus (Anglicized) RiveLepicier@teleworm.us Carray ieJij3no Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0 24 102884 357 Lanteigne 4/15/1983 36 Aries MasterCard 5494902843118711 376 8/2024 1Z W04 891 69 0373 898 8 7751332568 91800864 Yellow Payroll and benefits specialist Lechters Housewares 2006 Nissan Pathfinder MLSModels.com.cy A+ 119.7 54.4 5' 4" 163 17afbc1b-4d95-4583-8602-9680b4fd7c5c 41.311392 -72.829123
39 38 male Danish Mr. Marcus O Paulsen Plattenstrasse 57 Räterschen 8352 CH Switzerland MarcusOPaulsen@dayrep.com Mesee1943 Fi7eiva8Ah Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 044 347 47 26 41 Simonsen 10/8/1943 76 Libra MasterCard 5378357034830932 713 8/2021 1Z 387 E31 19 5962 225 9 2639724434 59408085 Black Management development specialist Parts and Pieces 2006 BMW M3 MalpracticeAgents.ch B+ 207.5 94.3 5' 10" 177 43ae4a6b-e1ca-4d5e-b6cd-3732697c9c71 47.488972 8.868299
40 39 female Chechen (Latin) Mrs. Zeliha I Sultygov Rookopli 96 Uralaane VG Valgamaa 68712 EE Estonia ZelihaSultygov@cuvox.de Dary1953 nohgief2A Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 763 5734 372 Desheriyev 7/27/1953 66 Leo Visa 4539696753097085 200 11/2021 1Z E17 641 44 8337 404 1 5174706823 32841274 Red Mental health assistant Reliable Investments 2000 Opel Signum MeDue.com.ee O+ 119.9 54.5 5' 3" 160 68c2ef2c-9990-41bc-be94-afa07e6e2379 58.070181 26.064252
41 40 female Russian Mrs. Ilona B Pirogova Αγ. Ανδρέα 130 ΒΑΣΑ ΚΟΙΛΑΝΙΟΥ LI Λεμεσός 4771 CY Cyprus (Greek) IlonaPirogova@superrito.com Hatiere Ahsha4Ai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 25 750307 357 12/5/1970 49 Sagittarius Visa 4539731441846112 542 4/2024 1Z 008 309 69 5575 108 1 3971097056 06463075 Purple Land acquisition manager Wickes Furniture 1992 Ford Taurus UGLive.com.cy O- 135.1 61.4 5' 1" 156 0c5899a0-ce9c-43a7-aba1-f279893620f9 41.295272 -72.961282
42 41 female Croatian Ms. Aleksandra K Petković Binzmühlestrasse 30 San Bernardino 6565 CH Switzerland AleksandraPetkovic@fleckens.hu Ramessanies1994 vie2quai7Ie8 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 091 808 37 22 41 Bašić 6/8/1994 25 Gemini MasterCard 5560110586075796 991 3/2023 1Z 11E 310 61 4037 919 3 0620594797 44658507 Purple Personal banker Ejecta 2005 Kia Amanti WirelessRelief.ch A+ 110.2 50.1 5' 4" 163 cc560302-0d00-410c-9629-2e68bb4ef864 46.506804 9.159102
43 42 male American Mr. Rogelio A Patrick Πλ Καραισκάκη 128 ΑΓΙΟΣ ΘΕΟ∆ΩΡΟΣ ΣΟΛΕΑΣ NI Λευκωσία 2823 CY Cyprus (Greek) RogelioAPatrick@dayrep.com Whortin1952 Yu7mah2z Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 95 432561 357 Thacker 10/25/1952 67 Scorpio Visa 4716135824324942 242 9/2021 1Z 14A 327 32 1181 648 4 9342714384 12993510 Blue Typesetting machine tender Vibrant Man 2001 Toyota MR2 MacroSigns.com.cy B+ 170.3 77.4 5' 11" 180 76a566cc-be59-4327-862e-312da09e0c42 41.353523 -72.965839
44 43 female American Mrs. Evelyn R Tucker Kringlan 66 Reykjavík 107 IS Iceland EvelynRTucker@armyspy.com Arresplet deiT0ahyu Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 450 3756 354 Burton 9/25/1986 33 Libra MasterCard 5592761939548814 873 9/2022 1Z 263 919 45 6552 555 7 5057305433 86266508 Purple Aquaculture farmer Weatherill's 2004 Mitsubishi Galant TheyTell.is O+ 105.8 48.1 5' 3" 159 716b5321-34bf-4514-8bca-fce5c482d8c3 64.159592 -21.928397
45 44 male Icelandic Mr. Þorkell H Hallbjörnsson Školní 296 Kaplice 1 JC Jihoceský kraj 382 41 CZ Czech Republic THorkellHallbjornsson@gustr.com Dessesid eeXahew1ui Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 772 616 930 420 12/19/1957 62 Sagittarius MasterCard 5534480983249093 443 6/2023 1Z 34E 320 47 9554 749 9 1825424609 81224507 Blue Financial aid director Grand Union 2004 Ford Explorer AttorneyBiographies.cz A- 236.9 107.7 6' 1" 186 393472fc-3454-4ba8-af3a-e1f7f626cfee 48.691433 14.516696
46 45 male Greenland Mr. Jan H Geisler Bayerhamerstrasse 79 GLAUBENDORF NO Lower Austria 3704 AT Austria JanGeisler@jourrapide.com Subjecould eepooz6U Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko 0699 456 17 84 43 Lange 10/25/2000 19 Scorpio MasterCard 5305776196130476 904 6/2020 1Z 084 34A 51 9322 259 5 7785136902 52035400 Blue Fire prevention specialist Mikrotechnic 2005 Bizzarrini BZ-2001 ProfilePeek.at B+ 203.1 92.3 5' 9" 175 92630214-ba49-47e4-8717-93e4bba4262e 48.560667 15.917015
47 46 female Norwegian Mrs. Caroline M Landmark Via Tasso 21 Perugia PG Perugia 06122 IT Italy CarolineLandmark@superrito.com Sweves ooNee0iechoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 0378 8718408 39 Benjaminsen 9/26/1975 44 Libra MasterCard 5248190326919222 805 6/2024 PR74491787 1Z 2Y4 773 67 4365 263 8 7709648575 99170582 Green Allopathic physician Castro Convertibles 2012 Dodge Durango CheapWarrants.it A- 187.7 85.3 5' 6" 168 dde3a962-10d8-4092-b9df-6da79f89f383 43.072973 12.459411
48 47 female Swedish Ms. Lena M Andersson Parkring 7 STEINPARZ OO Upper Austria 4730 AT Austria LenaAndersson@jourrapide.com Freen1978 aeVaiHohy7 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 0650 858 08 11 43 Holm 10/24/1978 41 Scorpio MasterCard 5451268671996177 795 8/2023 1Z 683 821 70 2253 409 0 7986278354 59684998 Blue Chemical technician Wholesale Club, Inc. 2007 Kia Carnival ProvidenceSold.at AB+ 180.2 81.9 5' 1" 154 8d7f4c08-ee33-4024-9474-0beed004df45 48.234026 13.824536
49 48 male Danish Mr. Elias A Jepsen Βασιλέως Αλεξάνδρου 195 ΦΑΡΜΑΚΑΣ NI Λευκωσία 2620 CY Cyprus (Greek) EliasAJepsen@dayrep.com Thenim uki0Zae7l Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 96 228011 357 Olsen 9/10/1967 52 Virgo Visa 4929867576889614 699 4/2023 1Z 348 2Y4 34 0337 091 2 3256227540 51352382 Blue Court, municipal, and license clerk Golden's Distributors 2011 Volvo XC70 StudRules.com.cy A+ 214.5 97.5 5' 9" 174 f06e06a7-59b0-4052-9928-92df476d7753 41.326352 -72.962624
50 49 male French Mr. Honoré N Beaudouin 13 Faubourg Saint Honoré PAU IL Île-de-France 64000 FR France HonoreBeaudouin@superrito.com Slise1955 Zei7phaeSuutu Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 05.39.21.79.15 33 Daoust 11/2/1955 64 Scorpio Visa 4539798879618651 321 11/2020 1551124428495 35 1Z 424 792 46 6757 249 6 9796303410 07585755 Blue Dairy scientist York Steak House 1999 GAZ 3111 RankHunter.fr O+ 236.1 107.3 5' 9" 175 00a9f1f4-bec6-4dda-ac22-97e2860b1662 43.241847 -0.41343
51 50 male American Mr. Richard K Martinez Via Zannoni 49 Tiarno Di Sopra TN Trento 38060 IT Italy RichardKMartinez@rhyta.com Himmest Feitee9ien Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 0328 4921229 39 Dickenson 9/2/1985 34 Virgo MasterCard 5258225275434802 882 2/2023 PJ53725800 1Z 4V9 5Y7 22 3010 519 9 5935776326 45852443 Blue Logistician Macroserve 1995 Fiat Bravo YellowShoppers.it O+ 205.9 93.6 5' 9" 176 fbe7a3e7-ace4-4495-b6e6-0c2cb4abcc88 45.964528 10.759331
52 51 male Chechen (Latin) Mr. Salambek T Melikov 1678 Dorp St Claremont WC Western Cape 7740 ZA South Africa SalambekMelikov@teleworm.us Brint1956 ehahCh1xai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 083 792 9726 27 Gairbekov 7/13/1956 63 Cancer MasterCard 5284109963636027 465 12/2023 5607139788084 1Z 558 445 48 7417 922 3 1631262867 63160277 Green Telephone operator Kinney Shoes 1998 Chevrolet Trans Sport TripMetro.co.za O+ 211.9 96.3 5' 8" 172 b1628f36-9fdd-487d-8297-a35ee6d72ebf -33.89536 18.479041
53 52 female Greenland Mrs. Mette K Olsen ul. Karpacka 69 Bydgoszcz 85-164 PL Poland MetteOlsen@cuvox.de Liffew queiy6ooGh Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 88 165 40 96 48 Jeremiassen 1/30/1935 84 Aquarius MasterCard 5353410735290150 175 3/2023 35013096720 1Z 129 156 25 6468 002 5 2379843087 66415820 Yellow Lawyer MagnaSolution 2005 Peugeot 107 ChildGaming.pl O- 117.0 53.2 5' 0" 152 01d98231-8e58-4e17-9c9a-bc5aa388928a 53.068638 18.093529
54 53 male Russian Mr. Spartacus N Ignatieff Bahnhofstrasse 57 Glovelier 2855 CH Switzerland SpartacusIgnatieff@jourrapide.com Imeting1968 yi2Eep8gieh Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 032 803 90 31 41 6/16/1968 51 Gemini MasterCard 5399586162423418 375 3/2024 1Z 632 482 77 2949 860 1 2465813266 58937027 Blue Sales worker supervisor Schweggmanns 2008 Mazda 5 SoccerInstructor.ch O+ 152.9 69.5 5' 8" 173 6365915b-b99c-426c-ae8e-698f846d3f03 47.232383 7.22689
55 54 male Brazil Mr. Kauã S Cardoso P.O. Box 194 Upernavik QA Qaasuitsup 3962 GL Greenland KauaSantosCardoso@fleckens.hu Searlitnot caiT4reN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 96 63 21 299 Castro 3/9/2000 19 Pisces MasterCard 5301206781704786 476 5/2024 1Z 346 5Y6 16 7677 108 7 7407791874 63200812 Blue Press secretary Dave Cooks 2002 Hyundai Elantra SpaRules.gl A+ 155.5 70.7 5' 10" 179 081371c1-479e-4055-95af-3110e72fc11a 72.786922 -56.131948
56 55 female Brazil Ms. Fernanda P Cavalcanti Via degli Aldobrandeschi 3 Jelsi CB Campobasso 86015 IT Italy FernandaPereiraCavalcanti@superrito.com Knour1941 ahChohqu4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 0327 9982793 39 Souza 11/30/1941 78 Sagittarius Visa 4929971103746071 969 10/2023 CI48765311 1Z 223 435 25 6742 103 3 6973121025 59054247 Blue Budget analyst Hughes & Hatcher 2001 Volkswagen Lupo MartiniMobile.it B+ 106.3 48.3 5' 5" 165 c75dc0e6-fe45-431f-8907-6e58db479a3d 41.444028 14.707643
57 56 female Hungarian Ms. Mónika Z Göröncsér Nábřežní 243 Spálené Porící PL Plzenský kraj 335 61 CZ Czech Republic GoroncserMonika@fleckens.hu Thenetiong quohwae5Quoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 376 147 284 420 Szôts 6/11/1946 73 Gemini Visa 4916007260260864 417 6/2024 1Z A43 364 39 5822 708 0 8027400869 93626602 Black Precision printing worker Plunkett Home Furnishings 2001 Bugatti EB 118 LeftJournal.cz B+ 146.3 66.5 5' 4" 162 13189ec1-db42-4f8c-b74e-e7a45d33a237 49.629 13.606864
58 57 female Czech Mrs. Zuzana M Kozáková Via del Pontiere 101 Birgi Aerostazione TP Trapani 91020 IT Italy ZuzanaKozakova@fleckens.hu Dinectich mai5eiXaexai Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 0391 7843193 39 Minarčíková 1/21/1953 66 Aquarius MasterCard 5164065137771907 924 5/2020 AT69882067 1Z 736 067 39 5591 664 0 0937144872 77392590 Red Industrial engineering technician Romp 1998 Isuzu VX-02 EugeneTownhouse.it A+ 123.0 55.9 5' 3" 161 c414c613-db2b-4f1a-8bf1-a91e518afb85 37.610445 12.42306
59 58 female Russian Mrs. Eugene R Bykova Via Goffredo Mameli 149 Poggiovalle Di Borgorose RI Rieti 02020 IT Italy EugeneBykova@einrot.com Ingentersed1943 aeN6eenul5 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134 0366 9434948 39 6/29/1943 76 Cancer Visa 4556039653807659 972 5/2022 KX92160250 1Z 975 21Y 29 8927 431 6 6861257852 04061254 Brown Executive secretary Kinney Shoes 1994 Plymouth Voyager GrandLunch.it A+ 195.8 89.0 4' 11" 150 7eb2e374-3e73-41d3-95b6-f9bb93993872 42.079608 12.989079
60 59 female Icelandic Ms. Nanna S Hallmundsdóttir 85 Gimblett Street Richmond Invercargill 9810 NZ New Zealand NannaHallmundsdottir@cuvox.de Wifflife1964 Deez4ooGi0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 (022) 7929-466 64 4/15/1964 55 Aries MasterCard 5504023231705718 236 9/2020 1Z 668 366 39 7132 126 6 7750395260 67884791 Black Photographic process worker Megatronic 1999 Dodge Avenger PrepaidCDs.co.nz O+ 170.1 77.3 5' 0" 153 05974695-4733-4bd3-b1ac-bffb56992160 -46.301185 168.423853
61 60 female Polish Ms. Halina C Zielinska 1 Gloucester Road CLACHANDHU PA68 7QD GB United Kingdom HalinaZielinska@jourrapide.com Corsome74 Vaem2keeV6 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36 077 5149 6820 44 Zielinska 8/7/1974 45 Leo Visa 4539422686560762 956 3/2022 TB 10 69 23 1Z 189 A93 08 3744 015 7 0022083989 01394977 Blue Surgical technician Monit 2011 Renault Grand Scenic CreditEducate.co.uk AB+ 208.3 94.7 5' 7" 171 44fd6c97-052a-42a8-b2d2-fb8b3fc70ba8 55.954988 -5.872046
62 61 female Japanese (Anglicized) Ms. Hatsuho K Yoneda Dalmatinova 35 Žabnica 4209 SI Slovenia HatsuhoYoneda@fleckens.hu Therl1988 ahteel2maeSh Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134 051-632-354 386 Mikami 9/17/1988 31 Virgo MasterCard 5104564324299139 201 4/2021 1Z 486 074 10 6995 124 2 8337173450 63167043 Purple Treasurer Practi-Plan 2003 MG ZT ConventionalMedicines.si O+ 134.4 61.1 5' 8" 173 e51aa8e2-9fc6-43e7-8488-168ca87b75d7 46.108258 14.328302
63 62 male Croatian Mr. Gojislav V Jukić Bahnhofstrasse 96 Gorgier 2023 CH Switzerland GojislavJukic@dayrep.com Tinguen GaiXa3ai Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 032 304 33 66 41 Pavić 9/21/1999 20 Virgo MasterCard 5165115278847765 074 12/2022 1Z 385 466 21 1758 512 0 6790588061 83119520 Red Extractive metallurgical engineer Value Giant 2011 Lexus LFA AmericasFunny.ch B+ 155.5 70.7 6' 1" 186 0165faab-d065-4a58-a543-057c540a6863 46.955216 6.830252
64 63 female England/Wales Mrs. Megan T Swift Postbox 23 Maniitsoq QE Qeqqata 3912 GL Greenland MeganSwift@teleworm.us Thersevere ieGh5huoK6 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 81 32 04 299 Poole 4/30/1964 55 Taurus Visa 4929574688538812 336 4/2024 1Z E38 W63 04 0063 263 0 6213800707 38860322 Green Support service manager Little Folk Shops 2007 Chevrolet Optra MontereySea.gl B+ 168.7 76.7 5' 4" 163 394b47d6-e3cb-4869-b4b5-c42a81ef9b00 65.395922 -52.878832
65 64 male Slovenian Mr. Milan Franc S Košelnik Na Výsluní 272 Primda PL Plzenský kraj 348 06 CZ Czech Republic MilanFrancKoselnik@teleworm.us Lospay67 queeN0ies Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 733 162 319 420 Mankoč 1/14/1967 52 Capricorn MasterCard 5453102419958199 879 12/2021 1Z A16 354 25 3838 551 9 0820470346 05905803 Brown Clinical laboratory technologist Singer Lumber 2003 Holden UTE TextFraud.cz A+ 214.5 97.5 5' 9" 175 7234b959-896e-4335-8745-c5716b1c7638 49.619319 12.730847
66 65 female German Ms. Johanna P Maurer Kaisergasse 64 KURZENKIRCHEN OO Upper Austria 4770 AT Austria JohannaMaurer@armyspy.com Reptaked1981 ohwae5Tee Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 0688 992 50 51 43 Schuhmacher 7/12/1981 38 Cancer Visa 4916529470563530 481 4/2020 1Z Y23 375 24 5962 121 9 0306589986 74875435 Purple Elevator repairer Solution Answers 2008 Rover Streetwise PublicityAid.at AB+ 186.6 84.8 5' 5" 164 30ce10a0-2ea3-4e24-871a-f72cb3beedfc 48.439595 13.547802
67 66 male Norwegian Mr. Teodor K Aune Gl. Sygehusvej 153 Narsaq KU Kujalleq 3921 GL Greenland TeodorAune@superrito.com Dickent ooph5leiG Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 66 28 45 299 Arntzen 7/18/1998 21 Cancer MasterCard 5296939640241254 624 2/2024 1Z 216 Y97 87 6791 863 9 0054598944 41175224 Blue Private household cook Builders Emporium 2008 Renault Laguna GraffitiRoom.gl O+ 239.1 108.7 5' 9" 174 0562be22-b239-4dbc-a0f7-d64679ae153f 60.827346 -46.022413
68 67 male Dutch Mr. Abderrahman I Kempers Hjellestadnipen 66 HJELLESTAD 5259 NO Norway AbderrahmanKempers@einrot.com Ancery eeC4tien9 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 460 73 493 47 Cuperus 7/17/1977 42 Cancer MasterCard 5358951722758050 004 9/2024 1Z Y78 20V 88 9025 826 2 1986315622 50825375 Blue Time clerk Carrols Restaurant Group 2008 Dacia Sandero PlatinumVoice.no O+ 182.4 82.9 5' 6" 168 49c9f2d1-5e74-4ef9-8c9c-c5e6338e1f6d 60.280666 5.157123
69 68 male French Mr. Nicolas R Lebrun 95 Burton Avenue Okoia Wanganui 4500 NZ New Zealand NicolasLebrun@armyspy.com Wroing iR2rahpaim2a Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 (027) 0336-972 64 Bondy 3/3/1964 55 Pisces MasterCard 5167661122227231 094 5/2021 1Z 418 1A1 09 5878 510 1 4757008829 69457336 Blue Corporate accountant Playworld 1997 Citroen Rally Raid PopularFlicks.co.nz O+ 209.4 95.2 5' 10" 178 851fc065-0061-4754-b807-421e4242b5ba -39.863379 174.967351
70 69 male Slovenian Mr. Ivan Martin J Bugarski Breivangvegen 38 TROMSØ 9010 NO Norway IvanMartinBugarski@einrot.com Dercy1937 iey2Xoh8o Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 448 63 713 47 Riboli 11/14/1937 82 Scorpio MasterCard 5364317183005716 388 12/2022 1Z 291 4Y2 69 2883 563 7 1469869904 37602111 Red Office clerk Quickbiz 2000 Chrysler Grand Voyager StrictlyIdeas.no O- 174.2 79.2 5' 10" 179 ca8717a7-1eba-4f93-9f1d-970b2fa1a45e 69.651262 18.958466
71 70 female Czech Ms. Jarmila M Chloupková 729 Albert St Germiston GA Gauteng 1419 ZA South Africa JarmilaChloupkova@superrito.com Wountim81 yie8Cees Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 084 256 2607 27 Poláčková 2/5/1981 38 Aquarius Visa 4716715994153559 960 3/2024 8102054742081 1Z 454 A20 14 2101 291 2 8866810206 44986764 Red Industrial-organizational psychologist Wells & Wade 1996 Mini MK VI SleepsAround.co.za B+ 140.6 63.9 5' 8" 172 940ec903-7fa0-421a-a1c5-7620bea7f7e0 -26.161314 28.133482
72 71 male Italian Dr. Manlio M Capon Lützelflühstrasse 122 Wil 5300 CH Switzerland ManlioCapon@einrot.com Theyear ieThuo1fei Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 062 380 34 69 41 Folliero 9/17/1947 72 Virgo Visa 4929441842722544 930 4/2021 1Z 15W 644 50 5740 843 7 5183466840 07258789 Black Billing and posting clerk House Of Denmark 1997 Oldsmobile Eighty-Eight PrepaidHoliday.ch B+ 162.1 73.7 5' 10" 179 0e92196a-3502-41c6-83bc-9a3b43c49317 47.510722 8.303196
73 72 female Japanese (Anglicized) Ms. Tomomi Y Nishiyama Rua do Arenque 1634 Goiânia GO Goiás 74343-040 BR Brazil TomomiNishiyama@fleckens.hu Mille1991 aeneThoh6x Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 (62) 9976-7986 55 Ishizaki 8/23/1991 28 Virgo Visa 4716240309647674 486 9/2020 406.537.117-19 1Z 01V 07A 25 5957 147 4 4689360184 38520562 Purple Oxy-gas cutter Lechters Housewares 2011 Chevrolet HHR PharmacyFile.com.br B+ 213.2 96.9 5' 2" 158 5141a9c2-cd45-4813-9cbe-12d630eefce4 -16.687195 -49.226261
74 73 female Russian Dr. Esther R Kalinina Τρικάλων 248 ΛΕΥΚΩΣΙΑ NI Λευκωσία 1687 CY Cyprus (Greek) EstherKalinina@dayrep.com Hics1952 euG0Aiqu2 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 22 723018 357 7/5/1952 67 Cancer MasterCard 5563120249912803 542 11/2024 1Z 734 93Y 11 9330 585 6 6006786087 26151161 Blue Ambulatory care nurse Waccamaw Pottery 1999 MCC Smart ShapeConsultant.com.cy A+ 133.5 60.7 5' 7" 170 642dbeb6-defe-4c89-bc0a-5c64ae807dcb 41.266749 -72.834759
75 74 male Icelandic Mr. Boði L Zóphoníasson Árpád fejedelem útja 3. Budapest BU Budapest 1184 HU Hungary BodiZophoniasson@jourrapide.com Inart1990 nugheiZ0eig5 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 (1) 941-2250 36 6/12/1990 29 Gemini MasterCard 5199467250444016 134 12/2020 1Z 563 9F0 30 4262 753 9 9294260413 82490560 Orange Mail processing machine operator Bell Markets 1997 Lada Natacha ReportDiscount.hu B+ 144.5 65.7 5' 9" 174 a3912b2f-c2dc-4ff7-ab81-af1e047108c5 47.515035 19.146851
76 75 female England/Wales Mrs. Grace G Boyle 62 Mavrokordatou Street Foinikaria LI Limassol 4530 CY Cyprus (Anglicized) GraceBoyle@einrot.com Fortat81 cahBoot3eH Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 25 880993 357 Thompson 9/27/1981 38 Libra MasterCard 5597431608392937 582 5/2021 1Z 791 957 61 6161 657 2 2596874442 13926446 Blue Forensic technician id Boutiques 1993 Bristol Beaufighter BayNeck.com.cy AB+ 181.7 82.6 5' 8" 173 149314e4-80a7-493a-8ece-9b6f0890fd5d 41.321303 -72.986114
77 76 female England/Wales Ms. Naomi S Ryan Λ. Μιχαλακοπούλου 160 ΕΓΚΩΜΗ NI Λευκωσία 2417 CY Cyprus (Greek) NaomiRyan@rhyta.com Fien1988 Gaitha4Ei Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 22 498317 357 Clayton 11/21/1988 31 Scorpio MasterCard 5283115033422307 714 12/2023 1Z 587 192 06 3768 866 1 7494134303 25065703 Blue Sewer pipe cleaner Monk Real Estate Service 1995 BMW Dinan MobLag.com.cy O+ 205.0 93.2 5' 2" 158 65219a9f-2c32-459a-98d2-7a7332e0f52f 41.385016 -72.962431
78 77 male Polish Mr. Szymon B Walczak 2347 Lauzon Parkway Windsor ON Ontario N9A 7A2 CA Canada SzymonWalczak@teleworm.us Dintep oog4aize7Ai Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 519-566-3375 1 Pawłowska 9/18/1958 61 Virgo MasterCard 5342019646824967 284 11/2023 727 633 539 1Z 883 8V5 80 2897 856 7 2176879008 20843357 Silver Neurosonographer De Pinna 2009 Nissan Frontier MissingWeapons.ca B+ 222.2 101.0 5' 10" 179 b2e01ab2-c265-42eb-90b8-e26a6361eed4 42.423583 -82.942171
79 78 female Hungarian Ms. Mercédesz S Szôllôssy Atamaria 86 Fornelos de Montes PO Pontevedra 36847 ES Spain SzollossyMercedesz@jourrapide.com Musere Eif1ce0ee Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36 641 572 459 34 Hofmann 1/15/1944 75 Capricorn MasterCard 5535165808125664 767 3/2023 1Z 681 2E4 44 0770 552 8 5631978558 99957155 Red Case management aide White Hen Pantry 2000 Noble M12 NeedCharge.es O+ 213.6 97.1 5' 7" 169 a5f515e0-310f-47ab-bc0c-a71bac531a1e 42.263983 -8.431245
80 79 male Norwegian Mr. Edgar E Andreassen Zistelweg 32 UNTERLAND SZ Salzburg 5661 AT Austria EdgarAndreassen@fleckens.hu Waakis2000 Iejeiz1oodei Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36 0664 701 04 17 43 Dybvik 6/29/2000 19 Cancer Visa 4716078791994463 776 11/2020 1Z 311 159 63 7486 723 7 5685893521 99981816 Black Speech pathologist Peaches 2001 Pontiac Grand Am WordRegistrar.at A- 127.8 58.1 5' 8" 172 15337913-059c-4dd1-9feb-5dc426abe8c7 47.203021 12.910163
81 80 male Slovenian Mr. Šemsudin M Vrhovski ul. Dawida Jana 124 Wrocław 50-527 PL Poland SemsudinVrhovski@rhyta.com Gother keeLaz9lee0 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 OPR/58.0.3135.132 67 534 85 44 48 Pataki 10/13/1955 64 Libra Visa 4716877835558592 363 3/2022 55101320290 1Z E40 449 57 5657 736 1 2373733334 21925106 Blue ABE teacher Integra Wealth Planners 2015 BMW X5 M VirginExpo.pl B+ 197.6 89.8 5' 11" 180 5c11fc04-45f8-4989-801f-4102ff38d376 51.112923 17.027289
82 81 female Finnish Mrs. Satu A Waltari 2071 Maryland Avenue Pinellas FL Florida 34624 US United States SatuWaltari@teleworm.us Stittair jal6oNgoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:66.0) Gecko/20100101 Firefox/66.0 727-538-7059 1 Viitala 9/16/1995 24 Virgo Visa 4556890465838575 158 12/2024 591-28-5104 1Z 534 941 77 8508 193 2 5257097378 69898015 Yellow Soil scientist White Hen Pantry 2003 Daihatsu Terios kupitorta.com O+ 141.0 64.1 5' 5" 164 37de7e34-2624-444a-978d-b1b758fbc993 27.864456 -82.748032
83 82 male German Mr. Matthias S Himmel Degnehøjvej 45 Silkeborg MI Region Midtjylland 8600 DK Denmark MatthiasHimmel@armyspy.com Barted Jemu5poosoo Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15 30-62-84-08 45 Trommler 8/23/1983 36 Virgo MasterCard 5289947968601628 320 12/2024 230883-1143 1Z 006 174 60 6563 087 1 3945260717 87205650 Blue Mental health social worker Superior Appraisals 1998 Alpina B 12 StLouisLighting.dk A+ 143.4 65.2 5' 10" 179 116742f3-f65e-45f1-a917-d13ad1db7bd4 56.199078 9.447827
84 83 female Danish Ms. Mia A Frederiksen ul. Zuchów 65 Dąbrowa Górnicza 41-303 PL Poland MiaAFrederiksen@rhyta.com Fance1958 buY5faij Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 53 459 91 54 48 Lauritsen 6/18/1958 61 Gemini MasterCard 5454007072610160 208 4/2023 58061866242 1Z 501 697 49 5209 014 8 9285893233 21439381 Blue Fine arts photographer Coon Chicken Inn 2008 SSC Aero WrestlingMonthly.pl O+ 181.1 82.3 5' 2" 158 8969b475-9dce-4173-b060-32da08dbbf0d 50.417075 19.133549
85 84 male Swedish Mr. Jesper N Lund Põllu 59 Kähu VG Valgamaa 68506 EE Estonia JesperLund@armyspy.com Planstim AeBeiNii0 Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 763 2200 372 Lundgren 5/6/1938 81 Taurus Visa 4539748641306150 938 3/2024 1Z 484 548 09 5749 331 5 6980674650 69979073 White Photogrammetrist Expo Superstore 1997 Panoz AIV CreditChaos.com.ee A+ 178.0 80.9 5' 10" 178 32b3dfb9-2eaa-4af2-a012-2a044f866550 57.915676 26.169326
86 85 male Icelandic Mr. Guðgeir S Bergsveinsson Rue du Centre 320 Marke VWV West Flanders 8510 BE Belgium GudgeirBergsveinsson@armyspy.com Ressen phukieGae9c Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 0493 28 62 88 32 4/2/1994 25 Aries MasterCard 5432615789205137 688 11/2021 1Z 684 8A0 47 6831 298 9 0387412870 16497840 Orange Forging machine tender Crafts & More 1994 Mitsubishi Sigma SoldierResources.be A+ 165.9 75.4 5' 8" 172 1016b5ad-56e3-4fb2-ba90-dc1f43694493 50.73779 3.22707
87 86 male Icelandic Mr. Esjar S Sturluson Hauptstrasse 75 PUCH BEI HALLEIN SZ Salzburg 5412 AT Austria EsjarSturluson@teleworm.us Conetund taeWeF2Eeph4 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 0699 465 17 25 43 3/24/1970 49 Aries MasterCard 5400451109415331 492 9/2021 1Z 807 0E7 80 7325 125 1 3810913760 52784081 Blue Dietetic technician Endicott Johnson 2002 Smart ForFour MicroLists.at B+ 171.6 78.0 6' 0" 182 b63ece12-4e7f-4348-bce8-1d8d5dc31dff 47.741555 13.137162
88 87 male England/Wales Mr. Zak M Leonard 27 Stroud Rd OCHTERTYRE PH7 6LF GB United Kingdom ZakLeonard@fleckens.hu Surn1940 ohteeF5RaeM Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 078 3687 4061 44 Henry 4/4/1940 79 Aries MasterCard 5336964062411492 178 11/2023 KY 97 49 93 A 1Z 77F 43A 87 5585 068 2 5973298171 66698464 Blue Heat treating equipment tender The Independent Planners 1996 Mitsubishi Verada HumorVids.co.uk O+ 221.1 100.5 5' 7" 171 d667dde7-082e-4480-98d9-5bdd383eb187 56.07981 -4.643057
89 88 female Chechen (Latin) Mrs. Ezinet B Umkhayev 216 Karaiskaki Sq Ineia PA Paphos 8704 CY Cyprus (Anglicized) EzinetUmkhayev@dayrep.com Mothen1991 IeH2ceebae Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 97 696060 357 Masaev 3/27/1991 28 Aries Visa 4929316208182816 160 4/2021 1Z 77V 114 81 0072 703 9 4908986837 97732729 Purple Quality assurance inspector Chief Auto Parts 2005 Jaguar XKR WealthyGadgets.com.cy B+ 218.9 99.5 5' 9" 175 7cb11d28-cbd3-49c2-be74-e6dcdca65cb4 41.270842 -72.883851
90 89 female Russian Mrs. Lucia V Voronina 75 Sale-Heyfield Road KONGWAK VIC Victoria 3951 AU Australia LuciaVoronina@gustr.com Riets1976 aY3ohbe8ai Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 (03) 5371 4059 61 7/25/1976 43 Leo MasterCard 5459182457974252 335 5/2024 1Z 618 731 57 5565 866 4 4199244189 20580381 Blue Eligibility interviewer William Wanamaker & Sons 2012 Tata Indica CheatPrevention.com.au O+ 176.7 80.3 5' 7" 169 bf8e68c3-2842-49da-8c4d-1ed4712a3852 -38.465215 145.830079
91 90 male Slovenian Mr. Milorad S Musić Välja 61 Mustahamba VR Võrumaa 66258 EE Estonia MiloradMusic@dayrep.com Entils oan8Eiyoaz Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 789 0750 372 Flach 12/1/1968 51 Sagittarius MasterCard 5211243470404849 924 2/2022 1Z F99 556 56 0740 270 3 0643392658 82292582 Blue Housekeeper Cougar Investment 2000 Buick Rendezvous StickerEmporium.com.ee A+ 244.4 111.1 6' 1" 185 ea727fc4-6f40-412c-9574-b372a6aef26f 57.820634 26.970835
92 91 female Japanese (Anglicized) Dr. Chisaki M Fujimura 1956 Uitsig St Grahamstad EC Eastern Cape 6139 ZA South Africa ChisakiFujimura@cuvox.de Rewhe1979 iayiQu9ahsie Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 082 875 2166 27 Wakabayashi 12/22/1979 40 Capricorn Visa 4485068737963325 600 10/2020 7912221956187 1Z 480 79V 08 4325 733 4 5381640942 65242007 Brown Extruding and drawing machine setters Alert Alarm Company 2005 Porsche Cayenne NoteBack.co.za O+ 216.5 98.4 5' 2" 157 b1a8327e-794e-4685-ab8b-43d20c08ed68 -33.370929 26.578978
93 92 female Russian Mrs. Inessa D Samoylova Bachloh 60 WATZING OO Upper Austria 4673 AT Austria InessaSamoylova@fleckens.hu Rolong moot9aQu2d Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0 0664 396 27 04 43 5/11/1956 63 Taurus Visa 4556264124085418 064 5/2024 1Z 596 V55 76 1512 629 4 1323820614 72364512 Purple Patternmaker Erb Lumber 2009 Honda CR-V EthanolSpecialist.at O+ 135.3 61.5 5' 3" 161 13f411d7-0754-4233-a902-37a68ce4bb45 48.118598 13.654018
94 93 female England/Wales Ms. Elise C Pearson 215 Andrew Street Monaco Nelson 7011 NZ New Zealand ElisePearson@rhyta.com Norly1997 eeNgoes7aez Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0 (027) 7329-039 64 Wilson 5/12/1997 22 Taurus Visa 4929452838531450 902 4/2022 1Z 245 516 58 7073 695 7 9911299106 38307168 Blue Activity specialist The Independent Planners 1995 Nissan President USFirm.co.nz O+ 162.1 73.7 5' 9" 175 962cba25-bb3d-4b3b-8aca-e4689ee69dd5 -41.333626 173.307741
95 94 male Norwegian Mr. Herman A Johansen 1324 Mosman Rd Alexander Bay NC Northern Cape 8294 ZA South Africa HermanJohansen@gustr.com Thinde loaH6shiemoh Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 083 779 9214 27 Smestad 7/29/1947 72 Leo Visa 4929709070861949 812 9/2023 4707295736082 1Z 2E1 Y32 51 4242 127 7 3751331085 25346448 Blue Cost estimator Brown Derby 2014 Audi SQ5 TypoPro.co.za O+ 182.8 83.1 5' 10" 179 377c5af3-3ab2-405b-bc80-1ddd4f06ecca -28.511777 16.410349
96 95 male Russian Mr. Armen D Balabanov Bavorovská 788 Stachy JC Jihoceský kraj 384 73 CZ Czech Republic ArmenBalabanov@teleworm.us Gery1975 Gaeg6uchoh Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 606 972 932 420 1/11/1975 44 Capricorn Visa 4532025806629404 529 1/2021 1Z 245 330 29 3529 731 4 5526134624 86584996 Orange Rigger Sun Foods 1999 Isuzu VX-02 MyBloggers.cz A+ 199.8 90.8 5' 6" 168 ec3e15ac-78df-4844-a8e3-c6c491c6dd39 49.090909 13.642637
97 96 male Russian Mr. Evdokim Y Bazarov Reykjarhóli 70 Fljót 570 IS Iceland EvdokimBazarov@einrot.com Deet1996 aiRubie9Poqu Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 413 4270 354 10/5/1996 23 Libra MasterCard 5413518495816218 546 9/2024 1Z 900 9A2 84 8206 503 2 0362518360 38724903 White Insurance investigator Modern Realty 1996 ZAZ Wagon WeekendScores.is O+ 212.1 96.4 5' 8" 173 9112f339-d232-4fa7-a1f3-2e74571fd00a 66.154544 -17.801351
98 97 female Hungarian Mrs. Agoti B Gyarmaty 793 Buena Vista Avenue Corvallis OR Oregon 97330 US United States GyarmatyAgoti@jourrapide.com Eage1963 pheiSha1aqu Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15 541-714-1388 1 Cseh 12/7/1963 56 Sagittarius MasterCard 5328604768229802 989 8/2024 543-24-6755 1Z 238 019 37 1904 563 8 2038177985 21602780 Blue Licensed clinical social worker Circuit Design 2001 Suzuki Covie adrifza.com A+ 140.8 64.0 5' 4" 163 b1f4eed7-ff6b-4671-bf5d-a04c6a7b4beb 44.597298 -123.334112
99 98 male England/Wales Mr. John K Carpenter Tavcarjeva 22 Senovo 8281 SI Slovenia JohnCarpenter@dayrep.com Foris1988 el6xoh7Qu Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0 070-783-977 386 Wheeler 1/10/1988 31 Capricorn MasterCard 5174982341006037 269 3/2020 1Z E95 341 79 8897 978 9 0309942560 54166967 Blue Marketing coordinator Balanced Fortune 1992 Mazda AZ-1 KeywordAlbum.si O+ 226.4 102.9 5' 8" 173 d2754fd9-f1c9-47cd-b6e1-7c8d5c0eec30 46.102339 15.464625
100 99 female Hispanic Mrs. Maha A Cazares Reyes Católicos 75 Chiclana de la Frontera CA Cádiz 11130 ES Spain MahaCazaresMendez@superrito.com Martrust57 Ohqu6achie Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 624 412 511 34 Méndez 9/25/1957 62 Libra Visa 4485545084297530 898 6/2020 1Z 508 603 87 4474 636 9 1330099115 99991665 Blue Diesel train engineer Sew-Fro Fabrics 2001 Alfa Romeo GTV WellnessPlant.es A+ 213.2 96.9 5' 8" 172 d768e30b-a6de-4977-8bb8-0c8432acce44 36.447765 -6.204969
101 100 female American Mrs. Patricia J Nevels 72 Acheron Road BUNDALAGUAH VIC Victoria 3851 AU Australia PatriciaJNevels@rhyta.com Butimis1962 eekea5Thoo Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 (03) 5301 7984 61 Thomas 7/30/1962 57 Leo Visa 4716717759727577 065 9/2024 1Z 449 366 30 8287 656 2 3560472157 65535722 Blue Identification clerk PriceRite Warehouse Club 2005 Infiniti QX56 PlayDetails.com.au A- 209.9 95.4 5' 3" 161 6156ce11-c2a6-4266-bb4d-f47b06292e4e -38.111213 147.271178

Просмотреть файл

@ -0,0 +1,3 @@
ROCKET
rocket
racket
1 ROCKET
2 rocket
3 racket

Просмотреть файл

@ -0,0 +1,5 @@
ROCKET
irocketiere
rock
pocket
racket
1 ROCKET
2 irocketiere
3 rock
4 pocket
5 racket

Просмотреть файл

@ -0,0 +1,4 @@
ROCKET
Rocket
Rocket
rocket
1 ROCKET
2 Rocket
3 Rocket
4 rocket

74124
tests/data/generated_large.txt Normal file

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,4 @@
My name is [FIRST_NAME] [LAST_NAME] and I fly a [ROCKET]
I'm [ROCKET]
The customer's name is [LAST_NAME], [FIRST_NAME] where is my [ROCKET]
The customer's name is [FIRST_NAME] [ROCKET]

15
tests/data/templates.txt Normal file
Просмотреть файл

@ -0,0 +1,15 @@
My email is [EMAIL]
My address is [ADDRESS]
My first name is [FIRST_NAME] and my last is [LAST_NAME]
My name is [PERSON]
My zip is [ZIP]
I live in [CITY]
Here's my phone number: [PHONE_NUMBER]
You want my credit card? No problem: [CREDIT_CARD]
I was born on [BIRTHDAY]
My full address is [FULL_ADDRESS]
My kids are [PERSON] and [PERSON2]
I either live on [ADDRESS] or [ADDRESS2]
Our last names are [LAST_NAME] and [LAST_NAME2]
My first name is [FIRST_NAME] and [FIRST_NAME2]
My accounts are [ACCOUNT_NUMBER] and [ACCOUNT_NUMBER2]

428
tests/generated_test.txt Normal file
Просмотреть файл

@ -0,0 +1,428 @@
[
{
"full_text": "My full address is Avda. Alameda Sundheim 46",
"masked": null,
"spans": [
{
"entity_type": "FULL_ADDRESS",
"entity_value": "Avda. Alameda Sundheim 46",
"start_position": 19,
"end_position": 44
}
],
"tokens": [
{
"text": "My",
"idx": 0,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "full",
"idx": 3,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "amod",
"lemma_": "full",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "address",
"idx": 8,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "nsubj",
"lemma_": "address",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "is",
"idx": 16,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "ROOT",
"lemma_": "be",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Avda",
"idx": 19,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Avda",
"_": {
"is_in_vocabulary": false
}
},
{
"text": ".",
"idx": 23,
"tag_": ".",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": ".",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Alameda",
"idx": 25,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "compound",
"lemma_": "Alameda",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Sundheim",
"idx": 33,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "ROOT",
"lemma_": "Sundheim",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "46",
"idx": 42,
"tag_": "CD",
"pos_": "NUM",
"dep_": "nummod",
"lemma_": "46",
"_": {
"is_in_vocabulary": false
}
}
],
"tags": [
"O",
"O",
"O",
"O",
"B-FULL_ADDRESS",
"I-FULL_ADDRESS",
"I-FULL_ADDRESS",
"I-FULL_ADDRESS",
"L-FULL_ADDRESS"
],
"template_id": null,
"metadata": {
"Gender": "male",
"NameSet": "Croatian",
"Country": "Uganda",
"Lowercase": false,
"Template#": 9
}
},
{
"full_text": "You want my credit card? No problem: 4532368231815457",
"masked": null,
"spans": [
{
"entity_type": "CREDIT_CARD",
"entity_value": "4532368231815457",
"start_position": 37,
"end_position": 53
}
],
"tokens": [
{
"text": "You",
"idx": 0,
"tag_": "PRP",
"pos_": "PRON",
"dep_": "nsubj",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "want",
"idx": 4,
"tag_": "VBP",
"pos_": "VERB",
"dep_": "ROOT",
"lemma_": "want",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "my",
"idx": 9,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "credit",
"idx": 12,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "compound",
"lemma_": "credit",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "card",
"idx": 19,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "dobj",
"lemma_": "card",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "?",
"idx": 23,
"tag_": ".",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": "?",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "No",
"idx": 25,
"tag_": "DT",
"pos_": "DET",
"dep_": "det",
"lemma_": "no",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "problem",
"idx": 28,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "ROOT",
"lemma_": "problem",
"_": {
"is_in_vocabulary": false
}
},
{
"text": ":",
"idx": 35,
"tag_": ":",
"pos_": "PUNCT",
"dep_": "punct",
"lemma_": ":",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "4532368231815457",
"idx": 37,
"tag_": "CD",
"pos_": "NUM",
"dep_": "appos",
"lemma_": "4532368231815457",
"_": {
"is_in_vocabulary": false
}
}
],
"tags": [
"O",
"O",
"O",
"O",
"O",
"O",
"O",
"O",
"O",
"U-CREDIT_CARD"
],
"template_id": null,
"metadata": {
"Gender": "female",
"NameSet": "Czech",
"Country": "Austria",
"Lowercase": false,
"Template#": 7
}
},
{
"full_text": "My first name is Rogelio and my last is Patrick",
"masked": null,
"spans": [
{
"entity_type": "PERSON",
"entity_value": "Rogelio",
"start_position": 17,
"end_position": 24
},
{
"entity_type": "PERSON",
"entity_value": "Patrick",
"start_position": 40,
"end_position": 47
}
],
"tokens": [
{
"text": "My",
"idx": 0,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "first",
"idx": 3,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "amod",
"lemma_": "first",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "name",
"idx": 9,
"tag_": "NN",
"pos_": "NOUN",
"dep_": "nsubj",
"lemma_": "name",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "is",
"idx": 14,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "ROOT",
"lemma_": "be",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Rogelio",
"idx": 17,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Rogelio",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "and",
"idx": 25,
"tag_": "CC",
"pos_": "CCONJ",
"dep_": "cc",
"lemma_": "and",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "my",
"idx": 29,
"tag_": "PRP$",
"pos_": "DET",
"dep_": "poss",
"lemma_": "-PRON-",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "last",
"idx": 32,
"tag_": "JJ",
"pos_": "ADJ",
"dep_": "nsubj",
"lemma_": "last",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "is",
"idx": 37,
"tag_": "VBZ",
"pos_": "AUX",
"dep_": "conj",
"lemma_": "be",
"_": {
"is_in_vocabulary": false
}
},
{
"text": "Patrick",
"idx": 40,
"tag_": "NNP",
"pos_": "PROPN",
"dep_": "attr",
"lemma_": "Patrick",
"_": {
"is_in_vocabulary": false
}
}
],
"tags": [
"O",
"O",
"O",
"O",
"U-PERSON",
"O",
"O",
"O",
"O",
"U-PERSON"
],
"template_id": null,
"metadata": {
"Gender": "male",
"NameSet": "American",
"Country": "California",
"Lowercase": false,
"Template#": 2
}
}
]

3
tests/mocks/__init__.py Normal file
Просмотреть файл

@ -0,0 +1,3 @@
from .model_mock import IdentityTokensMockModel, \
FiftyFiftyIdentityTokensMockModel, \
MockTokensModel

50
tests/mocks/model_mock.py Normal file
Просмотреть файл

@ -0,0 +1,50 @@
from typing import List
from presidio_evaluator import InputSample, ModelEvaluator
class MockTokensModel(ModelEvaluator):
"""
Simulates a real model, returns the prediction given in the constructor
"""
def __init__(self, prediction: List[str], entities_to_keep: List = None,
verbose: bool = False, **kwargs):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose,
**kwargs)
self.prediction = prediction
def predict(self, sample: InputSample) -> List[str]:
return self.prediction
class IdentityTokensMockModel(ModelEvaluator):
"""
Simulates a real model, always return the label as prediction
"""
def __init__(self, entities_to_keep: List = None,
verbose: bool = False):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
def predict(self, sample: InputSample) -> List[str]:
return sample.tags
class FiftyFiftyIdentityTokensMockModel(ModelEvaluator):
"""
Simulates a real model, returns the label or no predictions (list of 'O')
alternately
"""
def __init__(self, entities_to_keep: List = None,
verbose: bool = False):
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
self.counter = 0
def predict(self, sample: InputSample) -> List[str]:
self.counter += 1
if self.counter % 2 == 0:
return sample.tags
else:
return ["O" for i in range(len(sample.tags))]

Просмотреть файл

@ -0,0 +1,22 @@
import numpy as np
from presidio_evaluator.crf_evaluator import CRFEvaluator
from presidio_evaluator.data_generator import read_synth_dataset
# no_test since the CRF model is not supplied with the package
def no_test_test_crf_simple():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
model_path = os.path.abspath(os.path.join(dir_path, "..", "model-outputs/crf.pickle"))
crf_evaluator = CRFEvaluator(model_pickle_path=model_path,entities_to_keep=['PERSON'])
evaluation_results = crf_evaluator.evaluate_all(input_samples)
scores = crf_evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
assert scores.pii_recall > 0
assert scores.pii_precision > 0

Просмотреть файл

@ -0,0 +1,48 @@
from presidio_evaluator import InputSample
from presidio_evaluator.data_generator import read_synth_dataset
def test_to_conll():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
conll = InputSample.create_conll_dataset(input_samples)
sentences = conll['sentence'].unique()
assert len(sentences) == len(input_samples)
def test_to_spacy_all_entities():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_ver = InputSample.create_spacy_dataset(input_samples)
assert len(spacy_ver) == len(input_samples)
def test_to_spacy_all_entities_specific_entities():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_ver = InputSample.create_spacy_dataset(input_samples, entities=['PERSON'])
spacy_ver_with_labels = [sample for sample in spacy_ver if len(sample[1]['entities'])]
assert len(spacy_ver_with_labels) < len(input_samples)
assert len(spacy_ver_with_labels) > 0
def test_to_spach_json():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_ver = InputSample.create_spacy_json(input_samples)
assert len(spacy_ver) == len(input_samples)
assert 'id' in spacy_ver[0]
assert 'paragraphs' in spacy_ver[0]

Просмотреть файл

@ -0,0 +1,26 @@
try:
from flair.models import SequenceTagger
except ImportError:
print("Flair is not installed by default")
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.flair_evaluator import FlairEvaluator
import numpy as np
# no-unit because flair is not a dependency by default
def no_unit_test_flair_simple():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
model = SequenceTagger.load('ner-ontonotes-fast') # .load('ner')
flair_evaluator = FlairEvaluator(model=model, entities_to_keep=['PERSON'])
evaluation_results = flair_evaluator.evaluate_all(input_samples)
scores = flair_evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
assert scores.pii_recall > 0
assert scores.pii_precision > 0

121
tests/test_generator.py Normal file
Просмотреть файл

@ -0,0 +1,121 @@
from presidio_evaluator.data_generator import generate, read_synth_dataset, FakeDataGenerator
def get_fake_generator(template, fake_pii_df):
class MockFakeGenerator(FakeDataGenerator):
"""
Mock class that doesn't add to the fake PII DF so you could inject entities yourself.
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def prep_fake_pii(self, df):
return df
return MockFakeGenerator(templates=[template],
fake_pii_df=fake_pii_df,
include_metadata=False,
span_to_tag=False,
dictionary_path=None,
lower_case_ratio=0)
def test_generator_correct_output():
OUTPUT = "generated_test.txt"
EXAMPLES = 3
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
fake_pii_csv = "{}/data/FakeNameGenerator.com_100.csv".format(dir_path)
utterances_file = "{}/data/templates.txt".format(dir_path)
dictionary = "{}/data/Dictionary_test.csv".format(dir_path)
generate(fake_pii_csv=fake_pii_csv,
utterances_file=utterances_file,
dictionary_path=dictionary,
output_file=OUTPUT,
lower_case_ratio=0.3,
num_of_examples=EXAMPLES)
input_samples = read_synth_dataset(OUTPUT)
for sample in input_samples:
assert len(sample.tags) == len(sample.tokens)
def test_a_turned_to_an():
fake_pii_df = get_mock_fake_df(GENDER="Ale")
template = "I am a [GENDER] living in [COUNTRY]"
bracket_location = template.find("[")
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
template=template)
examples = [x for x in fake_generator.sample_examples(1)]
assert " an " in examples[0].full_text
# entity location updated
assert examples[0].spans[0].start_position == bracket_location + 1
def test_a_not_turning_into_an():
fake_pii_df = get_mock_fake_df(GENDER="Male")
template = "I am a [GENDER] living in [COUNTRY]"
previous_bracket = template.find("[")
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
template=template)
examples = [x for x in fake_generator.sample_examples(1)]
assert " an " not in examples[0].full_text
assert examples[0].spans[0].start_position == previous_bracket
def test_A_turning_into_An():
fake_pii_df = get_mock_fake_df(GENDER="ale")
template = "A [GENDER] living in [COUNTRY]"
previous_bracket = template.find("[")
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
template=template)
examples = [x for x in fake_generator.sample_examples(1)]
assert "An " in examples[0].full_text
assert examples[0].spans[0].start_position == previous_bracket + 1
def get_mock_fake_df(**kwargs):
dict = {
"Number": 1,
"Gender": "Male",
"NameSet": "English",
"Title": "Mr.",
"GivenName": "Dondo",
"MiddleInitial": "N",
"Surname": "Mondo",
"StreetAddress": "Where I live 15",
"City": "Amsterdam",
"State": "",
"StateFull": "",
"ZipCode": "12345",
"Country": "Netherlands",
"CountryFull": "Netherlands",
"EmailAddress": "dondo@mondo.net",
"Username": "Dondo12",
"Password": "123456",
"TelephoneNumber": "+1412391",
"TelephoneCountryCode": "14",
"MothersMaiden": "",
"Birthday": "15 Aug 1966",
"Age": "200",
"CCType": "astercard",
"CCNumber": "12371832821",
"CVV2": "123",
"CCExpires": "19-19",
"NationalID": "14124",
"Occupation": "Hunter",
"Company": "Lolo and sons",
"Domain": "lolo.com"}
dict.update(kwargs)
import pandas as pd
fake_pii_df = pd.DataFrame(dict, index=[0])
return fake_pii_df

Просмотреть файл

@ -0,0 +1,271 @@
import numpy as np
import pytest
from presidio_evaluator import InputSample, EvaluationResult
from presidio_evaluator.data_generator import read_synth_dataset
from tests.mocks import IdentityTokensMockModel, \
FiftyFiftyIdentityTokensMockModel, MockTokensModel
def test_evaluator_simple():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample(full_text="I am the walrus",
masked="I am the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = model.evaluate_sample(sample)
final_evaluation = model.calculate_score(
[evaluated])
assert final_evaluation.pii_precision == 1
assert final_evaluation.pii_recall == 1
def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
prediction = ["O", "O", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction,
entities_to_keep=['SPACESHIP'])
sample = InputSample(full_text="I am the walrus",
masked="I am the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluated = model.evaluate_sample(sample)
assert evaluated.results[("O", "O")] == 4
def test_evaluate_same_entity_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample(full_text="I dog the walrus",
masked="I [ANIMAL] the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = model.evaluate_sample(sample)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_entities_to_keep_correct_statistics():
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
model = MockTokensModel(prediction=prediction, labeling_scheme='BIO',
entities_to_keep=['ANIMAL', 'PLANT', 'SPACESHIP'])
sample = InputSample(full_text="I dog the walrus",
masked="I [ANIMAL] the [ANIMAL]",
spans=None)
sample.tokens = ["I", "am", "the", "walrus"]
sample.tags = ["O", "O", "O", "U-ANIMAL"]
evaluation_result = model.evaluate_sample(sample)
assert evaluation_result.results[("O", "O")] == 2
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
assert evaluation_result.results[("O", "ANIMAL")] == 1
def test_evaluate_multiple_tokens_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
spans=None)
sample.tokens = ["I", "am", "the",
"walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O",
"B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
[evaluated])
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 1
def test_evaluate_multiple_tokens_partial_match_correct_statistics():
prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
spans=None)
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
[evaluated])
assert evaluation.pii_precision == 1
assert evaluation.pii_recall == 4 / 6
def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
spans=None)
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
evaluated = model.evaluate_sample(sample)
evaluation = model.calculate_score(
[evaluated])
assert np.isnan(evaluation.pii_precision)
assert evaluation.pii_recall == 0
def test_evaluate_multiple_examples_correct_statistics():
prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
model = MockTokensModel(prediction=prediction,
labeling_scheme='BILOU',
entities_to_keep=['PERSON'])
input_sample = InputSample("My name is Raphael or David", masked=None,
spans=None)
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = model.evaluate_all(
[input_sample, input_sample, input_sample, input_sample])
scores = model.calculate_score(
evaluated)
assert scores.pii_precision == 0.5
assert scores.pii_recall == 0.5
def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
model = MockTokensModel(prediction=prediction,
labeling_scheme='BILOU',
entities_to_keep=['PERSON', 'TENNIS_PLAYER'])
input_sample = InputSample("My name is Raphael or David", masked=None,
spans=None)
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
evaluated = model.evaluate_all(
[input_sample, input_sample, input_sample, input_sample])
scores = model.calculate_score(evaluated)
assert scores.pii_precision == 1
assert scores.pii_recall == 1
def test_confusion_matrix_correct_metrics():
from collections import Counter
evaluated = [EvaluationResult(results=Counter({
('O', 'O'): 150,
('O', 'PERSON'): 30,
('O', 'COMPANY'): 30,
('PERSON', 'PERSON'): 40,
('COMPANY', 'COMPANY'): 40,
('PERSON', 'COMPANY'): 10,
('COMPANY', 'PERSON'): 10,
('PERSON', 'O'): 30,
('COMPANY', 'O'): 30}), model_errors=None, text=None)]
model = MockTokensModel(prediction=None,
entities_to_keep=['PERSON', 'COMPANY'])
scores = model.calculate_score(evaluated, beta=2.5)
assert scores.pii_precision == 0.625
assert scores.pii_recall == 0.625
assert scores.entity_recall_dict['PERSON'] == 0.5
assert scores.entity_precision_dict['PERSON'] == 0.5
assert scores.entity_recall_dict['COMPANY'] == 0.5
assert scores.entity_precision_dict['COMPANY'] == 0.5
def test_confusion_matrix_2_correct_metrics():
from collections import Counter
evaluated = [EvaluationResult(results=Counter(
{('O', 'O'): 65467,
('O', 'ORG'): 4189,
('GPE', 'O'): 3370,
('PERSON', 'PERSON'): 2024,
('GPE', 'PERSON'): 1488,
('GPE', 'GPE'): 1033,
('O', 'GPE'): 964,
('ORG', 'ORG'): 914,
('O', 'PERSON'): 834,
('GPE', 'ORG'): 401,
('PERSON', 'ORG'): 35,
('PERSON', 'O'): 33,
('ORG', 'O'): 8,
('PERSON', 'GPE'): 5,
('ORG', 'PERSON'): 1}), model_errors=None, text=None)]
model = MockTokensModel(prediction=None)
scores = model.calculate_score(evaluated, beta=2.5)
pii_tp = evaluated[0].results[('PERSON', 'PERSON')] + \
evaluated[0].results[('ORG', 'ORG')] + \
evaluated[0].results[('GPE', 'GPE')] + \
evaluated[0].results[('ORG', 'GPE')] + \
evaluated[0].results[('ORG', 'PERSON')] + \
evaluated[0].results[('GPE', 'ORG')] + \
evaluated[0].results[('GPE', 'PERSON')] + \
evaluated[0].results[('PERSON', 'GPE')] + \
evaluated[0].results[('PERSON', 'ORG')]
pii_fp = evaluated[0].results[('O', 'PERSON')] + \
evaluated[0].results[('O', 'GPE')] + \
evaluated[0].results[('O', 'ORG')]
pii_fn = evaluated[0].results[('PERSON', 'O')] + \
evaluated[0].results[('GPE', 'O')] + \
evaluated[0].results[('ORG', 'O')]
assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
def test_dataset_to_metric_identity_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=10)
model = IdentityTokensMockModel()
evaluation_results = model.evaluate_all(input_samples)
metrics = model.calculate_score(
evaluation_results)
assert metrics.pii_precision == 1
assert metrics.pii_recall == 1
def test_dataset_to_metric_50_50_model():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(
"{}/data/generated_small.txt".format(dir_path), length=100)
# Replace 50% of the predictions with a list of "O"
model = FiftyFiftyIdentityTokensMockModel(entities_to_keep='PERSON')
evaluation_results = model.evaluate_all(input_samples)
metrics = model.calculate_score(
evaluation_results)
print(metrics.pii_precision)
print(metrics.pii_recall)
print(metrics.pii_f)
assert metrics.pii_precision == 1
assert metrics.pii_recall < 0.75
assert metrics.pii_recall > 0.25

Просмотреть файл

@ -0,0 +1,80 @@
'''
Presidio Analyzer not yet on PyPI, ignoring temporarily
'''
#
# import pytest
#
# from presidio_evaluator import InputSample, Span
# from presidio_evaluator.data_generator import read_synth_dataset
# from presidio_evaluator.presidio_analyzer import PresidioAnalyzer
#
#
# class GeneratedTextTestCase:
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
# self.test_name = test_name
# self.test_input = test_input
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
#
# def to_pytest_param(self):
# return pytest.param(self.test_input, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
#
#
# # generated-text test cases
# analyzer_test_generate_text_testdata = [
# # small set fixture which expects all results.
# GeneratedTextTestCase(
# test_name="small-set",
# test_input="{}/data/generated_small.txt",
# acceptance_threshold=0.3,
# marks=pytest.mark.none
# )
# ]
#
#
# @pytest.mark.skip(reason="Presidio analyzer not on PyPi")
# def test_analyzer_simple_input():
# model = PresidioAnalyzer(entities_to_keep=['PERSON'])
#
# sample = InputSample(full_text="My name is Mike",
# masked="My name is [PERSON]",
# spans=[Span('PERSON', 'Mike', 10, 14)],
# create_tags_from_span=True)
#
# evaluated = model.evaluate_sample(sample)
# metrics = model.calculate_score(
# [evaluated])
#
# assert metrics.pii_precision == 1
# assert metrics.pii_recall == 1
#
#
# # analyzer tests on generated data
# @pytest.mark.skip(reason="Presidio analyzer not on PyPi")
# @pytest.mark.parametrize("test_input,acceptance_threshold",
# [testcase.to_pytest_param() for testcase in
# analyzer_test_generate_text_testdata])
# def test_analyzer_with_generated_text(test_input, acceptance_threshold):
# """
# Test analyzer with a generated dataset text file
# :param test_input: input text file location
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
# # read test input from generated file
#
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# input_samples = read_synth_dataset(
# test_input.format(dir_path))
#
# updated_samples = PresidioAnalyzer. \
# align_input_samples_to_presidio_analyzer(input_samples)
#
# analyzer = PresidioAnalyzer()
# evaluated_samples = analyzer.evaluate_all(updated_samples)
# scores = analyzer.calculate_score(evaluation_results=evaluated_samples)
#
# assert acceptance_threshold <= scores.pii_precision
# assert acceptance_threshold <= scores.pii_recall

Просмотреть файл

@ -0,0 +1,62 @@
'''
Presidio Analyzer not yet on PyPI, ignoring temporarily
'''
# from presidio_evaluator.data_generator import read_synth_dataset
# from presidio_evaluator.presidio_recognizer_evaluator import score_presidio_recognizer
# import pytest
#
# from analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
#
# # test case parameters for tests with dataset which was previously generated.
# class GeneratedTextTestCase:
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
# self.test_name = test_name
# self.test_input = test_input
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
#
# def to_pytest_param(self):
# return pytest.param(self.test_input, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
#
#
# # generated-text test cases
# cc_test_generate_text_testdata = [
# # small set fixture which expects all type results.
# GeneratedTextTestCase(
# test_name="small-set",
# test_input="{}/data/generated_small.txt",
# acceptance_threshold=1,
# marks=pytest.mark.none
# ),
# # large set fixture which expects all type results. marked as "slow"
# GeneratedTextTestCase(
# test_name="large_set",
# test_input="{}/data/generated_large.txt",
# acceptance_threshold=1,
# marks=pytest.mark.slow
# )
# ]
#
#
# # credit card recognizer tests on generated data
# @pytest.mark.parametrize("test_input,acceptance_threshold",
# [testcase.to_pytest_param()
# for testcase in cc_test_generate_text_testdata])
# def test_credit_card_recognizer_with_generated_text(test_input, acceptance_threshold):
# """
# Test credit card recognizer with a generated dataset text file
# :param test_input: input text file location
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
#
# # read test input from generated file
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# input_samples = read_synth_dataset(
# test_input.format(dir_path))
# scores = score_presidio_recognizer(
# CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
# assert acceptance_threshold <= scores.pii_f

Просмотреть файл

@ -0,0 +1,83 @@
'''
Presidio Analyzer not yet on PyPI, ignoring temporarily
'''
# from presidio_evaluator.data_generator import generate
# from presidio_evaluator.presidio_recognizer_evaluator import \
# score_presidio_recognizer
# import pytest
# import numpy as np
#
# from analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
#
# # test case parameters for tests with dataset generated from a template and csv values
# class TemplateTextTestCase:
# def __init__(self, test_name, pii_csv, utterances, dictionary_path,
# num_of_examples, acceptance_threshold, marks):
# self.test_name = test_name
# self.pii_csv = pii_csv
# self.utterances = utterances
# self.dictionary_path = dictionary_path
# self.num_of_examples = num_of_examples
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
#
# def to_pytest_param(self):
# return pytest.param(self.pii_csv, self.utterances, self.dictionary_path,
# self.num_of_examples, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
#
#
# # template-dataset test cases
# cc_test_template_testdata = [
# # large dataset fixture. marked as slow
# TemplateTextTestCase(
# test_name="fake-names-100",
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# utterances="{}/data/templates.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=0.9,
# marks=pytest.mark.slow
# )
# ]
#
#
# # credit card recognizer tests on template-generates data
# @pytest.mark.parametrize("pii_csv, "
# "utterances, "
# "dictionary_path, "
# "num_of_examples, "
# "acceptance_threshold",
# [testcase.to_pytest_param()
# for testcase in cc_test_template_testdata])
# def test_credit_card_recognizer_with_template(pii_csv, utterances,
# dictionary_path,
# num_of_examples,
# acceptance_threshold):
# """
# Test credit card recognizer with a dataset generated from
# template and a CSV values file
# :param pii_csv: input csv file location
# :param utterances: template file location
# :param dictionary_path: dictionary/vocabulary file location
# :param num_of_examples: number of samples to be used from dataset
# to test
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
#
# # read template and CSV files
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
#
# input_samples = generate(fake_pii_csv=pii_csv.format(dir_path),
# utterances_file=utterances.format(dir_path),
# dictionary_path=dictionary_path.format(dir_path),
# lower_case_ratio=0.5,
# num_of_examples=num_of_examples)
#
# scores = score_presidio_recognizer(
# CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
# if not np.isnan(scores.pii_f):
# assert acceptance_threshold <= scores.pii_f

Просмотреть файл

@ -0,0 +1,148 @@
'''
Presidio Analyzer not yet on PyPI, ignoring temporarily
'''
# from presidio_evaluator.data_generator import FakeDataGenerator
# from presidio_evaluator.presidio_recognizer_evaluator import \
# score_presidio_recognizer
# import pandas as pd
# import pytest
# import numpy as np
#
# from analyzer import Pattern, PatternRecognizer
#
# # test case parameters for tests with dataset generated from a template and
# # two csv value files, one containing the common-entities and another one with custom entities
# class PatternRecognizerTestCase:
# def __init__(self, test_name, entity_name, pattern, score, pii_csv, ext_csv,
# utterances, dictionary_path, num_of_examples, acceptance_threshold,
# max_mistakes_number, marks):
# self.test_name = test_name
# self.entity_name = entity_name
# self.pattern = pattern
# self.score = score
# self.pii_csv = pii_csv
# self.ext_csv = ext_csv
# self.utterances = utterances
# self.dictionary_path = dictionary_path
# self.num_of_examples = num_of_examples
# self.acceptance_threshold = acceptance_threshold
# self.max_mistakes_number = max_mistakes_number
# self.marks = marks
#
# def to_pytest_param(self):
# return pytest.param(self.pii_csv, self.ext_csv, self.utterances,
# self.dictionary_path,
# self.entity_name, self.pattern, self.score,
# self.num_of_examples, self.acceptance_threshold,
# self.max_mistakes_number, id=self.test_name,
# marks=self.marks)
#
#
# # template-dataset test cases
# rocket_test_template_testdata = [
# # large dataset fixture. marked as slow.
# # all input is correct, test is conclusive
# PatternRecognizerTestCase(
# test_name="rocket-no-errors",
# entity_name="ROCKET",
# pattern=r'\W*(rocket)\W*',
# score=0.8,
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# ext_csv="{}/data/FakeRocketGenerator.csv",
# utterances="{}/data/rocket_example_sentences.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=1,
# max_mistakes_number=0,
# marks=pytest.mark.slow
# ),
# # large dataset fixture. marked as slow
# # all input is correct, test is conclusive
# PatternRecognizerTestCase(
# test_name="rocket-all-errors",
# entity_name="ROCKET",
# pattern=r'\W*(rocket)\W*',
# score=0.8,
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# ext_csv="{}/data/FakeRocketErrorsGenerator.csv",
# utterances="{}/data/rocket_example_sentences.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=0,
# max_mistakes_number=100,
# marks=pytest.mark.slow
# ),
# # large dataset fixture. marked as slow
# # some input is correct some is not, test is inconclusive
# PatternRecognizerTestCase(
# test_name="rocket-some-errors",
# entity_name="ROCKET",
# pattern=r'\W*(rocket)\W*',
# score=0.8,
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
# ext_csv="{}/data/FakeRocket50PercentErrorsGenerator.csv",
# utterances="{}/data/rocket_example_sentences.txt",
# dictionary_path="{}/data/Dictionary_test.csv",
# num_of_examples=100,
# acceptance_threshold=0.3,
# max_mistakes_number=70,
# marks=[pytest.mark.slow, pytest.mark.inconclusive]
# )
# ]
#
#
# @pytest.mark.parametrize(
# "pii_csv, ext_csv, utterances, dictionary_path, "
# "entity_name, pattern, score, num_of_examples, "
# "acceptance_threshold, max_mistakes_number",
# [testcase.to_pytest_param()
# for testcase in rocket_test_template_testdata])
# def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
# entity_name, pattern,
# score, num_of_examples, acceptance_threshold,
# max_mistakes_number):
# """
# Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
# and another CSV values file with a custom entity
# :param pii_csv: input csv file location with the common entities
# :param ext_csv: input csv file location with custom entities
# :param utterances: template file location
# :param dictionary_path: vocabulary/dictionary file location
# :param entity_name: custom entity name
# :param pattern: recognizer pattern
# :param num_of_examples: number of samples to be used from dataset to test
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
#
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# dfpii = pd.read_csv(pii_csv.format(dir_path), encoding='utf-8')
# dfext = pd.read_csv(ext_csv.format(dir_path), encoding='utf-8')
# dictionary_path = dictionary_path.format(dir_path)
# ext_column_name = dfext.columns[0]
#
# def get_from_ext(i):
# index = i % dfext.shape[0]
# return dfext.iat[index, 0]
#
# # extend pii with ext data
# dfpii[ext_column_name] = [get_from_ext(i) for i in range(0, dfpii.shape[0])]
#
# # generate examples
# generator = FakeDataGenerator(fake_pii_csv_file=dfpii,
# utterances_file=utterances.format(dir_path),
# dictionary_path=dictionary_path)
# examples = generator.sample_examples(num_of_examples)
#
# pattern = Pattern("test pattern", pattern, score)
# pattern_recognizer = PatternRecognizer(entity_name,
# name="test recognizer",
# patterns=[pattern])
#
# scores = score_presidio_recognizer(
# pattern_recognizer, [entity_name], examples)
# if not np.isnan(scores.pii_f):
# assert acceptance_threshold <= scores.pii_f
# assert max_mistakes_number >= len(scores.model_errors)

Просмотреть файл

@ -0,0 +1,18 @@
from presidio_evaluator.data_generator import read_synth_dataset
from presidio_evaluator.spacy_evaluator import SpacyEvaluator
import numpy as np
def test_spacy_simple():
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
spacy_evaluator = SpacyEvaluator(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
evaluation_results = spacy_evaluator.evaluate_all(input_samples)
scores = spacy_evaluator.calculate_score(evaluation_results)
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
assert scores.pii_recall > 0
assert scores.pii_precision > 0

Просмотреть файл

@ -0,0 +1,63 @@
'''
Presidio Analyzer not yet on PyPI, ignoring temporarily
'''
# from presidio_evaluator.data_generator import read_synth_dataset
# from presidio_evaluator.presidio_recognizer_evaluator import \
# score_presidio_recognizer
#
# import pytest
# from analyzer.predefined_recognizers.spacy_recognizer import SpacyRecognizer
#
# # test case parameters for tests with dataset which was previously generated.
# class GeneratedTextTestCase:
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
# self.test_name = test_name
# self.test_input = test_input
# self.acceptance_threshold = acceptance_threshold
# self.marks = marks
#
# def to_pytest_param(self):
# return pytest.param(self.test_input, self.acceptance_threshold,
# id=self.test_name, marks=self.marks)
#
#
# # generated-text test cases
# cc_test_generate_text_testdata = [
# # small dataset, inconclusive results
# GeneratedTextTestCase(
# test_name="small-set",
# test_input="{}/data/generated_small.txt",
# acceptance_threshold=0.5,
# marks=pytest.mark.inconclusive
# ),
# # large dataset - test is slow and inconclusive
# GeneratedTextTestCase(
# test_name="large-set",
# test_input="{}/data/generated_large.txt",
# acceptance_threshold=0.5,
# marks=pytest.mark.slow
# )
# ]
#
#
# # credit card recognizer tests on generated data
# @pytest.mark.parametrize("test_input,acceptance_threshold",
# [testcase.to_pytest_param() for testcase in
# cc_test_generate_text_testdata])
# def test_spacy_recognizer_with_generated_text(test_input, acceptance_threshold):
# """
# Test spacy recognizer with a generated dataset text file
# :param test_input: input text file location
# :param acceptance_threshold: minimim precision/recall
# allowed for tests to pass
# """
#
# # read test input from generated file
# import os
# dir_path = os.path.dirname(os.path.realpath(__file__))
# input_samples = read_synth_dataset(
# test_input.format(dir_path))
# scores = score_presidio_recognizer(
# SpacyRecognizer(), ['PERSON'], input_samples, True)
# assert acceptance_threshold <= scores.pii_f

212
tests/test_span_to_tag.py Normal file
Просмотреть файл

@ -0,0 +1,212 @@
from presidio_evaluator import span_to_tag
BILOU_SCHEME = "BILOU"
BIO_SCHEME = "BIO"
def test_span_to_bio_multiple_tokens():
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
start = 14
end = 38
tag = "ADDRESS"
bio = span_to_tag(BIO_SCHEME, text, [start], [end], [tag])
print(bio)
expected = ['O', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS', 'I-ADDRESS',
'I-ADDRESS', 'I-ADDRESS', 'I-ADDRESS', 'O', 'O', 'O', 'O', 'O']
assert bio == expected
def test_span_to_bio_single_at_end():
text = "My name is Josh"
start = 11
end = 15
tag = "NAME"
bilou = span_to_tag(BIO_SCHEME, text, [start], [end], [tag], )
print(bilou)
expected = ['O', 'O', 'O', 'I-NAME']
assert bilou == expected
def test_span_to_bilou_multiple_tokens():
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
start = 14
end = 38
tag = "ADDRESS"
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], [tag])
print(bilou)
expected = ['O', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS', 'I-ADDRESS',
'I-ADDRESS', 'I-ADDRESS', 'L-ADDRESS', 'O', 'O', 'O', 'O', 'O']
assert bilou == expected
def test_span_to_bilou_adjacent_entities():
text = "Mr. Tree"
start1 = 0
end1 = 2
start2 = 4
end2 = 8
start = [start1, start2]
end = [end1, end2]
tag = ["TITLE", "NAME"]
bilou = span_to_tag(BILOU_SCHEME, text, start, end, tag)
print(bilou)
expected = ['U-TITLE', 'U-NAME']
assert bilou == expected
def test_span_to_bilou_single_at_end():
text = "My name is Josh"
start = 11
end = 15
tag = "NAME"
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], [tag])
print(bilou)
expected = ['O', 'O', 'O', 'U-NAME']
assert bilou == expected
def test_span_to_bilou_multiple_entities():
text = "My name is Josh or David"
start1 = 11
end1 = 15
start2 = 19
end2 = 26
start = [start1, start2]
end = [end1, end2]
tag = ["NAME", "NAME"]
bilou = span_to_tag(BILOU_SCHEME, text, start, end, tag)
print(bilou)
expected = ['O', 'O', 'O', 'U-NAME', 'O', 'U-NAME']
assert bilou == expected
def test_span_to_bio_multiple_entities():
text = "My name is Josh or David"
start1 = 11
end1 = 15
start2 = 19
end2 = 26
start = [start1, start2]
end = [end1, end2]
tag = ["NAME", "NAME"]
bilou = span_to_tag(scheme=BIO_SCHEME, text=text, start=start,
end=end, tag=tag)
print(bilou)
expected = ['O', 'O', 'O', 'I-NAME', 'O', 'I-NAME']
assert bilou == expected
def test_span_to_bio_specific_input():
text = "Someone stole my credit card. The number is 5277716201469117 and " \
"the my name is Mary Anguiano"
start = 80
end = 93
expected = ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'B-PERSON', 'I-PERSON']
tag = ["PERSON"]
bilou = span_to_tag(BIO_SCHEME, text, [start], [end], tag)
assert bilou == expected
def test_span_to_bilou_specific_input():
text = "Someone stole my credit card. The number is 5277716201469117 and " \
"the my name is Mary Anguiano"
start = 80
end = 93
expected = ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'B-PERSON', 'L-PERSON']
tag = ["PERSON"]
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], tag)
assert bilou == expected
def test_span_to_bilou_adjecent_identical_entities():
text = "May I get access to Jessica Gump's account?"
start = 20
end = 32
expected = ['O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O']
tag = ["PERSON"]
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], tag)
assert bilou == expected
def test_overlapping_entities_first_ends_in_mid_second():
text = "My new phone number is 1 705 774 8720. Thanks, man"
start = [22, 25]
end = [37, 37]
scores = [0.6, 0.6]
tag = ["PHONE_NUMBER", "US_PHONE_NUMBER"]
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
'US_PHONE_NUMBER', 'US_PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
assert io == expected
def test_overlapping_entities_second_embedded_in_first_with_lower_score():
text = "My new phone number is 1 705 774 8720. Thanks, man"
start = [22, 25]
end = [37, 33]
scores = [0.6, 0.5]
tag = ["PHONE_NUMBER", "US_PHONE_NUMBER"]
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'PHONE_NUMBER',
'PHONE_NUMBER', 'PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
assert io == expected
def test_overlapping_entities_second_embedded_in_first_has_higher_score():
text = "My new phone number is 1 705 774 8720. Thanks, man"
start = [23, 25]
end = [37, 28]
scores = [0.6, 0.7]
tag = ["PHONE_NUMBER", "US_PHONE_NUMBER"]
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
'PHONE_NUMBER', 'PHONE_NUMBER',
'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
assert io == expected
def test_overlapping_entities_pyramid():
text = "My new phone number is 1 705 999 774 8720. Thanks, cya"
start = [23, 25, 29]
end = [41, 36, 32]
scores = [0.6, 0.7, 0.8]
tag = ["A1", "B2","C3"]
expected = ['O', 'O', 'O', 'O', 'O', 'A1', 'B2', 'C3', 'B2',
'A1', 'O', 'O', 'O', 'O']
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
io_tags_only=True)
assert io == expected

98
tests/test_validation.py Normal file
Просмотреть файл

@ -0,0 +1,98 @@
import pytest
from presidio_evaluator import InputSample
from presidio_evaluator.validation import split_by_template, get_samples_by_pattern, split_dataset
def get_mock_dataset():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample5 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample6 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample7 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
sample8 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
return [sample1, sample2, sample3, sample4, sample5, sample6, sample7, sample8]
def test_split_by_template():
dataset = get_mock_dataset()
train_templates, test_templates = split_by_template(dataset, 0.5)
assert len(train_templates) == 2
assert len(test_templates) == 2
def test_get_samples_by_pattern():
dataset = get_mock_dataset()
train_templates, test_templates = split_by_template(dataset, 0.5)
train_samples = get_samples_by_pattern(dataset, train_templates)
test_samples = get_samples_by_pattern(dataset, test_templates)
dataset_templates = set([sample.metadata['Template#'] for sample in dataset])
train_samples_templates = set([sample.metadata['Template#'] for sample in train_samples])
test_samples_templates = set([sample.metadata['Template#'] for sample in test_samples])
assert len(train_samples) + len(test_samples) == len(dataset)
assert dataset_templates == train_samples_templates | test_samples_templates
assert train_samples_templates & test_samples_templates == set()
assert train_samples_templates == set(train_templates)
assert test_samples_templates == set(test_templates)
def test_split_dataset_two_sets():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
train, test = split_dataset([sample1, sample2, sample3, sample4], [0.5, 0.5])
assert len(train) == 2
assert len(test) == 2
def test_split_dataset_four_sets():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
dataset = [sample1, sample2, sample3, sample4]
train, test, val, dev = split_dataset(dataset, [0.25, 0.25, 0.25, 0.25])
assert len(train) == 1
assert len(test) == 1
assert len(val) == 1
assert len(dev) == 1
# make sure all original template IDs are in the new sets
original_keys = set([1, 2, 3, 4])
t1 = set([sample.metadata['Template#'] for sample in train])
t2 = set([sample.metadata['Template#'] for sample in test])
t3 = set([sample.metadata['Template#'] for sample in dev])
t4 = set([sample.metadata['Template#'] for sample in val])
assert original_keys == t1 | t2 | t3 | t4
def test_split_dataset_test_with_0_ratio():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
dataset = [sample1, sample2, sample3, sample4]
with pytest.raises(ValueError):
train, test, zero = split_dataset(dataset, [0.5, 0.5, 0])
def test_split_dataset_test_with_smallish_ratio():
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
dataset = [sample1, sample2, sample3, sample4]
train, test, zero = split_dataset(dataset, [0.5, 0.4999995, 0.0000005])
assert len(train) == 2
assert len(test) == 2
assert len(zero) == 0