initial commit
This commit is contained in:
Родитель
d93c57420d
Коммит
a25510b8bc
|
@ -20,8 +20,6 @@ parts/
|
|||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
pip-wheel-metadata/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
@ -40,14 +38,12 @@ pip-delete-this-directory.txt
|
|||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
|
||||
|
@ -59,7 +55,6 @@ coverage.xml
|
|||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
|
@ -77,26 +72,11 @@ target/
|
|||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# pyenv
|
||||
.python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
#Pipfile.lock
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
# celery beat schedule file
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
@ -122,8 +102,87 @@ venv.bak/
|
|||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
# Binaries for programs and plugins
|
||||
*.exe
|
||||
*.exe~
|
||||
*.dll
|
||||
*.so
|
||||
*.dylib
|
||||
|
||||
# Test binary, build with `go test -c`
|
||||
*.test
|
||||
|
||||
# Output of the go coverage tool, specifically when used with LiteIDE
|
||||
*.out
|
||||
|
||||
.vscode
|
||||
|
||||
debug.test
|
||||
debug
|
||||
|
||||
/bin/
|
||||
*/rootfs/*
|
||||
/tests/testdata/*-generated.*
|
||||
.DS_Store
|
||||
vendor/
|
||||
*.db
|
||||
|
||||
#pycharm
|
||||
.idea/*
|
||||
.idea
|
||||
#Ignore thumbnails created by Windows
|
||||
Thumbs.db
|
||||
#Ignore files built by Visual Studio
|
||||
*.obj
|
||||
*.pdb
|
||||
*.user
|
||||
*.aps
|
||||
*.pch
|
||||
*.vspscc
|
||||
*_i.c
|
||||
*_p.c
|
||||
*.ncb
|
||||
*.suo
|
||||
*.tlb
|
||||
*.tlh
|
||||
*.bak
|
||||
*.cache
|
||||
*.ilk
|
||||
[Bb]in
|
||||
[Dd]ebug*/
|
||||
*.lib
|
||||
*.sbr
|
||||
obj/
|
||||
[Rr]elease*/
|
||||
_ReSharper*/
|
||||
[Tt]est[Rr]esult*
|
||||
.vs/
|
||||
#Nuget packages folder
|
||||
packages/
|
||||
|
||||
|
||||
#R
|
||||
# History files
|
||||
.Rhistory
|
||||
.Rapp.history
|
||||
|
||||
# Session Data files
|
||||
.RData
|
||||
|
||||
# Example code in package build process
|
||||
*-Ex.R
|
||||
|
||||
# Output files from R CMD build
|
||||
/*.tar.gz
|
||||
|
||||
# Output files from R CMD check
|
||||
/*.Rcheck/
|
||||
|
||||
# RStudio files
|
||||
.Rproj.user/
|
||||
.Rproj.user
|
||||
|
||||
model-outputs/
|
||||
datasets/
|
||||
/model-outputs/
|
138
README.md
138
README.md
|
@ -1,14 +1,132 @@
|
|||
# Presidio-evaluator
|
||||
This package features data-science related tasks for developing new recognizers for Presidio.
|
||||
It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models
|
||||
|
||||
# Contributing
|
||||
## Who should use it?
|
||||
Anyone interested in evaluating an existing Presidio instance, a specific PII recognizer or to develop new models or logic for detecting PII could leverage the preexisting work in this package.
|
||||
Additionally, anyone interested in generating new data based on previous datasets (e.g. to increase the coverage of entity values) for Named Entity Recognition models could leverage the data generator contained in this package.
|
||||
|
||||
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
||||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
||||
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
|
||||
|
||||
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
|
||||
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
|
||||
provided by the bot. You will only need to do this once across all repos using our CLA.
|
||||
## What's in this package?
|
||||
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
||||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
||||
1. **Data generator** for PII recognizers and NER models
|
||||
2. **Data representation layer** for data generation, modeling and analysis
|
||||
3. Multiple **Model/Recognizer evaluation** files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
|
||||
4. **Training and modeling code** for multiple models
|
||||
4. Helper functions for **results analysis**
|
||||
|
||||
## 1. Data generation
|
||||
See [Data Generator README](/presidio_evaluator/data_generator/README.md) for more details.
|
||||
|
||||
The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
|
||||
Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
|
||||
|
||||
- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).
|
||||
|
||||
- For an example for running the generation process, see [this notebook](notebooks/Generate%20data.ipynb).
|
||||
|
||||
- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/PII%20EDA.ipynb).
|
||||
Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
|
||||
|
||||
Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See [this notebook for more details](notebooks/Split%20by%20pattern%20%23.ipynb).
|
||||
|
||||
## 2. Data representation
|
||||
|
||||
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see [data_objects.py](presidio_evaluator/data_objects.py).
|
||||
|
||||
## 3. Recognizer evaluation
|
||||
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
|
||||
The main logic lies in the [ModelEvaluator](presidio_evaluator/model_evaluator.py) class. It provides a structured way of evaluating models and recognizers.
|
||||
|
||||
|
||||
### Ready evaluators
|
||||
Some evaluators were developed for analysis and references. These include:
|
||||
|
||||
#### 1. Presidio API evaluator
|
||||
|
||||
Allows you to evaluate an existing Presidio deployment through the API. [See this notebook for details](notebooks/Evaluate%20Presidio-API.ipynb).
|
||||
|
||||
#### 2. Presidio analyzer evaluator
|
||||
Allows you to evaluate the local Presidio-Analyzer package. Faster than the API option but requires you to have Presidio-Analyzer installed locally. [See this class for more information](presidio_evaluator/presidio_analyzer.py)
|
||||
|
||||
#### 3. One recognizer evaluator
|
||||
Evaluate one specific recognizer for precision and recall. See [presidio_recognizer_evaluator.py](presidio_evaluator/presidio_recognizer_evaluator.py)
|
||||
|
||||
|
||||
## 4. Modeling
|
||||
|
||||
### Conditional Random Fields
|
||||
To train a CRF on a new dataset, see [this notebook](notebooks/models/CRF).
|
||||
To evaluate a CRF model, see [this notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/crf_evaluator.py).
|
||||
|
||||
### spaCy based models
|
||||
There are three ways of interacting with spaCy models:
|
||||
1. Evaluate an existing trained model
|
||||
2. Train with pretrained embeddings
|
||||
3. Fine tune an existing spaCy model
|
||||
|
||||
Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
|
||||
See [this notebook for creating spaCy datasets](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb).
|
||||
|
||||
#### Evaluate an existing trained model
|
||||
To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).
|
||||
|
||||
#### Train with pretrain embeddings
|
||||
In order to train a new spaCy model from scratch with pretrained embeddings (FastText wiki news subword in this case), follow these three steps:
|
||||
|
||||
##### 1. Download FastText pretrained (sub) word embeddings
|
||||
``` sh
|
||||
wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
|
||||
unzip wiki-news-300d-1M-subword.vec.zip
|
||||
```
|
||||
|
||||
##### 2. Init spaCy model with pre-trained embeddings
|
||||
Using spaCy CLI:
|
||||
``` sh
|
||||
python -m spacy init-model en spacy_fasttext --vectors-loc wiki-news-300d-1M-subword.vec
|
||||
```
|
||||
|
||||
##### 3. Train spaCy NER model
|
||||
Using spaCy CLI:
|
||||
``` sh
|
||||
python -m spacy train en spacy_fasttext_100 train.json test.json --vectors spacy_fasttext --pipeline ner -n 100
|
||||
```
|
||||
|
||||
#### Fine-tune an existing spaCy model
|
||||
See [this code for retraining an existing spaCy model](models/spacy_retrain.py). Specifically, run a SpacyRetrainer:
|
||||
First, you would have to create train and test pickle files for your train and test sets. See [this notebook](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb) for more information.
|
||||
|
||||
```python
|
||||
from models import SpacyRetrainer
|
||||
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
|
||||
experiment_name='new_spacy_experiment',
|
||||
n_iter=500, dropout=0.1, aml_config=None)
|
||||
spacy_retrainer.run()
|
||||
```
|
||||
|
||||
### Flair based models
|
||||
To train a new model, see the [FlairTrainer](presidio_evaluator/models/flair_train.py) object.
|
||||
For experimenting with other embedding types, change the `embeddings` object in the `train` method.
|
||||
To train a Flair model, run:
|
||||
|
||||
```python
|
||||
from models import FlairTrainer
|
||||
train_samples = "../data/generated_train.json"
|
||||
test_samples = "../data/generated_test.json"
|
||||
val_samples = "../data/generated_validation.json"
|
||||
|
||||
trainer = FlairTrainer()
|
||||
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
|
||||
|
||||
corpus = trainer.read_corpus("")
|
||||
trainer.train(corpus)
|
||||
```
|
||||
|
||||
To evaluate an existing model, see [this notebook](notebooks/models/Evaluate%20flair%20models.ipynb).
|
||||
|
||||
|
||||
|
||||
Copyright notice:
|
||||
|
||||
Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/)
|
||||
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/). Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
|
||||
|
|
|
@ -0,0 +1,2 @@
|
|||
0.0
|
||||
|
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
|
@ -0,0 +1,2 @@
|
|||
from .spacy_retrain import SpacyRetrainer
|
||||
from .flair_train import FlairTrainer
|
|
@ -0,0 +1,123 @@
|
|||
from typing import List
|
||||
|
||||
from flair.data import Corpus, Sentence
|
||||
from flair.datasets import ColumnCorpus
|
||||
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BertEmbeddings
|
||||
from flair.models import SequenceTagger
|
||||
from flair.trainers import ModelTrainer
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from os import path
|
||||
|
||||
|
||||
class FlairTrainer:
|
||||
|
||||
@staticmethod
|
||||
def to_flair_row(self, text, pos, label):
|
||||
return "{} {} {}".format(text, pos, label)
|
||||
|
||||
def to_flair(self, df, outfile="flair_train.txt"):
|
||||
sentence = 0
|
||||
flair = []
|
||||
for row in df.itertuples():
|
||||
if row.sentence != sentence:
|
||||
sentence += 1
|
||||
flair.append("")
|
||||
else:
|
||||
flair.append(self.to_flair_row(row.text, row.pos, row.label))
|
||||
|
||||
if outfile:
|
||||
with open(outfile, "w", encoding="utf-8") as f:
|
||||
for item in flair:
|
||||
f.write("{}\n".format(item))
|
||||
|
||||
def create_flair_corpus(self, train_samples_path, test_samples_path, val_samples_path):
|
||||
if not path.exists("flair_train.txt"):
|
||||
train_samples = read_synth_dataset(train_samples_path)
|
||||
train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]
|
||||
print("Kept {} train samples after removal of non-tagged samples".format(len(train_tagged)))
|
||||
train_data = InputSample.create_conll_dataset(train_tagged)
|
||||
self.to_flair(train_data, outfile="flair_train.txt")
|
||||
|
||||
if not path.exists("flair_test.txt"):
|
||||
test_samples = read_synth_dataset(test_samples_path)
|
||||
test_data = InputSample.create_conll_dataset(test_samples)
|
||||
self.to_flair(test_data, outfile="flair_test.txt")
|
||||
|
||||
if not path.exists("flair_val.txt"):
|
||||
val_samples = read_synth_dataset(val_samples_path)
|
||||
val_data = InputSample.create_conll_dataset(val_samples)
|
||||
self.to_flair(val_data, outfile="flair_val.txt")
|
||||
|
||||
@staticmethod
|
||||
def read_corpus(data_folder) -> Corpus:
|
||||
columns = {0: 'text', 1: 'pos', 2: 'ner'}
|
||||
corpus: Corpus = ColumnCorpus(data_folder, columns,
|
||||
train_file='flair_train.txt',
|
||||
test_file='flair_val.txt',
|
||||
dev_file='flair_test.txt')
|
||||
return corpus
|
||||
|
||||
@staticmethod
|
||||
def train(corpus):
|
||||
print(corpus)
|
||||
|
||||
# 2. what tag do we want to predict?
|
||||
tag_type = 'ner'
|
||||
|
||||
# 3. make the tag dictionary from the corpus
|
||||
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
|
||||
print(tag_dictionary.idx2item)
|
||||
|
||||
# 4. initialize embeddings
|
||||
embedding_types: List[TokenEmbeddings] = [
|
||||
WordEmbeddings('glove'),
|
||||
FlairEmbeddings('news-forward'),
|
||||
FlairEmbeddings('news-backward')
|
||||
]
|
||||
|
||||
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
|
||||
|
||||
# 5. initialize sequence tagger
|
||||
|
||||
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
|
||||
embeddings=embeddings,
|
||||
tag_dictionary=tag_dictionary,
|
||||
tag_type=tag_type,
|
||||
use_crf=True)
|
||||
|
||||
# 6. initialize trainer
|
||||
|
||||
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
|
||||
|
||||
checkpoint = 'resources/taggers/presidio-ner/checkpoint.pt'
|
||||
# trainer = ModelTrainer.load_checkpoint(checkpoint, corpus)
|
||||
trainer.train('resources/taggers/presidio-ner',
|
||||
learning_rate=0.1,
|
||||
mini_batch_size=32,
|
||||
max_epochs=150,
|
||||
checkpoint=True)
|
||||
|
||||
sentence = Sentence('I am from Jerusalem')
|
||||
# run NER over sentence
|
||||
tagger.predict(sentence)
|
||||
|
||||
print(sentence)
|
||||
print('The following NER tags are found:')
|
||||
|
||||
# iterate over entities and print
|
||||
for entity in sentence.get_spans('ner'):
|
||||
print(entity)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
train_samples = "../data/generated_train_November 12 2019.json"
|
||||
test_samples = "../data/generated_test_November 12 2019.json"
|
||||
val_samples = "../data/generated_validation_November 12 2019.json"
|
||||
|
||||
trainer = FlairTrainer()
|
||||
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
|
||||
|
||||
corpus = trainer.read_corpus("")
|
||||
trainer.train(corpus)
|
|
@ -0,0 +1,206 @@
|
|||
import logging
|
||||
import pickle
|
||||
import random
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import spacy
|
||||
from azureml.core import Workspace, Experiment
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
from presidio_evaluator import SpacyEvaluator, InputSample
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
root = logging.getLogger()
|
||||
root.setLevel(logging.INFO)
|
||||
|
||||
handler = logging.StreamHandler(sys.stdout)
|
||||
handler.setLevel(logging.INFO)
|
||||
root.addHandler(handler)
|
||||
|
||||
|
||||
class SpacyRetrainer:
|
||||
|
||||
def __init__(self, original_model_name=None, experiment_name=None, n_iter=100, dropout=0.5,
|
||||
aml_config='config.json', output_dir='../../model-outputs', train_pickle='../data/train.pickle',
|
||||
test_pickle='../data/test.pickle'):
|
||||
self.experiment_name = experiment_name
|
||||
if aml_config:
|
||||
self.ws = Workspace.from_config(aml_config)
|
||||
self.experiment = Experiment(workspace=self.ws, name=experiment_name)
|
||||
self.aml_run = self.experiment.start_logging()
|
||||
self.has_aml = True
|
||||
else:
|
||||
self.has_aml = False
|
||||
|
||||
self.model = original_model_name
|
||||
self.n_iter = n_iter
|
||||
self.output_dir = output_dir
|
||||
self.train_file = train_pickle
|
||||
self.test_file = test_pickle
|
||||
self.dropout = dropout
|
||||
|
||||
def run(self):
|
||||
if self.has_aml:
|
||||
self.aml_run.log("model", self.model)
|
||||
self.aml_run.log("n_iter", self.n_iter)
|
||||
self.aml_run.log("train_file", self.train_file)
|
||||
self.aml_run.log("test_file", self.test_file)
|
||||
self.aml_run.log("dropout rate", self.dropout)
|
||||
model_path = self._train(self.model, self.output_dir, self.n_iter, self.train_file, self.experiment_name)
|
||||
self._score_validate(model_path, self.test_file)
|
||||
if self.has_aml:
|
||||
self.aml_run.complete()
|
||||
|
||||
def print_scores(self, split, evaluation_result):
|
||||
"""
|
||||
Logs results into experiment run.
|
||||
:param split: Name of this split. For ex 'train' or 'valid'
|
||||
:param evaluation_result: EvaluationResult containing various metrics
|
||||
:return: None. Writes to experiment runner and logs locally.
|
||||
"""
|
||||
logging.info('SPLIT: {0}. PII_precision: {1}, PII_recall: {2},'
|
||||
'Person_precision: {3}, Person_recall: {4}'. \
|
||||
format(split, evaluation_result.pii_precision, evaluation_result.pii_recall,
|
||||
evaluation_result.entity_precision_dict['PERSON'],
|
||||
evaluation_result.entity_recall_dict['PERSON']))
|
||||
if self.has_aml:
|
||||
self.aml_run.log('Precision', evaluation_result.pii_precision, split)
|
||||
self.aml_run.log('Recall', evaluation_result.pii_recall, split)
|
||||
|
||||
@staticmethod
|
||||
def _score(model, data):
|
||||
"""
|
||||
Score the model against the data
|
||||
:param model: Trained model
|
||||
:param data: Data split which is being scored.
|
||||
:return: An EvaluationResult containing various metrics
|
||||
"""
|
||||
|
||||
spacy_evaluator = SpacyEvaluator(model=model)
|
||||
|
||||
results = []
|
||||
for text, ground_truth_annotations in data:
|
||||
ground_truth_entities = ground_truth_annotations['entities']
|
||||
input_sample = InputSample.from_spacy(text, ground_truth_entities)
|
||||
results.append(spacy_evaluator.evaluate_sample(input_sample))
|
||||
|
||||
return spacy_evaluator.calculate_score(evaluation_results=results)
|
||||
|
||||
def _score_validate(self, model_path, test_data_file):
|
||||
"""
|
||||
Validation step for the model. Also prints the scores.
|
||||
:param model_path: Path to trained model.
|
||||
:param test_data_file: Data file which has the dataset for this split.
|
||||
:return: None. Prints the scores.
|
||||
"""
|
||||
with open(test_data_file, 'rb') as f:
|
||||
valid_data = pickle.load(f)
|
||||
nlp = spacy.load(model_path)
|
||||
self.print_scores('Valid', self._score(nlp, valid_data))
|
||||
|
||||
# @plac.annotations(
|
||||
# model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
# output_dir=("Optional output directory", "option", "o", Path),
|
||||
# n_iter=("Number of training iterations", "option", "n", int),
|
||||
# train_file=("File containing pickled training Spacy NER formatted data", "option", "d", Path),
|
||||
# test_file=("File containing pickled test Spacy NER formatted data", "option", "d", Path),
|
||||
# exp_name=("Name of this experiment", "option", "e")
|
||||
# )
|
||||
|
||||
def _train(self, model, output_dir, n_iter, train_file, exp_name):
|
||||
"""Load the model, set up the pipeline and train the entity recognizer."""
|
||||
nlp = self.load_or_create_empty_model(model)
|
||||
|
||||
if "ner" not in nlp.pipe_names:
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner, last=True)
|
||||
else:
|
||||
ner = nlp.get_pipe("ner")
|
||||
|
||||
with open(train_file, 'rb') as f:
|
||||
train_data = pickle.load(f)
|
||||
|
||||
# DEBUG
|
||||
train_data = train_data[:50]
|
||||
|
||||
# add labels
|
||||
for _, annotations in train_data:
|
||||
for ent in annotations.get("entities"):
|
||||
ner.add_label(ent[2])
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
# reset and initialize the weights randomly – but only if we're
|
||||
# training a new model
|
||||
if model is None:
|
||||
nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(train_data)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, drop=self.dropout, losses=losses, )
|
||||
logging.debug("Losses", losses)
|
||||
if self.has_aml:
|
||||
self.aml_run.log('Losses', losses['ner'])
|
||||
self.print_scores('Itn {}'.format(itn), self._score(nlp, train_data))
|
||||
|
||||
self.print_scores('Train', self._score(nlp, train_data))
|
||||
|
||||
saved_model_path = self.save_model(exp_name, nlp, output_dir)
|
||||
return saved_model_path
|
||||
|
||||
@staticmethod
|
||||
def save_model(exp_name, model, output_dir):
|
||||
"""
|
||||
Saves model to disk for later use.
|
||||
:param exp_name: Name of the running experiment. This is used as folder name for storing the model.
|
||||
:param model: Model being saved
|
||||
:param output_dir: Directory where to save the model.
|
||||
:return: Full path to saved model.
|
||||
"""
|
||||
saved_model_path = Path(output_dir, exp_name)
|
||||
if not saved_model_path.exists():
|
||||
saved_model_path.mkdir(parents=True)
|
||||
model.to_disk(saved_model_path)
|
||||
logging.info("Saved model to {}".format(output_dir))
|
||||
return saved_model_path
|
||||
|
||||
@staticmethod
|
||||
def load_model(exp_name, model_dir):
|
||||
"""
|
||||
Loads a spacy model from disk
|
||||
|
||||
:param exp_name: Name of experiment under which the model was saved
|
||||
:param model_dir: path to saved model
|
||||
:return: spacy model
|
||||
"""
|
||||
saved_model_path = Path(model_dir, exp_name)
|
||||
return spacy.load(saved_model_path)
|
||||
|
||||
@staticmethod
|
||||
def load_or_create_empty_model(model=None):
|
||||
"""
|
||||
Loads a given model or creates a blank english model.
|
||||
:param model: Optional Model to load.
|
||||
:return: Loaded or blank model.
|
||||
"""
|
||||
if model:
|
||||
nlp = spacy.load(model)
|
||||
logging.debug("Loaded model {}".format(model))
|
||||
else:
|
||||
nlp = spacy.blank("en")
|
||||
logging.debug("Created blank 'en' model")
|
||||
return nlp
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
spacy_retrainer = SpacyRetrainer(original_model_name='en_core_web_lg',
|
||||
experiment_name='spacy_new_ontonotes28',
|
||||
n_iter=500, dropout=0.5, aml_config=None)
|
||||
spacy_retrainer.run()
|
|
@ -0,0 +1,151 @@
|
|||
# coding: utf-8
|
||||
"""
|
||||
Example of a Streamlit app for an interactive spaCy model visualizer. You can
|
||||
either download the script, or point streamlit run to the raw URL of this
|
||||
file. For more details, see https://streamlit.io.
|
||||
|
||||
Installation:
|
||||
pip install streamlit
|
||||
python -m spacy download en_core_web_sm
|
||||
python -m spacy download en_core_web_md
|
||||
python -m spacy download de_core_news_sm
|
||||
|
||||
Usage:
|
||||
streamlit run streamlit_spacy.py
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import streamlit as st
|
||||
import spacy
|
||||
from spacy import displacy
|
||||
import pandas as pd
|
||||
|
||||
|
||||
SPACY_MODEL_NAMES = ["en_core_web_lg", "spacy_new_ontonotes28","spacy_ft_100/model-final"]
|
||||
DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
|
||||
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
||||
|
||||
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def load_model(name):
|
||||
return spacy.load(name)
|
||||
|
||||
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def process_text(model_name, text):
|
||||
nlp = load_model(model_name)
|
||||
return nlp(text)
|
||||
|
||||
|
||||
st.sidebar.title("Interactive spaCy visualizer")
|
||||
st.sidebar.markdown(
|
||||
"""
|
||||
Process text with [spaCy](https://spacy.io) models and visualize named entities,
|
||||
dependencies and more. Uses spaCy's built-in
|
||||
[displaCy](http://spacy.io/usage/visualizers) visualizer under the hood.
|
||||
"""
|
||||
)
|
||||
|
||||
spacy_model = st.sidebar.selectbox("Model name", SPACY_MODEL_NAMES)
|
||||
model_load_state = st.info(f"Loading model '{spacy_model}'...")
|
||||
nlp = load_model(spacy_model)
|
||||
model_load_state.empty()
|
||||
|
||||
text = st.text_area("Text to analyze", DEFAULT_TEXT)
|
||||
doc = process_text(spacy_model, text)
|
||||
|
||||
if "parser" in nlp.pipe_names:
|
||||
st.header("Dependency Parse & Part-of-speech tags")
|
||||
st.sidebar.header("Dependency Parse")
|
||||
split_sents = st.sidebar.checkbox("Split sentences", value=True)
|
||||
collapse_punct = st.sidebar.checkbox("Collapse punctuation", value=True)
|
||||
collapse_phrases = st.sidebar.checkbox("Collapse phrases")
|
||||
compact = st.sidebar.checkbox("Compact mode")
|
||||
options = {
|
||||
"collapse_punct": collapse_punct,
|
||||
"collapse_phrases": collapse_phrases,
|
||||
"compact": compact,
|
||||
}
|
||||
docs = [span.as_doc() for span in doc.sents] if split_sents else [doc]
|
||||
for sent in docs:
|
||||
html = displacy.render(sent, options=options)
|
||||
# Double newlines seem to mess with the rendering
|
||||
html = html.replace("\n\n", "\n")
|
||||
if split_sents and len(docs) > 1:
|
||||
st.markdown(f"> {sent.text}")
|
||||
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
|
||||
|
||||
if "ner" in nlp.pipe_names:
|
||||
st.header("Named Entities")
|
||||
st.sidebar.header("Named Entities")
|
||||
label_set = nlp.get_pipe("ner").labels
|
||||
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
|
||||
html = displacy.render(doc, style="ent", options={"ents": labels})
|
||||
# Newlines seem to mess with the rendering
|
||||
html = html.replace("\n", " ")
|
||||
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
|
||||
attrs = ["text", "label_", "start", "end", "start_char", "end_char"]
|
||||
if "entity_linker" in nlp.pipe_names:
|
||||
attrs.append("kb_id_")
|
||||
data = [
|
||||
[str(getattr(ent, attr)) for attr in attrs]
|
||||
for ent in doc.ents
|
||||
if ent.label_ in labels
|
||||
]
|
||||
df = pd.DataFrame(data, columns=attrs)
|
||||
st.dataframe(df)
|
||||
|
||||
|
||||
if "textcat" in nlp.pipe_names:
|
||||
st.header("Text Classification")
|
||||
st.markdown(f"> {text}")
|
||||
df = pd.DataFrame(doc.cats.items(), columns=("Label", "Score"))
|
||||
st.dataframe(df)
|
||||
|
||||
|
||||
vector_size = nlp.meta.get("vectors", {}).get("width", 0)
|
||||
if vector_size:
|
||||
st.header("Vectors & Similarity")
|
||||
st.code(nlp.meta["vectors"])
|
||||
text1 = st.text_input("Text or word 1", "apple")
|
||||
text2 = st.text_input("Text or word 2", "orange")
|
||||
doc1 = process_text(spacy_model, text1)
|
||||
doc2 = process_text(spacy_model, text2)
|
||||
similarity = doc1.similarity(doc2)
|
||||
if similarity > 0.5:
|
||||
st.success(similarity)
|
||||
else:
|
||||
st.error(similarity)
|
||||
|
||||
st.header("Token attributes")
|
||||
|
||||
if st.button("Show token attributes"):
|
||||
attrs = [
|
||||
"idx",
|
||||
"text",
|
||||
"lemma_",
|
||||
"pos_",
|
||||
"tag_",
|
||||
"dep_",
|
||||
"head",
|
||||
"ent_type_",
|
||||
"ent_iob_",
|
||||
"shape_",
|
||||
"is_alpha",
|
||||
"is_ascii",
|
||||
"is_digit",
|
||||
"is_punct",
|
||||
"like_num",
|
||||
]
|
||||
data = [[str(getattr(token, attr)) for attr in attrs] for token in doc]
|
||||
df = pd.DataFrame(data, columns=attrs)
|
||||
st.dataframe(df)
|
||||
|
||||
|
||||
st.header("JSON Doc")
|
||||
if st.button("Show JSON Doc"):
|
||||
st.json(doc.to_json())
|
||||
|
||||
st.header("JSON model meta")
|
||||
if st.button("Show JSON model meta"):
|
||||
st.json(nlp.meta)
|
|
@ -0,0 +1,245 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator.data_generator import read_synth_dataset\n",
|
||||
"from presidio_evaluator import ModelEvaluator\n",
|
||||
"from collections import Counter\n",
|
||||
"%load_ext autoreload\n",
|
||||
"%autoreload 2\n",
|
||||
"\n",
|
||||
"MY_PRESIDIO_ENDPOINT = \"http://presidio-api.westeurope.cloudapp.azure.com/api/v1/projects/test/analyze\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Evaluate your Presidio instance via the Presidio API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### A. Read dataset for evaluation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"input_samples = read_synth_dataset(\"../data/synth_dataset.txt\")\n",
|
||||
"print(\"Read {} samples\".format(len(input_samples)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### B. Descriptive statistics"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"flatten = lambda l: [item for sublist in l for item in sublist]\n",
|
||||
"\n",
|
||||
"count_per_entity = Counter([span.entity_type for span in flatten([input_sample.spans for input_sample in input_samples])])\n",
|
||||
"count_per_entity"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### C. Match the dataset's entity names with Presidio's entity names"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity\n",
|
||||
"entities_mapping = {\n",
|
||||
" 'PERSON': 'PERSON',\n",
|
||||
" 'EMAIL': 'EMAIL_ADDRESS',\n",
|
||||
" 'CREDIT_CARD': 'CREDIT_CARD',\n",
|
||||
" 'FIRST_NAME': 'PERSON',\n",
|
||||
" 'PHONE_NUMBER': 'PHONE_NUMBER',\n",
|
||||
" 'LOCATION':'LOCATION',\n",
|
||||
" # 'BIRTHDAY': 'DATE_TIME',\n",
|
||||
" # 'DATE': 'DATE_TIME',\n",
|
||||
" 'DOMAIN': 'DOMAIN',\n",
|
||||
" # 'CITY': 'LOCATION',\n",
|
||||
" # 'ADDRESS': 'LOCATION',\n",
|
||||
" 'IBAN': 'IBAN_CODE',\n",
|
||||
" # 'URL': 'DOMAIN_NAME',\n",
|
||||
" 'US_SSN': 'US_SSN',\n",
|
||||
" 'IP_ADDRESS': 'IP_ADDRESS',\n",
|
||||
" # 'ORGANIZATION':'ORG'\n",
|
||||
" 'O': 'O'\n",
|
||||
"}\n",
|
||||
"presidio_fields = ['CREDIT_CARD', 'CRYPTO', 'DATE_TIME', 'DOMAIN_NAME', 'EMAIL_ADDRESS', 'IBAN_CODE',\n",
|
||||
" 'IP_ADDRESS', 'NRP', 'LOCATION', 'PERSON', 'PHONE_NUMBER', 'US_SSN']\n",
|
||||
"\n",
|
||||
"new_list = ModelEvaluator.align_input_samples_to_presidio_analyzer(input_samples,\n",
|
||||
" entities_mapping,\n",
|
||||
" presidio_fields)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### D. Recalculate statistics on updated dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## recheck counter\n",
|
||||
"count_per_entity_new = Counter([span.entity_type for span in flatten([input_sample.spans for input_sample in new_list])])\n",
|
||||
"count_per_entity_new"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### E. Run the presidio-evaluator framework with Presidio's API as the 'model' at test"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator import PresidioAPIEvaluator\n",
|
||||
"presidio = PresidioAPIEvaluator(entities_to_keep=list(count_per_entity_new.keys()),endpoint=MY_PRESIDIO_ENDPOINT)\n",
|
||||
"evaluted_samples = presidio.evaluate_all(new_list[:100])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### F. Extract statistics\n",
|
||||
"- Presicion, recall and F measure are calculated based on a PII/Not PII binary classification per token.\n",
|
||||
"- Specific entity recall and precision are calculated on the specific PII entity level."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"evaluation_result = presidio.calculate_score(evaluted_samples)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"evaluation_result.print()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### G. Analyze wrong predictions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"errors = evaluation_result.model_errors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ModelEvaluator.most_common_fp_tokens(errors,n=5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
|
||||
"fps_df[['full_text','token','prediction']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')\n",
|
||||
"fns_df"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": []
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,226 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from tqdm import tqdm_notebook as tqdm\n",
|
||||
"from presidio_evaluator.data_generator.main import generate,read_synth_dataset\n",
|
||||
"\n",
|
||||
"import datetime\n",
|
||||
"import json"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Generate fake PII data using Presidio's data generator"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Presidio's data generator allows you to generate a synthetic dataset with two preriquisites:\n",
|
||||
"1. A fake PII csv (We used https://www.fakenamegenerator.com/)\n",
|
||||
"2. A text file with template sentences or paragraphs. In this file, each PII entity placeholder is written in brackets. The name of the PII entity should be one of the columns in the fake PII csv file.\n",
|
||||
"\n",
|
||||
"The generator creates fake sentences based on the provided fake PII csv AND a list of [extension functions](../presidio_evaluator/data_generator/extensions.py) and a few additional 3rd party libraries like `Faker`, and `haikunator`.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"For example:\n",
|
||||
"1. **A fake PII csv**:\n",
|
||||
"\n",
|
||||
"| FIRST_NAME | LAST_NAME | EMAIL |\n",
|
||||
"|-------------|-------------|-----------|\n",
|
||||
"| David | Brown | david.brown@jobhop.com |\n",
|
||||
"| Mel | Brown | melb@hobjob.com |\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"2. **Templates**:\n",
|
||||
"\n",
|
||||
"My name is [FIRST_NAME]\n",
|
||||
"\n",
|
||||
"You can email me at [EMAIL]. Thanks, [FIRST_NAME]\n",
|
||||
"\n",
|
||||
"What's your last name? It's [LAST_NAME]\n",
|
||||
"\n",
|
||||
"Every time I see you falling I get down on my knees and pray\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Generate files\n",
|
||||
"Based on these two prerequisites, a requested number of examples and an output file name:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"EXAMPLES = 100\n",
|
||||
"SPAN_TO_TAG = True #Whether to create tokens + token labels (tags)\n",
|
||||
"TEMPLATES_FILE = '../presidio_evaluator/data_generator/' \\\n",
|
||||
" 'raw_data/ontonotes_based_templates.txt'\n",
|
||||
"KEEP_ONLY_TAGGED = False\n",
|
||||
"LOWER_CASE_RATIO = 0.1\n",
|
||||
"IGNORE_TYPES = {\"IP_ADDRESS\", 'US_SSN', 'URL'}\n",
|
||||
"\n",
|
||||
"OUTPUT = \"generated_size_{}_date_{}.txt\".format(EXAMPLES, cur_time)\n",
|
||||
"\n",
|
||||
"cur_time = datetime.date.today().strftime(\"%B %d %Y\")\n",
|
||||
"fake_pii_csv = '../presidio_evaluator/data_generator/' \\\n",
|
||||
" 'raw_data/FakeNameGenerator.com_100.csv'\n",
|
||||
"utterances_file = TEMPLATES_FILE\n",
|
||||
"dictionary_path = '../presidio_evaluator/data_generator/' \\\n",
|
||||
" 'raw_data/Dictionary.csv'\n",
|
||||
"\n",
|
||||
"examples = generate(fake_pii_csv=fake_pii_csv,\n",
|
||||
" utterances_file=utterances_file,\n",
|
||||
" dictionary_path=dictionary_path,\n",
|
||||
" output_file=OUTPUT,\n",
|
||||
" lower_case_ratio=LOWER_CASE_RATIO,\n",
|
||||
" num_of_examples=EXAMPLES,\n",
|
||||
" ignore_types=IGNORE_TYPES,\n",
|
||||
" keep_only_tagged=KEEP_ONLY_TAGGED,\n",
|
||||
" span_to_tag=SPAN_TO_TAG)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To read a dataset file into the InputSample format, use `read_synth_dataset`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"input_samples = read_synth_dataset(OUTPUT)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"input_samples[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The full structure of each input_sample is the following. It includes different feature values per token as calculated by Spacy"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"input_samples[0].to_dict()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Verify randomness of dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"count_per_template_id = Counter([sample.metadata['Template#'] for sample in input_samples])\n",
|
||||
"for key in sorted(count_per_template_id):\n",
|
||||
" print(\"{}: {}\".format(key,count_per_template_id[key]))\n",
|
||||
" \n",
|
||||
"print(sum(count_per_template_id.values()))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Transform to the CONLL structure:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator import InputSample\n",
|
||||
"\n",
|
||||
"conll = InputSample.create_conll_dataset(input_samples)\n",
|
||||
"conll.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Copyright notice:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Data generated for evaluation was created using Fake Name Generator.\n",
|
||||
"\n",
|
||||
"Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/) \n",
|
||||
"are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/). Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,274 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Fake PII data: Exploratory data analysis\n",
|
||||
"\n",
|
||||
"This notebook is used to verify the different fake entities before and after the creation of a synthetic dataset / augmented dataset. First part looks at the generation details and stats, second part evaluates the created synthetic dataset after it has been generated."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"from presidio_evaluator.data_generator.extensions import generate_iban, generate_ip_addresses, generate_SSNs, \\\n",
|
||||
" generate_company_names, generate_url, generate_roles, generate_titles, generate_nationality, generate_nation_man, \\\n",
|
||||
" generate_nation_woman, generate_nation_plural, generate_title\n",
|
||||
"\n",
|
||||
"from presidio_evaluator.data_generator import FakeDataGenerator, read_synth_dataset\n",
|
||||
"\n",
|
||||
"from collections import Counter\n",
|
||||
"\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"%matplotlib inline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Evaluate generation logic and the fake PII bank used during generation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = pd.read_csv(\"../presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_100000.csv\",encoding=\"utf-8\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"generator = FakeDataGenerator(fake_pii_df=df, \n",
|
||||
" templates=None, \n",
|
||||
" dictionary_path=None,\n",
|
||||
" ignore_types={\"IP_ADDRESS\", 'US_SSN', 'URL','ADDRESS'})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pii_df = generator.prep_fake_pii(df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for (name, series) in pii_df.iteritems():\n",
|
||||
" print(name)\n",
|
||||
" print(\"Unique values: {}\".format(len(series.unique())))\n",
|
||||
" print(series.value_counts())\n",
|
||||
" print(\"\\n**************\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from wordcloud import WordCloud\n",
|
||||
"\n",
|
||||
"def series_to_wordcloud(series):\n",
|
||||
" freqs = series.value_counts()\n",
|
||||
" wordcloud = WordCloud(background_color='white',width=800,height=400).generate_from_frequencies(freqs)\n",
|
||||
" fig = plt.figure(figsize=(16, 8))\n",
|
||||
" plt.suptitle(\"{} word cloud\".format(series.name))\n",
|
||||
" plt.imshow(wordcloud, interpolation='bilinear')\n",
|
||||
" plt.axis(\"off\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"series_to_wordcloud(df.FIRST_NAME)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"series_to_wordcloud(df.LAST_NAME)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 96,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"series_to_wordcloud(df.COUNTRY)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"series_to_wordcloud(df.ORGANIZATION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"series_to_wordcloud(df.CITY)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"2. Evaluate different entities in the synthetic dataset after creation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"synth = read_synth_dataset(\"../data/generated_train_November 12 2019.json\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentences_only = [(sample.full_text,sample.metadata) for sample in synth]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentences_only[2]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"Proportions of female vs. male based samples:\")\n",
|
||||
"Counter([sentence[1]['Gender'] for sentence in sentences_only])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"Proportion of lower case samples:\")\n",
|
||||
"Counter([sentence[1]['Lowercase'] for sentence in sentences_only])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"Proportion of nameset across samples:\")\n",
|
||||
"Counter([sentence[1]['NameSet'] for sentence in sentences_only])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def get_entity_values_from_sample(sample,entity_types):\n",
|
||||
" name_entities = [span.entity_value for span in sample.spans if span.entity_type in entity_types]\n",
|
||||
" return name_entities\n",
|
||||
" \n",
|
||||
"names = [get_entity_values_from_sample(sample,['PERSON','FIRST_NAME','LAST_NAME']) for sample in synth]\n",
|
||||
"names = [item for sublist in names for item in sublist]\n",
|
||||
"series_to_wordcloud(pd.Series(names,name='PERSON, FIRST_NAME, LAST_NAME'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"countries = [get_entity_values_from_sample(sample,['LOCATION']) for sample in synth]\n",
|
||||
"countries = [item for sublist in countries for item in sublist]\n",
|
||||
"series_to_wordcloud(pd.Series(countries,name='LOCATION'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"orgs = [get_entity_values_from_sample(sample,['ORGANIZATION']) for sample in synth]\n",
|
||||
"orgs = [item for sublist in orgs for item in sublist]\n",
|
||||
"series_to_wordcloud(pd.Series(orgs,name='ORGANIZATION'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,166 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Train/Test/Validation split of input samples. \n",
|
||||
"This notebook shows how train/test/split is being made on a List[InputSample]\n",
|
||||
"\n",
|
||||
"This is different for the normal split since we don't want sentences generated from the same pattern to be in more than one set. (Applicable only if the dataset was generated from templates)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator import InputSample\n",
|
||||
"from presidio_evaluator.data_generator import read_synth_dataset\n",
|
||||
"from presidio_evaluator.validation import split_dataset, save_to_json\n",
|
||||
"\n",
|
||||
"%reload_ext autoreload"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATE_DATE = \"November 12 2019\"\n",
|
||||
"SIZE = 80000"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Load full dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"all_samples = read_synth_dataset(\"../presidio_evaluator/data_generator/generated_size_{}_date_{}.txt\".format(SIZE, DATE_DATE))\n",
|
||||
"print(len(all_samples))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Split to train/test/dev"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"TRAIN_TEST_VAL_RATIOS = [0.7,0.2,0.1]\n",
|
||||
"\n",
|
||||
"train, test, validation = split_dataset(all_samples,TRAIN_TEST_VAL_RATIOS)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Train/Test only (no validation)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"#TRAIN_TEST_RATIOS = [0.7,0.3]\n",
|
||||
"#train,test = split_dataset(all_sampleTRAIN_TEST_RATIOSEST_RATIOS)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Save the different sets to files"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"save_to_json(train,\"../data/train_{}.json\".format(DATE_DATE))\n",
|
||||
"save_to_json(test,\"../data/test_{}.json\".format(DATE_DATE))\n",
|
||||
"save_to_json(validation,\"../data/1validation_{}.json\".format(DATE_DATE))\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(len(train))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(len(test))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(len(validation))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"assert len(train) + len(test) + len(validation) == len(all_samples)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,308 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"CRF trainer using the sklearn_crfsuite package (Python wrapper for CRFSuite): https://sklearn-crfsuite.readthedocs.io/en/latest/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"\n",
|
||||
"import sklearn_crfsuite\n",
|
||||
"from sklearn_crfsuite import metrics\n",
|
||||
"\n",
|
||||
"from presidio_evaluator import InputSample\n",
|
||||
"from presidio_evaluator.crf_evaluator import CRFEvaluator\n",
|
||||
"from presidio_evaluator.data_generator import read_synth_dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATA_DATE = \"November 12 2019\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Source a dataset to use for training / testing:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": true,
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train_samples = read_synth_dataset(\"../../data/generated_train_{}.json\".format(DATA_DATE))\n",
|
||||
"test_samples = read_synth_dataset(\"../../data/generated_test_{}.json\".format(DATA_DATE))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train_tagged = [sample for sample in train_samples if len(sample.spans) > 0]\n",
|
||||
"print(\"Kept {} train samples after removal of non-tagged samples\".format(len(train_tagged)))\n",
|
||||
"train_data = InputSample.create_conll_dataset(train_tagged)\n",
|
||||
"\n",
|
||||
"test_data = InputSample.create_conll_dataset(test_samples)\n",
|
||||
"test_data.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Turn every sentence into a list of lists (list of tokens + pos + label)\n",
|
||||
"test_sents=test_data.groupby('sentence')[['text','pos','label']].apply(lambda x: x.values.tolist())\n",
|
||||
"train_sents=train_data.groupby('sentence')[['text','pos','label']].apply(lambda x: x.values.tolist())\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Create features for CRF"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"CRFEvaluator.sent2features(train_sents[0])[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%time\n",
|
||||
"X_train = [CRFEvaluator.sent2features(s) for s in train_sents]\n",
|
||||
"y_train = [CRFEvaluator.sent2labels(s) for s in train_sents]\n",
|
||||
"\n",
|
||||
"X_test = [CRFEvaluator.sent2features(s) for s in test_sents]\n",
|
||||
"y_test = [CRFEvaluator.sent2labels(s) for s in test_sents]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%time\n",
|
||||
"crf = sklearn_crfsuite.CRF(\n",
|
||||
" algorithm='lbfgs',\n",
|
||||
" c1=0.1,\n",
|
||||
" c2=0.1,\n",
|
||||
" max_iterations=100,\n",
|
||||
" all_possible_transitions=True\n",
|
||||
")\n",
|
||||
"crf.fit(X_train, y_train)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Save trained model to pickle"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pickle\n",
|
||||
"with open(\"../../models/crf.pickle\",'wb') as f:\n",
|
||||
" data = pickle.dump(crf, f,protocol=pickle.HIGHEST_PROTOCOL)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Open saved model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open(\"../../models/crf.pickle\", 'rb') as f:\n",
|
||||
" crf = pickle.load(f)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Extract info and predictions from model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"labels = list(crf.classes_)\n",
|
||||
"labels.remove('O')\n",
|
||||
"labels"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"y_pred = crf.predict(X_test)\n",
|
||||
"metrics.flat_f1_score(y_test, y_pred,\n",
|
||||
" average='weighted', labels=labels)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## predict one:\n",
|
||||
"y_5_pred = crf.predict([X_test[5]])\n",
|
||||
"y_5_pred[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# group B and I results\n",
|
||||
"sorted_labels = sorted(\n",
|
||||
" labels,\n",
|
||||
" key=lambda name: (name[1:], name[0])\n",
|
||||
")\n",
|
||||
"print(metrics.flat_classification_report(\n",
|
||||
" y_test, y_pred, labels=sorted_labels, digits=3\n",
|
||||
"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Model explainability"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def print_transitions(trans_features):\n",
|
||||
" for (label_from, label_to), weight in trans_features:\n",
|
||||
" print(\"%-6s -> %-7s %0.6f\" % (label_from, label_to, weight))\n",
|
||||
"\n",
|
||||
"print(\"Top likely transitions:\")\n",
|
||||
"print_transitions(Counter(crf.transition_features_).most_common(20))\n",
|
||||
"\n",
|
||||
"print(\"\\nTop unlikely transitions:\")\n",
|
||||
"print_transitions(Counter(crf.transition_features_).most_common()[-20:])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def print_state_features(state_features):\n",
|
||||
" for (attr, label), weight in state_features:\n",
|
||||
" print(\"%0.6f %-8s %s\" % (weight, label, attr))\n",
|
||||
"\n",
|
||||
"print(\"Top positive:\")\n",
|
||||
"print_state_features(Counter(crf.state_features_).most_common(30))\n",
|
||||
"\n",
|
||||
"print(\"\\nTop negative:\")\n",
|
||||
"print_state_features(Counter(crf.state_features_).most_common()[-30:])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"source": [],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
|
@ -0,0 +1,315 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Spacy dataset creation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"This notebook takes train and test datasets (of type `List[InputSample]`)\n",
|
||||
"and transforms them into two structures consumed by Spacy:\n",
|
||||
"1. Spacy JSON (see https://spacy.io/api/annotation#json-input)\n",
|
||||
"2. Spacy Pickle files (of structure `[(full_text,\"entities\":[(start, end, type),(...))]`. \n",
|
||||
"See more details here: https://spacy.io/api/annotation#json-input)\n",
|
||||
"\n",
|
||||
"JSON is used for Spacy's CLI trainer. \n",
|
||||
"Pickle is used for fine-tuning using the logic in [../models/spacy_retrain.py](../models/spacy_retrain.py)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator.data_generator import read_synth_dataset\n",
|
||||
"%reload_ext autoreload"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATA_DATE = 'November 12 2019'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data_path = \"../data/generated_{}_{}.json\"\n",
|
||||
"\n",
|
||||
"train_samples = read_synth_dataset(data_path.format(\"train\",DATA_DATE))\n",
|
||||
"print(\"Read {} samples\".format(len(train_samples)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"For training, keep only sentences with entities:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train_tagged = [sample for sample in train_samples if len(sample.spans)>0]\n",
|
||||
"print(\"Kept {} samples after removal of non-tagged samples\".format(len(train_tagged)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Evaluate training set's entities"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"Entities found in training set:\")\n",
|
||||
"entities = []\n",
|
||||
"for sample in train_tagged:\n",
|
||||
" entities.extend([tag for tag in sample.tags])\n",
|
||||
"set(entities)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Create Spacy dataset (option 2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator import InputSample\n",
|
||||
"import pickle\n",
|
||||
"\n",
|
||||
"spacy_train = InputSample.create_spacy_dataset(train_tagged)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"entities_spacy = [x[1]['entities'] for x in spacy_train]\n",
|
||||
"entities_spacy\n",
|
||||
"entities_spacy_flat = []\n",
|
||||
"for samp in entities_spacy:\n",
|
||||
" for ent in samp:\n",
|
||||
" entities_spacy_flat.append(ent[2])\n",
|
||||
"set(entities_spacy_flat)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Create Spacy dataset (option 1: JSON)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator import InputSample\n",
|
||||
"spacy_train_json = InputSample.create_spacy_json(train_tagged)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Quick evaluation of samples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"[sample[0] for sample in spacy_train[:100]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"spacy_train_json[0]['paragraphs'][0]['sentences']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Dump training set to pickle and json respectively"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pickle\n",
|
||||
"import json\n",
|
||||
"with open(\"../data/train.pickle\", 'wb') as handle:\n",
|
||||
" pickle.dump(spacy_train,handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
|
||||
"\n",
|
||||
"with open(\"../data/train.json\",\"w\") as f:\n",
|
||||
" json.dump(spacy_train_json,f)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Create JSON and pickle files for test dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
|
||||
"print(\"Read {} samples\".format(len(test_samples)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"spacy_test = InputSample.create_spacy_dataset(test_samples)\n",
|
||||
"spacy_test_json = InputSample.create_spacy_json(test_samples)\n",
|
||||
"print(spacy_test[14])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Dump test set to pickle and json respectively"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pickle\n",
|
||||
"with open(\"../data/test.pickle\", 'wb') as handle:\n",
|
||||
" pickle.dump(spacy_test,handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
|
||||
" \n",
|
||||
"with open(\"../data/test.json\",\"w\") as f:\n",
|
||||
" json.dump(spacy_test_json,f)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": []
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,380 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Evaluate CRF models for person names, orgs and locations using the Presidio Evaluator framework\n",
|
||||
"\n",
|
||||
"Data = `generated_test_November 12 2019`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from tqdm import tqdm_notebook as tqdm\n",
|
||||
"import logging\n",
|
||||
"from presidio_evaluator import InputSample\n",
|
||||
"from presidio_evaluator.data_generator import read_synth_dataset\n",
|
||||
"import spacy\n",
|
||||
"import pandas as pd\n",
|
||||
"import pickle\n",
|
||||
"\n",
|
||||
"pd.set_option('display.width', 10000)\n",
|
||||
"pd.set_option('display.max_colwidth', -1)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"%reload_ext autoreload\n",
|
||||
"%autoreload 2\n",
|
||||
"\n",
|
||||
"DATA_DATE = 'November 12 2019'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data_path = \"../../data/generated_{}_{}.json\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Select data for evaluation:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#test_samples = read_synth_dataset(data_path.format(\"test\", DATA_DATE))\n",
|
||||
"#print(len(test_samples))\n",
|
||||
"\n",
|
||||
"val_samples = read_synth_dataset(data_path.format(\"validation\", DATA_DATE))\n",
|
||||
"#print(len(val_samples))\n",
|
||||
"\n",
|
||||
"#synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
|
||||
"#print(len(synth_samples))\n",
|
||||
"\n",
|
||||
"#conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
|
||||
"\n",
|
||||
"DATASET = val_samples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"entity_counter = Counter()\n",
|
||||
"for sample in DATASET:\n",
|
||||
" for tag in sample.tags:\n",
|
||||
" entity_counter[tag]+=1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"entity_counter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATASET[1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#max length sentence\n",
|
||||
"max([len(sample.tokens) for sample in DATASET])\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Select models for evaluation:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"crf_vanilla = \"../../model-outputs/crf.pickle\"\n",
|
||||
" \n",
|
||||
"models = [crf_vanilla]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run evaluation on all models:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator.crf_evaluator import CRFEvaluator\n",
|
||||
"\n",
|
||||
"for model in models:\n",
|
||||
" print(\"-----------------------------------\")\n",
|
||||
" print(\"Evaluating model {}\".format(model))\n",
|
||||
" crf_evaluator = CRFEvaluator(model_pickle_path=model)\n",
|
||||
" evaluation_results = crf_evaluator.evaluate_all(DATASET)\n",
|
||||
" scores = crf_evaluator.calculate_score(evaluation_results)\n",
|
||||
" \n",
|
||||
" print(\"Confusion matrix:\")\n",
|
||||
" print(scores.results)\n",
|
||||
"\n",
|
||||
" print(\"Precision and recall\")\n",
|
||||
" scores.print()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Custom evaluation of the model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Try out the model\n",
|
||||
"def sent_to_features(model_path,sent):\n",
|
||||
" \"\"\"\n",
|
||||
" Translates a sentence into a prediction using a saved CRF model\n",
|
||||
" \"\"\"\n",
|
||||
" \n",
|
||||
" with open(model_path, 'rb') as f:\n",
|
||||
" model = pickle.load(f)\n",
|
||||
" \n",
|
||||
" tokenizer = spacy.blank('en')\n",
|
||||
" tokens = tokenizer(sent)\n",
|
||||
" tags = ['O' for token in tokens] # Placeholder: Not used but required. \n",
|
||||
" metadata = {'Template#':1,'Gender':'1','Country':'2'} #Placeholder: Not used but required\n",
|
||||
" input_sample = InputSample(full_text=sent,masked=\"\",spans=None,tokens=tokens,tags=tags,metadata=metadata,create_tags_from_span=False,)\n",
|
||||
"\n",
|
||||
" return CRFEvaluator.crf_predict(input_sample, model)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"SENTENCE = \"Michael is American\"\n",
|
||||
"\n",
|
||||
"sent_to_features(model_path=crf_vanilla, sent=SENTENCE)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### False positives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Most false positive tokens:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"errors = scores.model_errors\n",
|
||||
"\n",
|
||||
"from presidio_evaluator import ModelEvaluator\n",
|
||||
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"2. review false positives for entity 'PERSON'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
|
||||
"fps_df[['full_text','token','prediction']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### False negative examples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"More FN analysis"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df[['full_text','token','annotation','prediction']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": []
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,326 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Evaluate Flair models for person names, orgs and locations using the Presidio Evaluator framework\n",
|
||||
"\n",
|
||||
"Data = `generated_test_November 12 2019`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator.data_generator import read_synth_dataset\n",
|
||||
"%reload_ext autoreload\n",
|
||||
"%autoreload 2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATA_DATE = \"November 12 2019\"\n",
|
||||
"data_path = \"../../data/generated_{}_{}.json\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Select data for evaluation:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
|
||||
"#print(len(test_samples))\n",
|
||||
"\n",
|
||||
"#val_samples = read_synth_dataset(data_path.format(\"validation\",DATA_DATE))\n",
|
||||
"#print(len(val_samples))\n",
|
||||
"\n",
|
||||
"#synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
|
||||
"#print(len(synth_samples))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
|
||||
"\n",
|
||||
"DATASET = conll_samples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"entity_counter = Counter()\n",
|
||||
"for sample in DATASET:\n",
|
||||
" for tag in sample.tags:\n",
|
||||
" entity_counter[tag]+=1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"entity_counter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATASET[1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#max length sentence\n",
|
||||
"max([len(sample.tokens) for sample in DATASET])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Select models for evaluation:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"flair_ner = 'ner'\n",
|
||||
"flair_ner_fast = 'ner-fast'\n",
|
||||
"flair_ontonotes = 'ner-ontonotes-fast'\n",
|
||||
"flair_bert_embeddings = '../../models/presidio-ner/flair-bert-embeddings.pt'\n",
|
||||
"glove_flair_embeddings = '../../models/presidio-ner/flair-embeddings.pt'\n",
|
||||
"models = [glove_flair_embeddings]\n",
|
||||
"#models = [flair_bert_embeddings, glove_flair_embeddings, flair_ner,flair_ner_fast,flair_ontonotes ]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": true
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator.flair_evaluator import FlairEvaluator\n",
|
||||
"\n",
|
||||
"for model in models:\n",
|
||||
" print(\"-----------------------------------\")\n",
|
||||
" print(\"Evaluating model {}\".format(model))\n",
|
||||
" flair_evaluator = FlairEvaluator(model_path=model)\n",
|
||||
" evaluation_results = flair_evaluator.evaluate_all(DATASET)\n",
|
||||
" scores = flair_evaluator.calculate_score(evaluation_results)\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" print(\"Confusion matrix:\")\n",
|
||||
" print(scores.results)\n",
|
||||
"\n",
|
||||
" print(\"Precision and recall\")\n",
|
||||
" scores.print()\n",
|
||||
" errors = scores.model_errors\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Custom evaluation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### False positives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Most false positive tokens:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"errors = scores.model_errors\n",
|
||||
"\n",
|
||||
"from presidio_evaluator import ModelEvaluator\n",
|
||||
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='PERSON')\n",
|
||||
"fps_df[['full_text','token','prediction']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"2. False negative examples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"More FN analysis"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='PERSON')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false,
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df[['full_text','token','annotation','prediction']]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": []
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,404 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Evaluate Spacy models for person names, orgs and locations using the Presidio Evaluator framework\n",
|
||||
"\n",
|
||||
"Data = `generated_test_November 12 2019`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import spacy\n",
|
||||
"\n",
|
||||
"from presidio_evaluator import ModelEvaluator\n",
|
||||
"from presidio_evaluator.data_generator import read_synth_dataset\n",
|
||||
"%reload_ext autoreload\n",
|
||||
"%autoreload 2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATA_DATE = \"November 12 2019\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#!pip freeze | grep en_core_web_lg\n",
|
||||
"!pip freeze | findstr en-core-web-lg\n",
|
||||
"!pip freeze | findstr spacy"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data_path = \"../../data/generated_{}_{}.json\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Select data for evaluation:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# test_samples = read_synth_dataset(data_path.format(\"test\",DATA_DATE))\n",
|
||||
"# print(len(test_samples))\n",
|
||||
"\n",
|
||||
"# val_samples = read_synth_dataset(data_path.format(\"validation\",DATA_DATE))\n",
|
||||
"# print(len(val_samples))\n",
|
||||
"\n",
|
||||
"# synth_samples = read_synth_dataset(\"../../data/synth_dataset.txt\")\n",
|
||||
"# print(len(synth_samples))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"conll_samples = read_synth_dataset(\"../../data/conll_generated_size_20000_date_November 12 2019.txt\")\n",
|
||||
"\n",
|
||||
"DATASET = conll_samples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"entity_counter = Counter()\n",
|
||||
"for sample in DATASET:\n",
|
||||
" for span in sample.spans:\n",
|
||||
" entity_counter[span.entity_type]+=1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"entity_counter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATASET[1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#max length sentence\n",
|
||||
"max([len(sample.tokens) for sample in DATASET])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Select models for evaluation:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"models = []\n",
|
||||
"\n",
|
||||
"en_core_web_lg = r\"en_core_web_lg\"\n",
|
||||
"spacy_new_ontonotes28 = r\"C:\\Users\\ommendel\\OneDrive - Microsoft\\Projects\\presidio\\Presidio-internal\\presidio-evaluator\\models\\spacy_new_ontonotes28\"\n",
|
||||
"\n",
|
||||
"spacy_ft_100 = r\"C:\\Users\\ommendel\\OneDrive - Microsoft\\Projects\\presidio\\Presidio-internal\\presidio-evaluator\\models\\spacy_ft_100\\model-final\"\n",
|
||||
"\n",
|
||||
"models = [en_core_web_lg, spacy_new_ontonotes28, spacy_ft_100]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run evaluation on all models:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": true
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from presidio_evaluator.spacy_evaluator import SpacyEvaluator\n",
|
||||
"\n",
|
||||
"for model in models:\n",
|
||||
" print(\"-----------------------------------\")\n",
|
||||
" print(\"Evaluating model {}\".format(model))\n",
|
||||
" nlp = spacy.load(model)\n",
|
||||
" spacy_evaluator = SpacyEvaluator(model=nlp,entities_to_keep=['PERSON','GPE','ORG'])\n",
|
||||
" evaluation_results = spacy_evaluator.evaluate_all(DATASET)\n",
|
||||
" scores = spacy_evaluator.calculate_score(evaluation_results)\n",
|
||||
" \n",
|
||||
" print(\"Confusion matrix:\")\n",
|
||||
" print(scores.results)\n",
|
||||
"\n",
|
||||
" print(\"Precision and recall\")\n",
|
||||
" scores.print()\n",
|
||||
" errors = scores.model_errors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Custom evaluation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#evaluate custom sentences\n",
|
||||
"nlp = spacy.load(spacy_ft_100)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Results analysis"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#sent = input(\"Enter sentence: \")\n",
|
||||
"sent = 'David is talking loudly'\n",
|
||||
"doc = nlp(sent)\n",
|
||||
"for ent in doc.ents:\n",
|
||||
" print(\"Entity = {} value = {}\".format(ent.label_,ent.text))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### False positives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Most false positive tokens:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ModelEvaluator.most_common_fp_tokens(errors)#[model_error for model_error in errors if model_error.error_type =='FP']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fps_df = ModelEvaluator.get_fps_dataframe(errors,entity='LOCATION')\n",
|
||||
"fps_df[['full_text','token','prediction']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"2. False negative examples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"errors = scores.model_errors\n",
|
||||
"ModelEvaluator.most_common_fn_tokens(errors,n=50, entity='PERSON')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"More FN analysis"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df = ModelEvaluator.get_fns_dataframe(errors,entity='GPE')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fns_df[['full_text','token','annotation','prediction']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"[print(error,\"\\n\") for error in errors]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": []
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,5 @@
|
|||
from .span_to_tag import span_to_tag, tokenize
|
||||
from .data_objects import Span, InputSample, EvaluationResult, ModelError
|
||||
from .model_evaluator import ModelEvaluator
|
||||
from .spacy_evaluator import SpacyEvaluator
|
||||
from .presidio_api_evaluator import PresidioAPIEvaluator
|
|
@ -0,0 +1,97 @@
|
|||
import pickle
|
||||
from typing import List
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample
|
||||
|
||||
|
||||
class CRFEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self,
|
||||
model_pickle_path: str = "../models/crf.pickle",
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme: str = "BIO",
|
||||
compare_by_io: bool = True):
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io)
|
||||
|
||||
if model_pickle_path is None:
|
||||
raise ValueError("model_pickle_path must be supplied")
|
||||
|
||||
with open(model_pickle_path, 'rb') as f:
|
||||
self.model = pickle.load(f)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
tags = CRFEvaluator.crf_predict(sample,self.model)
|
||||
|
||||
if len(tags) != len(sample.tokens):
|
||||
print("mismatch between previous tokens and new tokens")
|
||||
# translated_tags = sample.rename_from_spacy_tags(tags)
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def crf_predict(sample, model):
|
||||
sample.translate_input_sample_tags()
|
||||
|
||||
conll = sample.to_conll(translate_tags=True)
|
||||
sentence = [(di['text'], di['pos'], di['label']) for di in conll]
|
||||
features = CRFEvaluator.sent2features(sentence)
|
||||
return model.predict([features])[0]
|
||||
|
||||
@staticmethod
|
||||
def word2features(sent, i):
|
||||
word = sent[i][0]
|
||||
postag = sent[i][1]
|
||||
|
||||
features = {
|
||||
'bias': 1.0,
|
||||
'word.lower()': word.lower(),
|
||||
'word[-3:]': word[-3:],
|
||||
'word[-2:]': word[-2:],
|
||||
'word.isupper()': word.isupper(),
|
||||
'word.istitle()': word.istitle(),
|
||||
'word.isdigit()': word.isdigit(),
|
||||
'postag': postag,
|
||||
'postag[:2]': postag[:2],
|
||||
}
|
||||
if i > 0:
|
||||
word1 = sent[i - 1][0]
|
||||
postag1 = sent[i - 1][1]
|
||||
features.update({
|
||||
'-1:word.lower()': word1.lower(),
|
||||
'-1:word.istitle()': word1.istitle(),
|
||||
'-1:word.isupper()': word1.isupper(),
|
||||
'-1:postag': postag1,
|
||||
'-1:postag[:2]': postag1[:2],
|
||||
})
|
||||
else:
|
||||
features['BOS'] = True
|
||||
|
||||
if i < len(sent) - 1:
|
||||
word1 = sent[i + 1][0]
|
||||
postag1 = sent[i + 1][1]
|
||||
features.update({
|
||||
'+1:word.lower()': word1.lower(),
|
||||
'+1:word.istitle()': word1.istitle(),
|
||||
'+1:word.isupper()': word1.isupper(),
|
||||
'+1:postag': postag1,
|
||||
'+1:postag[:2]': postag1[:2],
|
||||
})
|
||||
else:
|
||||
features['EOS'] = True
|
||||
|
||||
return features
|
||||
|
||||
@staticmethod
|
||||
def sent2features(sent):
|
||||
return [CRFEvaluator.word2features(sent, i) for i in range(len(sent))]
|
||||
|
||||
@staticmethod
|
||||
def sent2labels(sent):
|
||||
return [label for token, postag, label in sent]
|
||||
|
||||
@staticmethod
|
||||
def sent2tokens(sent):
|
||||
return [token for token, postag, label in sent]
|
|
@ -0,0 +1,48 @@
|
|||
# PII dataset generator
|
||||
This data generator takes a text file with templates (e.g. `my name is [PERSON]`) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
|
||||
It also creates Spans (start and end of each entity), tokens (`spaCy` tokenizer) and tags in various schemas (BIO/IOB, IO, BILOU)
|
||||
In addition it provides some off-the-shelf features on each token, like `pos`, `dep` and `is_in_vocabulary`
|
||||
|
||||
The main class is `FakeDataGenerator` however the `main` module has two functions for creating and reading a fake dataset.
|
||||
During the generation process, the tool either takes fake PII from a provided CSV with a known format, and/or from extension functions which can be found in the extensions.py file.
|
||||
|
||||
The process in high level is the following:
|
||||
1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of templates: `My name is John` -> `My name is [PERSON]`
|
||||
2. (Optional) adapt the FakeDataGenerator to support new extensions which could generate fake PII entities
|
||||
3. Generate X samples using the templates list + a fake PII dataset + extensions that add additional PII entities
|
||||
4. Split the generated dataset to train/test/validation while making sure that samples from the same template would only appear in one set
|
||||
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
|
||||
6. Train models
|
||||
7. Evaluate using the evaluation notebooks and using the Presidio Evaluator framework
|
||||
|
||||
|
||||
|
||||
Notes:
|
||||
- For steps 5, 6, 7 see the main [README](../../README.md).
|
||||
- For a simple data generation pipeline, [see this notebook](../../notebooks/Generate data.ipynb).
|
||||
- For information on transforming a NER dataset into a templates, see the notebooks in the [helper notebooks](helper%20notebooks) folder.
|
||||
|
||||
Example run:
|
||||
|
||||
```python
|
||||
TEMPLATES_FILE = 'raw_data/templates.txt'
|
||||
OUTPUT = "generated_.txt"
|
||||
|
||||
## Should be downloaded from FakeNameGenerator
|
||||
fake_pii_csv = 'raw_data/FakeNameGenerator.csv'
|
||||
|
||||
examples = generate(fake_pii_csv=fake_pii_csv,
|
||||
utterances_file=TEMPLATES_FILE,
|
||||
dictionary_path=None,
|
||||
output_file=OUTPUT,
|
||||
lower_case_ratio=0.1,
|
||||
num_of_examples=100,
|
||||
ignore_types={"IP_ADDRESS", 'US_SSN', 'URL'},
|
||||
keep_only_tagged=False,
|
||||
span_to_tag=True)
|
||||
```
|
||||
|
||||
|
||||
*Copyright notice:*
|
||||
|
||||
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
|
|
@ -0,0 +1,2 @@
|
|||
from .generator import FakeDataGenerator
|
||||
from .main import generate, read_synth_dataset
|
|
@ -0,0 +1,124 @@
|
|||
import random
|
||||
import pandas as pd
|
||||
from faker import Faker
|
||||
from haikunator import Haikunator
|
||||
|
||||
from presidio_evaluator.data_generator.nationality_generator import NationalityGenerator
|
||||
from presidio_evaluator.data_generator.org_name_generator import OrgNameGenerator
|
||||
|
||||
fake = Faker()
|
||||
haikunator = Haikunator()
|
||||
IP_V4_RATIO = 0.8
|
||||
|
||||
org_name_generator = OrgNameGenerator()
|
||||
nationality_generator = NationalityGenerator()
|
||||
|
||||
def generate_url(domain: pd.Series):
|
||||
def generate_url_postfix():
|
||||
length = random.randint(4, 8)
|
||||
delim = "/" if random.random() > 0.5 else ""
|
||||
postfix = haikunator.haikunate(delimiter=delim,
|
||||
token_chars='abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ',
|
||||
token_length=length)
|
||||
return postfix
|
||||
|
||||
def generate_url_prefix():
|
||||
rand = random.random()
|
||||
|
||||
if rand < 0.3:
|
||||
return "http://"
|
||||
elif rand < 0.6:
|
||||
return "http://www."
|
||||
else:
|
||||
return ""
|
||||
|
||||
def concat_url(prefix, domain, postfix):
|
||||
return "{}{}/{}".format(prefix, domain, postfix)
|
||||
|
||||
return domain.apply(lambda x: concat_url(generate_url_prefix(), x.lower(), generate_url_postfix()))
|
||||
#
|
||||
# urls = []
|
||||
# for index, value in domain.items():
|
||||
# url = "{}{}/{}".format(generate_url_prefix(), value.lower(), generate_url_postfix())
|
||||
# urls.append(url)
|
||||
#
|
||||
# return urls
|
||||
|
||||
|
||||
def generate_SSNs(length):
|
||||
return [fake.ssn() for _ in range(length)]
|
||||
|
||||
|
||||
def generate_iban(country: pd.Series):
|
||||
def generate_one_iban(cntry):
|
||||
try:
|
||||
from schwifty.iban import _get_iban_spec, code_length, IBAN
|
||||
import math
|
||||
|
||||
spec = _get_iban_spec(cntry)
|
||||
bank_code_length = code_length(spec, 'bank_code')
|
||||
branch_code_length = code_length(spec, 'branch_code')
|
||||
bank_and_branch_code_length = bank_code_length + branch_code_length
|
||||
account_code_length = code_length(spec, 'account_code')
|
||||
|
||||
bank_code = random.randint(1, math.pow(10, bank_and_branch_code_length) - 1)
|
||||
account_code = random.randint(1, math.pow(10, account_code_length) - 1)
|
||||
iban = IBAN.generate(cntry, str(bank_code), str(account_code))
|
||||
return iban.formatted
|
||||
except ValueError as err:
|
||||
## Failed to generate IBAN
|
||||
return "IL270126100000000544211"
|
||||
|
||||
return country.apply(generate_one_iban)
|
||||
|
||||
|
||||
def generate_company_names(length):
|
||||
return [org_name_generator.get_organization() for _ in range(length)]
|
||||
|
||||
|
||||
def generate_ip_addresses(length):
|
||||
def generate_one():
|
||||
v = 4 if random.random() > IP_V4_RATIO else 6
|
||||
return fake.ipv4() if v == 4 else fake.ipv6()
|
||||
|
||||
return [generate_one() for _ in range(length)]
|
||||
|
||||
|
||||
def generate_title(gender=None):
|
||||
MALE_TITLES = ['Mr.', 'Dr.', 'Professor.', 'Eng.', 'Prof.', 'Doctor.']
|
||||
FEMALE_TITLES = ['Mrs.', 'Ms.', 'Miss', 'Dr.', 'Professor.', 'Eng.', 'Prof.', 'Doctor']
|
||||
|
||||
if gender.lower() == 'male':
|
||||
return random.choices(MALE_TITLES, weights=[0.7, 0.1, 0.05, 0.05, 0.05, 0.05])[0]
|
||||
else:
|
||||
return random.choices(FEMALE_TITLES, weights=[0.3, 0.25, 0.20, 0.05, 0.05, 0.05, 0.05, 0.05])[0]
|
||||
|
||||
|
||||
def generate_titles(gender: pd.Series):
|
||||
return gender.apply(generate_title)
|
||||
|
||||
|
||||
def generate_roles(length):
|
||||
roles = ['President', 'Vice-president', 'Chief of staff', 'Chief Architect', 'CEO', 'CFO', 'Engineer', 'Accountant',
|
||||
'Attorney', 'Scientist', 'Journalist', 'Operator', 'CIO', "Chief Information Officer", "General Manager",
|
||||
"Manager", "Chief Executive Officer", 'Actuary', 'Secretary', 'Prime minister', 'Minister', 'Director']
|
||||
return [random.choice(roles) for _ in range(length)]
|
||||
|
||||
|
||||
def generate_nationality(length):
|
||||
return [nationality_generator.get_nationality() for _ in range(length)]
|
||||
|
||||
|
||||
def generate_country(length):
|
||||
return [nationality_generator.get_country() for _ in range(length)]
|
||||
|
||||
|
||||
def generate_nation_woman(length):
|
||||
return [nationality_generator.get_nation_woman() for _ in range(length)]
|
||||
|
||||
|
||||
def generate_nation_man(length):
|
||||
return [nationality_generator.get_nation_man() for _ in range(length)]
|
||||
|
||||
def generate_nation_plural(length):
|
||||
return [nationality_generator.get_nation_plural() for _ in range(length)]
|
|
@ -0,0 +1,343 @@
|
|||
import random
|
||||
from typing import List
|
||||
|
||||
import re
|
||||
from collections import Counter
|
||||
|
||||
import pandas as pd
|
||||
from spacy.tokens import Token
|
||||
from tqdm import tqdm
|
||||
|
||||
from presidio_evaluator import Span, InputSample
|
||||
from presidio_evaluator.data_generator.extensions import generate_iban, generate_ip_addresses, generate_SSNs, \
|
||||
generate_company_names, generate_url, generate_roles, generate_titles, generate_nationality, generate_nation_man, \
|
||||
generate_nation_woman, generate_nation_plural, generate_title, generate_country
|
||||
|
||||
|
||||
class FakeDataGenerator:
|
||||
|
||||
def __init__(self, fake_pii_df: pd.DataFrame, templates: List[str],
|
||||
lower_case_ratio: float = 0.5, include_metadata=True,
|
||||
dictionary_path: str = None,
|
||||
ignore_types=None, span_to_tag=True, labeling_scheme="BILOU"):
|
||||
"""
|
||||
Fake data generator.
|
||||
Attaches fake PII entities into predefined templates of structure: a b c [PII] d e f,
|
||||
e.g. "My name is [FIRST_NAME]"
|
||||
:param fake_pii_df:
|
||||
A pd.DataFrame with a predefined set of PII entities as columns created using https://www.fakenamegenerator.com/
|
||||
:param templates: A list of templates
|
||||
with place holders for PII entities.
|
||||
For example: "My name is [FIRST_NAME] and I live in [ADDRESS]"
|
||||
Note that in case you have multiple entities of the same type
|
||||
in a template, you should put a number on the second. For example:
|
||||
"I'm changing my name from [FIRST_NAME] to [FIRST_NAME2].
|
||||
More than two are currently not supported but extending this
|
||||
is straightforward.
|
||||
:param lower_case_ratio: Percentage of names that should start
|
||||
with lower case
|
||||
:param include_metadata: Whether to include additional
|
||||
information in the output
|
||||
(e.g. NameSet from which the name was taken, gender, country etc.)
|
||||
:param dictionary_path: A path to a csv containing a vocabulary of
|
||||
a language, to check if a token exists in the vocabulary or not.
|
||||
:param ignore_types: set of types to ignore
|
||||
:param span_to_tag: whether to tokenize the generated samples or not
|
||||
:param labeling_scheme: labeling scheme (BILOU, BIO, IO)
|
||||
"""
|
||||
if ignore_types is None:
|
||||
ignore_types = {}
|
||||
self.lower_case_ratio = lower_case_ratio
|
||||
self.include_metadata = include_metadata
|
||||
self.ignore_types = ignore_types
|
||||
|
||||
if dictionary_path:
|
||||
vocab_df = pd.read_csv(dictionary_path, sep=',')
|
||||
self.vocabulary_words = set(vocab_df['WORD'].values.tolist())
|
||||
else:
|
||||
print("Warning: Dictionary path not provided. "
|
||||
"Feature `is_in_vocabulary` will be set to False for all samples")
|
||||
self.vocabulary_words = []
|
||||
Token.set_extension("is_in_vocabulary",
|
||||
getter=self.get_is_in_vocabulary,
|
||||
force=True)
|
||||
|
||||
if templates:
|
||||
self.templates = self.prep_templates(templates)
|
||||
else:
|
||||
print("Warning: templates not provided")
|
||||
self.templates = None
|
||||
self.original_pii_df = fake_pii_df
|
||||
self.fake_pii = None
|
||||
self.span_to_tag = span_to_tag
|
||||
self.labeling_scheme = labeling_scheme
|
||||
|
||||
def get_is_in_vocabulary(self, token):
|
||||
return token.text.lower() in self.vocabulary_words
|
||||
|
||||
def prep_fake_pii(self, df):
|
||||
print("Preparing fake PII data for ingestion")
|
||||
# define new column names
|
||||
column_names = {"Surname": "LAST_NAME", "GivenName": "FIRST_NAME",
|
||||
"Title": "TITLE", "Gender": "GENDER",
|
||||
"City": "CITY", "ZipCode": "ZIP",
|
||||
"CountryFull": "COUNTRY",
|
||||
"Occupation": "OCCUPTAION",
|
||||
"TelephoneNumber": "PHONE_NUMBER",
|
||||
"CCNumber": "CREDIT_CARD", "Birthday": "BIRTHDAY",
|
||||
"EmailAddress": "EMAIL",
|
||||
"StreetAddress": "FULL_ADDRESS",
|
||||
"Domain": "DOMAIN_NAME"}
|
||||
|
||||
# Remove brackets as they interfere with the process
|
||||
|
||||
def remove_brackets(series):
|
||||
if series.dtype == object or series.dtype == str:
|
||||
series = series.str.replace("[", "(")
|
||||
series = series.str.replace("]", ")")
|
||||
return series
|
||||
|
||||
df = df.apply(remove_brackets, axis=0)
|
||||
|
||||
# change column names
|
||||
column_names = {key: value for (key, value) in column_names.items() if value not in self.ignore_types}
|
||||
df.rename(columns=column_names, inplace=True)
|
||||
|
||||
# define PERSON as FIRST_NAME + LAST_NAME
|
||||
df["PERSON"] = df["FIRST_NAME"] + " " + df["LAST_NAME"]
|
||||
|
||||
df['COUNTRY'] = generate_country(len(df)) # replace previous country which has limited options
|
||||
# Copied entities
|
||||
|
||||
df["DATE"] = df["BIRTHDAY"]
|
||||
df['LOCATION'] = df[random.choice(["CITY", "COUNTRY"])].str.title()
|
||||
df['LOCATION'] = self.reshuffle_entity(df['LOCATION']) # Reshuffle to not have the same location and country
|
||||
|
||||
if 'ADDRESS' not in self.ignore_types:
|
||||
self.address_parts(df)
|
||||
|
||||
# title and role
|
||||
if 'ROLE' not in self.ignore_types:
|
||||
print("Generating roles")
|
||||
df['ROLE'] = generate_roles(length=len(df))
|
||||
if 'TITLE' not in self.ignore_types:
|
||||
print("Generating titles")
|
||||
df['TITLE'] = generate_titles(df['GENDER'])
|
||||
df['FEMALE_TITLE'] = [generate_title('female') for _ in range(len(df))]
|
||||
df['MALE_TITLE'] = [generate_title('male') for _ in range(len(df))]
|
||||
|
||||
if 'NATIONALITY' not in self.ignore_types:
|
||||
print("Generating nationalities")
|
||||
df['NATIONALITY'] = generate_nationality(len(df))
|
||||
df['NATION_MAN'] = generate_nation_man(len(df))
|
||||
df['NATION_WOMAN'] = generate_nation_woman(len(df))
|
||||
df['NATION_PLURAL'] = generate_nation_plural(len(df))
|
||||
|
||||
if 'IBAN' not in self.ignore_types:
|
||||
print("Generating IBANs")
|
||||
df['IBAN'] = generate_iban(df['COUNTRY']) # "IL270126100000000544211"
|
||||
|
||||
if 'IP_ADDRESS' not in self.ignore_types:
|
||||
print("Generating IP addresses")
|
||||
df['IP_ADDRESS'] = generate_ip_addresses(len(df))
|
||||
|
||||
if 'US_SSN' not in self.ignore_types:
|
||||
print("Generating SSN numbers")
|
||||
df['US_SSN'] = generate_SSNs(len(df))
|
||||
|
||||
if 'URL' not in self.ignore_types:
|
||||
print("Generating URLs")
|
||||
df['URL'] = generate_url(df['DOMAIN_NAME'])
|
||||
|
||||
if 'ORGANIZATION' not in self.ignore_types:
|
||||
print("Generating company names")
|
||||
df['ORG'] = generate_company_names(len(df))
|
||||
df['ORGANIZATION'] = df[random.choice(["Company", "ORG"])].str.title()
|
||||
|
||||
print("Finished preparing fake PII data")
|
||||
|
||||
return df
|
||||
|
||||
def address_parts(self, df):
|
||||
# extract street no, street and full address
|
||||
print("Generating address parts")
|
||||
if 'STREET_NO' not in self.ignore_types:
|
||||
df["STREET_NO"] = df["FULL_ADDRESS"].map(
|
||||
lambda r: re.search(r"([\d]+)", r).group(1))
|
||||
if 'STREET' not in self.ignore_types:
|
||||
df["STREET"] = df["FULL_ADDRESS"].map(
|
||||
lambda r: re.search(r"[\d]+(.*)", r).group(1))
|
||||
if 'ADDRESS' not in self.ignore_types:
|
||||
df["ADDRESS"] = df.apply(
|
||||
lambda r: "{0}, {2} {1}".format(r["FULL_ADDRESS"],
|
||||
r["ZIP"].replace(" ", ""),
|
||||
r["CITY"]), axis=1)
|
||||
|
||||
@staticmethod
|
||||
def get_additional_entity(df, entity):
|
||||
return df.sample(1).iloc[0][entity]
|
||||
|
||||
@staticmethod
|
||||
def reshuffle_entity(series):
|
||||
shuffled = series.sample(frac=1)
|
||||
shuffled.reset_index(inplace=True, drop=True)
|
||||
return shuffled
|
||||
|
||||
@staticmethod
|
||||
def prep_templates(raw_templates):
|
||||
print("Preparing sample sentences for ingestion")
|
||||
# Todo: introduce typos
|
||||
templates = [l.strip().replace("[", "{").replace("]", "}") for l in
|
||||
raw_templates]
|
||||
return templates
|
||||
|
||||
@staticmethod
|
||||
def get_template_entities(template):
|
||||
templates = []
|
||||
entities_count = Counter()
|
||||
for m in re.finditer(r"\{([A-Z_0-9]+)\}", template):
|
||||
ent = m.groups()[0]
|
||||
start, end = m.span()
|
||||
entities_count[ent] += 1
|
||||
if entities_count.get(ent) == 1:
|
||||
templates.append(ent)
|
||||
else:
|
||||
# Add an index to all additional entities of this type (LOCATION2, LOCATION3 etc.)
|
||||
templates.append(ent + str(entities_count[ent]))
|
||||
|
||||
for entity, count in entities_count.items():
|
||||
while count > 1:
|
||||
template = template.replace("{" + entity + "}", "{" + entity + str(count) + "}", 1)
|
||||
count -= 1
|
||||
|
||||
return template, templates, entities_count
|
||||
|
||||
def sample_examples(self, count):
|
||||
|
||||
if not self.fake_pii:
|
||||
self.fake_pii = self.prep_fake_pii(self.original_pii_df)
|
||||
|
||||
for _ in tqdm(range(count)):
|
||||
template_sentence_index = random.choice(range(len(self.templates)))
|
||||
original_sentence = self.templates[template_sentence_index]
|
||||
fake_pii_sample = self.fake_pii.sample(1).iloc[0]
|
||||
|
||||
# Find entities to be replaced + add running index for multiple entities of the same type
|
||||
original_sentence, replacements, entity_counts = self.get_template_entities(original_sentence)
|
||||
|
||||
# Get additional fake entries in case of multiple entities of the same type
|
||||
fake_pii_sample_duplicated = self.add_duplicated_entities(fake_pii_sample, entity_counts)
|
||||
|
||||
# Fill in fake entities for each template slot
|
||||
values = {}
|
||||
for h in replacements:
|
||||
if h in fake_pii_sample_duplicated:
|
||||
values[h] = str(fake_pii_sample_duplicated[h])
|
||||
else:
|
||||
print("Warning: entity {} is in the templates but not in the PII dataset. Ignoring.".format(h))
|
||||
values[h] = ''
|
||||
|
||||
# Create a new InputSample combining template with fake PII data
|
||||
input_sample = self.create_input_sample(original_sentence, values)
|
||||
|
||||
if self.include_metadata:
|
||||
metadata = {"Gender": fake_pii_sample['GENDER'],
|
||||
"NameSet": fake_pii_sample['NameSet'],
|
||||
"Country": fake_pii_sample['COUNTRY'],
|
||||
"Lowercase": input_sample.full_text.islower(),
|
||||
"Template#": template_sentence_index
|
||||
}
|
||||
input_sample.metadata = metadata
|
||||
|
||||
self.consolidate_names(input_sample)
|
||||
|
||||
# Creating tokens only after entities consolidation
|
||||
if self.span_to_tag:
|
||||
tokens, tags = input_sample.get_tags(scheme=self.labeling_scheme)
|
||||
input_sample.tokens = tokens
|
||||
input_sample.tags = tags
|
||||
|
||||
yield input_sample
|
||||
|
||||
@staticmethod
|
||||
def consolidate_names(input_sample):
|
||||
locations = ("LOCATION", "CITY", "STATE", "COUNTRY", "ADDRESS", "STREET")
|
||||
names = ("FIRST_NAME", "LAST_NAME", "PERSON")
|
||||
|
||||
for span in input_sample.spans:
|
||||
if span.entity_type in names:
|
||||
span.entity_type = 'PERSON'
|
||||
elif span.entity_type in locations:
|
||||
span.entity_type = "LOCATION"
|
||||
|
||||
masked = input_sample.masked
|
||||
for location in locations:
|
||||
masked = masked.replace("[" + location + "]", "[LOCATION]")
|
||||
for name in names:
|
||||
masked = masked.replace("[" + name + "]", "[PERSON]")
|
||||
|
||||
input_sample.masked = masked
|
||||
|
||||
def create_input_sample(self, original_sentence, values):
|
||||
"""
|
||||
Creates an InputSample out of a template sentence
|
||||
and a dict of entity names and values
|
||||
:param original_sentence: template (e.g. My name is [FIRST_NAME})
|
||||
:param values: Key = entity name, value = entity value
|
||||
(e.g. {"TITLE":"Mr."})
|
||||
:return: a list of InputSamples
|
||||
"""
|
||||
sentence = original_sentence
|
||||
spans = []
|
||||
|
||||
to_lower = random.random() < self.lower_case_ratio
|
||||
|
||||
i = 0
|
||||
# replaces placeholders with values and retrieve indices
|
||||
while i < len(sentence):
|
||||
entity_start = re.search("{", sentence, flags=0)
|
||||
if entity_start:
|
||||
entity_start = entity_start.start()
|
||||
else:
|
||||
break
|
||||
entity_end = re.search("}", sentence[entity_start:],
|
||||
flags=0).start() + entity_start
|
||||
entity = sentence[entity_start + 1:entity_end]
|
||||
entity_value = values[entity]
|
||||
entity_value = entity_value.strip()
|
||||
# Remove duplicate entity indices:
|
||||
entity = ''.join(i for i in entity if not i.isdigit())
|
||||
|
||||
entity_value_len = len(entity_value)
|
||||
sentence = sentence[:entity_start] + entity_value + sentence[
|
||||
entity_end + 1:]
|
||||
# replace a with an if
|
||||
if ((sentence[entity_start - 2: entity_start].lower() == "a " and entity_start == 2)
|
||||
or (sentence[entity_start - 3: entity_start].lower() == " a ")) \
|
||||
and entity_value[0].lower() in ['a', 'e', 'i', 'o', 'u']:
|
||||
sentence = sentence[:entity_start - 1] + "n " + sentence[entity_start:]
|
||||
entity_start = entity_start + 1
|
||||
|
||||
if to_lower:
|
||||
entity_value = entity_value.lower()
|
||||
|
||||
spans.append(Span(entity_type=entity,
|
||||
entity_value=entity_value,
|
||||
start_position=entity_start,
|
||||
end_position=entity_start + entity_value_len))
|
||||
i = entity_start + entity_value_len
|
||||
|
||||
if to_lower:
|
||||
sentence = sentence.lower()
|
||||
|
||||
# Not creating tokens here since we're consolidating names afterwards
|
||||
return InputSample(sentence, original_sentence, spans,
|
||||
create_tags_from_span=False)
|
||||
|
||||
def add_duplicated_entities(self, fake_pii_sample, entity_counts):
|
||||
for entity, ent_count in entity_counts.items():
|
||||
while ent_count > 1:
|
||||
fake_pii_sample[entity + str(ent_count)] = self.get_additional_entity(self.fake_pii, entity)
|
||||
ent_count -= 1
|
||||
|
||||
return fake_pii_sample
|
|
@ -0,0 +1,661 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This notebook takes the CONLL2003 dataset using deepavlov, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
|
||||
"\n",
|
||||
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"pd.options.display.max_rows = 4000\n",
|
||||
"pd.set_option('display.max_colwidth', -1)\n",
|
||||
"from deeppavlov.dataset_readers.conll2003_reader import Conll2003DatasetReader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"reader = Conll2003DatasetReader()\n",
|
||||
"dataset = reader.read(data_path =\"../../data\",dataset_name='conll2003')\n",
|
||||
"#Note: make sure you haven't downloaded something else with this function before, \n",
|
||||
"# as it will not download a new dataset (even if your previous download was for a different dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### To pandas + add sentence_idx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"new_dataset = [list(zip(a,b)) for a,b in dataset['train']]\n",
|
||||
"df_list = []\n",
|
||||
"sentence_id = 0\n",
|
||||
"for sentence in new_dataset:\n",
|
||||
" \n",
|
||||
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
|
||||
" df[\"sentence_idx\"] = sentence_id\n",
|
||||
" sentence_id+=1\n",
|
||||
" df_list.append(df)\n",
|
||||
"ner_dataset = pd.concat(df_list)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==12]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentences = ner_dataset.groupby('sentence_idx')['word'].apply(lambda x: \" \".join(x))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(sentences[:5])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Example sentence:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==3]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Unique entities\n",
|
||||
"ner_dataset['tag'].unique()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Replace tokenization replacements"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['word'] = ner_dataset['word']\\\n",
|
||||
".replace('-LRB-','(')\\\n",
|
||||
".replace('-RRB-',')')\\\n",
|
||||
".replace('-LCB-','(')\\\n",
|
||||
".replace('-RCB-',')')\\\n",
|
||||
".replace('``','\"')\\\n",
|
||||
".replace(\"''\",'\"')\\\n",
|
||||
".replace('/.','.')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# helper columns:\n",
|
||||
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
|
||||
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
|
||||
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
|
||||
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
|
||||
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
|
||||
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Remove unneeded (non PII) entities:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"TAGS_TO_IGNORE = ['CARDINAL','FAC','LAW','LANGUAGE','MISC','TIME','DATE','ORDINAL','EVENT','QUANTITY','WORK_OF_ART','MONEY','PRODUCT','PERCENT']\n",
|
||||
"def remote_unwanted_tags(x):\n",
|
||||
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
|
||||
" return 'O'\n",
|
||||
" else:\n",
|
||||
" return x\n",
|
||||
"\n",
|
||||
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
|
||||
"ner_dataset[ner_dataset['sentence_idx']==3]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Remove PERSON tags if preceding word is 'the' (e.g. the Bush administration)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
|
||||
"\n",
|
||||
"def remove_tag_if_the_person(row):\n",
|
||||
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
|
||||
" return 'O'\n",
|
||||
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"\n",
|
||||
"ner_dataset['prev-word']=ner_dataset['prev-word'].astype('str')\n",
|
||||
"ner_dataset['prev-prev-word']=ner_dataset['prev-prev-word'].astype('str')\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Remove tag from 's (Joe Wilson's cat)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def remove_tag_if_apostraphe_after_tag(row):\n",
|
||||
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Re-tag words from dictionaries (countries, nationalities, roles, titles)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Nationalities and countries:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
|
||||
"nationalities.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\"algeria\" in nationalities['country'].values"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"ner_dataset['metadata'] = None\n",
|
||||
"\n",
|
||||
"def get_nationality_as_metadata(row):\n",
|
||||
" if row['word'].lower() in nationalities['country'].values:\n",
|
||||
" return 'COUNTRY'\n",
|
||||
" elif row['word'].lower() in nationalities['nationality'].values:\n",
|
||||
" return 'NATIONALITY'\n",
|
||||
" elif row['word'].lower() in nationalities['man'].values:\n",
|
||||
" return 'NATION_MAN'\n",
|
||||
" elif row['word'].lower() in nationalities['woman'].values:\n",
|
||||
" return 'NATION_WOMAN'\n",
|
||||
" elif row['word'].lower() in nationalities['plural'].values:\n",
|
||||
" return 'NATION_PLURAL'\n",
|
||||
" return row['metadata']\n",
|
||||
"\n",
|
||||
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
|
||||
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
|
||||
"\n",
|
||||
"def update_tag_based_on_metadata(row):\n",
|
||||
" if row['metadata'] is not None:\n",
|
||||
" return \"B-\"+row['metadata']\n",
|
||||
" else:\n",
|
||||
" return row['tag']\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Titles"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"MALE_TITLES = ['mr', 'dr', 'professor', 'eng','prof','doctor']\n",
|
||||
"FEMALE_TITLES = ['mrs', 'ms', 'miss', 'dr', 'professor', 'eng', 'prof','doctor']\n",
|
||||
"\n",
|
||||
"def get_title_as_metadata(row):\n",
|
||||
" if row['word'].lower() in MALE_TITLES:\n",
|
||||
" return 'MALE_TITLE'\n",
|
||||
" elif row['word'].lower() in FEMALE_TITLES:\n",
|
||||
" return 'FEMALE_TITLE'\n",
|
||||
" return row['metadata']\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def update_title_tag_if_missing(row):\n",
|
||||
" if row['word'].lower() in MALE_TITLES and row['tag']=='O':\n",
|
||||
" return 'B-MALE_TITLE'\n",
|
||||
" elif row['word'].lower() in FEMALE_TITLES and row['tag']=='O':\n",
|
||||
" return 'B-FEMALE_TITLE'\n",
|
||||
" else:\n",
|
||||
" return row['tag']\n",
|
||||
"\n",
|
||||
"ner_dataset['metadata'] = ner_dataset.apply(get_title_as_metadata,axis=1)\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(update_title_tag_if_missing,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==18]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Remove 'the' from 'the NORP' if NORP is not in nationalities list."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def remove_tag_if_the_norp(row):\n",
|
||||
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
|
||||
" return 'O'\n",
|
||||
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Remove sentences with adjacent different entities (e.g calling from New York Larry King)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
|
||||
"ner_dataset['next-entity']=ner_dataset['next-tag'].str[2:]\n",
|
||||
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
|
||||
"sentences_to_remove = ner_dataset[adjacent_idc]['sentence_idx'].values\n",
|
||||
"sentences_to_remove\n",
|
||||
"\n",
|
||||
"ner_dataset=ner_dataset[~ner_dataset['sentence_idx'].isin(sentences_to_remove)]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Update tag for discovered metadata values (eg. nationalities)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['tag'] = ner_dataset.apply(update_tag_based_on_metadata, axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create templates base on NER dataset\n",
|
||||
"Here we create the actual templates + handle multiple weird cases that should cause the template sentences to be weird. Note that a manual run over the templates dataset is still required after this step."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import re\n",
|
||||
"class SentenceGetter(object):\n",
|
||||
" \n",
|
||||
" def __init__(self, dataset):\n",
|
||||
" self.n_sent = 1\n",
|
||||
" self.dataset = dataset\n",
|
||||
" self.empty = False\n",
|
||||
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
|
||||
" s[\"tag\"].values.tolist())]\n",
|
||||
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
|
||||
" self.sentences = [s for s in self.grouped]\n",
|
||||
" \n",
|
||||
" def get_next(self):\n",
|
||||
" try:\n",
|
||||
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
|
||||
" self.n_sent += 1\n",
|
||||
" return s\n",
|
||||
" except:\n",
|
||||
" return None\n",
|
||||
" \n",
|
||||
" @staticmethod \n",
|
||||
" def cleanse_template(template, ents):\n",
|
||||
" # Remove whitespace before certain punctuation marks\n",
|
||||
" template = re.sub(r'\\s([?,:.!](?:|$))+', r'\\1', template)\n",
|
||||
" \n",
|
||||
" # Remove whitespaces within double quotes\n",
|
||||
" template = re.sub('\\\"\\s*([^\\\"]*?)\\s*\\\"', r'\"\\1\"', template) \n",
|
||||
" \n",
|
||||
" # Remove whitespaces within quotes\n",
|
||||
" template = re.sub(\"\\'\\s*([^\\']*?)\\s*\\'\", r\"'\\1'\", template) \n",
|
||||
" \n",
|
||||
" # Remove whitespaces within parentheses\n",
|
||||
" template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
|
||||
" \n",
|
||||
" for ent in ents:\n",
|
||||
" #Turn PERSON PERSON into PERSON\n",
|
||||
" duplicates = \"[{}] [{}]\".format(ent,ent)\n",
|
||||
" template = template.replace(duplicates,\"[{}]\".format(ent))\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" # Replace additional weird templates:\n",
|
||||
" to_replace = {\n",
|
||||
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
|
||||
" \"[LOCATION] said\" : \"[PERSON] said\",\n",
|
||||
" \"[ORGANIZATION] of [ORGANIZATION]\" : \"[ORGANIZATION]\",\n",
|
||||
" \"the [COUNTRY]\" : \"[COUNTRY]\",\n",
|
||||
" \" 's \":\"'s\",\n",
|
||||
" \"] 's \":\"]'s \",\n",
|
||||
" \"] 's,\":\"]'s,\",\n",
|
||||
" \"] 's.\":\"]'s.\",\n",
|
||||
" \" n't\" : \"n't\",\n",
|
||||
" \"/?\":\"?\",\n",
|
||||
" \"%u\":\"u\",\n",
|
||||
" \"%m\":\"m\",\n",
|
||||
" \"%e\":\"e\", \n",
|
||||
" \"%h\":\"h\", \n",
|
||||
" \"%a\":\"a\",\n",
|
||||
" \" %\":\"%\",\n",
|
||||
" \" ?\":\"?\",\n",
|
||||
" \" /?\":\"?\",\n",
|
||||
" \" ' .\":\"'.\",\n",
|
||||
" \"[ \":\"(\",\n",
|
||||
" \" ]\":\")\",\n",
|
||||
" \"[PERSON] -- [PERSON]\":\"[PERSON]\",\n",
|
||||
" \"[COUNTRY] -- [ORGANIZATION]\":\"[ORGANIZATION]\",\n",
|
||||
" \"Jews\" : \"[NATIONALITY]\",\n",
|
||||
" \"Chinese\" : \"[NATIONALITY]\",\n",
|
||||
" \"Dutch\" : \"[NATIONALITY]\",\n",
|
||||
" \"[LOCATION], [LOCATION]\":\"[LOCATION]\",\n",
|
||||
" \"[LOCATION] [ORGANIZATION]\":\"[ORGANIZATION]\"\n",
|
||||
" }\n",
|
||||
" \n",
|
||||
" for weird in to_replace.keys():\n",
|
||||
" #if weird in template:\n",
|
||||
" # print(\"Weird sentence\",template)\n",
|
||||
" template = template.replace(weird,to_replace[weird])\n",
|
||||
" \n",
|
||||
" template = template.replace(\" -- \",\" - \")\n",
|
||||
" \n",
|
||||
" #Ignore templates that are incomplete\n",
|
||||
" if \"/-\" in template:\n",
|
||||
" template = \"\"\n",
|
||||
" \n",
|
||||
" #Ignore templates that have numbers after the end or start of the entity\n",
|
||||
" if len(re.findall(r\"\\]\\s[0-9]\",template)) > 0:\n",
|
||||
" template = \"\"\n",
|
||||
" \n",
|
||||
" if len(re.findall(r\"[0-9]\\s\\[\",template)) > 0:\n",
|
||||
" template = \"\"\n",
|
||||
" \n",
|
||||
" if len(re.findall(r\"[0-9].\\s\\[\",template)) > 0:\n",
|
||||
" template = \"\"\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" if \"[PERSON] ([COUNTRY])\" in template:\n",
|
||||
" template = \"\"\n",
|
||||
" if \"[PERSON] ([LOCATION])\" in template:\n",
|
||||
" template = \"\"\n",
|
||||
" \n",
|
||||
" if template.count('\"') == 1:\n",
|
||||
" template = template.replace('\"','')\n",
|
||||
"\n",
|
||||
" return template\n",
|
||||
" \n",
|
||||
" @staticmethod \n",
|
||||
" def get_template(grouped,entity_name_replace_dict):\n",
|
||||
" template = \"\"\n",
|
||||
" i=0\n",
|
||||
" cur_index = 0\n",
|
||||
" ents = []\n",
|
||||
" for token in grouped:\n",
|
||||
" # remove brackets as they interefere with the data generation process\n",
|
||||
" token_text = token[0].replace(\"[\", \"(\").replace(\"]\",\")\")\n",
|
||||
" token_text = token[0].replace(\"{\", \"(\").replace(\"}\",\")\")\n",
|
||||
" token_tag = token[1]\n",
|
||||
" token_entity = token_tag[2:] if len(token_tag)>1 else token_tag\n",
|
||||
" \n",
|
||||
" if token_entity == 'O':\n",
|
||||
" template += \" \" + token_text\n",
|
||||
" elif 'B-' in token_tag and token_entity not in TAGS_TO_IGNORE:\n",
|
||||
" #print(\"found entity: {}\".format(token_entity))\n",
|
||||
" ent = entity_name_replace_dict[token_entity]\n",
|
||||
" ents.append(ent)\n",
|
||||
" \n",
|
||||
" template += \" [\" + ent + \"]\"\n",
|
||||
" #print(\"template: \",template)\n",
|
||||
" \n",
|
||||
" template = SentenceGetter.cleanse_template(template, ents)\n",
|
||||
" \n",
|
||||
" return template.strip()\n",
|
||||
" \n",
|
||||
"getter = SentenceGetter(ner_dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ENTITIES_DICTIONARY = {\"PERSON\":\"PERSON\",\n",
|
||||
" \"PER\":\"PERSON\",\n",
|
||||
" \"GPE\":\"COUNTRY\",\n",
|
||||
" \"NORP\":\"LOCATION\",\n",
|
||||
" \"LOC\":\"LOCATION\",\n",
|
||||
" \"ORG\":\"ORGANIZATION\",\n",
|
||||
" \"MALE_TITLE\":\"MALE_TITLE\",\n",
|
||||
" \"FEMALE_TITLE\":\"FEMALE_TITLE\",\n",
|
||||
" \"COUNTRY\":\"COUNTRY\",\n",
|
||||
" \"NATIONALITY\":\"NATIONALITY\",\n",
|
||||
" \"NATION_WOMAN\":\"NATION_WOMAN\",\n",
|
||||
" \"NATION_MAN\":\"NATION_MAN\",\n",
|
||||
" \"NATION_PLURAL\":\"NATION_PLURAL\"}\n",
|
||||
"\n",
|
||||
"sentences = getter.sentences\n",
|
||||
"\n",
|
||||
"sent_id = 445\n",
|
||||
"\n",
|
||||
"print(\"original:\",sentences[sent_id])\n",
|
||||
"print(\"template:\", getter.get_template(sentences[sent_id],entity_name_replace_dict=ENTITIES_DICTIONARY))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"all_templates = [getter.get_template(sentence,entity_name_replace_dict=ENTITIES_DICTIONARY) for sentence in sentences]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"original length of templates: {}\".format(len(all_templates)))\n",
|
||||
"all_templates = list(set(all_templates))\n",
|
||||
"print(\"length after duplicates removal: {}\".format(len(all_templates)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Save templates to file:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open(\"../raw_data/conll_based_templates.txt\",\"w+\",encoding='utf-8') as f:\n",
|
||||
" for template in all_templates:\n",
|
||||
" f.write(\"%s\\n\" % template) "
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": []
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,396 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Generate new examples based on this dataset: \n",
|
||||
"https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus\n",
|
||||
"\n",
|
||||
"This notebook takes the ner dataset from the previous link, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
|
||||
"Note that due to the nature of the tagging, there might be weird output sentences. For example:\n",
|
||||
"\n",
|
||||
"- The same entity shows multiple times in sentence: \"I travel from Argentina to Argentina\"\n",
|
||||
"- Bad grammer due to the lack of inflection and changes to nouns due to context: \"*The statement said no Denmark or India-led troops were killed*\" instead of \"*The statement said no Danish or Indian led troops were killed*\"\n",
|
||||
"- Unrealistic sentences due to change in entities: \"Prime minister Lebron James enters the government building in Kuala Lumpur\"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#First, Download ner.csv from https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus\n",
|
||||
"ner_dataset = pd.read_csv(\"ner.csv\",encoding = \"ISO-8859-1\", error_bad_lines=False)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset.columns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"len(ner_dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset = ner_dataset.drop_duplicates()\n",
|
||||
"len(ner_dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Example sentence:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==13][['sentence_idx','word','tag','prev-word','prev-prev-word','next-word']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### New entities - Title and Role\n",
|
||||
"\n",
|
||||
"- **Title**: Mr., Mrs., Professor, Doctor, ...\n",
|
||||
"- **Role**: President, Secretary General, U.N. Secretary, ..."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Quick exploratory analysis of frequencies:\n",
|
||||
"- First PER token\n",
|
||||
"- Second PER token\n",
|
||||
"- First and second PER token\n",
|
||||
"- One before and first tokens of PER"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Evaluate words before I-per\n",
|
||||
"bper = ner_dataset[ner_dataset['tag']=='B-per']\n",
|
||||
"bper_tokens = bper['word']\n",
|
||||
"prev_bper_token = bper['prev-word']\n",
|
||||
"next_bper_token = bper['next-word']\n",
|
||||
"two_prev_tokens = zip(prev_bper_token, bper_tokens)\n",
|
||||
"two_next_tokens = zip(bper_tokens, next_bper_token)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"print(\"20 most common PER token frequencies:\")\n",
|
||||
"Counter(bper_tokens).most_common(20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"20 most common previous and first PER token frequencies:\")\n",
|
||||
"Counter(two_prev_tokens).most_common(20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"20 most common first and second PER token frequencies:\")\n",
|
||||
"Counter(two_next_tokens).most_common(20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Lists of titles and roles to update as ttl, rol\n",
|
||||
"TITLES = ['Mr.','Ms.','Mrs.']\n",
|
||||
"ROLES = ['President','General','Senator','Secretary-General','Minister','General']\n",
|
||||
"BIGRAMS_ROLES = [('Prime','Minister'),('prime','minister'),('U.S.','President'),\n",
|
||||
" ('Venezuelan', 'President'),('Vice','President'), ('Foreign', 'Minister'),\n",
|
||||
" ('U.S.','Secretary'),('U.N.','Secretary'),('Defence','Secretary')]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Update title and per for most common cases\n",
|
||||
"\n",
|
||||
"def fix_bigram_title(df, row,index,first='Prime',second='Minister',tag='ttl'):\n",
|
||||
" if row['word'] == first and row['next-word'] == second and 'per' in row['tag']:\n",
|
||||
" df.loc[index,'tag'] = 'B-{}'.format(tag)\n",
|
||||
" elif row['word'] == second and row['prev-word'] == first and 'per' in row['tag']:\n",
|
||||
" df.loc[index,'tag'] = 'I-{}'.format(tag)\n",
|
||||
" elif row['tag']== 'I-per' and row['prev-word'] == second and 'per' in row['tag']:\n",
|
||||
" df.loc[index,'tag'] = 'B-per'\n",
|
||||
"\n",
|
||||
"def fix_unigram_title(df, prev_row,prev_index, row , index, title='President',tag='ttl'):\n",
|
||||
" #print(row)\n",
|
||||
" if prev_row['word'] == title and prev_row['tag'] == 'B-per' and row['tag']=='I-per':\n",
|
||||
" df.loc[prev_index,'tag']='B-{}'.format(tag)\n",
|
||||
" df.loc[index,'tag'] = 'B-per'\n",
|
||||
"\n",
|
||||
"prev_row = None\n",
|
||||
"prev_index = None\n",
|
||||
"for index, row in ner_dataset.iterrows():\n",
|
||||
" # Handle 'Prime Minister'\n",
|
||||
" for bigram in BIGRAMS_ROLES:\n",
|
||||
" fix_bigram_title(ner_dataset,row,index,bigram[0],bigram[1],'rol')\n",
|
||||
"\n",
|
||||
" if prev_row is not None:\n",
|
||||
" for title in TITLES:\n",
|
||||
" fix_unigram_title(df=ner_dataset,prev_row=prev_row,prev_index=prev_index,row=row,index=index,title=title,tag='ttl')\n",
|
||||
" for role in ROLES:\n",
|
||||
" fix_unigram_title(ner_dataset,prev_row,prev_index,row,index,role,'rol')\n",
|
||||
"\n",
|
||||
" prev_row = row\n",
|
||||
" prev_index = index"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==13][['sentence_idx','word','tag','prev-word','prev-prev-word','next-word']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# keep only relevant columns\n",
|
||||
"dataset = ner_dataset[['sentence_idx','word','tag']]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dataset.to_csv(\"../../../datasets/ner_with_titles.csv\",encoding = \"ISO-8859-1\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create templates base on NER dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import re\n",
|
||||
"class SentenceGetter(object):\n",
|
||||
" \n",
|
||||
" def __init__(self, dataset):\n",
|
||||
" self.n_sent = 1\n",
|
||||
" self.dataset = dataset\n",
|
||||
" self.empty = False\n",
|
||||
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
|
||||
" s[\"tag\"].values.tolist())]\n",
|
||||
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
|
||||
" self.sentences = [s for s in self.grouped]\n",
|
||||
" \n",
|
||||
" def get_next(self):\n",
|
||||
" try:\n",
|
||||
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
|
||||
" self.n_sent += 1\n",
|
||||
" return s\n",
|
||||
" except:\n",
|
||||
" return None\n",
|
||||
" \n",
|
||||
" @staticmethod \n",
|
||||
" def get_template(grouped,entity_name_replace_dict=None):\n",
|
||||
" TAGS_TO_IGNORE = ['nat','eve','art','tim']\n",
|
||||
" template = \"\"\n",
|
||||
" i=0\n",
|
||||
" cur_index = 0\n",
|
||||
" ents = []\n",
|
||||
" for token in grouped:\n",
|
||||
" token_text = token[0].replace(\"[\", \"\").replace(\"]\",\"\")\n",
|
||||
" token_tag = token[1]\n",
|
||||
" if token_tag == 'O':\n",
|
||||
" template += \" \" + token_text\n",
|
||||
" elif 'B-' in token_tag and token_tag[2:] not in TAGS_TO_IGNORE:\n",
|
||||
" if entity_name_replace_dict:\n",
|
||||
" ent = entity_name_replace_dict[token[1][2:]]\n",
|
||||
" else:\n",
|
||||
" ent = token_tag[2:]\n",
|
||||
" ents.append(ent)\n",
|
||||
" template += \" [\" + ent + \"]\"\n",
|
||||
" template = re.sub(r'\\s([?,\\':.!\"](?:|$))+', r'\\1', template)\n",
|
||||
" \n",
|
||||
" for ent in ents:\n",
|
||||
" weird = \"[{}] [{}]\".format(ent,ent)\n",
|
||||
" template = template.replace(weird,\"[{}]\".format(ent))\n",
|
||||
" \n",
|
||||
" #remove additional weird combinations:\n",
|
||||
" \n",
|
||||
" to_replace = {\n",
|
||||
" \"[COUNTRY] [ROLE] [PERSON]\": \"[ROLE] [PERSON]\",\n",
|
||||
" \"[COUNTRY] [ROLE]\" : \"[ROLE]\",\n",
|
||||
" \"[ORGANIZATION] [ROLE] [PERSON]\" : \"[ORGANIZATION]'s [ROLE] [PERSON]\",\n",
|
||||
" \"[COUNTRY] [LOCATION]\" : \"[LOCATION]\",\n",
|
||||
" \"[LOCATION] [COUNTRY]\": \"[LOCATION]\",\n",
|
||||
" \"[PERSON] [COUNTRY]\" : \"[PERSON]\",\n",
|
||||
" \"[PERSON] [LOCATION]\" : \"[PERSON]\",\n",
|
||||
" \"[COUNTRY] [PERSON]\" : \"[PERSON]\",\n",
|
||||
" \"[LOCATION] [PERSON]\" : \"[PERSON]\"],\n",
|
||||
" \"The [ORGANIZATION]\" : \"[ORGANIZATION]\"\n",
|
||||
" \"[PERSON] [ORGANIZATION]\" : \"[PERSON]\",\n",
|
||||
" \"of [ORGANIZATION] [PERSON]\" : \"of [ORGANIZATION], [PERSON]\",\n",
|
||||
" \"[ORGANIZATION] [PERSON]\" : \"[PERSON]\",\n",
|
||||
" \"[PERSON] [PERSON]\": \"[PERSON]\",\n",
|
||||
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
|
||||
" \"[LOCATION] said\" : \"[PERSON] said\"\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" }\n",
|
||||
" \n",
|
||||
" for weird in to_replace.keys():\n",
|
||||
" template = template.replace(weird,to_replace[weird])\n",
|
||||
" \n",
|
||||
" return template.strip()\n",
|
||||
" \n",
|
||||
"getter = SentenceGetter(dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ENTITIES_DICTIONARY = {\"per\":\"PERSON\",\"gpe\":\"COUNTRY\",\"geo\":\"LOCATION\",\"org\":\"ORGANIZATION\",'ttl':'TITLE','rol':'ROLE'}\n",
|
||||
"\n",
|
||||
"sentences = getter.sentences\n",
|
||||
"print(\"original:\",sentences[12])\n",
|
||||
"print(\"template:\", getter.get_template(sentences[12],entity_name_replace_dict=ENTITIES_DICTIONARY))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"new_templates = [SentenceGetter.get_template(sentence, ENTITIES_DICTIONARY) for sentence in sentences]\n",
|
||||
"new_templates[:5]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# save to file\n",
|
||||
"\n",
|
||||
"with open(\"../../presidio_evaluator/data_generator/raw_data/new_templates2.txt\",\"w+\", encoding = \"ISO-8859-1\") as f:\n",
|
||||
" for template in new_templates:\n",
|
||||
" f.write(\"%s\\n\" % template)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"source": [],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,664 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This notebook takes the ontonoes ner dataset, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.\n",
|
||||
"\n",
|
||||
"The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like \"UK David Scott called his wife\", where the original sentence is \"UK Prime Minister Boris Johnson called his wife\" as \"Prime Minister\" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"pd.options.display.max_rows = 4000\n",
|
||||
"pd.set_option('display.max_colwidth', -1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## Download OntoNotes data\n",
|
||||
"ontonotes = \"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### To pandas + add sentence_idx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_list = []\n",
|
||||
"sentence_id = 0\n",
|
||||
"for sentence in ontonotes:\n",
|
||||
" \n",
|
||||
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
|
||||
" df[\"sentence_idx\"] = sentence_id\n",
|
||||
" sentence_id+=1\n",
|
||||
" df_list.append(df)\n",
|
||||
"ner_dataset = pd.concat(df_list)\n",
|
||||
"ner_dataset.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentences = ner_dataset.groupby('sentence_idx')['word'].apply(lambda x: \" \".join(x))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(sentences[:5])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Example sentence:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==3]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Unique entities\n",
|
||||
"ner_dataset['tag'].unique()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Replace tokenization replacements"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['word'] = ner_dataset['word']\\\n",
|
||||
".replace('-LRB-','(')\\\n",
|
||||
".replace('-RRB-',')')\\\n",
|
||||
".replace('-LCB-','(')\\\n",
|
||||
".replace('-RCB-',')')\\\n",
|
||||
".replace('``','\"')\\\n",
|
||||
".replace(\"''\",'\"')\\\n",
|
||||
".replace('/.','.')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# helper columns:\n",
|
||||
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
|
||||
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
|
||||
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
|
||||
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
|
||||
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
|
||||
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Remove unneeded (non PII) entities:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"TAGS_TO_IGNORE = ['CARDINAL','FAC','LAW','LANGUAGE','TIME','DATE','ORDINAL','EVENT','QUANTITY','WORK_OF_ART','MONEY','PRODUCT','PERCENT']\n",
|
||||
"def remote_unwanted_tags(x):\n",
|
||||
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
|
||||
" return 'O'\n",
|
||||
" else:\n",
|
||||
" return x\n",
|
||||
"\n",
|
||||
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
|
||||
"ner_dataset[ner_dataset['sentence_idx']==3]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Remove PERSON tags if preceding word is 'the' (e.g. the Bush administration)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
|
||||
"\n",
|
||||
"def remove_tag_if_the_person(row):\n",
|
||||
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
|
||||
" return 'O'\n",
|
||||
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"\n",
|
||||
"ner_dataset['prev-word']=ner_dataset['prev-word'].astype('str')\n",
|
||||
"ner_dataset['prev-prev-word']=ner_dataset['prev-prev-word'].astype('str')\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Remove tag from 's (Joe Wilson's cat)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def remove_tag_if_apostraphe_after_tag(row):\n",
|
||||
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Re-tag words from dictionaries (countries, nationalities, roles, titles)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Nationalities and countries:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
|
||||
"nationalities.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\"algeria\" in nationalities['country'].values"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"ner_dataset['metadata'] = None\n",
|
||||
"\n",
|
||||
"def get_nationality_as_metadata(row):\n",
|
||||
" if row['word'].lower() in nationalities['country'].values:\n",
|
||||
" return 'COUNTRY'\n",
|
||||
" elif row['word'].lower() in nationalities['nationality'].values:\n",
|
||||
" return 'NATIONALITY'\n",
|
||||
" elif row['word'].lower() in nationalities['man'].values:\n",
|
||||
" return 'NATION_MAN'\n",
|
||||
" elif row['word'].lower() in nationalities['woman'].values:\n",
|
||||
" return 'NATION_WOMAN'\n",
|
||||
" elif row['word'].lower() in nationalities['plural'].values:\n",
|
||||
" return 'NATION_PLURAL'\n",
|
||||
" return row['metadata']\n",
|
||||
"\n",
|
||||
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
|
||||
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
|
||||
"\n",
|
||||
"def update_tag_based_on_metadata(row):\n",
|
||||
" if row['tag'] != 'O' and row['metadata'] is not None:\n",
|
||||
" return row['tag'][:2] + row['metadata']\n",
|
||||
" else:\n",
|
||||
" return row['tag']\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Titles"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"MALE_TITLES = ['mr', 'dr', 'professor', 'eng','prof','doctor']\n",
|
||||
"FEMALE_TITLES = ['mrs', 'ms', 'miss', 'dr', 'professor', 'eng', 'prof','doctor']\n",
|
||||
"\n",
|
||||
"def get_title_as_metadata(row):\n",
|
||||
" if row['word'].lower() in MALE_TITLES:\n",
|
||||
" return 'MALE_TITLE'\n",
|
||||
" elif row['word'].lower() in FEMALE_TITLES:\n",
|
||||
" return 'FEMALE_TITLE'\n",
|
||||
" return row['metadata']\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def update_title_tag_if_missing(row):\n",
|
||||
" if row['word'].lower() in MALE_TITLES and row['tag']=='O':\n",
|
||||
" return 'B-MALE_TITLE'\n",
|
||||
" elif row['word'].lower() in FEMALE_TITLES and row['tag']=='O':\n",
|
||||
" return 'B-FEMALE_TITLE'\n",
|
||||
" else:\n",
|
||||
" return row['tag']\n",
|
||||
"\n",
|
||||
"ner_dataset['metadata'] = ner_dataset.apply(get_title_as_metadata,axis=1)\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(update_title_tag_if_missing,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==18]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Remove 'the' from 'the NORP' if NORP is not in nationalities list."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 40,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def remove_tag_if_the_norp(row):\n",
|
||||
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
|
||||
" return 'O'\n",
|
||||
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Remove sentences with adjacent different entities (e.g calling from New York Larry King)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
|
||||
"ner_dataset['next-entity']=ner_dataset['next-tag'].str[2:]\n",
|
||||
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
|
||||
"sentences_to_remove = ner_dataset[adjacent_idc]['sentence_idx'].values\n",
|
||||
"sentences_to_remove\n",
|
||||
"\n",
|
||||
"ner_dataset=ner_dataset[~ner_dataset['sentence_idx'].isin(sentences_to_remove)]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Update tag for discovered metadata values (eg. nationalities)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 42,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['tag'] = ner_dataset.apply(update_tag_based_on_metadata, axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 43,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create templates base on NER dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 331,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import re\n",
|
||||
"class SentenceGetter(object):\n",
|
||||
" \n",
|
||||
" def __init__(self, dataset):\n",
|
||||
" self.n_sent = 1\n",
|
||||
" self.dataset = dataset\n",
|
||||
" self.empty = False\n",
|
||||
" agg_func = lambda s: [(w, t) for w,t in zip(s[\"word\"].values.tolist(),\n",
|
||||
" s[\"tag\"].values.tolist())]\n",
|
||||
" self.grouped = self.dataset.groupby(\"sentence_idx\").apply(agg_func)\n",
|
||||
" self.sentences = [s for s in self.grouped]\n",
|
||||
" \n",
|
||||
" def get_next(self):\n",
|
||||
" try:\n",
|
||||
" s = self.grouped[\"Sentence: {}\".format(self.n_sent)]\n",
|
||||
" self.n_sent += 1\n",
|
||||
" return s\n",
|
||||
" except:\n",
|
||||
" return None\n",
|
||||
" \n",
|
||||
" @staticmethod \n",
|
||||
" def cleanse_template(template, ents):\n",
|
||||
" # Remove whitespace before certain punctuation marks\n",
|
||||
" template = re.sub(r'\\s([?,:.!](?:|$))+', r'\\1', template)\n",
|
||||
" \n",
|
||||
" # Remove whitespaces within double quotes\n",
|
||||
" template = re.sub('\\\"\\s*([^\\\"]*?)\\s*\\\"', r'\"\\1\"', template) \n",
|
||||
" \n",
|
||||
" # Remove whitespaces within quotes\n",
|
||||
" template = re.sub(\"\\'\\s*([^\\']*?)\\s*\\'\", r\"'\\1'\", template) \n",
|
||||
" \n",
|
||||
" # Remove whitespaces within parentheses\n",
|
||||
" template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
|
||||
" \n",
|
||||
" for ent in ents:\n",
|
||||
" #Turn PERSON PERSON into PERSON\n",
|
||||
" duplicates = \"[{}] [{}]\".format(ent,ent)\n",
|
||||
" template = template.replace(duplicates,\"[{}]\".format(ent))\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" # Replace additional weird templates:\n",
|
||||
" to_replace = {\n",
|
||||
" \"[LOCATION] says\" : \"[PERSON] says\",\n",
|
||||
" \"[LOCATION] said\" : \"[PERSON] said\",\n",
|
||||
" \"[ORGANIZATION] of [ORGANIZATION]\" : \"[ORGANIZATION]\",\n",
|
||||
" \"the [COUNTRY]\" : \"[COUNTRY]\",\n",
|
||||
" \" 's \":\"'s\",\n",
|
||||
" \"] 's \":\"]'s \",\n",
|
||||
" \"] 's,\":\"]'s,\",\n",
|
||||
" \"] 's.\":\"]'s.\",\n",
|
||||
" \" n't\" : \"n't\",\n",
|
||||
" \"/?\":\"?\",\n",
|
||||
" \"%u\":\"u\",\n",
|
||||
" \"%m\":\"m\",\n",
|
||||
" \"%e\":\"e\", \n",
|
||||
" \"%h\":\"h\", \n",
|
||||
" \"%a\":\"a\",\n",
|
||||
" \" %\":\"%\",\n",
|
||||
" \" ?\":\"?\",\n",
|
||||
" \" /?\":\"?\",\n",
|
||||
" \" ' .\":\"'.\",\n",
|
||||
" \"[ \":\"(\",\n",
|
||||
" \" ]\":\")\",\n",
|
||||
" \"[PERSON] -- [PERSON]\":\"[PERSON]\",\n",
|
||||
" \"[COUNTRY] -- [ORGANIZATION]\":\"[ORGANIZATION]\",\n",
|
||||
" \"Jews\" : \"[NATIONALITY]\",\n",
|
||||
" \"Chinese\" : \"[NATIONALITY]\",\n",
|
||||
" \"Dutch\" : \"[NATIONALITY]\",\n",
|
||||
" \"[LOCATION], [LOCATION]\":\"[LOCATION]\"\n",
|
||||
" }\n",
|
||||
" \n",
|
||||
" for weird in to_replace.keys():\n",
|
||||
" #if weird in template:\n",
|
||||
" # print(\"Weird sentence\",template)\n",
|
||||
" template = template.replace(weird,to_replace[weird])\n",
|
||||
" \n",
|
||||
" template = template.replace(\" -- \",\" - \")\n",
|
||||
" \n",
|
||||
" #Ignore templates that are incomplete\n",
|
||||
" if \"/-\" in template:\n",
|
||||
" template = \"\"\n",
|
||||
" \n",
|
||||
" if template.count('\"') == 1:\n",
|
||||
" template = template.replace('\"','')\n",
|
||||
"\n",
|
||||
" return template\n",
|
||||
" \n",
|
||||
" @staticmethod \n",
|
||||
" def get_template(grouped,entity_name_replace_dict):\n",
|
||||
" template = \"\"\n",
|
||||
" i=0\n",
|
||||
" cur_index = 0\n",
|
||||
" ents = []\n",
|
||||
" for token in grouped:\n",
|
||||
" # remove brackets as they interefere with the data generation process\n",
|
||||
" token_text = token[0].replace(\"[\", \"(\").replace(\"]\",\")\")\n",
|
||||
" token_text = token[0].replace(\"{\", \"(\").replace(\"}\",\")\")\n",
|
||||
" token_tag = token[1]\n",
|
||||
" token_entity = token_tag[2:] if len(token_tag)>1 else token_tag\n",
|
||||
" \n",
|
||||
" if token_entity == 'O':\n",
|
||||
" template += \" \" + token_text\n",
|
||||
" elif 'B-' in token_tag and token_entity not in TAGS_TO_IGNORE:\n",
|
||||
" #print(\"found entity: {}\".format(token_entity))\n",
|
||||
" ent = entity_name_replace_dict[token_entity]\n",
|
||||
" ents.append(ent)\n",
|
||||
" \n",
|
||||
" template += \" [\" + ent + \"]\"\n",
|
||||
" #print(\"template: \",template)\n",
|
||||
" \n",
|
||||
" template = SentenceGetter.cleanse_template(template, ents)\n",
|
||||
" \n",
|
||||
" return template.strip()\n",
|
||||
" \n",
|
||||
"getter = SentenceGetter(ner_dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 321,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ENTITIES_DICTIONARY = {\"PERSON\":\"PERSON\",\n",
|
||||
" \"GPE\":\"COUNTRY\",\n",
|
||||
" \"NORP\":\"LOCATION\",\n",
|
||||
" \"LOC\":\"LOCATION\",\n",
|
||||
" \"ORG\":\"ORGANIZATION\",\n",
|
||||
" \"MALE_TITLE\":\"MALE_TITLE\",\n",
|
||||
" \"FEMALE_TITLE\":\"FEMALE_TITLE\",\n",
|
||||
" \"COUNTRY\":\"COUNTRY\",\n",
|
||||
" \"NATIONALITY\":\"NATIONALITY\",\n",
|
||||
" \"NATION_WOMAN\":\"NATION_WOMAN\",\n",
|
||||
" \"NATION_MAN\":\"NATION_MAN\",\n",
|
||||
" \"NATION_PLURAL\":\"NATION_PLURAL\"}\n",
|
||||
" \n",
|
||||
"\n",
|
||||
"\n",
|
||||
"sentences = getter.sentences\n",
|
||||
"\n",
|
||||
"sent_id = 445\n",
|
||||
"\n",
|
||||
"print(\"original:\",sentences[sent_id])\n",
|
||||
"print(\"template:\", getter.get_template(sentences[sent_id],entity_name_replace_dict=ENTITIES_DICTIONARY))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 322,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"all_templates = [getter.get_template(sentence,entity_name_replace_dict=ENTITIES_DICTIONARY) for sentence in sentences]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 323,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"original length of templates: {}\".format(len(all_templates)))\n",
|
||||
"all_templates = list(set(all_templates))\n",
|
||||
"print(\"length after duplicates removal: {}\".format(len(all_templates)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 324,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# save to file\n",
|
||||
"\n",
|
||||
"with open(\"../raw_data/ontonotes_based_templates.txt\",\"w+\",encoding='utf-8') as f:\n",
|
||||
" for template in all_templates:\n",
|
||||
" f.write(\"%s\\n\" % template)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 330,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"template = \"[NATIONALITY]'s[MALE_TITLE]'\"\n",
|
||||
"\n",
|
||||
"template = getter.cleanse_template(template,[])\n",
|
||||
"#template = re.sub('\\(\\s*([^\\(]*?)\\s*\\)', r'(\\1)', template) \n",
|
||||
"template"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 326,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"if template.count(\"'\")==1:\n",
|
||||
" print(True)\n",
|
||||
" template = template.replace(\"'\",'')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 327,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"template"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"source": [],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,436 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Exploratory data analysis on the OntoNotes dataset, to gain insights towards the templating of the dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"pd.options.display.max_rows = 4000\n",
|
||||
"pd.set_option('display.max_colwidth', -1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"conll = \"\" # Download CoNLL-2003\n",
|
||||
"\n",
|
||||
"df_list = []\n",
|
||||
"sentence_id = 0\n",
|
||||
"for sentence in conll:\n",
|
||||
" \n",
|
||||
" df = pd.DataFrame(sentence,columns = [\"word\",\"tag\"])\n",
|
||||
" df[\"sentence_idx\"] = sentence_id\n",
|
||||
" sentence_id+=1\n",
|
||||
" df_list.append(df)\n",
|
||||
"ner_dataset = pd.concat(df_list)\n",
|
||||
"ner_dataset.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"TAGS_TO_IGNORE = ['CARDINAL','FAC','LAW','LANGUAGE','TIME','DATE','ORDINAL','EVENT','QUANTITY','WORK_OF_ART','MONEY','PRODUCT','PERCENT']\n",
|
||||
"def remote_unwanted_tags(x):\n",
|
||||
" if len(x)>1 and x[2:] in TAGS_TO_IGNORE:\n",
|
||||
" return 'O'\n",
|
||||
" else:\n",
|
||||
" return x\n",
|
||||
"\n",
|
||||
"ner_dataset['tag'] = ner_dataset['tag'].apply(remote_unwanted_tags)\n",
|
||||
"ner_dataset[ner_dataset['sentence_idx']==3]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentences = ner_dataset.groupby('sentence_idx')['word'].transform(lambda x: ' '.join(x)).unique().tolist()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"len(sentences)\n",
|
||||
"#print(sentences[:5])\n",
|
||||
"with open(\"raw_sentences.txt\",\"w\",encoding=\"utf8\") as f:\n",
|
||||
" for item in sentences:\n",
|
||||
" f.write(\"{}\\n\".format(item))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Number of labels per tag"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 261,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset.groupby('tag')['tag'].count()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 264,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['word'] = ner_dataset['word'].replace('-LRB-',')')\\\n",
|
||||
".replace('-RRB-',')')\\\n",
|
||||
".replace('``',\"\\\"\")\\\n",
|
||||
".replace(\"''\",'\"')\\\n",
|
||||
".replace('/.','.')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 265,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"Counter(ner_dataset['word']).most_common(30)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Add lead and lag words and tags to dataset_no_punct"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 267,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import string\n",
|
||||
"punct = [c for c in string.punctuation]\n",
|
||||
"punct.extend([\"--\",\"''\",\"/.\"])\n",
|
||||
"print(punct)\n",
|
||||
"dataset_no_punct = ner_dataset[~ner_dataset.word.str.strip().isin(punct)]\n",
|
||||
"dataset_no_punct['prev-word'] = dataset_no_punct.word.shift(1)\n",
|
||||
"dataset_no_punct['prev-prev-word'] = dataset_no_punct['word'].shift(2)\n",
|
||||
"dataset_no_punct['next-word'] = dataset_no_punct['word'].shift(-1)\n",
|
||||
"dataset_no_punct['prev-tag'] = dataset_no_punct['tag'].shift(1)\n",
|
||||
"dataset_no_punct['next-tag'] = dataset_no_punct['tag'].shift(-1)\n",
|
||||
"dataset_no_punct.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Add features for easier manipulation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 268,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['prev-word'] = ner_dataset.word.shift(1)\n",
|
||||
"ner_dataset['prev-prev-word'] = ner_dataset['word'].shift(2)\n",
|
||||
"ner_dataset['next-word'] = ner_dataset['word'].shift(-1)\n",
|
||||
"ner_dataset['next-next-word'] = ner_dataset['word'].shift(-2)\n",
|
||||
"ner_dataset['prev-tag'] = ner_dataset['tag'].shift(1)\n",
|
||||
"ner_dataset['next-tag'] = ner_dataset['tag'].shift(-1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Gather statistics on the first person token"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 269,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"bper = dataset_no_punct[dataset_no_punct['tag']=='B-PERSON']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 270,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# histogram of B-PERSON tokens\n",
|
||||
"from collections import Counter\n",
|
||||
"Counter(bper['word']).most_common(20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 271,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prev_bper_token = bper['prev-word'].str.lower()\n",
|
||||
"Counter(prev_bper_token).most_common(20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 272,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prev_prev_bper_token = bper['prev-prev-word']\n",
|
||||
"two_prev_tokens = zip(prev_prev_bper_token.str.lower(), prev_bper_token.str.lower())\n",
|
||||
"Counter(two_prev_tokens).most_common(20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 273,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# find \"the\" followed by B-PERSON\n",
|
||||
"the_PERSON = ner_dataset[(ner_dataset['prev-word'].str.lower()==\"the\") & (ner_dataset['tag']=='B-PERSON')]\n",
|
||||
"print(the_PERSON['prev-word']+\" \"+the_PERSON['word']+\" \"+the_PERSON['next-word']+\" \"+the_PERSON['next-next-word'].values)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 296,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## add metadata for nationalities (to differentiate between America, Americans and US citizen)\n",
|
||||
"nationalities = pd.read_csv(\"../raw_data/nationalities.csv\")\n",
|
||||
"nationalities.head()\n",
|
||||
"\n",
|
||||
"ner_dataset['metadata'] = None\n",
|
||||
"\n",
|
||||
"def get_nationality_as_metadata(row):\n",
|
||||
" if row['word'].lower() in nationalities['country'].values:\n",
|
||||
" return 'COUNTRY'\n",
|
||||
" elif row['word'].lower() in nationalities['nationality'].values:\n",
|
||||
" return 'NATIONALITY'\n",
|
||||
" elif row['word'].lower() in nationalities['man'].values:\n",
|
||||
" return 'NATION_MAN'\n",
|
||||
" elif row['word'].lower() in nationalities['woman'].values:\n",
|
||||
" return 'NATION_WOMAN'\n",
|
||||
" return row['metadata']\n",
|
||||
"\n",
|
||||
"row = pd.Series({'word':'Frenchwoman','metadata':None})\n",
|
||||
"print(\"Example: Frenchwoman -> \",get_nationality_as_metadata(row))\n",
|
||||
"\n",
|
||||
"ner_dataset['metadata'] = ner_dataset.apply(get_nationality_as_metadata, axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 297,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# removing PERSON tags from sentences with a 'the' preceding the person:\n",
|
||||
"\n",
|
||||
"def remove_tag_if_the_person(row):\n",
|
||||
" if row['prev-word'].lower() == 'the' and row['tag']=='B-PERSON':\n",
|
||||
" return 'O'\n",
|
||||
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-PERSON' and row['tag']=='B-PERSON':\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"\n",
|
||||
"def remove_tag_if_the_norp(row):\n",
|
||||
" if row['prev-word'].lower() == 'the' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
|
||||
" return 'O'\n",
|
||||
" elif row['prev-prev-word'].lower() == 'the' and row['prev-tag']=='I-NORP' and row['tag']=='B-NORP' and row['metadata'] is None:\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"\n",
|
||||
"ner_dataset['prev-word']=ner_dataset['prev-word'].astype('str')\n",
|
||||
"ner_dataset['prev-prev-word']=ner_dataset['prev-prev-word'].astype('str')\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_person,axis=1)\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_the_norp,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 299,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# find \"the\" followed by B-NORP\n",
|
||||
"the_NORP = ner_dataset[(ner_dataset['prev-word'].str.lower()==\"the\") & (ner_dataset['tag']=='B-NORP')]\n",
|
||||
"print(the_NORP['prev-word']+\" \"+the_NORP['word']+\" \"+the_NORP['next-word']+\" \"+the_NORP['next-next-word'].values + \" (\" + the_NORP['metadata'] + \")\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 276,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def remove_tag_if_apostraphe_after_tag(row):\n",
|
||||
" if row['prev-tag'] != 'O' and row['word']==\"'s\":\n",
|
||||
" return 'O'\n",
|
||||
" return row['tag']\n",
|
||||
"ner_dataset['tag'] = ner_dataset.apply(remove_tag_if_apostraphe_after_tag,axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 277,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sentences_with_president=ner_dataset[ner_dataset['word'].str.lower() == 'president']['sentence_idx']\n",
|
||||
"ner_dataset[ner_dataset['sentence_idx']==sentences_with_president.iloc[0]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 279,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['tag']=='B-PERSON']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Adjacent tags"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 281,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset['entity'] = ner_dataset['tag'].str[2:]\n",
|
||||
"ner_dataset['next-entity']=ner_dataset['next-tag'].str[2:]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 286,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"adjacent_idc = (ner_dataset['tag'] != 'O') & (ner_dataset['next-tag'] != 'O') & (ner_dataset['entity'] != ner_dataset['next-entity'])\n",
|
||||
"print(\"sentences with duplicate different entities: \",str(len(ner_dataset[adjacent_idc])))\n",
|
||||
"ner_dataset[adjacent_idc]['sentence_idx']\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 289,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ner_dataset[ner_dataset['sentence_idx']==8759]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"NORP values"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 293,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"norp_values = ner_dataset[ner_dataset['entity']=='NORP']['word']\n",
|
||||
"Counter(norp_values).most_common(50)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The country?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 311,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"the_X_idx = (ner_dataset['prev-word']=='the') & (ner_dataset['tag'] != 'O')\n",
|
||||
"the_X_sentences = ner_dataset[the_X_idx]['sentence_idx']\n",
|
||||
"the_X_sentences.values[0]\n",
|
||||
"ner_dataset[ner_dataset['sentence_idx']==the_X_sentences.values[0]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"source": [],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,116 @@
|
|||
import datetime
|
||||
import json
|
||||
|
||||
import pandas as pd
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.data_generator import FakeDataGenerator
|
||||
|
||||
|
||||
def read_utterances(utterances_file):
|
||||
with open(utterances_file) as f:
|
||||
return f.readlines()
|
||||
|
||||
|
||||
def generate(fake_pii_csv,
|
||||
utterances_file,
|
||||
output_file=None,
|
||||
num_of_examples=1000,
|
||||
dictionary_path=None,
|
||||
store_masked_text=False,
|
||||
keep_only_tagged=False,
|
||||
**kwargs):
|
||||
"""
|
||||
|
||||
:param fake_pii_csv: csv containing fake PII
|
||||
:param utterances_file: txt file containing template sentences
|
||||
:param output_file: filepath for json or csv output
|
||||
:param num_of_examples: number of examples to generate
|
||||
:param dictionary_path: path to vocabulary file
|
||||
:param store_masked_text: Whether to remove or keep masked version of text
|
||||
:param keep_only_tagged: Ignore utterances with no entity
|
||||
(e.g. Remove: 'I went to the shop today', Keep: '[PERSON] went to the shop today')
|
||||
:return: list of generated InputSamples
|
||||
"""
|
||||
|
||||
if not output_file:
|
||||
raise ValueError("Please provide an output file path")
|
||||
|
||||
templates = read_utterances(utterances_file)
|
||||
|
||||
if keep_only_tagged:
|
||||
templates = [template for template in templates if "[" in template]
|
||||
|
||||
df = pd.read_csv(fake_pii_csv, encoding='utf-8')
|
||||
|
||||
generator = FakeDataGenerator(fake_pii_df=df,
|
||||
dictionary_path=dictionary_path,
|
||||
templates=templates, **kwargs)
|
||||
counter = 0
|
||||
|
||||
examples = []
|
||||
for example in generator.sample_examples(num_of_examples):
|
||||
if not store_masked_text:
|
||||
example.masked = None
|
||||
examples.append(example)
|
||||
|
||||
examples_json = [example.to_dict() for example in examples]
|
||||
|
||||
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
|
||||
json.dump(examples_json, f, ensure_ascii=False, indent=4)
|
||||
|
||||
print("generated {} examples".format(len(examples)))
|
||||
print("Finished creating generated dataset. File location:{}".format(output_file))
|
||||
|
||||
return examples
|
||||
|
||||
|
||||
def read_synth_dataset(filepath=None, length=None):
|
||||
import json
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
dataset = json.load(f)
|
||||
|
||||
if length:
|
||||
dataset = dataset[:length]
|
||||
|
||||
input_samples = [InputSample.from_json(row) for row in dataset]
|
||||
|
||||
return input_samples
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
# PARAMS:
|
||||
EXAMPLES = 30
|
||||
PII_FILE_SIZE = 3000
|
||||
SPAN_TO_TAG = True
|
||||
TEMPLATES_FILE = 'raw_data/templates.txt'
|
||||
KEEP_ONLY_TAGGED = False
|
||||
LOWER_CASE_RATIO = 0.1
|
||||
IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}
|
||||
|
||||
cur_time = datetime.date.today().strftime("%B %d %Y")
|
||||
OUTPUT = "generated_size_{}_date_{}.txt".format(EXAMPLES, cur_time)
|
||||
|
||||
fake_pii_csv = '../../presidio_evaluator/data_generator/' \
|
||||
'raw_data/FakeNameGenerator.com_{}.csv'.format(PII_FILE_SIZE)
|
||||
utterances_file = TEMPLATES_FILE
|
||||
dictionary_path = None
|
||||
|
||||
examples = generate(fake_pii_csv=fake_pii_csv,
|
||||
utterances_file=utterances_file,
|
||||
dictionary_path=dictionary_path,
|
||||
output_file=OUTPUT,
|
||||
lower_case_ratio=LOWER_CASE_RATIO,
|
||||
num_of_examples=EXAMPLES,
|
||||
ignore_types=IGNORE_TYPES,
|
||||
keep_only_tagged=KEEP_ONLY_TAGGED,
|
||||
span_to_tag=SPAN_TO_TAG)
|
||||
|
||||
# sanity
|
||||
input_samples = read_synth_dataset(OUTPUT)
|
||||
for sample in input_samples:
|
||||
if len(sample.tags) != len(sample.tokens):
|
||||
print("ERROR during generation. sample: {}".format(sample))
|
||||
|
||||
print(input_samples[:10])
|
|
@ -0,0 +1,38 @@
|
|||
import random
|
||||
import os
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
import re
|
||||
|
||||
|
||||
class NationalityGenerator:
|
||||
def __init__(self, company_name_file_path="raw_data/nationalities.csv"):
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
file_path = Path(dir_path, company_name_file_path)
|
||||
df = pd.read_csv(str(file_path))
|
||||
|
||||
self.df = df
|
||||
|
||||
def get_country(self):
|
||||
## [COUNTRY]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['country'].values))
|
||||
|
||||
def get_nationality(self):
|
||||
## [NATIONALITY]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['nationality'].values))
|
||||
|
||||
def get_nation_woman(self):
|
||||
## [NATION_WOMAN]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['woman'].values))
|
||||
|
||||
def get_nation_man(self):
|
||||
## [NATION_MAN]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['man'].values))
|
||||
|
||||
def get_nation_plural(self):
|
||||
## [NATION_PLURAL]
|
||||
return NationalityGenerator.capitalizeWords(random.choice(self.df['plural'].values))
|
||||
|
||||
@staticmethod
|
||||
def capitalizeWords(s):
|
||||
return re.sub(r'\w+', lambda m: m.group(0).capitalize(), s)
|
|
@ -0,0 +1,16 @@
|
|||
import random
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class OrgNameGenerator:
|
||||
def __init__(self, company_name_file_path="raw_data/organizations.csv"):
|
||||
self.companies = []
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
file_path = Path(dir_path, company_name_file_path)
|
||||
|
||||
with open(str(file_path)) as file:
|
||||
self.companies = file.read().splitlines()
|
||||
|
||||
def get_organization(self):
|
||||
return random.choice(self.companies)
|
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
|
@ -0,0 +1,203 @@
|
|||
country,nationality,man,woman,plural
|
||||
algeria,algerian,algerian,algerian,algerians
|
||||
andorra,andorran,andorran,andorran,andorrans
|
||||
angola,angolan,angolan,angolan,angolans
|
||||
argentina,argentinian,argentinian,argentinian,argentinians
|
||||
armenia,armenian,armenian,armenian,armenians
|
||||
australia,australian,australian,australian,australians
|
||||
austria,austrian,austrian,austrian,austrians
|
||||
azerbaijan,azerbaijani,azerbaijani,azerbaijani,azerbaijanis
|
||||
bahamas,bahamian,bahamian,bahamian,bahamians
|
||||
bahrain,bahraini,bahraini,bahraini,bahrainis
|
||||
bangladesh,bangladeshi,bangladeshi,bangladeshi,bangladeshis
|
||||
barbados,barbadian,barbadian,barbadian,barbadians
|
||||
belarus,belarusian,belarusian,belarusian,belarusians
|
||||
belgium,belgian,belgian,belgian,belgians
|
||||
belize,belizian,belizian,belizian,belizians
|
||||
benin,beninese,beninese,beninese,benineses
|
||||
bhutan,bhutanese,bhutanese,bhutanese,bhutaneses
|
||||
bolivia,bolivian,bolivian,bolivian,bolivians
|
||||
bosnia-herzegovina,bosnian,bosnian,bosnian,bosnians
|
||||
botswana,botswanan,tswana,tswana,tswanas
|
||||
brazil,brazilian,brazilian,brazilian,brazilians
|
||||
britain,british,briton,briton,britons
|
||||
brunei,bruneian,bruneian,bruneian,bruneians
|
||||
bulgaria,bulgarian,bulgarian,bulgarian,bulgarians
|
||||
burkina,burkinese,burkinese,burkinese,burkineses
|
||||
burma (or myanmar),burmese,burmese,burmese,burmeses
|
||||
burundi,burundian,burundian,burundian,burundians
|
||||
cambodia,cambodian,cambodian,cambodian,cambodians
|
||||
cameroon,cameroonian,cameroonian,cameroonian,cameroonians
|
||||
canada,canadian,canadian,canadian,canadians
|
||||
cape verde islands,cape verdean,cape verdean,cape verdean,cape verdeans
|
||||
chad,chadian,chadian,chadian,chadians
|
||||
chile,chilean,chilean,chilean,chileans
|
||||
china,chinese,chinese,chinese,chineses
|
||||
colombia,colombian,colombian,colombian,colombians
|
||||
congo,congolese,congolese,congolese,congoleses
|
||||
costa rica,costa rican,costrican,costrican,costricans
|
||||
croatia,croatian,croatian,croatian,croatians
|
||||
cuba,cuban,cuban,cuban,cubans
|
||||
cyprus,cypriot,cypriot,cypriot,cypriots
|
||||
czech republic,czech,czech,czech,czechs
|
||||
denmark,danish,dane,dane,danes
|
||||
djibouti,djiboutian,djiboutian,djiboutian,djiboutians
|
||||
dominica,dominican,dominican,dominican,dominicans
|
||||
dominican republic,dominican,dominican,dominican,dominicans
|
||||
ecuador,ecuadorean,ecuadorean,ecuadorean,ecuadoreans
|
||||
egypt,egyptian,egyptian,egyptian,egyptians
|
||||
el salvador,salvadorean,salvadorean,salvadorean,salvadoreans
|
||||
england,english,englishman,englishwoman,englishmans
|
||||
eritrea,eritrean,eritrean,eritrean,eritreans
|
||||
estonia,estonian,estonian,estonian,estonians
|
||||
ethiopia,ethiopian,ethiopian,ethiopian,ethiopians
|
||||
fiji,fijian,fijian,fijian,fijians
|
||||
finland,finnish,finn,finn,finns
|
||||
france,french,frenchman,frenchwoman,frenchmans
|
||||
gabon,gabonese,gabonese,gabonese,gaboneses
|
||||
gambia,gambian,gambian,gambian,gambians
|
||||
georgia,georgian,georgian,georgian,georgians
|
||||
germany,german,german,german,germans
|
||||
ghana,ghanaian,ghanaian,ghanaian,ghanaians
|
||||
greece,greek,greek,greek,greeks
|
||||
grenada,grenadian,grenadian,grenadian,grenadians
|
||||
guatemala,guatemalan,guatemalan,guatemalan,guatemalans
|
||||
guinea,guinean,guinean,guinean,guineans
|
||||
guyana,guyanese,guyanese,guyanese,guyaneses
|
||||
haiti,haitian,haitian,haitian,haitians
|
||||
holland,dutch,dutchman,dutchwoman,dutchmans
|
||||
netherlands,dutch,dutchman,dutchwoman,dutchmans
|
||||
honduras,honduran,honduran,honduran,hondurans
|
||||
hungary,hungarian,hungarian,hungarian,hungarians
|
||||
iceland,icelandic,icelander,icelander,icelanders
|
||||
india,indian,indian,indian,indians
|
||||
indonesia,indonesian,indonesian,indonesian,indonesians
|
||||
iran,iranian,iranian,iranian,iranians
|
||||
iraq,iraqi,iraqi,iraqi,iraqis
|
||||
ireland, irish,irishman,irishwoman,irishmans
|
||||
republic of ireland,irish,irishman,irishwoman,irishmans
|
||||
israel,israeli,israeli,israeli,israelis
|
||||
italy,italian,italian,italian,italians
|
||||
jamaica,jamaican,jamaican,jamaican,jamaicans
|
||||
japan,japanese,japanese,japanese,japaneses
|
||||
jordan,jordanian,jordanian,jordanian,jordanians
|
||||
kazakhstan,kazakh,kazakh,kazakh,kazakhs
|
||||
kenya,kenyan,kenyan,kenyan,kenyans
|
||||
kuwait,kuwaiti,kuwaiti,kuwaiti,kuwaitis
|
||||
laos,laotian,laotian,laotian,laotians
|
||||
latvia,latvian,latvian,latvian,latvians
|
||||
lebanon,lebanese,lebanese,lebanese,lebaneses
|
||||
liberia,liberian,liberian,liberian,liberians
|
||||
libya,libyan,libyan,libyan,libyans
|
||||
liechtenstein,liechtensteiner,liechtensteiner,liechtensteiner,liechtensteiners
|
||||
lithuania,lithuanian,lithuanian,lithuanian,lithuanians
|
||||
luxembourg,luxembourger,luxembourger,luxembourger,luxembourgers
|
||||
macedonia,macedonian,macedonian,macedonian,macedonians
|
||||
madagascar,madagascan,malagasy,malagasy,malagasys
|
||||
malawi,malawian,malawian,malawian,malawians
|
||||
malaysia,malaysian,malaysian,malaysian,malaysians
|
||||
maldives,maldivian,maldivian,maldivian,maldivians
|
||||
mali,malian,malian,malian,malians
|
||||
malta,maltese,maltese,maltese,malteses
|
||||
mauritania,mauritanian,mauritanian,mauritanian,mauritanians
|
||||
mauritius,mauritian,mauritian,mauritian,mauritians
|
||||
mexico,mexican,mexican,mexican,mexicans
|
||||
moldova,moldovan,moldovan,moldovan,moldovans
|
||||
monaco,monã©gasque,monacan,monacan,monacans
|
||||
mongolia,mongolian,mongolian,mongolian,mongolians
|
||||
montenegro,montenegrin,montenegrin,montenegrin,montenegrins
|
||||
morocco,moroccan,moroccan,moroccan,moroccans
|
||||
mozambique,mozambican,mozambican,mozambican,mozambicans
|
||||
namibia,namibian,namibian,namibian,namibians
|
||||
nepal,nepalese,nepalese,nepalese,nepaleses
|
||||
new zealand,new zealand,new zealander,new zealander,new zealanders
|
||||
nicaragua,nicaraguan,nicaraguan,nicaraguan,nicaraguans
|
||||
niger,nigerien,nigerien,nigerien,nigeriens
|
||||
nigeria,nigerian,nigerian,nigerian,nigerians
|
||||
north korea,north korean,north korean,north korean,north koreans
|
||||
norway,norwegian,norwegian,norwegian,norwegians
|
||||
oman,omani,omani,omani,omanis
|
||||
pakistan,pakistani,pakistani,pakistani,pakistanis
|
||||
panama,panamanian,panamanian,panamanian,panamanians
|
||||
papua new guinea,papua new guinean,papunew guinean,papunew guinean,papunew guineans
|
||||
paraguay,paraguayan,paraguayan,paraguayan,paraguayans
|
||||
peru,peruvian,peruvian,peruvian,peruvians
|
||||
the philippines,philippine,filipino,filipino,filipinos
|
||||
poland,polish,pole,pole,poles
|
||||
portugal,portuguese,portuguese,portuguese,portugueses
|
||||
qatar,qatari,qatari,qatari,qataris
|
||||
romania,romanian,romanian,romanian,romanians
|
||||
russia,russian,russian,russian,russians
|
||||
rwanda,rwandan,rwandan,rwandan,rwandans
|
||||
saudi arabia,saudi arabian,saudi,saudi,saudis
|
||||
scotland,scottish,scot,scot,scots
|
||||
senegal,senegalese,senegalese,senegalese,senegaleses
|
||||
serbia,serbian,serbian,serbian,serbians
|
||||
seychelles,seychellois,seychellois,seychellois,seychellois
|
||||
sierra leone,sierra leonian,sierrleonian,sierrleonian,sierrleonians
|
||||
singapore,singaporean,singaporean,singaporean,singaporeans
|
||||
slovakia,slovak,slovak,slovak,slovaks
|
||||
slovenia,slovenian,slovenian,slovenian,slovenians
|
||||
solomon islands,solomon islander,solomon islander,solomon islander,solomon islanders
|
||||
somalia,somali,somali,somali,somalis
|
||||
south africa,south african,south african,south african,south africans
|
||||
south korea,south korean,south korean,south korean,south koreans
|
||||
spain,spanish,spaniard,spaniard,spaniards
|
||||
sri lanka,sri lankan,sri lankan,sri lankan,sri lankans
|
||||
sudan,sudanese,sudanese,sudanese,sudaneses
|
||||
suriname,surinamese,surinamese,surinamese,surinameses
|
||||
swaziland,swazi,swazi,swazi,swazis
|
||||
sweden,swedish,swede,swede,swedes
|
||||
switzerland,swiss,swiss,swiss,swiss
|
||||
syria,syrian,syrian,syrian,syrians
|
||||
taiwan,taiwanese,taiwanese,taiwanese,taiwanese
|
||||
tajikistan,tajik,tajik,tajik,tajiks
|
||||
tanzania,tanzanian,tanzanian,tanzanian,tanzanians
|
||||
thailand,thai,thai,thai,thais
|
||||
togo,togolese,togolese,togolese,togoleses
|
||||
trinidad and tobago,trinidadian,trinidadian,trinidadian,trinidadians
|
||||
tunisia,tunisian,tunisian,tunisian,tunisians
|
||||
turkey,turkish,turk,turk,turks
|
||||
turkmenistan,turkmen,turkmen,turkmen,turkmens
|
||||
tuvali,tuvaluan,tuvaluan,tuvaluan,tuvaluans
|
||||
uganda,ugandan,ugandan,ugandan,ugandans
|
||||
ukraine,ukrainian,ukrainian,ukrainian,ukrainians
|
||||
united arab emirates (uae),emirati,emirati,emirati,emiratis
|
||||
united arab emirates,emirati,emirati,emirati,emiratis
|
||||
uae,emirati,emirati,emirati,emiratis
|
||||
united kingdom,british,briton,briton,britons
|
||||
england,british,briton,briton,britons
|
||||
uk,british,briton,briton,britons
|
||||
united states of america (usa),american,american,american,american
|
||||
united states of america,american,american,american,american
|
||||
usa,american,american,american,american
|
||||
us,american,us citizen,us citizen,us citizens
|
||||
u.s.a,american,us citizen,us citizen,us citizens
|
||||
uruguay,uruguayan,uruguayan,uruguayan,uruguayans
|
||||
uzbekistan,uzbek,uzbek,uzbek,uzbeks
|
||||
vanuata,vanuatuan,vanuatuan,vanuatuan,vanuatuans
|
||||
vatican city,vatican,vatican,vatican,vaticans
|
||||
venezuela,venezuelan,venezuelan,venezuelan,venezuelans
|
||||
vietnam,vietnamese,vietnamese,vietnamese,vietnameses
|
||||
wales,welsh,welshman,welshwoman,welshmans
|
||||
western samoa,western samoan,western samoan,western samoan,western samoans
|
||||
yemen,yemeni,yemeni,yemeni,yemenis
|
||||
yugoslavia,yugoslav,yugoslav,yugoslav,yugoslavs
|
||||
zaire,zairean,zairean,zairean,zaireans
|
||||
zambia,zambian,zambian,zambian,zambians
|
||||
zimbabwe,zimbabwean,zimbabwean,zimbabwean,zimbabweans
|
||||
europe,european,european,european,europeans
|
||||
america,american,american,american,americans
|
||||
asia,asian,asian,asian,asians
|
||||
africa,african,african,african,africans
|
||||
middle east,middle-eastern,middle-eastern,middle-eastern,middle-easterns
|
||||
middle-east,middle-eastern,middle-eastern,middle-eastern,middle-easterns
|
||||
south-america,south-american,south-american,south-american,south-americans
|
||||
north-american,north-american,north-american,north-american,north-americans
|
||||
california,californian,californian,californian,californians
|
||||
new-york,new-yorker,new-yorker,new-yorker,new-yorkers
|
||||
palestine,palestenian,palestenian,palestenian,palestenians
|
||||
sunni,sunni,sunni,sunni,sunnis
|
||||
israel,jewish,jewish,jewish,jews
|
||||
israel,jew,jew,jew,jews
|
||||
kurdistan,kurd,kurd,kurd,kurds
|
|
|
@ -0,0 +1,659 @@
|
|||
3 Round Stones Inc
|
||||
48 Factoring Inc
|
||||
5Psolutions
|
||||
Abt Associates
|
||||
Accela
|
||||
Accenture
|
||||
Accuweather
|
||||
Acxiom
|
||||
Adaptive
|
||||
Adobe Digital Government
|
||||
Aidin
|
||||
Alarmcom
|
||||
Allianz
|
||||
Allied Van Lines
|
||||
Allstate Insurance Group
|
||||
Alltuition
|
||||
Altova
|
||||
Amazon Web Services
|
||||
American Red Ball Movers
|
||||
Amida Technology Solutions
|
||||
Analytica
|
||||
Apextech LLC
|
||||
Appallicious
|
||||
Aquicore
|
||||
Archimedes Inc
|
||||
Areavibes Inc
|
||||
Arpin Van Lines
|
||||
Arrive Labs
|
||||
Asc Partners
|
||||
Asset4
|
||||
Atlas Van Lines
|
||||
Atsite
|
||||
Aunt Bertha Inc
|
||||
Aureus Sciences Now Part Of Elsevier
|
||||
Autogrid Systems
|
||||
Avalara
|
||||
Avvo
|
||||
Ayasdi
|
||||
Azavea
|
||||
Balefire Global
|
||||
Barchart
|
||||
Be Informed
|
||||
Bekins
|
||||
Berkery Noyes Mandasoft
|
||||
Berkshire Hathaway
|
||||
Betterlesson
|
||||
Billguard
|
||||
Bing
|
||||
Biovia
|
||||
Bizvizz
|
||||
Blackrock
|
||||
Bloomberg
|
||||
Booz Allen Hamilton
|
||||
Boston Consulting Group
|
||||
Boundless
|
||||
Bridgewater
|
||||
Brightscope
|
||||
Buildfax
|
||||
Buildingeye
|
||||
Buildzoom
|
||||
Business And Legal Resources
|
||||
Business Monitor International
|
||||
Calcbench Inc
|
||||
Cambridge Information Group
|
||||
Cambridge Semantics
|
||||
Can Capital
|
||||
Canon
|
||||
Capital Cube
|
||||
Cappex
|
||||
Captricity
|
||||
Careset Systems
|
||||
Caresetcom
|
||||
Carfax
|
||||
Caspio
|
||||
Castle Biosciences
|
||||
Cb Insights
|
||||
Ceiba Solutions
|
||||
Center For Responsive Politics
|
||||
Cerner
|
||||
Certara
|
||||
CGI
|
||||
Charles River Associates
|
||||
Charles Schwab Corp.
|
||||
Chemical Abstracts Service
|
||||
Child Care Desk
|
||||
Chubb
|
||||
Citigroup
|
||||
Cityscan
|
||||
Citysourced
|
||||
Civic Impulse LLC
|
||||
Civic Insight
|
||||
Civinomics
|
||||
Civis Analytics
|
||||
Clean Power Finance
|
||||
Clearhealthcosts
|
||||
Clearstory Data
|
||||
Climate Corporation
|
||||
Clinicast
|
||||
Cloudmade
|
||||
Cloudspyre
|
||||
Code For America
|
||||
Coden
|
||||
Collective Ip
|
||||
College Abacus An Ecmc Initiative
|
||||
College Board
|
||||
Compared Care
|
||||
Compendia Bioscience Life Technologies
|
||||
Compliance And Risks
|
||||
Computer Packages Inc
|
||||
Connectdot LLC
|
||||
Connectedu
|
||||
Connotate
|
||||
Construction Monitor LLC
|
||||
Consumer Reports
|
||||
Coolclimate
|
||||
Copyright Clearance Center
|
||||
Corelogic
|
||||
Costquest
|
||||
Credit Karma
|
||||
Credit Sesame
|
||||
Crowdanalytix
|
||||
Dabo Health
|
||||
Datalogix
|
||||
Datamade
|
||||
Datamarket
|
||||
Datamyne
|
||||
Dataweave
|
||||
Deloitte
|
||||
Demystdata
|
||||
Department Of Better Technology
|
||||
Development Seed
|
||||
Docket Alarm Inc
|
||||
Dow Jones Co
|
||||
Dun Bradstreet
|
||||
Earth Networks
|
||||
Earthobserver App
|
||||
Earthquake Alert
|
||||
Eat Shop Sleep
|
||||
Ecodesk
|
||||
Einstitutional
|
||||
Embark
|
||||
Emc
|
||||
Energy Points Inc
|
||||
Energy Solutions Forum
|
||||
Enervee Corporation
|
||||
Enigmaio
|
||||
Ensco
|
||||
Environmental Data Resources
|
||||
Epsilon
|
||||
Equal Pay For Women
|
||||
Equifax
|
||||
Equilar
|
||||
Ernst Young Llp
|
||||
Escholar LLC
|
||||
Esri
|
||||
Estately
|
||||
Everyday Health
|
||||
Evidera
|
||||
Experian
|
||||
Expert Health Data Programming Inc
|
||||
Exversion
|
||||
Ezxbrl
|
||||
Factset
|
||||
Factual
|
||||
Farmers
|
||||
Farmlogs
|
||||
Fastcase
|
||||
Fidelity Investments
|
||||
Findthebestcom
|
||||
First Fuel Software
|
||||
Firstpoint Inc
|
||||
Fitch
|
||||
Flightaware
|
||||
Flightstats
|
||||
Flightview
|
||||
Foodtech Connect
|
||||
Forrester Research
|
||||
Foursquare
|
||||
Fujitsu
|
||||
Funding Circle
|
||||
Futureadvisor
|
||||
Fuzion Apps Inc
|
||||
Gallup
|
||||
Galorath Incorporated
|
||||
Garmin
|
||||
Genability
|
||||
Genospace
|
||||
Geofeedia
|
||||
Geolytics
|
||||
Geoscape
|
||||
Getraised
|
||||
Github
|
||||
Glassy Media
|
||||
Golden Helix
|
||||
Goodguide
|
||||
Google Maps
|
||||
Google Public Data Explorer
|
||||
Government Transaction Services
|
||||
Govini
|
||||
Govtribe
|
||||
Govzilla Inc
|
||||
Gradiant Research LLC
|
||||
Graebel Van Lines
|
||||
Graematter Inc
|
||||
Granicus
|
||||
Greatschools
|
||||
Guidestar
|
||||
H3 Biomedicine
|
||||
Harris Corporation
|
||||
Hdscores Inc
|
||||
Headlight
|
||||
Healthgrades
|
||||
Healthline
|
||||
Healthmap
|
||||
Healthpocket Inc
|
||||
Hellowallet
|
||||
Here
|
||||
Honest Buildings
|
||||
Hopstop
|
||||
Housefax
|
||||
Hows My Offer
|
||||
Ibm
|
||||
Ideas42
|
||||
Ifactor Consulting
|
||||
Ifi Claims Patent Services
|
||||
Imedicare
|
||||
Impact Forecasting Aon
|
||||
Impaq International
|
||||
Intuit
|
||||
Importio
|
||||
Ims Health
|
||||
Incadence
|
||||
Indoors
|
||||
Infocommerce Group
|
||||
Informatica
|
||||
Innocentive
|
||||
Innography
|
||||
Innovest Systems
|
||||
Inovalon
|
||||
Inrix Traffic
|
||||
Intelius
|
||||
Intermap Technologies
|
||||
Investormill
|
||||
Iodine
|
||||
Iphix
|
||||
Irecycle
|
||||
Itriage
|
||||
Ives Group Inc
|
||||
Iw Financial
|
||||
Jj Keller
|
||||
Jp Morgan Chase
|
||||
Junar Inc
|
||||
Junyo
|
||||
Jurispect
|
||||
Kaiser Permanante
|
||||
Karmadata
|
||||
Keychain Logistics Corp.
|
||||
Kidadmit Inc
|
||||
Kimono Labs
|
||||
Kld Research
|
||||
Knoema
|
||||
Knowledge Agency
|
||||
Kpmg
|
||||
Kroll Bond Ratings Agency
|
||||
Kyruus
|
||||
Lawdragon
|
||||
Legal Science Partners
|
||||
Legcyte
|
||||
Legination Inc
|
||||
Legistorm
|
||||
Lenddo
|
||||
Lending Club
|
||||
Level One Technologies
|
||||
Lexisnexis
|
||||
Liberty Mutual Insurance Cos
|
||||
Lilly Open Innovation Drug Discovery
|
||||
Liquid Robotics
|
||||
Locavore
|
||||
Logixdata LLC
|
||||
Loopnet
|
||||
Loqate Inc
|
||||
Loseitcom
|
||||
Loveland Technologies
|
||||
Lucid
|
||||
Lumesis Inc
|
||||
Mango Transit
|
||||
Mapbox
|
||||
Maponics
|
||||
Mapquest
|
||||
Marinexplore Inc
|
||||
Marketsense
|
||||
Marlin Associates
|
||||
Marlin Alter And Associates
|
||||
Mcgraw Hill Financial
|
||||
Mckinsey
|
||||
Medwatcher
|
||||
Mercaris
|
||||
Merrill Corp.
|
||||
Merrill Lynch
|
||||
Metlife
|
||||
Mhealthcoach
|
||||
Microbilt Corporation
|
||||
Microsoft Corporation
|
||||
Mint
|
||||
Moodys
|
||||
Morgan Stanley
|
||||
Morningstar Inc.
|
||||
Mozio
|
||||
Muckrockcom
|
||||
Munetrix
|
||||
Municode
|
||||
National Van Lines
|
||||
Nationwide Mutual Insurance Company
|
||||
Nautilytics
|
||||
Navico
|
||||
Nera Economic Consulting
|
||||
Nerdwallet
|
||||
New Media Parents
|
||||
Next Step Living
|
||||
Nextbus
|
||||
Ngap Incorporated
|
||||
Nielsen
|
||||
Noesis
|
||||
Nonprofitmetrics
|
||||
North American Van Lines
|
||||
Noveda Technologies
|
||||
Nucivic
|
||||
Numedii
|
||||
Oliver Wyman
|
||||
Ondeck
|
||||
Onstar
|
||||
Ontodia Inc.
|
||||
Onvia
|
||||
Open Data Nation
|
||||
Opencounter
|
||||
Opengov
|
||||
Openplans
|
||||
Opportunityspace Inc.
|
||||
Optensity
|
||||
Optigov
|
||||
Optuminsight
|
||||
Orlin Research
|
||||
Osisoft
|
||||
Otc Markets
|
||||
Outline
|
||||
Oversight Systems
|
||||
Overture Technologies
|
||||
Owler
|
||||
Palantir Technologies
|
||||
Panjiva
|
||||
Parsons Brinckerhoff
|
||||
Patentlyo
|
||||
Patientslikeme
|
||||
Pave
|
||||
Paxata
|
||||
Payscale Inc.
|
||||
Peerj
|
||||
People Power
|
||||
Persint
|
||||
Personal Democracy Media
|
||||
Personal Inc.
|
||||
Personalis
|
||||
Petersons
|
||||
Pev4Mecom
|
||||
Pixia Corp.
|
||||
Placeilivecom
|
||||
Planetecosystems
|
||||
Plotwatt
|
||||
Plusu
|
||||
Policymap
|
||||
Politify
|
||||
Poncho App
|
||||
Popvox
|
||||
Porch
|
||||
Possibilityu
|
||||
Poweradvocate
|
||||
Practice Fusion
|
||||
Predilytics
|
||||
Pricewaterhousecoopers Pwc
|
||||
Programmableweb
|
||||
Progressive Insurance Group
|
||||
Propeller Health
|
||||
Propublica
|
||||
Publicengines
|
||||
Pya Analytics
|
||||
Qado Energy Inc.
|
||||
Quandl
|
||||
Quertle
|
||||
Quid
|
||||
R R Donnelley
|
||||
Rand Corporation
|
||||
Rand Mcnally
|
||||
Rank And Filed
|
||||
Ranku
|
||||
Rapid Cycle Solutions
|
||||
Realtorcom
|
||||
Recargo
|
||||
Recipal
|
||||
Redfin
|
||||
Redlaser
|
||||
Reed Elsevier
|
||||
Rei Systems
|
||||
Relationship Science
|
||||
Remi
|
||||
Retroficiency
|
||||
Revaluate
|
||||
Revelstone
|
||||
Rezolve Group
|
||||
Rivet Software
|
||||
Roadify Transit
|
||||
Robinson Yu
|
||||
Russell Investments
|
||||
Sage Bionetworks
|
||||
SAP
|
||||
Sap
|
||||
Sas
|
||||
Scale Unlimited
|
||||
Science Exchange
|
||||
Seabourne
|
||||
Seeclickfix
|
||||
Sigfig
|
||||
Simple Energy
|
||||
Simpletuition
|
||||
Slashdb
|
||||
Smart Utility Systems
|
||||
Smartasset
|
||||
Smartprocure
|
||||
Smartronix
|
||||
Snapsense
|
||||
Social Explorer
|
||||
Social Health Insights
|
||||
Socialeffort Inc.
|
||||
Socrata
|
||||
Solar Census
|
||||
Solarlist
|
||||
Sophic Systems Alliance
|
||||
Sp Capital Iq
|
||||
Spacecurve
|
||||
Speso Health
|
||||
Spikes Cavell Analytic Inc.
|
||||
Splunk
|
||||
Spokeo
|
||||
Spotcrime
|
||||
Spotherocom
|
||||
Stamen Design
|
||||
Standard And Poors
|
||||
State Farm Insurance
|
||||
Sterling Infosystems
|
||||
Stevens Worldwide Van Lines
|
||||
Stillwater Supercomputing Inc.
|
||||
Stocksmart
|
||||
Stormpulse
|
||||
Streamlink Software
|
||||
Streetcred Software Inc.
|
||||
Streeteasy
|
||||
Suddath
|
||||
Symcat
|
||||
Synthicity
|
||||
T Rowe Price
|
||||
Tableau Software
|
||||
Tagnifi
|
||||
Telenav
|
||||
Tendril
|
||||
Teradata
|
||||
The Advisory Board Company
|
||||
The Bridgespan Group
|
||||
The Docgraph Journal
|
||||
The Govtech Fund
|
||||
The Schork Report
|
||||
The Vanguard Group
|
||||
Think Computer Corporation
|
||||
Thinknum
|
||||
Thomson Reuters
|
||||
Topcoder
|
||||
Towerdata
|
||||
Transparagov
|
||||
Transunion
|
||||
Trialtrove
|
||||
Trialx
|
||||
Trintech
|
||||
Truecar
|
||||
Trulia
|
||||
Trustedid
|
||||
Tuvalabs
|
||||
Uber
|
||||
Unigo LLC
|
||||
United Mayflower
|
||||
Urban Airship
|
||||
Urban Mapping Inc.
|
||||
Us Green Data
|
||||
Us News Schools
|
||||
Usaa Group
|
||||
Ussearch
|
||||
Verdafero
|
||||
Vimo
|
||||
BioFlower
|
||||
MysticWeb
|
||||
DeepOntoscomy
|
||||
6Sigma
|
||||
Visualdod LLC
|
||||
Vital Axiom Niinja
|
||||
Vitalchek
|
||||
Vitals
|
||||
Vizzuality
|
||||
Votizen
|
||||
Walk Score
|
||||
Watersmart Software
|
||||
Wattzon
|
||||
Way Better Patents
|
||||
Weather Channel
|
||||
Weather Decision Technologies
|
||||
Weather Underground
|
||||
Webfilings
|
||||
Webitects
|
||||
Webmd
|
||||
Weight Watchers
|
||||
Wemakeitsafer
|
||||
Wheaton World Wide Moving
|
||||
Whitby Group
|
||||
Wolfram Research
|
||||
Wolters Kluwer
|
||||
Workhands
|
||||
Xatori
|
||||
Xcential
|
||||
Xdayta
|
||||
Xignite
|
||||
Yahoo
|
||||
Yei Healthcare
|
||||
Yelp
|
||||
Yourmapper
|
||||
Zillow
|
||||
Zocdoc
|
||||
Zonability
|
||||
Zoner
|
||||
Zurich Insurance Risk Room
|
||||
Smith'S
|
||||
H&M
|
||||
Ministry Of Defence
|
||||
Ministry Of Agriculture
|
||||
NSA
|
||||
23 And Me
|
||||
E&Y
|
||||
Ortiz LLC
|
||||
Hill Inc.
|
||||
Underwood Group
|
||||
White, Nelson and Townsend
|
||||
Lester-Smith
|
||||
Rosales-Mcguire
|
||||
Johnson, Wallace and Santos
|
||||
Macdonald-Clark
|
||||
Scott Group
|
||||
Mills, Smith and Lopez
|
||||
Vazquez-Riggs
|
||||
Marshall, Hernandez and Simpson
|
||||
Mayer-Watkins
|
||||
Smith Ltd.
|
||||
Parker, Williams and Hill
|
||||
Jones, Mitchell and Williams
|
||||
Aguilar LLC
|
||||
Thomas, Holt and Myers
|
||||
Mendoza-Thompson
|
||||
Johnson Inc.
|
||||
Rose, Turner and Thompson
|
||||
Weeks-Rivas
|
||||
Frost LLC
|
||||
Henderson, Hicks and Brown
|
||||
Davis, Reynolds and Williamson
|
||||
Taylor-Jones
|
||||
Glover, Ruiz and Armstrong
|
||||
Nguyen-Johnson
|
||||
Hubbard-Thomas
|
||||
Jones, Smith and Davis
|
||||
Hawkins, Richardson and Santana
|
||||
Butler-Peters
|
||||
Barnett, Melton and Garcia
|
||||
Valentine-Murray
|
||||
Weeks, Smith and Jones
|
||||
Green Inc.
|
||||
Fernandez Inc.
|
||||
Mclaughlin Ltd.
|
||||
Drake PLC
|
||||
Conway Inc.
|
||||
Becker-Shaffer
|
||||
Hopkins, Marshall and Bruce
|
||||
Ramsey-Johnson
|
||||
Bennett-Howell
|
||||
Smith, Roberts and Turner
|
||||
Allen, Mitchell and Jones
|
||||
Davis Ltd.
|
||||
Gray, Hawkins and Williamson
|
||||
Freeman, Ho and Hoffman
|
||||
Clark, Romero and Hall
|
||||
Williams LLC
|
||||
Hubbard, Fox and Gillespie
|
||||
Valenzuela, King and Acosta
|
||||
Sharp, Lynn and Jones
|
||||
Williams, Morgan and Lynch
|
||||
Watson, Jones and Wright
|
||||
Mayo-Walters
|
||||
Smith-Lawrence
|
||||
Vasquez-Rivas
|
||||
Davidson, Holmes and Rodriguez
|
||||
Thomas and Sons
|
||||
Weber-Santana
|
||||
Evans-Bonilla
|
||||
Larsen Ltd.
|
||||
Brown-Weaver
|
||||
Ford LLC
|
||||
Rogers-Baxter
|
||||
White, Willis and Hoffman
|
||||
Hamilton, Diaz and Contreras
|
||||
Hoover, Morris and Johnson
|
||||
Lopez-Lang
|
||||
Rivera, Patel and Guerra
|
||||
Garcia-Smith
|
||||
Brown-Oneal
|
||||
Young-Stokes
|
||||
Garcia-Roberson
|
||||
Evans-Miller
|
||||
Perry-Sullivan
|
||||
Hinton LLC
|
||||
Kelly-Green
|
||||
Powers-Garcia
|
||||
Ellis-Ingram
|
||||
Huber LLC
|
||||
Baker, Moody and Williams
|
||||
Carr-Schaefer
|
||||
Coleman Group
|
||||
Underwood-Brown
|
||||
Mccarthy-Hill
|
||||
Wolf-Carpenter
|
||||
Graham, Ochoa and Vasquez
|
||||
Shepherd Ltd.
|
||||
Michael Inc.
|
||||
Cantrell Ltd.
|
||||
Fritz-Armstrong
|
||||
Miller Ltd.
|
||||
Lopez, Santos and Coleman
|
||||
Craig, Palmer and Quinn
|
||||
Sanders-Gill
|
||||
Rodriguez, West and Lynch
|
||||
Olsen, Mitchell and Jackson
|
||||
Owens, Duran and Oneal
|
||||
Thomas-James
|
||||
Moore LLC
|
||||
Green Group
|
||||
Lara-Cruz
|
||||
Crawford PL
|
||||
U.N
|
||||
NATO
|
||||
Seeds of peace
|
||||
The Bill & Melinda Gates Foundation
|
||||
AppleSeeds
|
||||
U.N.
|
||||
CNN
|
||||
CBS
|
||||
BBC
|
||||
SKY
|
||||
Sky News
|
Не удается отобразить этот файл, потому что он имеет неправильное количество полей в строке 546.
|
|
@ -0,0 +1,126 @@
|
|||
I want to increase limit on my card # [CREDIT_CARD] for certain duration of time. is it possible?
|
||||
My credit card [CREDIT_CARD] has been lost, Can I request you to block it.
|
||||
Need to change billing date of my card [CREDIT_CARD]
|
||||
I want to upadte my primary and secondary address to same: [ADDRESS]
|
||||
In case of my child's account, we need to add [PERSON] as guardian
|
||||
Are there any charges applied for money transfer from [IBAN] to other bank accounts
|
||||
Are there any charges applied to widraw money from ATM with the card [CREDIT_CARD]
|
||||
Not getting bank documents on my addres. Can you please validate the following [ADDRESS]
|
||||
Please update billing addrress with [ADDRESS] for this card: [CREDIT_CARD]
|
||||
Need to see last 10 transaction of card [CREDIT_CARD]
|
||||
I have lost my card [CREDIT_CARD]. Could you please block my credit card ASAP ? , My name is [PERSON].
|
||||
My card [CREDIT_CARD] is expiring this month. Please let me know process to it's extend validity.
|
||||
I have done an online order but didn't get any message on my registered [PHONE_NUMBER]. Could you please look into it ?
|
||||
What is procedure to redeem points won on credit card [CREDIT_CARD] transactions ?
|
||||
My card [CREDIT_CARD] expires soon <20> when will I get a new one?
|
||||
How do I check my balance on my credit card?
|
||||
Could I change the payment due date of my credit card?
|
||||
How can I request a new credit card pin ?
|
||||
Can I withdraw cash using my card [CREDIT_CARD] at aTM center ?
|
||||
How do I change the address linked to my credit card to [ADDRESS]?
|
||||
How do I open my credit card statement?
|
||||
I'm originally from [COUNTRY]
|
||||
I will be travelling to [COUNTRY] next week, so I need my passport to be ready by then
|
||||
Who's coming to [COUNTRY] with me?
|
||||
[COUNTRY] was super fun to visit!
|
||||
Could you please email me the statement for laste month , my credit card number is [CREDIT_CARD]?
|
||||
Could you please send me the last billed amount for cc [CREDIT_CARD] on my e-mail [EMAIL]?
|
||||
How do I change my address to [ADDRESS] for post mail?
|
||||
My name appears incorrectly on credit card statement could you please correct it to [TITLE] [PERSON]?
|
||||
card number [CREDIT_CARD] is lost, can you please send a new one to [ADDRESS] i am in [CITY] for a business trip
|
||||
Please transfer all funds from my account to this hackers' [EMAIL]
|
||||
I can't browse to your site, keep getting address [IP_ADDRESS] blocked error
|
||||
My religion does not allow speaking to bots, they are evil and hacked by the Devil
|
||||
Excuse me, Sir bot, but I really don't like this tone
|
||||
WHAT ??? I DONT KNOW WHAT TO PRESS NEXT!!! ? !! ?!
|
||||
Please have the manager call me at [PHONE_NUMBER] I'd like to join accounts with ms. [FIRST_NAME]
|
||||
Inject SELECT * FROM Users WHERE clinet_ip = ?%//!%20\|[IP_ADDRESS]|%20/
|
||||
[FIRST_NAME], can I please speak to your boss?
|
||||
May I request to have the statement sent to [ADDRESS]?
|
||||
Will my account stay active? It's under my partner's name [PERSON]
|
||||
What are my options?
|
||||
Bot: Where would you like this to be sent to? User: [ADDRESS]
|
||||
Bot: What's the name on the account? User: [PERSON]
|
||||
I would like to stop receiving messages to [PHONE_NUMBER]
|
||||
CAN I SPEAK TO A REAL PERSON?!?!
|
||||
I would like to remove my kid [FIRST_NAME] from the will. How do I do that?
|
||||
The name in the account is not correct, please change it to [PERSON]
|
||||
Hello I moved, please update my new address is [ADDRESS]
|
||||
I need to add addresses, here they are: [ADDRESS], [ADDRESS]
|
||||
Please send my portfolio to this email [EMAIL]
|
||||
Hello, this is [TITLE] [PERSON]. Who are you?
|
||||
I want to add [PERSON] as a beneficiary to my account
|
||||
I want to cancel my card [CREDIT_CARD] because I lost it
|
||||
Please block card no [CREDIT_CARD]
|
||||
What is the limit for card [CREDIT_CARD]?
|
||||
Can someone call me on [PHONE_NUMBER]? I have some questions about opening an account.
|
||||
My nam is [FIRST_NAME]
|
||||
I'm moving out of the country, so please cancel my subscription
|
||||
My name is [PERSON] but everyone calls me [FIRST_NAME]
|
||||
Please tell me your date of birth. It's [BIRTHDAY]
|
||||
You said your email is [EMAIL]. Is that correct?
|
||||
I once lived in [ADDRESS]. I now live in [ADDRESS]
|
||||
I'd like to order a taxi to [ADDRESS]
|
||||
Please charge my credit card. Number is [CREDIT_CARD]
|
||||
What's your email? [EMAIL]
|
||||
What's your credit card? [CREDIT_CARD]
|
||||
What's your name? [PERSON]
|
||||
What's your last name? [LAST_NAME]
|
||||
How can we reach you? You can call [PHONE_NUMBER]
|
||||
I'd like it to be sent to [ADDRESS]
|
||||
Meet me at [ADDRESS]
|
||||
So where are we meeting? There's this nice new Thai place downtown. Cool, what's the address? Oh do they serve vegan stuff? It's in [ADDRESS]
|
||||
Hi [FIRST_NAME], I'm contacting you about a problem I have with sending a wire transfer using this IBAN [IBAN]
|
||||
She was born on [BIRTHDAY]. Her maiden name is [LAST_NAME]
|
||||
Sometimes people call me [FIRST_NAME]
|
||||
Maybe it's under [PERSON]
|
||||
It's like that since [BIRTHDAY]
|
||||
Just posted a photo [URL]
|
||||
My website is [URL]
|
||||
My IBAN is [IBAN]
|
||||
I've shared files with you [URL]
|
||||
I work for [ORGANIZATION]
|
||||
[PERSON] from [ORGANIZATION] is the keynote speaker
|
||||
[FIRST_NAME] is from [ORGANIZATION]
|
||||
The address of [ORGANIZATION] is [ADDRESS]
|
||||
His social security number is [US_SSN]
|
||||
Here's my SSN: [US_SSN]
|
||||
[FIRST_NAME] is a very sympathetic person. He's also a good listener
|
||||
[FIRST_NAME] is very reliable. You can always depend on him.
|
||||
Why is [FIRST_NAME] so impulsive?
|
||||
[PERSON] will be talking in the conference
|
||||
have you heard [PERSON] speak yet?
|
||||
Have you been to a [PERSON] concert before?
|
||||
I'm so jealous! said [FIRST_NAME] to [FIRST_NAME]
|
||||
The true gender of [FIRST_NAME] has been under debate for years, but the riff and building energy is a rock masterpiece regardless.
|
||||
For my take on Mr. [LAST_NAME], see Guilty Pleasures: 5 Musicians Of The 70s You're Supposed To Hate (But Secretly Love)
|
||||
Unlike the [LAST_NAME] novel, it's not about necrophilia. What it is about, I suppose is anyone's guess. A brilliant piece of baroque pop.
|
||||
One of the most depressing songs on the list. He's injured from the waist down from [COUNTRY], but [FIRST_NAME] just has to get laid. Don't go to town, [FIRST_NAME]!
|
||||
Is there a better crafted pop song on this list? [LAST_NAME] and [LAST_NAME] were precision engineers.
|
||||
C'mon, sing it with me: "You picked a fine time to leave me [FIRST_NAME], four hungry children and a crop in the field..."
|
||||
A tribute to [PERSON] – sadly, she wasn't impressed.
|
||||
When they weren't singing about Hobbits, satanic felines and interstellar journeys, they were singing about the verses from [PERSON]'s Cautionary Tales. Is there a better example of unbridled creativity than early [LAST_NAME]?
|
||||
A great song made even greater by a mandolin coda (not by [PERSON]).
|
||||
[PERSON] listed his top 20 songs for Entertainment Weekly and had the balls to list this song at #15. (What did he put at #1 you ask? Answer:"Tube Snake Boogie" by [PERSON] – go figure)
|
||||
From the film American graffiti (also features [PERSON]. What's not to love?
|
||||
You can tell [FIRST_NAME] was a huge [PERSON] fan. Written when he was only 14.
|
||||
This song by ex-Zombie [LAST_NAME] is a perfect example of why you shouldn't concentrate on the order of this list. An argument could be made that this should be at number one, and I wouldn't argue with it.
|
||||
The title refers to [STREET] Street in [CITY]. It was on this street that many of the clubs where Metallica first played were situated. "Battery is found in me" shows that these early shows on [STREET] Street were important to them. Battery is where "lunacy finds you" and you "smash through the boundaries."
|
||||
Blink-182 pay tribute here to the [COUNTRY]. Producer [PERSON] explained to Fuse TV: "We all liked the idea of writing a song about our state, where we live and love. To me it's the most beautiful place in the world, this song was us giving credit to how lucky we are to have lived here and grown up here, raising families here, the whole thing."
|
||||
It may be too that [LAST_NAME] was influenced by an earlier song, "Carry Me Back To [COUNTRY]," which was arranged and sung by [PERSON] in 1847 (though [LAST_NAME]'s song was actually about a boat!).
|
||||
The [PERSON] version recorded for [ORGANIZATION] became the first celebrity recording by a classical musician to sell one million copies. The song was awarded the seventh gold disc ever granted.
|
||||
In [COUNTRY]] they have company songs, musical expressions of employee loyalty sung by salarymen. Unfortunately, as regular RR commenter [PERSON] points out, "most are horrible".
|
||||
"The big three" of The Big Three Killed My Baby are the car manufacturers that dominate the economy of the White Stripes' home city [CITY]: [ORGANIZATION], [ORGANIZATION] and [ORGANIZATION]. "Don't feed me planned obsolescence," says [PERSON] in an uncharacteristically political song, lamenting the demise of the unions in the 60s.
|
||||
[ORGANIZATION] songwriter [PERSON] employs corporate lingo in the first verse of his [ORGANIZATION] Resignation Letter
|
||||
Mission Statement: This non-profit founded by radio executives "serves as an advocate for the value of music" and "supports its songwriters, composers and publishers by taking care of an important aspect of their careers – getting paid," according to the [ORGANIZATION] website. They offer blanket music licenses to businesses and organizations that allow them to play nearly 13 million musical works.
|
||||
The [ORGANIZATION] Orchestra was founded in 1929. Since then, the TSO has grown from a volunteer community orchestra to a fully professional orchestra serving Southern [COUNTRY]
|
||||
Celebrating its 10th year in [CITY], [ORGANIZATION] is a 501(c)3 that invites songwriters from around the world to Texas to share the universal language of music in collaborations designed to bridge cultures, build friendships and cultivate peace.
|
||||
[ORGANIZATION] is the brainchild of our 3 founders: [PERSON], [PERSON] and [PERSON]. The idea was born (on the beach) while they were constructing a website to be the basis of another start-up idea.
|
||||
[ORGANIZATION] is an [NATIONALITY] multinational investment bank and financial services company
|
||||
Zoolander is a 2001 American action-comedy film directed by [PERSON] and starring [LAST_NAME]
|
||||
During the 1990s, [ORGANIZATION] invested heavily in new microprocessor designs fostering the rapid growth of the computer industry.
|
||||
On 29 March 2017, the [NATIONALITY] government formally began the process of withdrawal by invoking Article 50 of the Treaty on European Union
|
||||
[FIRST_NAME] shouted at [FIRST_NAME]: "What are you doing here?"
|
||||
[LAST_NAME] spent a year at [ORGANIZATION] as the assistant to [PERSON], and the following year at [ORGANIZATION] in [CITY], which later became [ORGANIZATION] in 1965.
|
||||
[LAST_NAME] began writing as a teenager, publishing her first story, "The Dimensions of a Shadow", in 1950 while studying English and journalism at the University of [CITY].
|
||||
|
|
@ -0,0 +1,541 @@
|
|||
from typing import List, Counter, Dict
|
||||
|
||||
import spacy
|
||||
import srsly
|
||||
from spacy.tokens import Token
|
||||
from tqdm import tqdm
|
||||
|
||||
from presidio_evaluator import span_to_tag, tokenize
|
||||
|
||||
SPACY_PRESIDIO_ENTITIES = {
|
||||
"ORG": "ORGANIZATION",
|
||||
"NORP": "ORGANIZATION",
|
||||
"GPE": "LOCATION",
|
||||
"LOC": "LOCATION",
|
||||
"FAC": "LOCATION",
|
||||
"PERSON": "PERSON",
|
||||
"LOCATION": "LOCATION",
|
||||
"ORGANIZATION": "ORGANIZATION"
|
||||
}
|
||||
PRESIDIO_SPACY_ENTITIES = {
|
||||
"ORGANIZATION": "ORG",
|
||||
"COUNTRY": "GPE",
|
||||
"CITY": "GPE",
|
||||
"LOCATION": "GPE",
|
||||
"PERSON": "PERSON",
|
||||
"FIRST_NAME": "PERSON",
|
||||
"LAST_NAME": "PERSON",
|
||||
"NATION_MAN": "GPE",
|
||||
"NATION_WOMAN": "GPE",
|
||||
"NATION_PLURAL": "GPE",
|
||||
"NATIONALITY": "GPE",
|
||||
"GPE": "GPE",
|
||||
"ORG": "ORG",
|
||||
}
|
||||
|
||||
|
||||
class Span:
|
||||
"""
|
||||
Holds information about the start, end, type nad value
|
||||
of an entity in a text
|
||||
"""
|
||||
|
||||
def __init__(self, entity_type, entity_value, start_position, end_position):
|
||||
self.entity_type = entity_type
|
||||
self.entity_value = entity_value
|
||||
self.start_position = start_position
|
||||
self.end_position = end_position
|
||||
|
||||
def intersect(self, other, ignore_entity_type: bool):
|
||||
"""
|
||||
Checks if self intersects with a different Span
|
||||
:return: If interesecting, returns the number of
|
||||
intersecting characters.
|
||||
If not, returns 0
|
||||
"""
|
||||
|
||||
# if they do not overlap the intersection is 0
|
||||
if self.end_position < other.start_position or other.end_position < \
|
||||
self.start_position:
|
||||
return 0
|
||||
|
||||
# if we are accounting for entity type a diff type means intersection 0
|
||||
if not ignore_entity_type and (self.entity_type != other.entity_type):
|
||||
return 0
|
||||
|
||||
# otherwise the intersection is min(end) - max(start)
|
||||
return min(self.end_position, other.end_position) - max(
|
||||
self.start_position,
|
||||
other.start_position)
|
||||
|
||||
def __repr__(self):
|
||||
return "Type: {}, value: {}, start: {}, end: {}".format(
|
||||
self.entity_type, self.entity_value, self.start_position,
|
||||
self.end_position)
|
||||
|
||||
def __eq__(self, other):
|
||||
return self.entity_type == other.entity_type \
|
||||
and self.entity_value == other.entity_value \
|
||||
and self.start_position == other.start_position \
|
||||
and self.end_position == other.end_position
|
||||
|
||||
def __hash__(self):
|
||||
return hash(('entity_type', self.entity_type,
|
||||
'entity_value', self.entity_value,
|
||||
'start_position', self.start_position,
|
||||
'end_position', self.end_position))
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, data):
|
||||
return cls(**data)
|
||||
|
||||
|
||||
class SimpleSpacyExtensions(object):
|
||||
def __init__(self, **kwargs):
|
||||
"""
|
||||
Serialization of Spacy Token extensions.
|
||||
see https://spacy.io/api/token#set_extension
|
||||
:param kwargs: dictionary of spacy extensions and their values
|
||||
"""
|
||||
self.__dict__.update(kwargs)
|
||||
|
||||
def to_dict(self):
|
||||
return self.__dict__
|
||||
|
||||
|
||||
class SimpleToken(object):
|
||||
"""
|
||||
A class mimicking the Spacy Token class, for serialization purposes
|
||||
"""
|
||||
|
||||
def __init__(self, text, idx, tag_=None,
|
||||
pos_=None,
|
||||
dep_=None,
|
||||
lemma_=None,
|
||||
spacy_extensions: SimpleSpacyExtensions = None,
|
||||
**kwargs):
|
||||
self.text = text
|
||||
self.idx = idx
|
||||
self.tag_ = tag_
|
||||
self.pos_ = pos_
|
||||
self.dep_ = dep_
|
||||
self.lemma_ = lemma_
|
||||
|
||||
# serialization for Spacy extensions:
|
||||
if spacy_extensions is None:
|
||||
self._ = SimpleSpacyExtensions()
|
||||
else:
|
||||
self._ = spacy_extensions
|
||||
self.params = kwargs
|
||||
|
||||
@classmethod
|
||||
def from_spacy_token(cls, token):
|
||||
|
||||
if isinstance(token, SimpleToken):
|
||||
return token
|
||||
|
||||
elif isinstance(token, Token):
|
||||
|
||||
if token._ and token._._extensions:
|
||||
extensions = list(token._.token_extensions.keys())
|
||||
extension_values = {}
|
||||
for extension in extensions:
|
||||
extension_values[extension] = token._.__getattr__(extension)
|
||||
spacy_extensions = SimpleSpacyExtensions(**extension_values)
|
||||
else:
|
||||
spacy_extensions = None
|
||||
|
||||
return cls(text=token.text,
|
||||
idx=token.idx,
|
||||
tag_=token.tag_,
|
||||
pos_=token.pos_,
|
||||
dep_=token.dep_,
|
||||
lemma_=token.lemma_,
|
||||
spacy_extensions=spacy_extensions)
|
||||
|
||||
def to_dict(self):
|
||||
return {
|
||||
"text": self.text,
|
||||
"idx": self.idx,
|
||||
"tag_": self.tag_,
|
||||
"pos_": self.pos_,
|
||||
"dep_": self.dep_,
|
||||
"lemma_": self.lemma_,
|
||||
"_": self._.to_dict()
|
||||
}
|
||||
|
||||
def __repr__(self):
|
||||
return self.text
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, data):
|
||||
|
||||
if '_' in data:
|
||||
data['spacy_extensions'] = \
|
||||
SimpleSpacyExtensions(**data['_'])
|
||||
return cls(**data)
|
||||
|
||||
|
||||
class InputSample(object):
|
||||
|
||||
def __init__(self, full_text: str, masked: str, spans: List[Span],
|
||||
tokens=[], tags=[],
|
||||
create_tags_from_span=True, scheme="IO", metadata=None, template_id=None):
|
||||
"""
|
||||
Holds all the information needed for evaluation in the
|
||||
presidio-evaluator framework.
|
||||
Can generate tags (BIO/BILOU/IO) based on spans
|
||||
|
||||
:param full_text: The raw text of this sample
|
||||
:param masked: Masked version of the raw text (desired output)
|
||||
:param spans: List of spans for entities
|
||||
:param create_tags_from_span: True if tags (tokens+taks) should be added
|
||||
:param scheme: IO, BIO/IOB or BILOU. Only applicable if span_to_tag=True
|
||||
:param tokens: list of items of type SimpleToken
|
||||
:param tags: list of strings representing the label for each token,
|
||||
given the scheme
|
||||
:param metadata: A dictionary of additional metadata on the sample,
|
||||
in the English (or other language) vocabulary
|
||||
:param template_id: Original template (utterance) of sample, in case it was generated
|
||||
"""
|
||||
self.full_text = full_text
|
||||
self.masked = masked
|
||||
self.spans = spans if spans else []
|
||||
self.metadata = metadata
|
||||
|
||||
# generated samples have a template from which they were generated
|
||||
if not template_id and self.metadata:
|
||||
self.template_id = self.metadata.get("Template#")
|
||||
else:
|
||||
self.template_id = template_id
|
||||
|
||||
if create_tags_from_span:
|
||||
tokens, tags = self.get_tags(scheme)
|
||||
self.tokens = tokens
|
||||
self.tags = tags
|
||||
else:
|
||||
self.tokens = tokens
|
||||
self.tags = tags
|
||||
|
||||
def __repr__(self):
|
||||
return "Full text: {}\n" \
|
||||
"Spans: {}\n" \
|
||||
"Tokens: {}\n" \
|
||||
"Tags: {}\n".format(self.full_text, self.spans, self.tokens,
|
||||
self.tags)
|
||||
|
||||
def to_dict(self):
|
||||
|
||||
return {
|
||||
"full_text": self.full_text,
|
||||
"masked": self.masked,
|
||||
"spans": [span.__dict__ for span in self.spans],
|
||||
"tokens": [SimpleToken.from_spacy_token(token).to_dict()
|
||||
for token in self.tokens],
|
||||
"tags": self.tags,
|
||||
"template_id": self.template_id,
|
||||
"metadata": self.metadata
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, data):
|
||||
if 'spans' in data:
|
||||
data['spans'] = [Span.from_json(span) for span in data['spans']]
|
||||
if 'tokens' in data:
|
||||
data['tokens'] = [SimpleToken.from_json(val) for val in
|
||||
data['tokens']]
|
||||
return cls(**data, create_tags_from_span=False)
|
||||
|
||||
def get_tags(self, scheme="IOB"):
|
||||
start_indices = [span.start_position for span in self.spans]
|
||||
end_indices = [span.end_position for span in self.spans]
|
||||
tags = [span.entity_type for span in self.spans]
|
||||
tokens = tokenize(self.full_text)
|
||||
|
||||
labels = span_to_tag(scheme=scheme, text=self.full_text, tag=tags,
|
||||
start=start_indices, end=end_indices,
|
||||
tokens=tokens)
|
||||
|
||||
return tokens, labels
|
||||
|
||||
def to_conll(self, translate_tags, scheme="BIO"):
|
||||
|
||||
conll = []
|
||||
for i, token in enumerate(self.tokens):
|
||||
if translate_tags:
|
||||
label = self.translate_tag(self.tags[i], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
|
||||
else:
|
||||
label = self.tags[i]
|
||||
conll.append({"text": token.text,
|
||||
"pos": token.pos_,
|
||||
"tag": token.tag_,
|
||||
"Template#": self.metadata['Template#'],
|
||||
"gender": self.metadata['Gender'],
|
||||
"country": self.metadata['Country'],
|
||||
"label": label},
|
||||
)
|
||||
|
||||
return conll
|
||||
|
||||
def get_template_id(self):
|
||||
return self.metadata['Template#']
|
||||
|
||||
@staticmethod
|
||||
def create_conll_dataset(dataset, translate_tags=True, to_bio=True):
|
||||
import pandas as pd
|
||||
|
||||
conlls = []
|
||||
i = 0
|
||||
for sample in dataset:
|
||||
if to_bio:
|
||||
sample.bilou_to_bio()
|
||||
conll = sample.to_conll(translate_tags=translate_tags)
|
||||
for token in conll:
|
||||
token['sentence'] = i
|
||||
conlls.append(token)
|
||||
i += 1
|
||||
|
||||
return pd.DataFrame(conlls)
|
||||
|
||||
def to_spacy(self, entities=None, translate_tags=True):
|
||||
entities = [(span.start_position, span.end_position, span.entity_type)
|
||||
for span in self.spans if (entities is None) or (span.entity_type in entities)]
|
||||
new_entities = []
|
||||
if translate_tags:
|
||||
for entity in entities:
|
||||
new_tag = self.translate_tag(entity[2], PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
|
||||
new_entities.append((entity[0], entity[1], new_tag))
|
||||
else:
|
||||
new_entities = entities
|
||||
return (self.full_text,
|
||||
{"entities": new_entities})
|
||||
|
||||
@classmethod
|
||||
def from_spacy(cls, text, annotations, translate_from_spacy=True):
|
||||
spans = []
|
||||
for annotation in annotations:
|
||||
tag = cls.rename_from_spacy_tags([annotation[2]])[0] if translate_from_spacy else annotation[2]
|
||||
span = Span(tag, text[annotation[0]: annotation[1]], annotation[0], annotation[1])
|
||||
spans.append(span)
|
||||
return cls(full_text=text, masked=None, spans=spans)
|
||||
|
||||
@staticmethod
|
||||
def create_spacy_dataset(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
|
||||
def template_sort(x):
|
||||
return x.metadata['Template#']
|
||||
|
||||
if sort_by_template_id:
|
||||
dataset.sort(key=template_sort)
|
||||
|
||||
return [sample.to_spacy(entities=entities, translate_tags=translate_tags) for sample in dataset]
|
||||
|
||||
def to_spacy_json(self, entities=None, translate_tags=True):
|
||||
token_dicts = []
|
||||
for i, token in enumerate(self.tokens):
|
||||
if entities:
|
||||
tag = self.tags[i] if self.tags[i][2:] in entities else 'O'
|
||||
else:
|
||||
tag = self.tags[i]
|
||||
|
||||
if translate_tags:
|
||||
tag = self.translate_tag(tag, PRESIDIO_SPACY_ENTITIES, ignore_unknown=True)
|
||||
token_dicts.append({
|
||||
"orth": token.text,
|
||||
"tag": token.tag_,
|
||||
"ner": tag
|
||||
})
|
||||
|
||||
spacy_json_sentence = {
|
||||
"raw": self.full_text,
|
||||
"sentences": [{
|
||||
"tokens": token_dicts
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
return spacy_json_sentence
|
||||
|
||||
def to_spacy_doc(self):
|
||||
doc = self.tokens
|
||||
spacy_spans = []
|
||||
for span in self.spans:
|
||||
start_token = [token.i for token in self.tokens if token.idx == span.start_position][0]
|
||||
end_token = [token.i for token in self.tokens if token.idx + len(token.text) == span.end_position][0] + 1
|
||||
spacy_span = spacy.tokens.span.Span(doc, start=start_token, end=end_token,
|
||||
label=span.entity_type)
|
||||
spacy_spans.append(spacy_span)
|
||||
doc.ents = spacy_spans
|
||||
return doc
|
||||
|
||||
@staticmethod
|
||||
def create_spacy_json(dataset, entities=None, sort_by_template_id=False, translate_tags=True):
|
||||
def template_sort(x):
|
||||
return x.metadata['Template#']
|
||||
|
||||
if sort_by_template_id:
|
||||
dataset.sort(key=template_sort)
|
||||
|
||||
json_str = []
|
||||
for i, sample in tqdm(enumerate(dataset)):
|
||||
paragraph = sample.to_spacy_json(entities=entities, translate_tags=translate_tags)
|
||||
json_str.append({
|
||||
"id": i,
|
||||
"paragraphs": [paragraph]
|
||||
})
|
||||
|
||||
return json_str
|
||||
|
||||
@staticmethod
|
||||
def translate_tags(tags, dictionary, ignore_unknown):
|
||||
"""
|
||||
Translates entity types from one set to another
|
||||
:param tags: list of entities to translate, e.g. ["LOCATION","O","PERSON"]
|
||||
:param dictionary: Dictionary of old tags to new tags
|
||||
:param ignore_unknown: Whether to put "O" when word not in dictionary or keep old entity type
|
||||
:return: list of translated entities
|
||||
"""
|
||||
new_tags = []
|
||||
for tag in tags:
|
||||
new_tags.append(InputSample.translate_tag(tag, dictionary, ignore_unknown))
|
||||
|
||||
return new_tags
|
||||
|
||||
@staticmethod
|
||||
def translate_tag(tag, dictionary, ignore_unknown):
|
||||
has_prefix = len(tag) > 2 and tag[1] == '-'
|
||||
no_prefix = tag[2:] if has_prefix else tag
|
||||
if no_prefix in dictionary.keys():
|
||||
return tag[:2] + dictionary[no_prefix] if has_prefix else dictionary[no_prefix]
|
||||
else:
|
||||
if ignore_unknown:
|
||||
return "O"
|
||||
else:
|
||||
return tag
|
||||
|
||||
def bilou_to_bio(self):
|
||||
new_tags = []
|
||||
for tag in self.tags:
|
||||
new_tag = tag
|
||||
has_prefix = len(tag) > 2 and tag[1] == '-'
|
||||
if has_prefix:
|
||||
if tag[0] == 'U':
|
||||
new_tag = 'B' + tag[1:]
|
||||
elif tag[0] == 'L':
|
||||
new_tag = 'I' + tag[1:]
|
||||
new_tags.append(new_tag)
|
||||
|
||||
self.tags = new_tags
|
||||
|
||||
|
||||
@staticmethod
|
||||
def rename_from_spacy_tags(spacy_tags, ignore_unknown=False):
|
||||
return InputSample.translate_tags(spacy_tags, SPACY_PRESIDIO_ENTITIES, ignore_unknown=ignore_unknown)
|
||||
|
||||
@staticmethod
|
||||
def rename_to_spacy_tags(tags, ignore_unknown=True):
|
||||
return InputSample.translate_tags(tags, PRESIDIO_SPACY_ENTITIES, ignore_unknown=ignore_unknown)
|
||||
|
||||
@staticmethod
|
||||
def write_spacy_json_from_docs(dataset, filename="spacy_output.json"):
|
||||
docs = [sample.to_spacy_doc() for sample in dataset]
|
||||
srsly.write_json(filename, [spacy.gold.docs_to_json(docs)])
|
||||
|
||||
def to_flair(self):
|
||||
for token, i in enumerate(self.tokens):
|
||||
return "{} {} {}".format(token, token.pos_, self.tags[i])
|
||||
|
||||
def translate_input_sample_tags(self, dictionary=PRESIDIO_SPACY_ENTITIES, ignore_unknown=True):
|
||||
self.tags = InputSample.translate_tags(self.tags, dictionary, ignore_unknown=ignore_unknown)
|
||||
for span in self.spans:
|
||||
if span.entity_value in PRESIDIO_SPACY_ENTITIES:
|
||||
span.entity_value = PRESIDIO_SPACY_ENTITIES[span.entity_value]
|
||||
elif ignore_unknown:
|
||||
span.entity_value = 'O'
|
||||
|
||||
@staticmethod
|
||||
def create_flair_dataset(dataset):
|
||||
flair_samples = []
|
||||
for sample in dataset:
|
||||
flair_samples.append(sample.to_flair())
|
||||
|
||||
return flair_samples
|
||||
|
||||
|
||||
class ModelError:
|
||||
|
||||
def __init__(self, error_type, annotation, prediction, token, full_text, metadata):
|
||||
"""
|
||||
Holds information about an error a model made for analysis purposes
|
||||
:param error_type: str, e.g. FP, FN, Person->Address etc.
|
||||
:param annotation: ground truth value
|
||||
:param prediction: predicted value
|
||||
:param token: token in question
|
||||
:param full_text: full input text
|
||||
:param metadata: metadata on text from InputSample
|
||||
"""
|
||||
|
||||
self.error_type = error_type
|
||||
self.annotation = annotation
|
||||
self.prediction = prediction
|
||||
self.token = token
|
||||
self.full_text = full_text
|
||||
self.metadata = metadata
|
||||
|
||||
def __str__(self):
|
||||
return "type: {}, " \
|
||||
"Annotation = {}, " \
|
||||
"prediction = {}, " \
|
||||
"Token = {}, " \
|
||||
"Full text = {}, " \
|
||||
"Metadata = {}".format(self.error_type,
|
||||
self.annotation,
|
||||
self.prediction,
|
||||
self.token,
|
||||
self.full_text,
|
||||
self.metadata)
|
||||
|
||||
def __repr__(self):
|
||||
return r"<ModelError {{0}}>".format(self.__str__())
|
||||
|
||||
|
||||
class EvaluationResult(object):
|
||||
def __init__(self, results: Counter, model_errors: List[ModelError], text: str = None):
|
||||
"""
|
||||
Holds the output of a comparison between ground truth and predicted
|
||||
:param results: List of objects of type Counter
|
||||
with structure {(actual, predicted) : count}
|
||||
:param model_errors: List of ModelError
|
||||
:param text: sample's full text (if used for one sample)
|
||||
:type results: Counter
|
||||
:type model_errors : List[ModelError]
|
||||
:type text: object
|
||||
"""
|
||||
self.results = results
|
||||
self.model_errors = model_errors
|
||||
self.text = text
|
||||
|
||||
self.pii_recall = None
|
||||
self.pii_precision = None
|
||||
self.pii_f = None
|
||||
self.entity_recall_dict = None
|
||||
self.entity_precision_dict = None
|
||||
|
||||
def print(self):
|
||||
recall_dict = self.entity_recall_dict
|
||||
precision_dict = self.entity_precision_dict
|
||||
|
||||
recall_dict["PII"] = self.pii_recall
|
||||
precision_dict["PII"] = self.pii_precision
|
||||
|
||||
entities = recall_dict.keys()
|
||||
recall = recall_dict.values()
|
||||
precision = precision_dict.values()
|
||||
|
||||
row_format = "{:>30}{:>30.2%}{:>30.2%}"
|
||||
header_format = "{:>30}" * 3
|
||||
print(header_format.format(*("Entity", "Precision", "Recall")))
|
||||
for entity, precision, recall in zip(entities, recall, precision):
|
||||
print(row_format.format(entity, precision, recall))
|
||||
|
||||
print("PII F measure: {}".format(self.pii_f))
|
||||
|
|
@ -0,0 +1,74 @@
|
|||
from typing import List
|
||||
|
||||
try:
|
||||
from flair.data import Sentence, build_spacy_tokenizer
|
||||
from flair.models import SequenceTagger
|
||||
except ImportError:
|
||||
print("Flair is not installed by default")
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample
|
||||
import spacy
|
||||
|
||||
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
|
||||
|
||||
|
||||
class FlairEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self,
|
||||
model=None,
|
||||
model_path: str = None,
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme: str = "BIO",
|
||||
compare_by_io: bool = True,
|
||||
translate_to_spacy_entities=True):
|
||||
"""
|
||||
Evaluator for Flair models
|
||||
:param model: model of type SequenceTagger
|
||||
:param model_path:
|
||||
:param entities_to_keep:
|
||||
:param verbose:
|
||||
:param labeling_scheme:
|
||||
:param compare_by_io:
|
||||
:param translate_to_spacy_entities:
|
||||
"""
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io)
|
||||
|
||||
if model is None:
|
||||
if model_path is None:
|
||||
raise ValueError("Either model_path or model object must be supplied")
|
||||
self.model = SequenceTagger.load(model_path)
|
||||
else:
|
||||
self.model = model
|
||||
|
||||
self.spacy_tokenizer = build_spacy_tokenizer(model=spacy.blank('en'))
|
||||
self.translate_to_spacy_entities = translate_to_spacy_entities
|
||||
|
||||
if self.translate_to_spacy_entities:
|
||||
print("Translating entities using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
if self.translate_to_spacy_entities:
|
||||
sample.translate_input_sample_tags()
|
||||
sentence = Sentence(text=sample.full_text, use_tokenizer=self.spacy_tokenizer)
|
||||
self.model.predict(sentence)
|
||||
|
||||
tags = self.get_tags_from_sentence(sentence)
|
||||
if len(tags) != len(sample.tokens):
|
||||
print("mismatch between previous tokens and new tokens")
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def get_tags_from_sentence(sentence):
|
||||
tags = []
|
||||
for token in sentence:
|
||||
tags.append(token.get_tag('ner').value)
|
||||
|
||||
new_tags = []
|
||||
for tag in tags:
|
||||
new_tags.append("PERSON" if tag == "PER" else tag)
|
||||
|
||||
return new_tags
|
|
@ -0,0 +1,398 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import List, Tuple, Dict
|
||||
from collections import Counter
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from presidio_evaluator import InputSample, EvaluationResult, ModelError
|
||||
from tqdm import tqdm
|
||||
|
||||
|
||||
class ModelEvaluator(ABC):
|
||||
|
||||
def __init__(self, entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
use_spans: bool = False, labeling_scheme="BIO",
|
||||
compare_by_io=True):
|
||||
|
||||
"""
|
||||
Abstract class for evaluating NER models and others
|
||||
:param entities_to_keep: Which entities should be evaluated? All other
|
||||
entities are ignored. If None, none are filtered
|
||||
:param verbose: Whether to print more debug info
|
||||
:param labeling_scheme: Type of scheme used for labeling (BILOU,
|
||||
BIO/LOB or IO)
|
||||
:param compare_by_io: True if comparison should be done on the entity
|
||||
level and not the sub-entity level
|
||||
|
||||
"""
|
||||
self.entities = entities_to_keep
|
||||
self.verbose = verbose
|
||||
self.use_spans = use_spans
|
||||
self.compare_by_io = compare_by_io
|
||||
self.labeling_scheme = labeling_scheme
|
||||
|
||||
@abstractmethod
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
"""
|
||||
Abstract. Returns the predicted tokens/spans from the evaluated model
|
||||
:param sample: Sample to be evaluated
|
||||
:return: if self.use spans: list of spans
|
||||
if not self.use_spans: tags in self.labeling_scheme format
|
||||
"""
|
||||
pass
|
||||
|
||||
def compare(self, input_sample: InputSample, prediction: List[str]):
|
||||
|
||||
"""
|
||||
Compares gound truth tags (annotation) and predicted (prediction)
|
||||
:param input_sample: input sample containing list of tags with scheme
|
||||
:param prediction: predicted value for each token
|
||||
self.labeling_scheme
|
||||
|
||||
"""
|
||||
annotation = input_sample.tags
|
||||
tokens = input_sample.tokens
|
||||
|
||||
if len(annotation) != len(prediction):
|
||||
print("Annotation and prediction do not have the"
|
||||
"same length. Sample={}".format(input_sample))
|
||||
return Counter(), []
|
||||
|
||||
results = Counter()
|
||||
mistakes = []
|
||||
|
||||
new_annotation = annotation.copy()
|
||||
|
||||
if self.compare_by_io:
|
||||
new_annotation = self._to_io(new_annotation)
|
||||
prediction = self._to_io(prediction)
|
||||
|
||||
# Ignore annotations that aren't in the list of
|
||||
# requested entities.
|
||||
if self.entities:
|
||||
prediction = self._adjust_per_entities(prediction)
|
||||
new_annotation = self._adjust_per_entities(new_annotation)
|
||||
for i in range(0, len(new_annotation)):
|
||||
results[(new_annotation[i], prediction[i])] += 1
|
||||
|
||||
if self.verbose:
|
||||
print('Annotation:', new_annotation[i])
|
||||
print('Prediction:', prediction[i])
|
||||
print(results)
|
||||
|
||||
# check if there was an error
|
||||
is_error = (new_annotation[i] != prediction[i])
|
||||
if is_error:
|
||||
if prediction[i] == 'O':
|
||||
mistakes.append(ModelError("FN",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata))
|
||||
elif new_annotation[i] == 'O':
|
||||
mistakes.append(ModelError("FP",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata))
|
||||
else:
|
||||
mistakes.append(ModelError("Wrong entity",
|
||||
new_annotation[i],
|
||||
prediction[i],
|
||||
tokens[i],
|
||||
input_sample.full_text,
|
||||
input_sample.metadata))
|
||||
|
||||
return results, mistakes
|
||||
|
||||
def _adjust_per_entities(self, tags):
|
||||
if self.entities:
|
||||
return [tag if tag in self.entities else 'O' for tag in tags]
|
||||
|
||||
@staticmethod
|
||||
def _to_io(tags):
|
||||
"""
|
||||
Translates BILOU/BIO/IOB to IO - only In or Out of entity.
|
||||
['B-PERSON','I-PERSON','L-PERSON'] is translated into
|
||||
['PERSON','PERSON','PERSON']
|
||||
:param tags: the input tags in BILOU/IOB/BIO format
|
||||
:return: a new list of IO tags
|
||||
"""
|
||||
return [tag[2:] if '-' in tag else tag for tag in tags]
|
||||
|
||||
def evaluate_sample(self, sample: InputSample) -> EvaluationResult:
|
||||
if self.verbose:
|
||||
print("Input sentence: {}".format(sample.full_text))
|
||||
|
||||
prediction = self.predict(sample)
|
||||
results, mistakes = self.compare(
|
||||
input_sample=sample,
|
||||
prediction=prediction)
|
||||
return EvaluationResult(results, mistakes, sample.full_text)
|
||||
|
||||
def evaluate_all(self, dataset: List[InputSample]) -> List[EvaluationResult]:
|
||||
evaluation_results = []
|
||||
for sample in tqdm(dataset, desc='Evaluating {}'.format(self.__class__)):
|
||||
evaluation_result = self.evaluate_sample(sample)
|
||||
evaluation_results.append(evaluation_result)
|
||||
|
||||
return evaluation_results
|
||||
|
||||
def calculate_score(self, evaluation_results: List[
|
||||
EvaluationResult], beta: float = 1) \
|
||||
-> EvaluationResult:
|
||||
"""
|
||||
Returns the pii_precision, pii_recall and f_measure either for each entity
|
||||
or for all entities (ignore_entity_type = True)
|
||||
:param evaluation_results: List of EvaluationResult
|
||||
:param beta: F measure beta value
|
||||
between different entity types, or to treat these as misclassifications
|
||||
:return: EvaluationResult with precision, recall and f measures
|
||||
"""
|
||||
|
||||
# aggregate results
|
||||
all_results = sum([er.results for er in evaluation_results], Counter())
|
||||
|
||||
# compute pii_recall per entity
|
||||
entity_recall = {}
|
||||
entity_precision = {}
|
||||
if self.entities:
|
||||
entities = self.entities
|
||||
else:
|
||||
entities = list(
|
||||
set([x[0] for x in all_results.keys() if x[0] != 'O']))
|
||||
|
||||
for entity in entities:
|
||||
# all annotation of given type
|
||||
annotated = sum(
|
||||
[all_results[x] for x in all_results if x[0] == entity])
|
||||
predicted = sum(
|
||||
[all_results[x] for x in all_results if x[1] == entity])
|
||||
tp = all_results[(entity, entity)]
|
||||
|
||||
if annotated > 0:
|
||||
entity_recall[entity] = tp / annotated
|
||||
else:
|
||||
entity_recall[entity] = np.NaN
|
||||
|
||||
if predicted > 0:
|
||||
per_entity_tp = all_results[(entity, entity)]
|
||||
entity_precision[entity] = per_entity_tp / predicted
|
||||
else:
|
||||
entity_precision[entity] = np.NaN
|
||||
|
||||
# compute pii_precision and pii_recall
|
||||
annotated_all = sum(
|
||||
[all_results[x] for x in all_results if x[0] != 'O'])
|
||||
predicted_all = sum(
|
||||
[all_results[x] for x in all_results if x[1] != 'O'])
|
||||
if annotated_all > 0:
|
||||
pii_recall = sum([all_results[x] for x in all_results if
|
||||
(x[0] != 'O' and x[1] != 'O')]) / annotated_all
|
||||
else:
|
||||
pii_recall = np.NaN
|
||||
if predicted_all > 0:
|
||||
pii_precision = sum([all_results[x] for x in all_results if
|
||||
(x[0] != 'O' and x[1] != 'O')]) / predicted_all
|
||||
else:
|
||||
pii_precision = np.NaN
|
||||
# compute pii_f_beta-score
|
||||
pii_f_beta = self.f_beta(pii_precision, pii_recall, beta)
|
||||
|
||||
# aggregate errors
|
||||
errors = []
|
||||
for res in evaluation_results:
|
||||
if res.model_errors:
|
||||
errors.extend(res.model_errors)
|
||||
|
||||
evaluation_result = EvaluationResult(results=all_results, model_errors=errors)
|
||||
evaluation_result.pii_precision = pii_precision
|
||||
evaluation_result.pii_recall = pii_recall
|
||||
evaluation_result.entity_recall_dict = entity_recall
|
||||
evaluation_result.entity_precision_dict = entity_precision
|
||||
evaluation_result.pii_f = pii_f_beta
|
||||
|
||||
return evaluation_result
|
||||
|
||||
@staticmethod
|
||||
def precision(tp: int, fp: int) -> float:
|
||||
return tp / (tp + fp + 1e-100)
|
||||
|
||||
@staticmethod
|
||||
def recall(tp: int, fn: int) -> float:
|
||||
return tp / (tp + fn + 1e-100)
|
||||
|
||||
@staticmethod
|
||||
def f_beta(precision: float, recall: float, beta: float) -> float:
|
||||
"""
|
||||
Returns the F score for precision, recall and a beta parameter
|
||||
:param precision: a float with the precision value
|
||||
:param recall: a float with the recall value
|
||||
:param beta: a float with the beta parameter of the F measure,
|
||||
which gives more or less weight to precision
|
||||
vs. recall
|
||||
:return: a float value of the f(beta) measure.
|
||||
"""
|
||||
if np.isnan(precision) or np.isnan(recall) or (
|
||||
precision == 0 and recall == 0):
|
||||
return np.nan
|
||||
|
||||
return ((1 + beta ** 2) * precision * recall) / (
|
||||
((beta ** 2) * precision) + recall)
|
||||
|
||||
@staticmethod
|
||||
def align_input_samples_to_presidio_analyzer(input_samples: List[InputSample],
|
||||
entities_mapping: Dict[str, str],
|
||||
presidio_fields: List[str]=None) \
|
||||
-> List[InputSample]:
|
||||
"""
|
||||
Change input samples to conform with Presidio's entities
|
||||
:return: new list of InputSample
|
||||
"""
|
||||
|
||||
new_input_samples = input_samples.copy()
|
||||
|
||||
# Match entity names to Presidio's
|
||||
if not presidio_fields:
|
||||
presidio_fields = ['CREDIT_CARD', 'CRYPTO', 'DATE_TIME', 'DOMAIN_NAME', 'EMAIL_ADDRESS', 'IBAN_CODE',
|
||||
'IP_ADDRESS', 'NRP', 'LOCATION', 'PERSON', 'PHONE_NUMBER', 'US_SSN']
|
||||
|
||||
# A list that will contain updated input samples,
|
||||
new_list = []
|
||||
|
||||
# Iterate on all samples
|
||||
for input_sample in new_input_samples:
|
||||
contains_presidio_field = False
|
||||
new_spans = []
|
||||
# Update spans to match Presidio's entity name
|
||||
for span in input_sample.spans:
|
||||
in_presidio_field = False
|
||||
if span.entity_type in entities_mapping.keys():
|
||||
new_name = entities_mapping.get(span.entity_type)
|
||||
span.entity_type = new_name
|
||||
contains_presidio_field = True
|
||||
|
||||
# Add to new span list, if the span contains an entity relevant to Presidio
|
||||
new_spans.append(span)
|
||||
input_sample.spans = new_spans
|
||||
|
||||
# Update tags in case this sample has relevant entities for evaluation
|
||||
if contains_presidio_field:
|
||||
for i, tag in enumerate(input_sample.tags):
|
||||
has_prefix = '-' in tag
|
||||
if has_prefix:
|
||||
prefix = tag[:2]
|
||||
clean = tag[2:]
|
||||
else:
|
||||
prefix = ""
|
||||
clean = tag
|
||||
|
||||
if clean in entities_mapping.keys():
|
||||
new_name = entities_mapping.get(clean)
|
||||
input_sample.tags[i] = "{}{}".format(prefix, new_name)
|
||||
else:
|
||||
input_sample.tags[i] = 'O'
|
||||
|
||||
new_list.append(input_sample)
|
||||
return new_list
|
||||
|
||||
@staticmethod
|
||||
def get_false_positives(errors=List[ModelError], entity=None):
|
||||
"""
|
||||
Get a list of all false positive errors in the results
|
||||
"""
|
||||
if isinstance(entity, str):
|
||||
entity = [entity]
|
||||
|
||||
if entity:
|
||||
return [model_error for model_error in errors if
|
||||
model_error.error_type == 'FP' and model_error.prediction in entity]
|
||||
else:
|
||||
return [model_error for model_error in errors if model_error.error_type == 'FP']
|
||||
|
||||
@staticmethod
|
||||
def get_false_negatives(errors=List[ModelError], entity=None):
|
||||
"""
|
||||
Get a list of all false positive negative errors in the results (False negatives and wrong entity detection)
|
||||
"""
|
||||
if isinstance(entity, str):
|
||||
entity = [entity]
|
||||
if entity:
|
||||
return [model_error for model_error in errors if
|
||||
model_error.error_type != 'FP' and model_error.annotation in entity]
|
||||
else:
|
||||
return [model_error for model_error in errors if model_error.error_type != 'FP']
|
||||
|
||||
@staticmethod
|
||||
def most_common_fp_tokens(errors=List[ModelError], n: int = 10, entity=None):
|
||||
"""
|
||||
Print the n most common false positive tokens (tokens thought to be an entity)
|
||||
"""
|
||||
fps = ModelEvaluator.get_false_positives(errors, entity)
|
||||
|
||||
tokens = [err.token.text for err in fps]
|
||||
from collections import Counter
|
||||
by_frequency = Counter(tokens)
|
||||
most_common = by_frequency.most_common(n)
|
||||
print("Most common false positive tokens:")
|
||||
print(most_common)
|
||||
print("Example sentence with each FP token:")
|
||||
for tok, val in most_common:
|
||||
with_tok = [err for err in fps if err.token.text == tok]
|
||||
print(with_tok[0].full_text)
|
||||
|
||||
@staticmethod
|
||||
def most_common_fn_tokens(errors=List[ModelError], n: int = 10, entity=None):
|
||||
"""
|
||||
Print all tokens that were missed by the model, including an example of the full text in which they appear
|
||||
"""
|
||||
fns = ModelEvaluator.get_false_negatives(errors, entity)
|
||||
|
||||
fns_tokens = [err.token.text for err in fns]
|
||||
from collections import Counter
|
||||
by_frequency_fns = Counter(fns_tokens)
|
||||
most_common_fns = by_frequency_fns.most_common(50)
|
||||
print(most_common_fns)
|
||||
for tok, val in most_common_fns:
|
||||
with_tok = [err for err in fns if err.token.text == tok]
|
||||
print("Token: {}, Annotation: {}, Full text: {}".format(with_tok[0].token, with_tok[0].annotation,
|
||||
with_tok[0].full_text))
|
||||
|
||||
@staticmethod
|
||||
def get_errors_df(errors=List[ModelError], entity: List[str] = None, error_type: str = 'FN'):
|
||||
"""
|
||||
Get ModelErrors as pd.DataFrame
|
||||
"""
|
||||
if error_type == 'FN':
|
||||
filtered_errors = ModelEvaluator.get_false_negatives(errors, entity)
|
||||
elif error_type == 'FP':
|
||||
filtered_errors = ModelEvaluator.get_false_positives(errors, entity)
|
||||
else:
|
||||
raise ValueError("error_type should be either FP or FN")
|
||||
|
||||
if len(filtered_errors) == 0:
|
||||
print("No errors of type {} and entity {} were found".format(error_type,entity))
|
||||
return None
|
||||
|
||||
errors_df = pd.DataFrame.from_records([error.__dict__ for error in filtered_errors])
|
||||
metadata_df = pd.DataFrame(errors_df['metadata'].tolist())
|
||||
errors_df.drop(['metadata'], axis=1, inplace=True)
|
||||
new_errors_df = pd.concat([errors_df, metadata_df], axis=1)
|
||||
return new_errors_df
|
||||
|
||||
@staticmethod
|
||||
def get_fps_dataframe(errors=List[ModelError], entity: List[str] = None):
|
||||
"""
|
||||
Get false positive ModelErrors as pd.DataFrame
|
||||
"""
|
||||
return ModelEvaluator.get_errors_df(errors, entity, error_type='FP')
|
||||
|
||||
@staticmethod
|
||||
def get_fns_dataframe(errors=List[ModelError], entity: List[str] = None):
|
||||
"""
|
||||
Get false negative ModelErrors as pd.DataFrame
|
||||
"""
|
||||
return ModelEvaluator.get_errors_df(errors, entity, error_type='FN')
|
|
@ -0,0 +1,136 @@
|
|||
'''
|
||||
Presidio Analyzer not yet on PyPI, cannot explicitly reference it
|
||||
'''
|
||||
|
||||
from typing import List, Dict
|
||||
#
|
||||
from presidio_evaluator import ModelEvaluator, InputSample, span_to_tag
|
||||
#
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
|
||||
|
||||
#
|
||||
#
|
||||
class PresidioAnalyzer(ModelEvaluator):
|
||||
|
||||
def __init__(self, analyzer,
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme="BIO",
|
||||
compare_by_io=True,
|
||||
score_threshold=0.4
|
||||
):
|
||||
"""
|
||||
Evaluation wrapper for the Presidio Analyzer
|
||||
:param analyzer: object of type AnalyzerEngine (from presidio-analyzer)
|
||||
"""
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io)
|
||||
self.analyzer = analyzer
|
||||
|
||||
self.score_threshold = score_threshold
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
if self.entities is None or len(self.entities) == 0:
|
||||
all_fields = True
|
||||
else:
|
||||
all_fields = None
|
||||
results = self.analyzer.analyze(sample.full_text, self.entities,
|
||||
language='en', all_fields=all_fields)
|
||||
starts = []
|
||||
ends = []
|
||||
scores = []
|
||||
tags = []
|
||||
#
|
||||
for res in results:
|
||||
#
|
||||
if res.score >= self.score_threshold:
|
||||
starts.append(res.start)
|
||||
ends.append(res.end)
|
||||
tags.append(res.entity_type)
|
||||
scores.append(res.score)
|
||||
#
|
||||
response_tags = span_to_tag(scheme=self.labeling_scheme,
|
||||
text=sample.full_text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tokens=sample.tokens,
|
||||
scores=scores,
|
||||
tag=tags)
|
||||
return response_tags
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Reading dataset")
|
||||
input_samples = read_synth_dataset("../data/generated_size_30000_date_July 24 2019.txt")
|
||||
|
||||
print("Preparing dataset by aligning entity names to Presidio's entity names")
|
||||
|
||||
# Mapping between dataset entities and Presidio entities. Key: Dataset entity, Value: Presidio entity
|
||||
entities_mapping = {
|
||||
'PERSON': 'PERSON',
|
||||
'EMAIL': 'EMAIL_ADDRESS',
|
||||
'CREDIT_CARD': 'CREDIT_CARD',
|
||||
'FIRST_NAME': 'PERSON',
|
||||
'PHONE_NUMBER': 'PHONE_NUMBER',
|
||||
'BIRTHDAY': 'DATE_TIME',
|
||||
'DATE': 'DATE_TIME',
|
||||
'DOMAIN': 'DOMAIN',
|
||||
'CITY': 'LOCATION',
|
||||
'ADDRESS': 'LOCATION',
|
||||
'IBAN': 'IBAN_CODE',
|
||||
'URL': 'DOMAIN_NAME',
|
||||
'US_SSN': 'US_SSN',
|
||||
'IP_ADDRESS': 'IP_ADDRESS',
|
||||
'ORGANIZATION': 'ORG',
|
||||
'O': 'O'
|
||||
}
|
||||
|
||||
updated_samples = ModelEvaluator.align_input_samples_to_presidio_analyzer(input_samples,
|
||||
entities_mapping)
|
||||
|
||||
flatten = lambda l: [item for sublist in l for item in sublist]
|
||||
from collections import Counter
|
||||
|
||||
count_per_entity = Counter(
|
||||
[span.entity_type for span in flatten([input_sample.spans for input_sample in updated_samples])])
|
||||
|
||||
print("Evaluating samples")
|
||||
analyzer = PresidioAnalyzer(entities_to_keep=count_per_entity.keys())
|
||||
evaluated_samples = analyzer.evaluate_all(updated_samples)
|
||||
#
|
||||
print("Estimating metrics")
|
||||
precision, recall, \
|
||||
entity_recall, entity_precision, \
|
||||
f, errors = analyzer.calculate_score(evaluation_results=evaluated_samples, beta=2.5)
|
||||
#
|
||||
print("precision: {}".format(precision))
|
||||
print("Recall: {}".format(recall))
|
||||
print("F 2.5: {}".format(f))
|
||||
print("Precision per entity: {}".format(entity_precision))
|
||||
print("Recall per entity: {}".format(entity_recall))
|
||||
#
|
||||
FN_mistakes = [mistake for mistake in flatten(errors) if mistake[0:2] == 'FN']
|
||||
FP_mistakes = [mistake for mistake in flatten(errors) if mistake[0:2] == 'FP']
|
||||
other_mistakes = [mistake for mistake in flatten(errors) if "Wrong entity" in mistake]
|
||||
|
||||
fn = open('../data/fn_30000.txt', 'w+', encoding='utf-8')
|
||||
fn1 = '\n'.join(FN_mistakes)
|
||||
fn.write(fn1)
|
||||
fn.close()
|
||||
|
||||
fp = open('../data/fp_30000.txt', 'w+', encoding='utf-8')
|
||||
fp1 = '\n'.join(FP_mistakes)
|
||||
fp.write(fp1)
|
||||
fp.close()
|
||||
|
||||
mistakes_file = open('../data/mistakes_30000.txt', 'w+', encoding='utf-8')
|
||||
mistakes1 = '\n'.join(other_mistakes)
|
||||
mistakes_file.write(mistakes1)
|
||||
mistakes_file.close()
|
||||
|
||||
from pickle import dump
|
||||
|
||||
dump(evaluated_samples, open("../data/evaluated_samples_30000.pickle", "wb"))
|
|
@ -0,0 +1,133 @@
|
|||
import json
|
||||
from typing import List
|
||||
|
||||
import requests
|
||||
|
||||
from presidio_evaluator import InputSample, ModelEvaluator
|
||||
from presidio_evaluator.span_to_tag import span_to_tag, tokenize
|
||||
|
||||
ENDPOINT = "http://40.113.201.221:8080/api/v1/projects/test/analyze"
|
||||
|
||||
|
||||
class PresidioAPIEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self, endpoint=None, all_fields=False, entities_to_keep=None,
|
||||
verbose=False, labeling_scheme="IO", **kwargs):
|
||||
"""
|
||||
evaluator model for the presidio API as a system
|
||||
:param endpoint: url of presidio API
|
||||
:param all_fields: boolean, true if no entities filtering should take
|
||||
place
|
||||
:param entities_to_keep: list of entities to return if found
|
||||
:param labeling_scheme: BIO/IOB or BILOU
|
||||
:param verbose:
|
||||
:param kwargs:
|
||||
"""
|
||||
|
||||
if not endpoint:
|
||||
print(
|
||||
"Endpoint is missing. using default presidio API at {}".format(
|
||||
ENDPOINT))
|
||||
self.endpoint = ENDPOINT
|
||||
else:
|
||||
self.endpoint = endpoint
|
||||
|
||||
if not entities_to_keep and not all_fields:
|
||||
raise ValueError("Please provide either a list of entities or"
|
||||
"all_fields=true")
|
||||
|
||||
if all_fields:
|
||||
entities_to_keep = None
|
||||
super().__init__(verbose=verbose, entities_to_keep=entities_to_keep,
|
||||
labeling_scheme=labeling_scheme, **kwargs)
|
||||
|
||||
self.set_analyze_template(all_fields=all_fields,
|
||||
entities=entities_to_keep)
|
||||
|
||||
def predict(self, sample: InputSample):
|
||||
text = sample.full_text
|
||||
request = {"text": text,
|
||||
"analyzeTemplate": self.analyze_template
|
||||
}
|
||||
# Call presidio API
|
||||
r = requests.post(self.endpoint, json=request)
|
||||
starts = []
|
||||
ends = []
|
||||
tags = []
|
||||
|
||||
if r.status_code == 200:
|
||||
analyzer_results = json.loads(r.text)
|
||||
if self.verbose:
|
||||
print(analyzer_results)
|
||||
|
||||
if analyzer_results:
|
||||
for res in analyzer_results:
|
||||
if not res['location'].get('start'):
|
||||
res['location']['start'] = 0
|
||||
starts.append(res['location']['start'])
|
||||
ends.append(res['location']['end'])
|
||||
tags.append(res['field']['name'])
|
||||
|
||||
response_tags = span_to_tag(scheme=self.labeling_scheme,
|
||||
text=text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tag=tags)
|
||||
|
||||
elif r.status_code == 400 or r.text == "":
|
||||
if self.verbose:
|
||||
print("Status 400 received")
|
||||
response_tags = ['O' for token in sample.tokens]
|
||||
else:
|
||||
print("Error getting result from Presidio API")
|
||||
print("Request = {}".format(request))
|
||||
print("Response = {}".format(r.text))
|
||||
raise Exception(r)
|
||||
|
||||
return response_tags
|
||||
|
||||
def set_analyze_template(self, all_fields: bool, entities: List[str]):
|
||||
template = {
|
||||
"fields": [{"name": "EMAIL_ADDRESS"}, {"name": "IP_ADDRESS"},
|
||||
{"name": "US_DRIVER_LICENSE"},
|
||||
{"name": "US_ITIN"}, {"name": "US_SSN"},
|
||||
{"name": "DOMAIN_NAME"},
|
||||
{"name": "IBAN_CODE"}, {"name": "PERSON"},
|
||||
{"name": "PHONE_NUMBER"},
|
||||
{"name": "US_BANK_NUMBER"}, {"name": "CRYPTO"},
|
||||
{"name": "NRP"},
|
||||
{"name": "UK_NHS"}, {"name": "CREDIT_CARD"},
|
||||
{"name": "DATE_TIME"},
|
||||
{"name": "LOCATION"}, {"name": "US_PASSPORT"}]}
|
||||
|
||||
if all_fields:
|
||||
self.analyze_template = template
|
||||
return
|
||||
|
||||
requested_fields = []
|
||||
for entity in entities:
|
||||
for field in template['fields']:
|
||||
if entity == field['name']:
|
||||
requested_fields.append(field)
|
||||
|
||||
new_template = {'fields': requested_fields}
|
||||
|
||||
self.analyze_template = new_template
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example:
|
||||
text = "My siblings are Dan and magen"
|
||||
bilou_tags = ['O', 'O', 'O', 'U-PERSON', 'O', 'U-PERSON']
|
||||
presidio = PresidioAPIEvaluator(verbose=True, all_fields=True, compare_by_io=True)
|
||||
tokens = tokenize(text)
|
||||
s = InputSample(text, masked=None, spans=None)
|
||||
s.tokens = tokens
|
||||
s.tags = bilou_tags
|
||||
|
||||
evaluated_sample = presidio.evaluate_sample(s)
|
||||
p, r, entity_recall, f, mistakes = presidio.calculate_score([evaluated_sample])
|
||||
print("Precision = {}\n"
|
||||
"Recall = {}\n"
|
||||
"F_3 = {}\n"
|
||||
"Errors = {}".format(p, r, f, mistakes))
|
|
@ -0,0 +1,82 @@
|
|||
'''
|
||||
Presidio Analyzer not yet on PyPI, therefore it cannot be referenced explicitly
|
||||
'''
|
||||
|
||||
import math
|
||||
from typing import List, Tuple, Dict
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample
|
||||
from presidio_evaluator.span_to_tag import span_to_tag
|
||||
|
||||
|
||||
class PresidioRecognizerEvaluator(ModelEvaluator):
|
||||
def __init__(self, recognizer, nlp_engine, entities_to_keep=None,
|
||||
with_nlp_artifacts=False, verbose=False, compare_by_io=True,
|
||||
):
|
||||
"""
|
||||
Evaluator for one recognizer
|
||||
:param recognizer: An object of type EntityRecognizer (in presidion-analyzer)
|
||||
:param nlp_engine: An object of type NlpEngine, e.g. SpacyNlpEngine (in presidio-analyzer)
|
||||
"""
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose, compare_by_io=compare_by_io)
|
||||
self.withNlpArtifacts = with_nlp_artifacts
|
||||
self.recognizer = recognizer
|
||||
self.nlp_engine = nlp_engine
|
||||
|
||||
#
|
||||
def __make_nlp_artifacts(self, text: str):
|
||||
return self.nlp_engine.process_text(text, 'en')
|
||||
|
||||
#
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
nlpArtifacts = None
|
||||
if self.withNlpArtifacts:
|
||||
nlpArtifacts = self.__make_nlp_artifacts(sample.full_text)
|
||||
results = self.recognizer.analyze(sample.full_text, self.entities,
|
||||
nlpArtifacts)
|
||||
starts = []
|
||||
ends = []
|
||||
tags = []
|
||||
scores = []
|
||||
for res in results:
|
||||
if not res.start:
|
||||
res.start = 0
|
||||
starts.append(res.start)
|
||||
ends.append(res.end)
|
||||
tags.append(res.entity_type)
|
||||
scores.append(res.score)
|
||||
response_tags = span_to_tag(scheme=self.labeling_scheme,
|
||||
text=sample.full_text,
|
||||
start=starts,
|
||||
end=ends,
|
||||
tag=tags,
|
||||
tokens=sample.tokens,
|
||||
scores=scores,
|
||||
io_tags_only=self.compare_by_io)
|
||||
if len(sample.tags) == 0:
|
||||
sample.tags = ['0' for word in response_tags]
|
||||
return response_tags
|
||||
|
||||
|
||||
def score_presidio_recognizer(recognizer, entities_to_keep, input_samples,
|
||||
withNlpArtifacts=False) \
|
||||
-> Tuple[Dict[str, float], Dict[str, float], Dict[str, float], Dict[
|
||||
str, float], Dict[str, float], List[str]]:
|
||||
model = PresidioRecognizerEvaluator(recognizer=recognizer,
|
||||
entities_to_keep=entities_to_keep,
|
||||
with_nlp_artifacts=withNlpArtifacts)
|
||||
evaluated_samples = model.evaluate_all(input_samples[:])
|
||||
precision, recall, ent_recall, \
|
||||
ent_precision, fscore, mistakes = model.calculate_score(
|
||||
evaluated_samples, beta=2.5)
|
||||
print("p={precision}, r={recall},f={f},"
|
||||
"entity recall={ent},entity precision={prec}".format(
|
||||
precision=precision,
|
||||
recall=recall,
|
||||
f=fscore,
|
||||
ent=ent_recall,
|
||||
prec=ent_precision))
|
||||
if math.isnan(precision):
|
||||
precision = 0
|
||||
return precision, recall, ent_recall, ent_precision, fscore, mistakes
|
|
@ -0,0 +1,52 @@
|
|||
from typing import List
|
||||
|
||||
from presidio_evaluator import ModelEvaluator, InputSample
|
||||
import spacy
|
||||
|
||||
from spacy.language import Language
|
||||
|
||||
from presidio_evaluator.data_objects import PRESIDIO_SPACY_ENTITIES
|
||||
|
||||
|
||||
class SpacyEvaluator(ModelEvaluator):
|
||||
|
||||
def __init__(self,
|
||||
model: spacy.language.Language = None,
|
||||
model_name: str = None,
|
||||
entities_to_keep: List[str] = None,
|
||||
verbose: bool = False,
|
||||
labeling_scheme: str = "BIO",
|
||||
compare_by_io: bool = True,
|
||||
translate_to_spacy_ents = True):
|
||||
super().__init__(entities_to_keep=entities_to_keep,
|
||||
verbose=verbose,
|
||||
labeling_scheme=labeling_scheme,
|
||||
compare_by_io=compare_by_io)
|
||||
|
||||
if model is None:
|
||||
if model_name is None:
|
||||
raise ValueError("Either model_name or model object must be supplied")
|
||||
self.model = spacy.load(model_name)
|
||||
else:
|
||||
self.model = model
|
||||
|
||||
self.translate_to_spacy_ents = translate_to_spacy_ents
|
||||
if self.translate_to_spacy_ents:
|
||||
print("Translating entites using this dictionary: {}".format(PRESIDIO_SPACY_ENTITIES))
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
if self.translate_to_spacy_ents:
|
||||
sample.translate_input_sample_tags()
|
||||
|
||||
doc = self.model(sample.full_text)
|
||||
tags = self.get_tags_from_doc(doc)
|
||||
if len(doc) != len(sample.tokens):
|
||||
print("mismatch between input tokens and new tokens")
|
||||
|
||||
return tags
|
||||
|
||||
@staticmethod
|
||||
def get_tags_from_doc(doc):
|
||||
tags = [token.ent_type_ if token.ent_type_ != "" else "O" for token in doc]
|
||||
return tags
|
||||
|
|
@ -0,0 +1,164 @@
|
|||
from collections import namedtuple
|
||||
from typing import List
|
||||
|
||||
import spacy
|
||||
|
||||
loaded_spacy = {}
|
||||
|
||||
|
||||
def get_spacy(loaded_spacy=loaded_spacy, model_version="en_core_web_lg"):
|
||||
if model_version not in loaded_spacy:
|
||||
disable = ['vectors', 'textcat', 'ner']
|
||||
print("loading model {}".format(model_version))
|
||||
loaded_spacy[model_version] = spacy.load(model_version, disable=disable)
|
||||
return loaded_spacy[model_version]
|
||||
|
||||
|
||||
def tokenize(text, model_version="en_core_web_lg"):
|
||||
return get_spacy(model_version=model_version)(text)
|
||||
|
||||
|
||||
def _get_detailed_tags(scheme, cur_tags):
|
||||
"""
|
||||
Replaces IO tags (e.g. PERSON PERSON) with IOB/BIO/BILOU tags
|
||||
:param cur_tags:
|
||||
:param scheme:
|
||||
:return:
|
||||
"""
|
||||
|
||||
if all([tag == 'O' for tag in cur_tags]):
|
||||
return cur_tags
|
||||
|
||||
return_tags = []
|
||||
if len(cur_tags) == 1:
|
||||
if scheme == "BILOU":
|
||||
return_tags.append("U-{}".format(cur_tags[0]))
|
||||
else:
|
||||
return_tags.append("I-{}".format(cur_tags[0]))
|
||||
elif len(cur_tags) > 0:
|
||||
tg = cur_tags[0]
|
||||
for j in range(0, len(cur_tags)):
|
||||
if j == 0:
|
||||
return_tags.append("B-{}".format(tg))
|
||||
elif j == len(cur_tags) - 1:
|
||||
if scheme == "BILOU":
|
||||
return_tags.append("L-{}".format(tg))
|
||||
else:
|
||||
return_tags.append("I-{}".format(tg))
|
||||
else:
|
||||
return_tags.append("I-{}".format(tg))
|
||||
return return_tags
|
||||
|
||||
|
||||
def _sort_spans(start, end, tag, score):
|
||||
if len(start) > 0:
|
||||
tpl = [(a, b, c, d) for a, b, c, d in sorted(zip(start, end, tag, score), key=lambda pair: pair[0])]
|
||||
start, end, tag, score = [[x[i] for x in tpl] for i in range(len(tpl[0]))]
|
||||
return start, end, tag, score
|
||||
|
||||
|
||||
def _handle_overlaps(start, end, tag, score):
|
||||
start, end, tag, score = _sort_spans(start, end, tag, score)
|
||||
if len(start) == 0:
|
||||
return start, end, tag, score
|
||||
max_end = max(end)
|
||||
index = min(start)
|
||||
number_of_spans = len(start)
|
||||
i = 0
|
||||
while i < number_of_spans-1:
|
||||
for j in range(i+1,number_of_spans):
|
||||
# Span j intersects with span i
|
||||
if start[i] <= start[j] <= end[i]:
|
||||
# i's score is higher, remove intersecting part
|
||||
if score[i] > score[j]:
|
||||
# j is contained within i but has lower score, remove
|
||||
if start[i] >= end[j] >= end[i]:
|
||||
score[j] = 0
|
||||
# else, j continues after i ended:
|
||||
else:
|
||||
start[j] = end[i] + 1
|
||||
# j's score is higher, break i
|
||||
else:
|
||||
# If i finishes after j ended, split i
|
||||
if end[j] < end[i]:
|
||||
# create new span at the end
|
||||
start.append(end[j] + 1)
|
||||
end.append(end[i])
|
||||
score.append(score[i])
|
||||
tag.append(tag[i])
|
||||
number_of_spans += 1
|
||||
# truncate the current i to end at start(j)
|
||||
end[i] = start[j] - 1
|
||||
# else, i finishes before j ended. truncate i
|
||||
else:
|
||||
end[i] = start[j] - 1
|
||||
|
||||
i += 1
|
||||
start, end, tag, score = _sort_spans(start, end, tag, score)
|
||||
return start, end, tag, score
|
||||
|
||||
|
||||
def span_to_tag(scheme: str,
|
||||
text: str,
|
||||
start: List[int],
|
||||
end: List[int],
|
||||
tag: List[str],
|
||||
scores: List[float] = None,
|
||||
tokens: List[spacy.tokens.Token] = None,
|
||||
io_tags_only=False) -> List[str]:
|
||||
"""
|
||||
Turns a list of start and end values with corresponding labels, into a NER
|
||||
tagging (BILOU,BIO/IOB)
|
||||
:param scheme: labeling scheme, either BILOU, BIO/IOB or IO
|
||||
:param text: input text
|
||||
:param tokens: text tokenized to tokens
|
||||
:param start: list of indices where entities in the text start
|
||||
:param end: list of indices where entities in the text end
|
||||
:param tag: list of entity names
|
||||
:param scores: score of tag (confidence)
|
||||
:param io_tags_only: Whether to return only I and O tags
|
||||
:return: list of strings, representing either BILOU or BIO for the input
|
||||
"""
|
||||
|
||||
if not scores:
|
||||
# assume all scores are of equal weight
|
||||
scores = [0.5 for start in start]
|
||||
|
||||
start, end, tag, scores = _handle_overlaps(start, end, tag, scores)
|
||||
|
||||
if not tokens:
|
||||
tokens = tokenize(text)
|
||||
|
||||
io_tags = []
|
||||
for token in tokens:
|
||||
found = False
|
||||
for span_index in range(0, len(start)):
|
||||
if start[span_index] <= token.idx < end[span_index]:
|
||||
io_tags.append(tag[span_index])
|
||||
found = True
|
||||
break
|
||||
|
||||
if not found:
|
||||
io_tags.append("O")
|
||||
|
||||
if io_tags_only or scheme == "IO":
|
||||
return io_tags
|
||||
|
||||
# Set tagging based on scheme (BIO/IOB or BILOU)
|
||||
current_tag = ""
|
||||
span_index = 0
|
||||
changes = []
|
||||
for io_tag in io_tags:
|
||||
if io_tag != current_tag:
|
||||
changes.append(span_index)
|
||||
span_index += 1
|
||||
current_tag = io_tag
|
||||
changes.append(len(io_tags))
|
||||
|
||||
new_return_tags = []
|
||||
for i in range(len(changes) - 1):
|
||||
new_return_tags.extend(
|
||||
_get_detailed_tags(scheme=scheme,
|
||||
cur_tags=io_tags[changes[i]:changes[i + 1]]))
|
||||
|
||||
return new_return_tags
|
|
@ -0,0 +1,79 @@
|
|||
from collections import defaultdict
|
||||
import random
|
||||
import numpy as np
|
||||
from typing import List, Dict
|
||||
import json
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
|
||||
|
||||
def split_dataset(dataset : List[InputSample], ratios):
|
||||
"""
|
||||
Splits a provided dataset into n groups, by the Template# attribute in each sample's metadata
|
||||
:param dataset: List of InputSamples to be splitted
|
||||
:param ratios: list of percentages. The len of the list would be the len of the splits returned,
|
||||
e.g. [0.7,0.2,0.1] for train, test, validation
|
||||
"""
|
||||
splits = []
|
||||
remaining_dataset = dataset
|
||||
remaining_ratio = 1.0
|
||||
|
||||
if sum(ratios) > 1 or sum(ratios) < 0.999:
|
||||
raise ValueError("Ratios should sum to 1 and be in (0,1]")
|
||||
|
||||
for ratio in ratios:
|
||||
if 1 >= ratio > 0:
|
||||
first_templates, second_templates = split_by_template(remaining_dataset, ratio/remaining_ratio)
|
||||
first_split = get_samples_by_pattern(remaining_dataset, first_templates)
|
||||
second_split = get_samples_by_pattern(remaining_dataset, second_templates)
|
||||
splits.append(first_split)
|
||||
remaining_dataset = second_split
|
||||
remaining_ratio -= ratio
|
||||
else:
|
||||
raise ValueError("Ratio needs to be in (0,1]")
|
||||
|
||||
return tuple(splits)
|
||||
|
||||
|
||||
def group_by_template(dataset: List[InputSample]) -> Dict[str, List[InputSample]]:
|
||||
"""
|
||||
Creates a dict of key = template ID and value = List[InputSamples] for this template id
|
||||
"""
|
||||
samples_pattern_tup = [(sample.metadata["Template#"],sample) for sample in dataset]
|
||||
|
||||
group_by_template = defaultdict(list)
|
||||
for sample in samples_pattern_tup:
|
||||
group_by_template[sample[0]].append(sample[1])
|
||||
|
||||
return group_by_template
|
||||
|
||||
|
||||
def split_by_template(input_samples: List[InputSample], train_pct: float = 0.7):
|
||||
"""
|
||||
Splits a daset of type List[InputSample] into a tuple of train template IDs and test template IDs
|
||||
"""
|
||||
samples_grpd = group_by_template(input_samples)
|
||||
|
||||
templates = np.array(list(samples_grpd.keys()))
|
||||
train_ind = set(random.sample(range(len(templates)), round(train_pct * len(templates))))
|
||||
|
||||
test_ind = set(range(len(templates))) - train_ind
|
||||
|
||||
return templates[list(train_ind)], templates[list(test_ind)]
|
||||
|
||||
|
||||
def get_samples_by_pattern(input_samples, patterns_list):
|
||||
samples_grpd = group_by_template(input_samples)
|
||||
dataset = []
|
||||
for pattern in patterns_list:
|
||||
dataset.extend(samples_grpd[pattern])
|
||||
random.shuffle(dataset)
|
||||
|
||||
return dataset
|
||||
|
||||
|
||||
def save_to_json(samples, output_file):
|
||||
examples_dict = [example.to_dict() for example in samples]
|
||||
|
||||
with open("{}".format(output_file), 'w+', encoding='utf-8') as f:
|
||||
json.dump(examples_dict, f, ensure_ascii=False, indent=4)
|
|
@ -0,0 +1,14 @@
|
|||
[pytest]
|
||||
testpaths = .
|
||||
markers =
|
||||
slow: marks tests as slow (deselect with '-m "not slow"')
|
||||
inconclusive: marks tests as those that may sometimes fail due to threshold
|
||||
none: regular tests
|
||||
serial
|
||||
|
||||
# Commented out to avoid performance tests failures. Uncoment when debugging tests.
|
||||
#log_cli = true
|
||||
#log_level = DEBUG
|
||||
|
||||
filterwarnings =
|
||||
ignore::DeprecationWarning
|
|
@ -0,0 +1,17 @@
|
|||
spacy
|
||||
requests==2.22.0
|
||||
numpy==1.17.2
|
||||
jupyter==1.0.0
|
||||
pandas==0.25.1
|
||||
tqdm
|
||||
haikunator
|
||||
schwifty
|
||||
faker
|
||||
sklearn
|
||||
https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz
|
||||
regex
|
||||
#azureml
|
||||
#azureml-sdk
|
||||
#flair
|
||||
sklearn_crfsuite
|
||||
pytest
|
|
@ -0,0 +1,38 @@
|
|||
from setuptools import setup
|
||||
import os.path
|
||||
# read the contents of the README file
|
||||
from os import path
|
||||
|
||||
this_directory = path.abspath(path.dirname(__file__))
|
||||
with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
|
||||
long_description = f.read()
|
||||
# print(long_description)
|
||||
__version__ = ""
|
||||
|
||||
with open(os.path.join(this_directory, 'VERSION')) as version_file:
|
||||
__version__ = version_file.read().strip()
|
||||
|
||||
setup(
|
||||
name='presidio-evaluator',
|
||||
long_description=long_description,
|
||||
long_description_content_type='text/markdown',
|
||||
version=__version__,
|
||||
packages=['presidio_evaluator', 'presidio_evaluator.data_generator'
|
||||
],
|
||||
url='https://www.github.com/microsoft/presidio',
|
||||
license='MIT',
|
||||
description='PII dataset generator, model evaluator for Presidio and PII data in general',
|
||||
install_requires=[
|
||||
'spacy>=2.2.0',
|
||||
'requests==2.22.0',
|
||||
'numpy==1.16.4',
|
||||
'pandas>=0.24.2',
|
||||
'tqdm>=4.32.1',
|
||||
'jupyter>=1.0.0',
|
||||
'pytest>=4.6.2',
|
||||
'haikunator',
|
||||
'schwifty',
|
||||
'faker',
|
||||
'sklearn_crfsuite']
|
||||
|
||||
)
|
|
@ -0,0 +1,31 @@
|
|||
import pytest
|
||||
|
||||
# pytest configuration file
|
||||
# the configuration allow 3 kind of tests:
|
||||
# * unmarked tests run on all pytest execution
|
||||
# * tests with large datasets\long testing time are marked as "slow" and have to be run with pytest run --runslow
|
||||
# * tests with inconclusive result are marked as "inconclusive" have to be run with pytest run --runinconclusive
|
||||
# * tests can be both slow and inconclusive and have to be run with pytest run --runslow --runinconclusive
|
||||
|
||||
def pytest_addoption(parser):
|
||||
parser.addoption(
|
||||
"--runslow", action="store_true", default=False, help="run slow tests"
|
||||
)
|
||||
parser.addoption(
|
||||
"--runinconclusive", action="store_true", default=False, help="run slow tests"
|
||||
)
|
||||
|
||||
|
||||
def pytest_collection_modifyitems(items, config):
|
||||
if not config.getoption("--runslow"):
|
||||
skip_slow = pytest.mark.skip(reason="need --runslow option to run")
|
||||
for item in items:
|
||||
if "slow" in item.keywords:
|
||||
item.add_marker(skip_slow)
|
||||
|
||||
if not config.getoption("--runinconclusive"):
|
||||
skip_slow = pytest.mark.skip(reason="need --runinconclusive option to run")
|
||||
for item in items:
|
||||
if "inconclusive" in item.keywords:
|
||||
item.add_marker(skip_slow)
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
WORD,PARSING
|
||||
a,()
|
||||
a-,()
|
||||
a 1,()
|
||||
a b c,()
|
||||
a cappella,()
|
||||
a fortiori,()
|
||||
a mensa et thoro,()
|
||||
a posteriori,()
|
||||
a priori,()
|
||||
aam,(n.)
|
||||
aard-vark,(n.)
|
||||
aard-wolf,(n.)
|
||||
aaronic,(a.)
|
||||
aaronical,(a.)
|
||||
aaron's rod,()
|
||||
ab,(n.)
|
||||
ab-,()
|
|
|
@ -0,0 +1,101 @@
|
|||
Number,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,State,StateFull,ZipCode,Country,CountryFull,EmailAddress,Username,Password,BrowserUserAgent,TelephoneNumber,TelephoneCountryCode,MothersMaiden,Birthday,Age,TropicalZodiac,CCType,CCNumber,CVV2,CCExpires,NationalID,UPS,WesternUnionMTCN,MoneyGramMTCN,Color,Occupation,Company,Vehicle,Domain,BloodType,Pounds,Kilograms,FeetInches,Centimeters,GUID,Latitude,Longitude
|
||||
1,female,Czech,Mrs.,Marie,J,Hamanová,"P.O. Box 255",Kangerlussuaq,QE,Qeqqata,3910,GL,Greenland,MarieHamanova@armyspy.com,Wasco1982,eiZookooB7,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","84 23 30",299,Kubíková,3/29/1982,37,Aries,MasterCard,5545634085461876,511,1/2020,,"1Z 789 686 82 8979 914 6",6945116246,34746079,Purple,"Surveillance officer","Simple Solutions","1995 Zastava 65",MarathonDancing.gl,O+,217.6,98.9,"5' 5""",164,6781b04d-7b5f-4c1a-bceb-b953e6ef70d7,77.377518,-67.015569
|
||||
2,female,French,Ms.,Patricia,G,Desrosiers,"Avenida Noruega 42","Vila Real",VR,"Vila Real",5000-047,PT,Portugal,PatriciaDesrosiers@superrito.com,Fultses,eb6soCha4ae,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","21 259 903 5696",351,Daviau,2/28/1956,63,Pisces,MasterCard,5317250628844522,874,3/2022,,"1Z V38 747 73 7311 832 9",7398998399,18093674,Blue,"Vascular technologist","Formula Gray","2006 Lexus GS",LostMillions.com.pt,O+,118.1,53.7,"5' 0""",152,2b2e7e1a-855f-4089-a570-c0af2381a6d6,41.274541,-7.876658
|
||||
3,female,American,Ms.,Debra,O,Neal,"1659 Hoog St",Brakpan,GA,Gauteng,1553,ZA,"South Africa",DebraONeal@fleckens.hu,Cognoy,sha3Sohzee,"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/19.4.0.2397 Yowser/2.5 Safari/537.36","082 490 1693",27,Barrett,6/11/1957,62,Gemini,Visa,4916429195104076,315,5/2020,5706114632083,"1Z 061 1E5 71 3400 427 4",6186449862,58702271,Blue,"Information architect librarian",Dahlkemper's,"1993 Honda Prelude",MediumTube.co.za,A+,120.1,54.6,"5' 4""",162,2ef83f4c-3102-4f79-839d-c75bf6a06f0a,-26.22096,28.283398
|
||||
4,male,French,Mr.,Peverell,C,Racine,"183 Epimenidou Street",Limassol,LI,Limassol,3041,CY,"Cyprus (Anglicized)",PeverellRacine@teleworm.us,Restlys,Aekie7ohs,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","25 470375",357,Grondin,6/14/1962,57,Gemini,Visa,4485421519226702,653,5/2023,,"1Z F44 91V 14 3570 491 2",0850016444,52534088,Blue,"Desk clerk",Quickbiz,"2008 Infiniti G35",ImproveLook.com.cy,B+,142.1,64.6,"5' 9""",174,bfb4be71-3710-4ffa-baaf-5af6aa4b339e,41.30296,-72.989066
|
||||
5,female,Slovenian,Mrs.,Iolanda,S,Tratnik,"Karu põik 61",Pärnu,PR,Pärnumaa,80098,EE,Estonia,IolandaTratnik@teleworm.us,Trely1962,jeiziejohH3ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","445 6271",372,Korbun,1/23/1962,57,Aquarius,Visa,4532820383285186,893,4/2024,,"1Z 060 418 64 7516 574 4",1178606881,74806227,Purple,"Production assistant","Dubrow's Cafeteria","2007 Fiat Idea",PostTan.com.ee,O+,141.5,64.3,"5' 3""",160,0cbb7bf3-466f-4df6-bda3-9c9fe7bfc5c1,58.293395,24.434851
|
||||
6,male,Italian,Mr.,Domenico,D,Pisano,"Via Pisanelli 104",Traversara,RA,Ravenna,48020,IT,Italy,DomenicoPisano@armyspy.com,Hatelt,lohhee8Zah,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36","0312 0828589",39,Conti,6/1/1979,40,Gemini,Visa,4532872142737056,237,6/2023,WK48391724,"1Z 175 1F5 29 1963 168 1",7448393148,31617424,Blue,"Professional scout",Littler's,"1998 Nissan Serena",HardDriveBlog.it,O+,247.5,112.5,"6' 0""",182,f4feeb24-e3b1-4d99-9c71-e8c6a95762fe,44.588081,12.055283
|
||||
7,male,Greenland,Mr.,Pavia,A,Rosing,"29 Wattle St","King William's Town",EC,"Eastern Cape",5601,ZA,"South Africa",PaviaRosing@superrito.com,Thattere,aiCheed7tie,"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 YaBrowser/19.4.0.2397 Yowser/2.5 Safari/537.36","082 692 3461",27,Lennert,5/5/1937,82,Taurus,Visa,4539980160229196,256,10/2020,3705057790082,"1Z 507 770 52 3012 473 1",1256867146,65720899,Green,"Chemical engineer","Pup 'N' Taco","2003 Peugeot Partner",RecruitSuit.co.za,O-,192.1,87.3,"6' 0""",182,b6b75cf9-dfbf-424d-a03c-90cdd859e9eb,-32.787712,27.343649
|
||||
8,female,French,Mrs.,Ormazd,M,Jomphe,"Mattenstrasse 108",Sissach,,,4450,CH,Switzerland,OrmazdJomphe@rhyta.com,Deace1999,oochui5Eboe5T,"Mozilla/5.0 (Windows NT 6.1; rv:66.0) Gecko/20100101 Firefox/66.0","061 947 83 90",41,Busson,1/14/1999,20,Capricorn,Visa,4556603638439886,691,6/2024,,"1Z 091 192 83 9348 168 6",4380386435,24628087,Purple,"Clinical psychologist","Linens 'n Things","1996 Plymouth Neon",CyclingMonthly.ch,O+,115.3,52.4,"5' 1""",154,e5858bdb-9173-4991-9857-4e09b61e4e16,47.520557,7.863831
|
||||
9,male,Norwegian,Mr.,Severin,L,Akhtar,"251 Charilaou Trikoupi Str.",Pigenia,NI,Nicosia,2962,CY,"Cyprus (Anglicized)",SeverinAkhtar@rhyta.com,Heremer,cieCipua8L,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 586625",357,Mathisen,4/30/1960,59,Taurus,MasterCard,5230940651584482,785,10/2022,,"1Z W81 228 52 4912 032 1",8897778249,55031915,Green,"Pump operator","Fragrant Flower Lawn Services","2005 Dodge Nitro",GainPain.com.cy,B+,155.3,70.6,"6' 0""",182,64383596-6dc8-4b77-9476-c1a8ef23ffc6,41.335894,-72.908321
|
||||
10,female,Greenland,Mrs.,Margrethe,H,Kristiansen,"94 boulevard Amiral Courbet",ORLÉANS,CE,Centre,45100,FR,France,MargretheKristiansen@gustr.com,Theirturavid,Aiv4ohwae,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",02.64.62.52.37,33,Berthelsen,12/13/1979,40,Sagittarius,MasterCard,5306020150102745,915,12/2024,"2791269679323 49","1Z 987 E42 01 7982 218 2",0231937615,18687876,Purple,"Systems software engineer","Independent Wealth Management","2012 Porsche 911",ToyProtection.fr,A+,200.2,91.0,"5' 3""",159,ae7b1a56-0d6d-46ba-895e-75ee10482858,47.850047,1.875252
|
||||
11,female,Hispanic,Mrs.,Myrna,G,Feliciano,"Männi 12",Mustoja,LV,Lääne-Virumaa,45429,EE,Estonia,MyrnaFelicianoCortes@superrito.com,Wilthe84,Chu6shiRees,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","329 3803",372,Cortés,1/23/1984,35,Aquarius,MasterCard,5402842306504596,788,7/2020,,"1Z 599 666 46 6430 018 3",8678427166,90687114,Blue,"Radiologic technician",Monmax,"2003 Mitsubishi Lancer",SharkStatistics.com.ee,O+,212.7,96.7,"5' 5""",164,4e8b5c6c-0b04-43c1-ad5b-2fc92365455c,59.638357,26.059683
|
||||
12,male,Czech,Mr.,Michal,E,Horký,"Algade 33",Guldborg,SJ,"Region Sjælland",4862,DK,Denmark,MichalHorky@rhyta.com,Fiect1941,oxep7Aev,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0",28-64-27-85,45,Siváková,6/19/1941,78,Gemini,MasterCard,5108116586316493,376,2/2021,190641-4941,"1Z 85E W86 50 8027 647 6",4547979244,81431928,Orange,"Dental assistant",Pointers,"1995 Daihatsu Rocky",ExShows.dk,B+,165.0,75.0,"5' 10""",177,a2aa0138-07d3-41e3-b3b7-455d9854e31f,54.815396,11.760822
|
||||
13,male,French,Mr.,Donat,M,Lespérance,"96 rue de Penthièvre",PONTOISE,IL,Île-de-France,95000,FR,France,DonatLesperance@gustr.com,Sirep1950,re8ZieK4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",03.34.08.71.06,33,Dodier,11/12/1950,69,Scorpio,Visa,4556904288472270,512,9/2022,"1501143313127 93","1Z 070 9Y5 64 7265 236 0",1770634719,11974448,Black,Neurosonographer,"American Appliance","2009 Kia Cerato",BasketballBiz.fr,A+,180.8,82.2,"5' 7""",170,78b497ed-6d7d-4e5f-8150-1155abf9716e,48.977559,1.976986
|
||||
14,female,"Japanese (Anglicized)",Ms.,Yuuka,M,Shimasaki,"Mjövattnet 1",NYLAND,,,"870 52",SE,Sweden,YuukaShimasaki@cuvox.de,Coun1976,ohleaT4ae,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",0613-9040212,46,Kimura,2/11/1976,43,Aquarius,MasterCard,5457403889440023,903,5/2021,760211-4105,"1Z 975 450 29 2316 562 4",8652144021,72590241,Blue,"Credit checker",Elek-Tek,"2006 Ford Territory",ClassInsider.se,A-,151.8,69.0,"5' 5""",165,df1a2d57-31d8-4a71-8ad6-cb687ee250d4,62.773416,17.853904
|
||||
15,male,Swedish,Mr.,Wiktor,H,Ek,"Norðurbraut 27",Reykjavík,,,112,IS,Iceland,WiktorEk@rhyta.com,Boally,aigo2OoPhoi,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","434 6815",354,Göransson,5/4/1945,74,Taurus,Visa,4532063379896779,489,4/2024,,"1Z 278 965 48 6106 268 1",1158883056,79465382,White,"Legal secretary","Handy Andy Home Improvement Center","2014 Jaguar XF",SSLAlert.is,O+,227.0,103.2,"5' 10""",178,d39f58d7-7bb7-4f77-9956-9b505f8a4cc8,64.187422,-21.93344
|
||||
16,female,Slovenian,Ms.,Polona,H,Ranković,"Õli 68",Himmiste,PL,Põlvamaa,64204,EE,Estonia,PolonaRankovic@rhyta.com,Whou1985,eeZae5oech,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36","798 9719",372,Orehek,7/18/1985,34,Cancer,MasterCard,5433975988486550,723,7/2022,,"1Z A77 0E9 59 0580 345 8",0984960146,75114603,Blue,"Personal trainer","Budget Tapes & Records","1995 Lancia Delta",TodayAlert.com.ee,A+,160.2,72.8,"5' 5""",165,c8e80b15-7ff4-4b21-bebf-5737d8133bdc,57.990467,27.141455
|
||||
17,male,Scottish,Mr.,Ivan,M,King,"Pachergasse 64",BÜSCHENDORF,ST,Styria,8786,AT,Austria,IvanKing@einrot.com,Anempon,XooJoh0se5sh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0660 475 13 89",43,Watson,9/15/1994,25,Virgo,MasterCard,5257015834586726,714,12/2020,,"1Z 37A 329 60 9892 939 6",1176881962,86098320,Green,"Placement counselor",Opticomp,"2000 Citroen C 15",InvestmentBrowse.at,A+,130.2,59.2,"5' 7""",170,58aec14a-fc35-4e19-a61f-b8f170a9ec7d,47.492881,14.377639
|
||||
18,female,Finnish,Mrs.,Nelma,M,Grönholm,"Rostsestraat 222",Froidchapelle,WLX,Luxembourg,6440,BE,Belgium,NelmaGronholm@jourrapide.com,Obect1946,aeJ6OhneiF3t,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0479 50 54 03",32,Pelkonen,9/15/1946,73,Virgo,MasterCard,5208559023644291,187,12/2021,,"1Z 416 A35 89 5065 644 8",4484163657,51232002,Green,"Claims adjuster","Starship Tapes & Records","2010 Land Rover Defender",MobileKicks.be,A+,216.5,98.4,"5' 7""",169,2f575438-1e23-4123-ab60-990725e8c08b,50.072755,4.344408
|
||||
19,female,Hungarian,Ms.,Tünde,F,Hoffmann,"Via Nazario Sauro 112","Cusano Milanino",MI,Milano,20095,IT,Italy,HoffmannTunde@armyspy.com,Preacces,aob4eiteiL,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","0352 9353380",39,Bagi,11/1/1951,68,Scorpio,MasterCard,5559688562190559,258,7/2023,DA75119938,"1Z 959 98A 67 7929 896 3",2123939613,78515387,Orange,"Apparel worker","Hugh M. Woods","2005 BMW X5",CrabDealer.it,O+,186.1,84.6,"4' 11""",151,0a14766e-42f5-4943-ae10-884bcadf43f8,45.644863,9.128014
|
||||
20,female,Finnish,Ms.,Riitta,N,Hahl,"ul. Elbląska 97",Olsztyn,,,10-672,PL,Poland,RiittaHahl@einrot.com,Thill1954,Aiwar2ooh1,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","67 788 02 83",48,Linna,3/2/1954,65,Pisces,MasterCard,5150663209146952,542,2/2021,54030268860,"1Z 626 853 00 4461 590 4",4657047492,90091015,Purple,"Medical secretary",Edwards,"2012 Toyota Prius",DustingSprays.pl,O+,187.7,85.3,"5' 7""",170,e7825dc5-9fdf-478c-ba72-ab25879699f1,53.768852,20.536572
|
||||
21,male,Dutch,Mr.,Harwin,R,Galesloot,"Glynitveien 218",SKI,,,1400,NO,Norway,HarwinGalesloot@einrot.com,Saffive,wohHaix5fa,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","914 54 925",47,Ramakers,9/28/1994,25,Libra,MasterCard,5580997921941047,961,4/2022,,"1Z 761 410 55 6702 728 6",9467674600,08790710,Blue,"Electric motor repairer","Buena Vista Realty Service","2002 ZAZ Slavuta",SwankBlog.no,B+,217.1,98.7,"5' 6""",167,4029654d-5c8f-407e-b003-72bde9f593a1,59.814132,10.87172
|
||||
22,male,Hispanic,Mr.,Azarías,A,Segovia,"Via Francesco Girardi 49","Carmignano Di Brenta",PD,Padova,35010,IT,Italy,AzariasSegoviaNava@cuvox.de,Portalime,duteiVeev1,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","0394 9130281",39,Nava,12/24/1949,70,Capricorn,MasterCard,5146736492498053,173,1/2022,SZ30479384,"1Z 931 W28 91 2882 876 3",2747014433,22062098,Blue,"Administrative office manager","Total Serve","2005 BMW 325",EmployeeVerified.it,B+,140.4,63.8,"6' 0""",184,916cbe50-55b2-4a11-acf4-b8d8d9cc9668,45.432966,12.000719
|
||||
23,male,Hungarian,Mr.,Adelbert,A,Kuncz,"Turjaška 115","Rečica ob Savinji",,,3332,SI,Slovenia,KunczAdelbert@fleckens.hu,Firseten,woh3ejoo2Ei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",031-365-314,386,Pethô,10/17/1976,43,Libra,MasterCard,5431971800958886,663,4/2022,,"1Z 5V7 161 85 3863 932 0",7652480699,67841291,Blue,"Extruding, forming, pressing, and compacting machine setter","Red Owl","2002 Citroen C-Airdream",BankingDetective.si,O+,193.6,88.0,"6' 1""",186,6653d52f-870f-4956-8b1a-c1463007f387,46.357704,14.932894
|
||||
24,female,England/Wales,Ms.,Charlie,S,Campbell,"Rue du Château 414",Limont,WLG,Liège,4357,BE,Belgium,CharlieCampbell@gustr.com,Lithatinquir,Kae4aetah,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0489 33 97 26",32,Tucker,10/16/1987,32,Libra,MasterCard,5268537479220623,072,1/2023,,"1Z 3V4 354 38 1677 342 8",0119239854,24858321,Purple,"Dry-cleaning worker","Red Robin Stores","2004 Audi S4",PinkCheek.be,O+,119.5,54.3,"5' 8""",173,f3a846e6-ec1c-4ccc-bf51-f66bb64ff559,50.616062,5.247283
|
||||
25,female,American,Ms.,Thelma,K,Mitchell,"Väike-Laagri 80",Orissaare,SA,Saaremaa,94691,EE,Estonia,ThelmaKMitchell@einrot.com,Trind1979,eeThooy3ieph,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","455 6186",372,Rumbaugh,5/23/1979,40,Gemini,MasterCard,5252527082023181,249,9/2024,,"1Z 001 306 31 4745 481 7",4024903278,57842179,Black,"Camera repairer","Omni Superstore","2000 Ford Artic",BedroomRental.com.ee,A+,118.4,53.8,"5' 6""",168,6e2490c7-4a17-4ce8-9a80-4b82551d3180,58.540748,23.059808
|
||||
26,male,Brazil,Mr.,Davi,G,Santos,"Rákóczi út 66.",Barnag,VE,Veszprém,8291,HU,Hungary,DaviGoncalvesSantos@superrito.com,Cumeneamord,AeYeRie2doo,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","(88) 158-170",36,Goncalves,3/15/1974,45,Pisces,Visa,4539342451489007,234,9/2021,,"1Z 598 735 11 9286 476 3",9550539865,84997161,Blue,"Clinical manager","Star Merchant Services","2011 Alfa Romeo Giulietta",ConfidentialCash.hu,A+,156.9,71.3,"5' 11""",180,95501fdd-a13f-43fd-981c-0dced4dd23e6,47.04597,17.745945
|
||||
27,male,England/Wales,Mr.,Jonathan,A,Conway,"Rua Doutor Afrânio Junqueira 1460","São Paulo",SP,"São Paulo",04581-040,BR,Brazil,JonathanConway@jourrapide.com,Bessed,Egaez2Vuo,"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko","(11) 7113-8192",55,Humphries,1/20/1938,81,Aquarius,Visa,4916317241919037,680,3/2023,308.271.618-08,"1Z 02V 42E 21 2992 331 4",3494650897,26033304,Blue,"Electrical drafter","Pro Yard Services","2006 Dodge Caravan",AidsRate.com.br,B+,170.9,77.7,"6' 0""",183,0eb93cfe-1260-4caa-99d5-e0f31f73b1e9,-23.594567,-46.709971
|
||||
28,male,French,Mr.,Guy,A,Migneault,"90 Petworth Rd",DUNSINNAN,,,"PH2 5HL",GB,"United Kingdom",GuyMigneault@teleworm.us,Enut1960,nahming4Oo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","077 5138 5842",44,Labrecque,4/28/1960,59,Taurus,MasterCard,5581802812256860,692,3/2022,"ZT 01 87 75","1Z 96V 661 27 4962 061 4",8863556519,92044686,Blue,"Technical trainer","Rogers Peet","1999 Fiat Siena",ReligiousCounselor.co.uk,B+,176.7,80.3,"5' 7""",171,eca54fd7-3576-426c-b4f9-b617a3662931,56.025747,-3.640577
|
||||
29,male,Hispanic,Dr.,Breogan,J,Orosco,"Ηλίου 64",ΛΑΡΝΑΚΑ,LA,Λάρνακα,6031,CY,"Cyprus (Greek)",BreoganOroscoCeballos@teleworm.us,Priback,Quoo9choo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","97 687579",357,Ceballos,9/19/1957,62,Virgo,MasterCard,5567650882732569,392,10/2022,,"1Z A69 196 36 1803 521 0",6314780611,79148788,Green,"Machinery maintenance mechanic","Jack Lang","2010 Hyundai i30",TalkAbuse.com.cy,A+,155.1,70.5,"5' 10""",179,8dda77bb-d75c-40ef-b1c0-6381d305b053,41.348113,-72.957665
|
||||
30,female,Czech,Mrs.,Jaroslava,M,Kindlová,"22 Rue de Sidi Bou Zid",Zouarine,33,"Governorate Kef",7170,TN,Tunisia,JaroslavaKindlova@armyspy.com,Tepen1939,thahShee7,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763","78 427 062",216,Langová,11/28/1939,80,Sagittarius,Visa,4532368231815457,583,12/2022,,"1Z 349 036 22 9992 262 9",5257019048,20372734,Black,"Copy editor",Anthony's,"2005 Nissan Altima",ScanFund.tn,B+,108.7,49.4,"5' 4""",163,94e73608-33dd-4f7a-a86d-9575aa2e63e6,37.219459,9.881268
|
||||
31,male,Croatian,Mr.,Stjepan,A,Perković,"Escuadro 26","Castelló de Rugat",V,Valencia,46841,ES,Spain,StjepanPerkovic@jourrapide.com,Spastry,uLiH7iech3,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","779 021 982",34,Topić,3/18/1978,41,Pisces,Visa,4539449132248031,692,6/2023,,"1Z V92 2E8 34 9201 447 0",3497779171,87502347,Orange,Geoscientist,Monmax,"2010 Chrysler PT Cruiser",AudioBoom.es,A+,175.8,79.9,"5' 10""",178,e49f858d-b218-4ce7-89ae-62599b3bb276,38.927331,-0.375157
|
||||
32,male,Croatian,Mr.,Stanko,T,Crnić,"Avda. Alameda Sundheim 46",Benasque,HU,Huesca,22440,ES,Spain,StankoCrnic@fleckens.hu,Waskents,Piu4theeg1ae,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","793 358 347",34,Jozić,9/13/1974,45,Virgo,Visa,4929607743905830,463,2/2021,,"1Z 981 F24 67 9469 260 4",6183936244,94973462,Blue,"Studio camera operator","Wealthy Ideas","2004 Isuzu Axiom",PokerPortraits.es,B+,185.2,84.2,"6' 1""",186,6ee36d07-1397-43c8-86a4-455bce264fbc,42.623207,0.475571
|
||||
33,female,Russian,Ms.,Marianne,I,Zhdanova,"18 Rue de bayrout","Cite Badrani",61,"Governorate Sfax",3083,TN,Tunisia,MarianneZhdanova@armyspy.com,Thock1968,uRai6thoh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","74 849 807",216,,11/30/1968,51,Sagittarius,MasterCard,5223193173825541,622,7/2020,,"1Z 60W 422 22 2116 147 5",6606006372,32841170,Blue,"Geographic information specialist","Star Interior Design","1996 Dodge Caravan",PreviewBuy.tn,B+,168.3,76.5,"5' 7""",169,eff7c2c8-85f4-45dc-8a4d-60770f4eda42,35.319852,9.785865
|
||||
34,female,Hungarian,Ms.,Ferike,G,Jónás,"Brucker Bundesstrasse 31",FÜRLING,NO,"Lower Austria",4152,AT,Austria,JonasFerike@dayrep.com,Ofigaill49,ni8ooy1Thee,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763","0681 563 12 72",43,Tolnay,4/12/1949,70,Aries,Visa,4532628381402038,969,6/2021,,"1Z 099 5Y5 30 8995 126 3",7242712846,73346379,Red,"Computer systems administrator","ABCO Foods","1999 Land Rover Discovery",SemiCheap.at,O+,134.6,61.2,"5' 5""",166,4687baf3-0486-4466-a63b-d0dff8903249,48.633035,13.984742
|
||||
35,female,Czech,Ms.,Lenka,T,Mizerová,"Rhinstrasse 91",München,BY,"Freistaat Bayern",80975,DE,Germany,LenkaMizerova@fleckens.hu,Daudgessed,keer4Ceej9j,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","089 60 25 65",49,Brožová,1/14/1944,75,Capricorn,Visa,4916250570933685,939,6/2023,,"1Z 580 A90 83 6948 520 2",9291535188,57982233,Red,Maid,AdventureSports!,"2000 Toyota Sienna",ExitMarketing.de,O+,184.1,83.7,"5' 1""",154,efb875ed-74dc-4ea3-b155-4a48cb61d187,48.189892,11.502201
|
||||
36,female,Icelandic,Mrs.,Eyþóra,P,Runólfsdóttir,"Ditscheinergasse 80",HUNDSHAGEN,OO,"Upper Austria",4773,AT,Austria,EythoraRunolfsdottir@jourrapide.com,Expregiat,Beph8ieX,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","0699 830 58 07",43,,11/22/1960,59,Sagittarius,MasterCard,5426319483823638,482,11/2024,,"1Z 448 19V 41 8418 729 7",2455440003,04422500,Purple,"Building cleaning worker","Matrix Architectural Service ","2010 BMW 650",PaidValue.at,O-,130.7,59.4,"5' 6""",168,02a29393-6fdc-4120-adf0-568906c8c111,48.266115,13.568714
|
||||
37,female,French,Mrs.,Rive,T,Lépicier,"144 Souniou Ave.",Menogeia,LA,Larnaca,7578,CY,"Cyprus (Anglicized)",RiveLepicier@teleworm.us,Carray,ieJij3no,"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0","24 102884",357,Lanteigne,4/15/1983,36,Aries,MasterCard,5494902843118711,376,8/2024,,"1Z W04 891 69 0373 898 8",7751332568,91800864,Yellow,"Payroll and benefits specialist","Lechters Housewares","2006 Nissan Pathfinder",MLSModels.com.cy,A+,119.7,54.4,"5' 4""",163,17afbc1b-4d95-4583-8602-9680b4fd7c5c,41.311392,-72.829123
|
||||
38,male,Danish,Mr.,Marcus,O,Paulsen,"Plattenstrasse 57",Räterschen,,,8352,CH,Switzerland,MarcusOPaulsen@dayrep.com,Mesee1943,Fi7eiva8Ah,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","044 347 47 26",41,Simonsen,10/8/1943,76,Libra,MasterCard,5378357034830932,713,8/2021,,"1Z 387 E31 19 5962 225 9",2639724434,59408085,Black,"Management development specialist","Parts and Pieces","2006 BMW M3",MalpracticeAgents.ch,B+,207.5,94.3,"5' 10""",177,43ae4a6b-e1ca-4d5e-b6cd-3732697c9c71,47.488972,8.868299
|
||||
39,female,"Chechen (Latin)",Mrs.,Zeliha,I,Sultygov,"Rookopli 96",Uralaane,VG,Valgamaa,68712,EE,Estonia,ZelihaSultygov@cuvox.de,Dary1953,nohgief2A,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","763 5734",372,Desheriyev,7/27/1953,66,Leo,Visa,4539696753097085,200,11/2021,,"1Z E17 641 44 8337 404 1",5174706823,32841274,Red,"Mental health assistant","Reliable Investments","2000 Opel Signum",MeDue.com.ee,O+,119.9,54.5,"5' 3""",160,68c2ef2c-9990-41bc-be94-afa07e6e2379,58.070181,26.064252
|
||||
40,female,Russian,Mrs.,Ilona,B,Pirogova,"Αγ. Ανδρέα 130","ΒΑΣΑ ΚΟΙΛΑΝΙΟΥ",LI,Λεμεσός,4771,CY,"Cyprus (Greek)",IlonaPirogova@superrito.com,Hatiere,Ahsha4Ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","25 750307",357,,12/5/1970,49,Sagittarius,Visa,4539731441846112,542,4/2024,,"1Z 008 309 69 5575 108 1",3971097056,06463075,Purple,"Land acquisition manager","Wickes Furniture","1992 Ford Taurus",UGLive.com.cy,O-,135.1,61.4,"5' 1""",156,0c5899a0-ce9c-43a7-aba1-f279893620f9,41.295272,-72.961282
|
||||
41,female,Croatian,Ms.,Aleksandra,K,Petković,"Binzmühlestrasse 30","San Bernardino",,,6565,CH,Switzerland,AleksandraPetkovic@fleckens.hu,Ramessanies1994,vie2quai7Ie8,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","091 808 37 22",41,Bašić,6/8/1994,25,Gemini,MasterCard,5560110586075796,991,3/2023,,"1Z 11E 310 61 4037 919 3",0620594797,44658507,Purple,"Personal banker",Ejecta,"2005 Kia Amanti",WirelessRelief.ch,A+,110.2,50.1,"5' 4""",163,cc560302-0d00-410c-9629-2e68bb4ef864,46.506804,9.159102
|
||||
42,male,American,Mr.,Rogelio,A,Patrick,"Πλ Καραισκάκη 128","ΑΓΙΟΣ ΘΕΟ∆ΩΡΟΣ ΣΟΛΕΑΣ",NI,Λευκωσία,2823,CY,"Cyprus (Greek)",RogelioAPatrick@dayrep.com,Whortin1952,Yu7mah2z,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","95 432561",357,Thacker,10/25/1952,67,Scorpio,Visa,4716135824324942,242,9/2021,,"1Z 14A 327 32 1181 648 4",9342714384,12993510,Blue,"Typesetting machine tender","Vibrant Man","2001 Toyota MR2",MacroSigns.com.cy,B+,170.3,77.4,"5' 11""",180,76a566cc-be59-4327-862e-312da09e0c42,41.353523,-72.965839
|
||||
43,female,American,Mrs.,Evelyn,R,Tucker,"Kringlan 66",Reykjavík,,,107,IS,Iceland,EvelynRTucker@armyspy.com,Arresplet,deiT0ahyu,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36","450 3756",354,Burton,9/25/1986,33,Libra,MasterCard,5592761939548814,873,9/2022,,"1Z 263 919 45 6552 555 7",5057305433,86266508,Purple,"Aquaculture farmer",Weatherill's,"2004 Mitsubishi Galant",TheyTell.is,O+,105.8,48.1,"5' 3""",159,716b5321-34bf-4514-8bca-fce5c482d8c3,64.159592,-21.928397
|
||||
44,male,Icelandic,Mr.,Þorkell,H,Hallbjörnsson,"Školní 296","Kaplice 1",JC,"Jihoceský kraj","382 41",CZ,"Czech Republic",THorkellHallbjornsson@gustr.com,Dessesid,eeXahew1ui,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","772 616 930",420,,12/19/1957,62,Sagittarius,MasterCard,5534480983249093,443,6/2023,,"1Z 34E 320 47 9554 749 9",1825424609,81224507,Blue,"Financial aid director","Grand Union","2004 Ford Explorer",AttorneyBiographies.cz,A-,236.9,107.7,"6' 1""",186,393472fc-3454-4ba8-af3a-e1f7f626cfee,48.691433,14.516696
|
||||
45,male,Greenland,Mr.,Jan,H,Geisler,"Bayerhamerstrasse 79",GLAUBENDORF,NO,"Lower Austria",3704,AT,Austria,JanGeisler@jourrapide.com,Subjecould,eepooz6U,"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko","0699 456 17 84",43,Lange,10/25/2000,19,Scorpio,MasterCard,5305776196130476,904,6/2020,,"1Z 084 34A 51 9322 259 5",7785136902,52035400,Blue,"Fire prevention specialist",Mikrotechnic,"2005 Bizzarrini BZ-2001",ProfilePeek.at,B+,203.1,92.3,"5' 9""",175,92630214-ba49-47e4-8717-93e4bba4262e,48.560667,15.917015
|
||||
46,female,Norwegian,Mrs.,Caroline,M,Landmark,"Via Tasso 21",Perugia,PG,Perugia,06122,IT,Italy,CarolineLandmark@superrito.com,Sweves,ooNee0iechoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","0378 8718408",39,Benjaminsen,9/26/1975,44,Libra,MasterCard,5248190326919222,805,6/2024,PR74491787,"1Z 2Y4 773 67 4365 263 8",7709648575,99170582,Green,"Allopathic physician","Castro Convertibles","2012 Dodge Durango",CheapWarrants.it,A-,187.7,85.3,"5' 6""",168,dde3a962-10d8-4092-b9df-6da79f89f383,43.072973,12.459411
|
||||
47,female,Swedish,Ms.,Lena,M,Andersson,"Parkring 7",STEINPARZ,OO,"Upper Austria",4730,AT,Austria,LenaAndersson@jourrapide.com,Freen1978,aeVaiHohy7,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","0650 858 08 11",43,Holm,10/24/1978,41,Scorpio,MasterCard,5451268671996177,795,8/2023,,"1Z 683 821 70 2253 409 0",7986278354,59684998,Blue,"Chemical technician","Wholesale Club, Inc.","2007 Kia Carnival",ProvidenceSold.at,AB+,180.2,81.9,"5' 1""",154,8d7f4c08-ee33-4024-9474-0beed004df45,48.234026,13.824536
|
||||
48,male,Danish,Mr.,Elias,A,Jepsen,"Βασιλέως Αλεξάνδρου 195",ΦΑΡΜΑΚΑΣ,NI,Λευκωσία,2620,CY,"Cyprus (Greek)",EliasAJepsen@dayrep.com,Thenim,uki0Zae7l,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 228011",357,Olsen,9/10/1967,52,Virgo,Visa,4929867576889614,699,4/2023,,"1Z 348 2Y4 34 0337 091 2",3256227540,51352382,Blue,"Court, municipal, and license clerk","Golden's Distributors","2011 Volvo XC70",StudRules.com.cy,A+,214.5,97.5,"5' 9""",174,f06e06a7-59b0-4052-9928-92df476d7753,41.326352,-72.962624
|
||||
49,male,French,Mr.,Honoré,N,Beaudouin,"13 Faubourg Saint Honoré",PAU,IL,Île-de-France,64000,FR,France,HonoreBeaudouin@superrito.com,Slise1955,Zei7phaeSuutu,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",05.39.21.79.15,33,Daoust,11/2/1955,64,Scorpio,Visa,4539798879618651,321,11/2020,"1551124428495 35","1Z 424 792 46 6757 249 6",9796303410,07585755,Blue,"Dairy scientist","York Steak House","1999 GAZ 3111",RankHunter.fr,O+,236.1,107.3,"5' 9""",175,00a9f1f4-bec6-4dda-ac22-97e2860b1662,43.241847,-0.41343
|
||||
50,male,American,Mr.,Richard,K,Martinez,"Via Zannoni 49","Tiarno Di Sopra",TN,Trento,38060,IT,Italy,RichardKMartinez@rhyta.com,Himmest,Feitee9ien,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","0328 4921229",39,Dickenson,9/2/1985,34,Virgo,MasterCard,5258225275434802,882,2/2023,PJ53725800,"1Z 4V9 5Y7 22 3010 519 9",5935776326,45852443,Blue,Logistician,Macroserve,"1995 Fiat Bravo",YellowShoppers.it,O+,205.9,93.6,"5' 9""",176,fbe7a3e7-ace4-4495-b6e6-0c2cb4abcc88,45.964528,10.759331
|
||||
51,male,"Chechen (Latin)",Mr.,Salambek,T,Melikov,"1678 Dorp St",Claremont,WC,"Western Cape",7740,ZA,"South Africa",SalambekMelikov@teleworm.us,Brint1956,ehahCh1xai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","083 792 9726",27,Gairbekov,7/13/1956,63,Cancer,MasterCard,5284109963636027,465,12/2023,5607139788084,"1Z 558 445 48 7417 922 3",1631262867,63160277,Green,"Telephone operator","Kinney Shoes","1998 Chevrolet Trans Sport",TripMetro.co.za,O+,211.9,96.3,"5' 8""",172,b1628f36-9fdd-487d-8297-a35ee6d72ebf,-33.89536,18.479041
|
||||
52,female,Greenland,Mrs.,Mette,K,Olsen,"ul. Karpacka 69",Bydgoszcz,,,85-164,PL,Poland,MetteOlsen@cuvox.de,Liffew,queiy6ooGh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","88 165 40 96",48,Jeremiassen,1/30/1935,84,Aquarius,MasterCard,5353410735290150,175,3/2023,35013096720,"1Z 129 156 25 6468 002 5",2379843087,66415820,Yellow,Lawyer,MagnaSolution,"2005 Peugeot 107",ChildGaming.pl,O-,117.0,53.2,"5' 0""",152,01d98231-8e58-4e17-9c9a-bc5aa388928a,53.068638,18.093529
|
||||
53,male,Russian,Mr.,Spartacus,N,Ignatieff,"Bahnhofstrasse 57",Glovelier,,,2855,CH,Switzerland,SpartacusIgnatieff@jourrapide.com,Imeting1968,yi2Eep8gieh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","032 803 90 31",41,,6/16/1968,51,Gemini,MasterCard,5399586162423418,375,3/2024,,"1Z 632 482 77 2949 860 1",2465813266,58937027,Blue,"Sales worker supervisor",Schweggmanns,"2008 Mazda 5",SoccerInstructor.ch,O+,152.9,69.5,"5' 8""",173,6365915b-b99c-426c-ae8e-698f846d3f03,47.232383,7.22689
|
||||
54,male,Brazil,Mr.,Kauã,S,Cardoso,"P.O. Box 194",Upernavik,QA,Qaasuitsup,3962,GL,Greenland,KauaSantosCardoso@fleckens.hu,Searlitnot,caiT4reN,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","96 63 21",299,Castro,3/9/2000,19,Pisces,MasterCard,5301206781704786,476,5/2024,,"1Z 346 5Y6 16 7677 108 7",7407791874,63200812,Blue,"Press secretary","Dave Cooks","2002 Hyundai Elantra",SpaRules.gl,A+,155.5,70.7,"5' 10""",179,081371c1-479e-4055-95af-3110e72fc11a,72.786922,-56.131948
|
||||
55,female,Brazil,Ms.,Fernanda,P,Cavalcanti,"Via degli Aldobrandeschi 3",Jelsi,CB,Campobasso,86015,IT,Italy,FernandaPereiraCavalcanti@superrito.com,Knour1941,ahChohqu4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0327 9982793",39,Souza,11/30/1941,78,Sagittarius,Visa,4929971103746071,969,10/2023,CI48765311,"1Z 223 435 25 6742 103 3",6973121025,59054247,Blue,"Budget analyst","Hughes & Hatcher","2001 Volkswagen Lupo",MartiniMobile.it,B+,106.3,48.3,"5' 5""",165,c75dc0e6-fe45-431f-8907-6e58db479a3d,41.444028,14.707643
|
||||
56,female,Hungarian,Ms.,Mónika,Z,Göröncsér,"Nábřežní 243","Spálené Porící",PL,"Plzenský kraj","335 61",CZ,"Czech Republic",GoroncserMonika@fleckens.hu,Thenetiong,quohwae5Quoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","376 147 284",420,Szôts,6/11/1946,73,Gemini,Visa,4916007260260864,417,6/2024,,"1Z A43 364 39 5822 708 0",8027400869,93626602,Black,"Precision printing worker","Plunkett Home Furnishings","2001 Bugatti EB 118",LeftJournal.cz,B+,146.3,66.5,"5' 4""",162,13189ec1-db42-4f8c-b74e-e7a45d33a237,49.629,13.606864
|
||||
57,female,Czech,Mrs.,Zuzana,M,Kozáková,"Via del Pontiere 101","Birgi Aerostazione",TP,Trapani,91020,IT,Italy,ZuzanaKozakova@fleckens.hu,Dinectich,mai5eiXaexai,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","0391 7843193",39,Minarčíková,1/21/1953,66,Aquarius,MasterCard,5164065137771907,924,5/2020,AT69882067,"1Z 736 067 39 5591 664 0",0937144872,77392590,Red,"Industrial engineering technician",Romp,"1998 Isuzu VX-02",EugeneTownhouse.it,A+,123.0,55.9,"5' 3""",161,c414c613-db2b-4f1a-8bf1-a91e518afb85,37.610445,12.42306
|
||||
58,female,Russian,Mrs.,Eugene,R,Bykova,"Via Goffredo Mameli 149","Poggiovalle Di Borgorose",RI,Rieti,02020,IT,Italy,EugeneBykova@einrot.com,Ingentersed1943,aeN6eenul5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134","0366 9434948",39,,6/29/1943,76,Cancer,Visa,4556039653807659,972,5/2022,KX92160250,"1Z 975 21Y 29 8927 431 6",6861257852,04061254,Brown,"Executive secretary","Kinney Shoes","1994 Plymouth Voyager",GrandLunch.it,A+,195.8,89.0,"4' 11""",150,7eb2e374-3e73-41d3-95b6-f9bb93993872,42.079608,12.989079
|
||||
59,female,Icelandic,Ms.,Nanna,S,Hallmundsdóttir,"85 Gimblett Street",Richmond,,Invercargill,9810,NZ,"New Zealand",NannaHallmundsdottir@cuvox.de,Wifflife1964,Deez4ooGi0,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","(022) 7929-466",64,,4/15/1964,55,Aries,MasterCard,5504023231705718,236,9/2020,,"1Z 668 366 39 7132 126 6",7750395260,67884791,Black,"Photographic process worker",Megatronic,"1999 Dodge Avenger",PrepaidCDs.co.nz,O+,170.1,77.3,"5' 0""",153,05974695-4733-4bd3-b1ac-bffb56992160,-46.301185,168.423853
|
||||
60,female,Polish,Ms.,Halina,C,Zielinska,"1 Gloucester Road",CLACHANDHU,,,"PA68 7QD",GB,"United Kingdom",HalinaZielinska@jourrapide.com,Corsome74,Vaem2keeV6,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36","077 5149 6820",44,Zielinska,8/7/1974,45,Leo,Visa,4539422686560762,956,3/2022,"TB 10 69 23","1Z 189 A93 08 3744 015 7",0022083989,01394977,Blue,"Surgical technician",Monit,"2011 Renault Grand Scenic",CreditEducate.co.uk,AB+,208.3,94.7,"5' 7""",171,44fd6c97-052a-42a8-b2d2-fb8b3fc70ba8,55.954988,-5.872046
|
||||
61,female,"Japanese (Anglicized)",Ms.,Hatsuho,K,Yoneda,"Dalmatinova 35",Žabnica,,,4209,SI,Slovenia,HatsuhoYoneda@fleckens.hu,Therl1988,ahteel2maeSh,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134",051-632-354,386,Mikami,9/17/1988,31,Virgo,MasterCard,5104564324299139,201,4/2021,,"1Z 486 074 10 6995 124 2",8337173450,63167043,Purple,Treasurer,Practi-Plan,"2003 MG ZT",ConventionalMedicines.si,O+,134.4,61.1,"5' 8""",173,e51aa8e2-9fc6-43e7-8488-168ca87b75d7,46.108258,14.328302
|
||||
62,male,Croatian,Mr.,Gojislav,V,Jukić,"Bahnhofstrasse 96",Gorgier,,,2023,CH,Switzerland,GojislavJukic@dayrep.com,Tinguen,GaiXa3ai,"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36","032 304 33 66",41,Pavić,9/21/1999,20,Virgo,MasterCard,5165115278847765,074,12/2022,,"1Z 385 466 21 1758 512 0",6790588061,83119520,Red,"Extractive metallurgical engineer","Value Giant","2011 Lexus LFA",AmericasFunny.ch,B+,155.5,70.7,"6' 1""",186,0165faab-d065-4a58-a543-057c540a6863,46.955216,6.830252
|
||||
63,female,England/Wales,Mrs.,Megan,T,Swift,"Postbox 23",Maniitsoq,QE,Qeqqata,3912,GL,Greenland,MeganSwift@teleworm.us,Thersevere,ieGh5huoK6,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0","81 32 04",299,Poole,4/30/1964,55,Taurus,Visa,4929574688538812,336,4/2024,,"1Z E38 W63 04 0063 263 0",6213800707,38860322,Green,"Support service manager","Little Folk Shops","2007 Chevrolet Optra",MontereySea.gl,B+,168.7,76.7,"5' 4""",163,394b47d6-e3cb-4869-b4b5-c42a81ef9b00,65.395922,-52.878832
|
||||
64,male,Slovenian,Mr.,"Milan Franc",S,Košelnik,"Na Výsluní 272",Primda,PL,"Plzenský kraj","348 06",CZ,"Czech Republic",MilanFrancKoselnik@teleworm.us,Lospay67,queeN0ies,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","733 162 319",420,Mankoč,1/14/1967,52,Capricorn,MasterCard,5453102419958199,879,12/2021,,"1Z A16 354 25 3838 551 9",0820470346,05905803,Brown,"Clinical laboratory technologist","Singer Lumber","2003 Holden UTE",TextFraud.cz,A+,214.5,97.5,"5' 9""",175,7234b959-896e-4335-8745-c5716b1c7638,49.619319,12.730847
|
||||
65,female,German,Ms.,Johanna,P,Maurer,"Kaisergasse 64",KURZENKIRCHEN,OO,"Upper Austria",4770,AT,Austria,JohannaMaurer@armyspy.com,Reptaked1981,ohwae5Tee,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","0688 992 50 51",43,Schuhmacher,7/12/1981,38,Cancer,Visa,4916529470563530,481,4/2020,,"1Z Y23 375 24 5962 121 9",0306589986,74875435,Purple,"Elevator repairer","Solution Answers","2008 Rover Streetwise",PublicityAid.at,AB+,186.6,84.8,"5' 5""",164,30ce10a0-2ea3-4e24-871a-f72cb3beedfc,48.439595,13.547802
|
||||
66,male,Norwegian,Mr.,Teodor,K,Aune,"Gl. Sygehusvej 153",Narsaq,KU,Kujalleq,3921,GL,Greenland,TeodorAune@superrito.com,Dickent,ooph5leiG,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","66 28 45",299,Arntzen,7/18/1998,21,Cancer,MasterCard,5296939640241254,624,2/2024,,"1Z 216 Y97 87 6791 863 9",0054598944,41175224,Blue,"Private household cook","Builders Emporium","2008 Renault Laguna",GraffitiRoom.gl,O+,239.1,108.7,"5' 9""",174,0562be22-b239-4dbc-a0f7-d64679ae153f,60.827346,-46.022413
|
||||
67,male,Dutch,Mr.,Abderrahman,I,Kempers,"Hjellestadnipen 66",HJELLESTAD,,,5259,NO,Norway,AbderrahmanKempers@einrot.com,Ancery,eeC4tien9,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","460 73 493",47,Cuperus,7/17/1977,42,Cancer,MasterCard,5358951722758050,004,9/2024,,"1Z Y78 20V 88 9025 826 2",1986315622,50825375,Blue,"Time clerk","Carrols Restaurant Group","2008 Dacia Sandero",PlatinumVoice.no,O+,182.4,82.9,"5' 6""",168,49c9f2d1-5e74-4ef9-8c9c-c5e6338e1f6d,60.280666,5.157123
|
||||
68,male,French,Mr.,Nicolas,R,Lebrun,"95 Burton Avenue",Okoia,,Wanganui,4500,NZ,"New Zealand",NicolasLebrun@armyspy.com,Wroing,iR2rahpaim2a,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","(027) 0336-972",64,Bondy,3/3/1964,55,Pisces,MasterCard,5167661122227231,094,5/2021,,"1Z 418 1A1 09 5878 510 1",4757008829,69457336,Blue,"Corporate accountant",Playworld,"1997 Citroen Rally Raid",PopularFlicks.co.nz,O+,209.4,95.2,"5' 10""",178,851fc065-0061-4754-b807-421e4242b5ba,-39.863379,174.967351
|
||||
69,male,Slovenian,Mr.,"Ivan Martin",J,Bugarski,"Breivangvegen 38",TROMSØ,,,9010,NO,Norway,IvanMartinBugarski@einrot.com,Dercy1937,iey2Xoh8o,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","448 63 713",47,Riboli,11/14/1937,82,Scorpio,MasterCard,5364317183005716,388,12/2022,,"1Z 291 4Y2 69 2883 563 7",1469869904,37602111,Red,"Office clerk",Quickbiz,"2000 Chrysler Grand Voyager",StrictlyIdeas.no,O-,174.2,79.2,"5' 10""",179,ca8717a7-1eba-4f93-9f1d-970b2fa1a45e,69.651262,18.958466
|
||||
70,female,Czech,Ms.,Jarmila,M,Chloupková,"729 Albert St",Germiston,GA,Gauteng,1419,ZA,"South Africa",JarmilaChloupkova@superrito.com,Wountim81,yie8Cees,"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","084 256 2607",27,Poláčková,2/5/1981,38,Aquarius,Visa,4716715994153559,960,3/2024,8102054742081,"1Z 454 A20 14 2101 291 2",8866810206,44986764,Red,"Industrial-organizational psychologist","Wells & Wade","1996 Mini MK VI",SleepsAround.co.za,B+,140.6,63.9,"5' 8""",172,940ec903-7fa0-421a-a1c5-7620bea7f7e0,-26.161314,28.133482
|
||||
71,male,Italian,Dr.,Manlio,M,Capon,"Lützelflühstrasse 122",Wil,,,5300,CH,Switzerland,ManlioCapon@einrot.com,Theyear,ieThuo1fei,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","062 380 34 69",41,Folliero,9/17/1947,72,Virgo,Visa,4929441842722544,930,4/2021,,"1Z 15W 644 50 5740 843 7",5183466840,07258789,Black,"Billing and posting clerk","House Of Denmark","1997 Oldsmobile Eighty-Eight",PrepaidHoliday.ch,B+,162.1,73.7,"5' 10""",179,0e92196a-3502-41c6-83bc-9a3b43c49317,47.510722,8.303196
|
||||
72,female,"Japanese (Anglicized)",Ms.,Tomomi,Y,Nishiyama,"Rua do Arenque 1634",Goiânia,GO,Goiás,74343-040,BR,Brazil,TomomiNishiyama@fleckens.hu,Mille1991,aeneThoh6x,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15","(62) 9976-7986",55,Ishizaki,8/23/1991,28,Virgo,Visa,4716240309647674,486,9/2020,406.537.117-19,"1Z 01V 07A 25 5957 147 4",4689360184,38520562,Purple,"Oxy-gas cutter","Lechters Housewares","2011 Chevrolet HHR",PharmacyFile.com.br,B+,213.2,96.9,"5' 2""",158,5141a9c2-cd45-4813-9cbe-12d630eefce4,-16.687195,-49.226261
|
||||
73,female,Russian,Dr.,Esther,R,Kalinina,"Τρικάλων 248",ΛΕΥΚΩΣΙΑ,NI,Λευκωσία,1687,CY,"Cyprus (Greek)",EstherKalinina@dayrep.com,Hics1952,euG0Aiqu2,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","22 723018",357,,7/5/1952,67,Cancer,MasterCard,5563120249912803,542,11/2024,,"1Z 734 93Y 11 9330 585 6",6006786087,26151161,Blue,"Ambulatory care nurse","Waccamaw Pottery","1999 MCC Smart",ShapeConsultant.com.cy,A+,133.5,60.7,"5' 7""",170,642dbeb6-defe-4c89-bc0a-5c64ae807dcb,41.266749,-72.834759
|
||||
74,male,Icelandic,Mr.,Boði,L,Zóphoníasson,"Árpád fejedelem útja 3.",Budapest,BU,Budapest,1184,HU,Hungary,BodiZophoniasson@jourrapide.com,Inart1990,nugheiZ0eig5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","(1) 941-2250",36,,6/12/1990,29,Gemini,MasterCard,5199467250444016,134,12/2020,,"1Z 563 9F0 30 4262 753 9",9294260413,82490560,Orange,"Mail processing machine operator","Bell Markets","1997 Lada Natacha",ReportDiscount.hu,B+,144.5,65.7,"5' 9""",174,a3912b2f-c2dc-4ff7-ab81-af1e047108c5,47.515035,19.146851
|
||||
75,female,England/Wales,Mrs.,Grace,G,Boyle,"62 Mavrokordatou Street",Foinikaria,LI,Limassol,4530,CY,"Cyprus (Anglicized)",GraceBoyle@einrot.com,Fortat81,cahBoot3eH,"Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","25 880993",357,Thompson,9/27/1981,38,Libra,MasterCard,5597431608392937,582,5/2021,,"1Z 791 957 61 6161 657 2",2596874442,13926446,Blue,"Forensic technician","id Boutiques","1993 Bristol Beaufighter",BayNeck.com.cy,AB+,181.7,82.6,"5' 8""",173,149314e4-80a7-493a-8ece-9b6f0890fd5d,41.321303,-72.986114
|
||||
76,female,England/Wales,Ms.,Naomi,S,Ryan,"Λ. Μιχαλακοπούλου 160",ΕΓΚΩΜΗ,NI,Λευκωσία,2417,CY,"Cyprus (Greek)",NaomiRyan@rhyta.com,Fien1988,Gaitha4Ei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","22 498317",357,Clayton,11/21/1988,31,Scorpio,MasterCard,5283115033422307,714,12/2023,,"1Z 587 192 06 3768 866 1",7494134303,25065703,Blue,"Sewer pipe cleaner","Monk Real Estate Service","1995 BMW Dinan",MobLag.com.cy,O+,205.0,93.2,"5' 2""",158,65219a9f-2c32-459a-98d2-7a7332e0f52f,41.385016,-72.962431
|
||||
77,male,Polish,Mr.,Szymon,B,Walczak,"2347 Lauzon Parkway",Windsor,ON,Ontario,"N9A 7A2",CA,Canada,SzymonWalczak@teleworm.us,Dintep,oog4aize7Ai,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",519-566-3375,1,Pawłowska,9/18/1958,61,Virgo,MasterCard,5342019646824967,284,11/2023,"727 633 539","1Z 883 8V5 80 2897 856 7",2176879008,20843357,Silver,Neurosonographer,"De Pinna","2009 Nissan Frontier",MissingWeapons.ca,B+,222.2,101.0,"5' 10""",179,b2e01ab2-c265-42eb-90b8-e26a6361eed4,42.423583,-82.942171
|
||||
78,female,Hungarian,Ms.,Mercédesz,S,Szôllôssy,"Atamaria 86","Fornelos de Montes",PO,Pontevedra,36847,ES,Spain,SzollossyMercedesz@jourrapide.com,Musere,Eif1ce0ee,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36","641 572 459",34,Hofmann,1/15/1944,75,Capricorn,MasterCard,5535165808125664,767,3/2023,,"1Z 681 2E4 44 0770 552 8",5631978558,99957155,Red,"Case management aide","White Hen Pantry","2000 Noble M12",NeedCharge.es,O+,213.6,97.1,"5' 7""",169,a5f515e0-310f-47ab-bc0c-a71bac531a1e,42.263983,-8.431245
|
||||
79,male,Norwegian,Mr.,Edgar,E,Andreassen,"Zistelweg 32",UNTERLAND,SZ,Salzburg,5661,AT,Austria,EdgarAndreassen@fleckens.hu,Waakis2000,Iejeiz1oodei,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36","0664 701 04 17",43,Dybvik,6/29/2000,19,Cancer,Visa,4716078791994463,776,11/2020,,"1Z 311 159 63 7486 723 7",5685893521,99981816,Black,"Speech pathologist",Peaches,"2001 Pontiac Grand Am",WordRegistrar.at,A-,127.8,58.1,"5' 8""",172,15337913-059c-4dd1-9feb-5dc426abe8c7,47.203021,12.910163
|
||||
80,male,Slovenian,Mr.,Šemsudin,M,Vrhovski,"ul. Dawida Jana 124",Wrocław,,,50-527,PL,Poland,SemsudinVrhovski@rhyta.com,Gother,keeLaz9lee0,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 OPR/58.0.3135.132","67 534 85 44",48,Pataki,10/13/1955,64,Libra,Visa,4716877835558592,363,3/2022,55101320290,"1Z E40 449 57 5657 736 1",2373733334,21925106,Blue,"ABE teacher","Integra Wealth Planners","2015 BMW X5 M",VirginExpo.pl,B+,197.6,89.8,"5' 11""",180,5c11fc04-45f8-4989-801f-4102ff38d376,51.112923,17.027289
|
||||
81,female,Finnish,Mrs.,Satu,A,Waltari,"2071 Maryland Avenue",Pinellas,FL,Florida,34624,US,"United States",SatuWaltari@teleworm.us,Stittair,jal6oNgoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:66.0) Gecko/20100101 Firefox/66.0",727-538-7059,1,Viitala,9/16/1995,24,Virgo,Visa,4556890465838575,158,12/2024,591-28-5104,"1Z 534 941 77 8508 193 2",5257097378,69898015,Yellow,"Soil scientist","White Hen Pantry","2003 Daihatsu Terios",kupitorta.com,O+,141.0,64.1,"5' 5""",164,37de7e34-2624-444a-978d-b1b758fbc993,27.864456,-82.748032
|
||||
82,male,German,Mr.,Matthias,S,Himmel,"Degnehøjvej 45",Silkeborg,MI,"Region Midtjylland",8600,DK,Denmark,MatthiasHimmel@armyspy.com,Barted,Jemu5poosoo,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15",30-62-84-08,45,Trommler,8/23/1983,36,Virgo,MasterCard,5289947968601628,320,12/2024,230883-1143,"1Z 006 174 60 6563 087 1",3945260717,87205650,Blue,"Mental health social worker","Superior Appraisals","1998 Alpina B 12",StLouisLighting.dk,A+,143.4,65.2,"5' 10""",179,116742f3-f65e-45f1-a917-d13ad1db7bd4,56.199078,9.447827
|
||||
83,female,Danish,Ms.,Mia,A,Frederiksen,"ul. Zuchów 65","Dąbrowa Górnicza",,,41-303,PL,Poland,MiaAFrederiksen@rhyta.com,Fance1958,buY5faij,"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0","53 459 91 54",48,Lauritsen,6/18/1958,61,Gemini,MasterCard,5454007072610160,208,4/2023,58061866242,"1Z 501 697 49 5209 014 8",9285893233,21439381,Blue,"Fine arts photographer","Coon Chicken Inn","2008 SSC Aero",WrestlingMonthly.pl,O+,181.1,82.3,"5' 2""",158,8969b475-9dce-4173-b060-32da08dbbf0d,50.417075,19.133549
|
||||
84,male,Swedish,Mr.,Jesper,N,Lund,"Põllu 59",Kähu,VG,Valgamaa,68506,EE,Estonia,JesperLund@armyspy.com,Planstim,AeBeiNii0,"Mozilla/5.0 (X11; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","763 2200",372,Lundgren,5/6/1938,81,Taurus,Visa,4539748641306150,938,3/2024,,"1Z 484 548 09 5749 331 5",6980674650,69979073,White,Photogrammetrist,"Expo Superstore","1997 Panoz AIV",CreditChaos.com.ee,A+,178.0,80.9,"5' 10""",178,32b3dfb9-2eaa-4af2-a012-2a044f866550,57.915676,26.169326
|
||||
85,male,Icelandic,Mr.,Guðgeir,S,Bergsveinsson,"Rue du Centre 320",Marke,VWV,"West Flanders",8510,BE,Belgium,GudgeirBergsveinsson@armyspy.com,Ressen,phukieGae9c,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","0493 28 62 88",32,,4/2/1994,25,Aries,MasterCard,5432615789205137,688,11/2021,,"1Z 684 8A0 47 6831 298 9",0387412870,16497840,Orange,"Forging machine tender","Crafts & More","1994 Mitsubishi Sigma",SoldierResources.be,A+,165.9,75.4,"5' 8""",172,1016b5ad-56e3-4fb2-ba90-dc1f43694493,50.73779,3.22707
|
||||
86,male,Icelandic,Mr.,Esjar,S,Sturluson,"Hauptstrasse 75","PUCH BEI HALLEIN",SZ,Salzburg,5412,AT,Austria,EsjarSturluson@teleworm.us,Conetund,taeWeF2Eeph4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","0699 465 17 25",43,,3/24/1970,49,Aries,MasterCard,5400451109415331,492,9/2021,,"1Z 807 0E7 80 7325 125 1",3810913760,52784081,Blue,"Dietetic technician","Endicott Johnson","2002 Smart ForFour",MicroLists.at,B+,171.6,78.0,"6' 0""",182,b63ece12-4e7f-4348-bce8-1d8d5dc31dff,47.741555,13.137162
|
||||
87,male,England/Wales,Mr.,Zak,M,Leonard,"27 Stroud Rd",OCHTERTYRE,,,"PH7 6LF",GB,"United Kingdom",ZakLeonard@fleckens.hu,Surn1940,ohteeF5RaeM,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","078 3687 4061",44,Henry,4/4/1940,79,Aries,MasterCard,5336964062411492,178,11/2023,"KY 97 49 93 A","1Z 77F 43A 87 5585 068 2",5973298171,66698464,Blue,"Heat treating equipment tender","The Independent Planners","1996 Mitsubishi Verada",HumorVids.co.uk,O+,221.1,100.5,"5' 7""",171,d667dde7-082e-4480-98d9-5bdd383eb187,56.07981,-4.643057
|
||||
88,female,"Chechen (Latin)",Mrs.,Ezinet,B,Umkhayev,"216 Karaiskaki Sq",Ineia,PA,Paphos,8704,CY,"Cyprus (Anglicized)",EzinetUmkhayev@dayrep.com,Mothen1991,IeH2ceebae,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","97 696060",357,Masaev,3/27/1991,28,Aries,Visa,4929316208182816,160,4/2021,,"1Z 77V 114 81 0072 703 9",4908986837,97732729,Purple,"Quality assurance inspector","Chief Auto Parts","2005 Jaguar XKR",WealthyGadgets.com.cy,B+,218.9,99.5,"5' 9""",175,7cb11d28-cbd3-49c2-be74-e6dcdca65cb4,41.270842,-72.883851
|
||||
89,female,Russian,Mrs.,Lucia,V,Voronina,"75 Sale-Heyfield Road",KONGWAK,VIC,Victoria,3951,AU,Australia,LuciaVoronina@gustr.com,Riets1976,aY3ohbe8ai,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0","(03) 5371 4059",61,,7/25/1976,43,Leo,MasterCard,5459182457974252,335,5/2024,,"1Z 618 731 57 5565 866 4",4199244189,20580381,Blue,"Eligibility interviewer","William Wanamaker & Sons","2012 Tata Indica",CheatPrevention.com.au,O+,176.7,80.3,"5' 7""",169,bf8e68c3-2842-49da-8c4d-1ed4712a3852,-38.465215,145.830079
|
||||
90,male,Slovenian,Mr.,Milorad,S,Musić,"Välja 61",Mustahamba,VR,Võrumaa,66258,EE,Estonia,MiloradMusic@dayrep.com,Entils,oan8Eiyoaz,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","789 0750",372,Flach,12/1/1968,51,Sagittarius,MasterCard,5211243470404849,924,2/2022,,"1Z F99 556 56 0740 270 3",0643392658,82292582,Blue,Housekeeper,"Cougar Investment","2000 Buick Rendezvous",StickerEmporium.com.ee,A+,244.4,111.1,"6' 1""",185,ea727fc4-6f40-412c-9574-b372a6aef26f,57.820634,26.970835
|
||||
91,female,"Japanese (Anglicized)",Dr.,Chisaki,M,Fujimura,"1956 Uitsig St",Grahamstad,EC,"Eastern Cape",6139,ZA,"South Africa",ChisakiFujimura@cuvox.de,Rewhe1979,iayiQu9ahsie,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","082 875 2166",27,Wakabayashi,12/22/1979,40,Capricorn,Visa,4485068737963325,600,10/2020,7912221956187,"1Z 480 79V 08 4325 733 4",5381640942,65242007,Brown,"Extruding and drawing machine setters","Alert Alarm Company","2005 Porsche Cayenne",NoteBack.co.za,O+,216.5,98.4,"5' 2""",157,b1a8327e-794e-4685-ab8b-43d20c08ed68,-33.370929,26.578978
|
||||
92,female,Russian,Mrs.,Inessa,D,Samoylova,"Bachloh 60",WATZING,OO,"Upper Austria",4673,AT,Austria,InessaSamoylova@fleckens.hu,Rolong,moot9aQu2d,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0","0664 396 27 04",43,,5/11/1956,63,Taurus,Visa,4556264124085418,064,5/2024,,"1Z 596 V55 76 1512 629 4",1323820614,72364512,Purple,Patternmaker,"Erb Lumber","2009 Honda CR-V",EthanolSpecialist.at,O+,135.3,61.5,"5' 3""",161,13f411d7-0754-4233-a902-37a68ce4bb45,48.118598,13.654018
|
||||
93,female,England/Wales,Ms.,Elise,C,Pearson,"215 Andrew Street",Monaco,,Nelson,7011,NZ,"New Zealand",ElisePearson@rhyta.com,Norly1997,eeNgoes7aez,"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0","(027) 7329-039",64,Wilson,5/12/1997,22,Taurus,Visa,4929452838531450,902,4/2022,,"1Z 245 516 58 7073 695 7",9911299106,38307168,Blue,"Activity specialist","The Independent Planners","1995 Nissan President",USFirm.co.nz,O+,162.1,73.7,"5' 9""",175,962cba25-bb3d-4b3b-8aca-e4689ee69dd5,-41.333626,173.307741
|
||||
94,male,Norwegian,Mr.,Herman,A,Johansen,"1324 Mosman Rd","Alexander Bay",NC,"Northern Cape",8294,ZA,"South Africa",HermanJohansen@gustr.com,Thinde,loaH6shiemoh,"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","083 779 9214",27,Smestad,7/29/1947,72,Leo,Visa,4929709070861949,812,9/2023,4707295736082,"1Z 2E1 Y32 51 4242 127 7",3751331085,25346448,Blue,"Cost estimator","Brown Derby","2014 Audi SQ5",TypoPro.co.za,O+,182.8,83.1,"5' 10""",179,377c5af3-3ab2-405b-bc80-1ddd4f06ecca,-28.511777,16.410349
|
||||
95,male,Russian,Mr.,Armen,D,Balabanov,"Bavorovská 788",Stachy,JC,"Jihoceský kraj","384 73",CZ,"Czech Republic",ArmenBalabanov@teleworm.us,Gery1975,Gaeg6uchoh,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","606 972 932",420,,1/11/1975,44,Capricorn,Visa,4532025806629404,529,1/2021,,"1Z 245 330 29 3529 731 4",5526134624,86584996,Orange,Rigger,"Sun Foods","1999 Isuzu VX-02",MyBloggers.cz,A+,199.8,90.8,"5' 6""",168,ec3e15ac-78df-4844-a8e3-c6c491c6dd39,49.090909,13.642637
|
||||
96,male,Russian,Mr.,Evdokim,Y,Bazarov,"Reykjarhóli 70",Fljót,,,570,IS,Iceland,EvdokimBazarov@einrot.com,Deet1996,aiRubie9Poqu,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36","413 4270",354,,10/5/1996,23,Libra,MasterCard,5413518495816218,546,9/2024,,"1Z 900 9A2 84 8206 503 2",0362518360,38724903,White,"Insurance investigator","Modern Realty","1996 ZAZ Wagon",WeekendScores.is,O+,212.1,96.4,"5' 8""",173,9112f339-d232-4fa7-a1f3-2e74571fd00a,66.154544,-17.801351
|
||||
97,female,Hungarian,Mrs.,Agoti,B,Gyarmaty,"793 Buena Vista Avenue",Corvallis,OR,Oregon,97330,US,"United States",GyarmatyAgoti@jourrapide.com,Eage1963,pheiSha1aqu,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15",541-714-1388,1,Cseh,12/7/1963,56,Sagittarius,MasterCard,5328604768229802,989,8/2024,543-24-6755,"1Z 238 019 37 1904 563 8",2038177985,21602780,Blue,"Licensed clinical social worker","Circuit Design","2001 Suzuki Covie",adrifza.com,A+,140.8,64.0,"5' 4""",163,b1f4eed7-ff6b-4671-bf5d-a04c6a7b4beb,44.597298,-123.334112
|
||||
98,male,England/Wales,Mr.,John,K,Carpenter,"Tavcarjeva 22",Senovo,,,8281,SI,Slovenia,JohnCarpenter@dayrep.com,Foris1988,el6xoh7Qu,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0",070-783-977,386,Wheeler,1/10/1988,31,Capricorn,MasterCard,5174982341006037,269,3/2020,,"1Z E95 341 79 8897 978 9",0309942560,54166967,Blue,"Marketing coordinator","Balanced Fortune","1992 Mazda AZ-1",KeywordAlbum.si,O+,226.4,102.9,"5' 8""",173,d2754fd9-f1c9-47cd-b6e1-7c8d5c0eec30,46.102339,15.464625
|
||||
99,female,Hispanic,Mrs.,Maha,A,Cazares,"Reyes Católicos 75","Chiclana de la Frontera",CA,Cádiz,11130,ES,Spain,MahaCazaresMendez@superrito.com,Martrust57,Ohqu6achie,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36","624 412 511",34,Méndez,9/25/1957,62,Libra,Visa,4485545084297530,898,6/2020,,"1Z 508 603 87 4474 636 9",1330099115,99991665,Blue,"Diesel train engineer","Sew-Fro Fabrics","2001 Alfa Romeo GTV",WellnessPlant.es,A+,213.2,96.9,"5' 8""",172,d768e30b-a6de-4977-8bb8-0c8432acce44,36.447765,-6.204969
|
||||
100,female,American,Mrs.,Patricia,J,Nevels,"72 Acheron Road",BUNDALAGUAH,VIC,Victoria,3851,AU,Australia,PatriciaJNevels@rhyta.com,Butimis1962,eekea5Thoo,"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","(03) 5301 7984",61,Thomas,7/30/1962,57,Leo,Visa,4716717759727577,065,9/2024,,"1Z 449 366 30 8287 656 2",3560472157,65535722,Blue,"Identification clerk","PriceRite Warehouse Club","2005 Infiniti QX56",PlayDetails.com.au,A-,209.9,95.4,"5' 3""",161,6156ce11-c2a6-4266-bb4d-f47b06292e4e,-38.111213,147.271178
|
|
|
@ -0,0 +1,3 @@
|
|||
ROCKET
|
||||
rocket
|
||||
racket
|
|
|
@ -0,0 +1,5 @@
|
|||
ROCKET
|
||||
irocketiere
|
||||
rock
|
||||
pocket
|
||||
racket
|
|
|
@ -0,0 +1,4 @@
|
|||
ROCKET
|
||||
Rocket
|
||||
Rocket
|
||||
rocket
|
|
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
|
@ -0,0 +1,4 @@
|
|||
My name is [FIRST_NAME] [LAST_NAME] and I fly a [ROCKET]
|
||||
I'm [ROCKET]
|
||||
The customer's name is [LAST_NAME], [FIRST_NAME] where is my [ROCKET]
|
||||
The customer's name is [FIRST_NAME] [ROCKET]
|
|
@ -0,0 +1,15 @@
|
|||
My email is [EMAIL]
|
||||
My address is [ADDRESS]
|
||||
My first name is [FIRST_NAME] and my last is [LAST_NAME]
|
||||
My name is [PERSON]
|
||||
My zip is [ZIP]
|
||||
I live in [CITY]
|
||||
Here's my phone number: [PHONE_NUMBER]
|
||||
You want my credit card? No problem: [CREDIT_CARD]
|
||||
I was born on [BIRTHDAY]
|
||||
My full address is [FULL_ADDRESS]
|
||||
My kids are [PERSON] and [PERSON2]
|
||||
I either live on [ADDRESS] or [ADDRESS2]
|
||||
Our last names are [LAST_NAME] and [LAST_NAME2]
|
||||
My first name is [FIRST_NAME] and [FIRST_NAME2]
|
||||
My accounts are [ACCOUNT_NUMBER] and [ACCOUNT_NUMBER2]
|
|
@ -0,0 +1,428 @@
|
|||
[
|
||||
{
|
||||
"full_text": "My full address is Avda. Alameda Sundheim 46",
|
||||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "FULL_ADDRESS",
|
||||
"entity_value": "Avda. Alameda Sundheim 46",
|
||||
"start_position": 19,
|
||||
"end_position": 44
|
||||
}
|
||||
],
|
||||
"tokens": [
|
||||
{
|
||||
"text": "My",
|
||||
"idx": 0,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "full",
|
||||
"idx": 3,
|
||||
"tag_": "JJ",
|
||||
"pos_": "ADJ",
|
||||
"dep_": "amod",
|
||||
"lemma_": "full",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "address",
|
||||
"idx": 8,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "address",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "is",
|
||||
"idx": 16,
|
||||
"tag_": "VBZ",
|
||||
"pos_": "AUX",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "be",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Avda",
|
||||
"idx": 19,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "attr",
|
||||
"lemma_": "Avda",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": ".",
|
||||
"idx": 23,
|
||||
"tag_": ".",
|
||||
"pos_": "PUNCT",
|
||||
"dep_": "punct",
|
||||
"lemma_": ".",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Alameda",
|
||||
"idx": 25,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "compound",
|
||||
"lemma_": "Alameda",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Sundheim",
|
||||
"idx": 33,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "Sundheim",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "46",
|
||||
"idx": 42,
|
||||
"tag_": "CD",
|
||||
"pos_": "NUM",
|
||||
"dep_": "nummod",
|
||||
"lemma_": "46",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"tags": [
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"B-FULL_ADDRESS",
|
||||
"I-FULL_ADDRESS",
|
||||
"I-FULL_ADDRESS",
|
||||
"I-FULL_ADDRESS",
|
||||
"L-FULL_ADDRESS"
|
||||
],
|
||||
"template_id": null,
|
||||
"metadata": {
|
||||
"Gender": "male",
|
||||
"NameSet": "Croatian",
|
||||
"Country": "Uganda",
|
||||
"Lowercase": false,
|
||||
"Template#": 9
|
||||
}
|
||||
},
|
||||
{
|
||||
"full_text": "You want my credit card? No problem: 4532368231815457",
|
||||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "CREDIT_CARD",
|
||||
"entity_value": "4532368231815457",
|
||||
"start_position": 37,
|
||||
"end_position": 53
|
||||
}
|
||||
],
|
||||
"tokens": [
|
||||
{
|
||||
"text": "You",
|
||||
"idx": 0,
|
||||
"tag_": "PRP",
|
||||
"pos_": "PRON",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "-PRON-",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "want",
|
||||
"idx": 4,
|
||||
"tag_": "VBP",
|
||||
"pos_": "VERB",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "want",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "my",
|
||||
"idx": 9,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "credit",
|
||||
"idx": 12,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "compound",
|
||||
"lemma_": "credit",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "card",
|
||||
"idx": 19,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "dobj",
|
||||
"lemma_": "card",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "?",
|
||||
"idx": 23,
|
||||
"tag_": ".",
|
||||
"pos_": "PUNCT",
|
||||
"dep_": "punct",
|
||||
"lemma_": "?",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "No",
|
||||
"idx": 25,
|
||||
"tag_": "DT",
|
||||
"pos_": "DET",
|
||||
"dep_": "det",
|
||||
"lemma_": "no",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "problem",
|
||||
"idx": 28,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "problem",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": ":",
|
||||
"idx": 35,
|
||||
"tag_": ":",
|
||||
"pos_": "PUNCT",
|
||||
"dep_": "punct",
|
||||
"lemma_": ":",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "4532368231815457",
|
||||
"idx": 37,
|
||||
"tag_": "CD",
|
||||
"pos_": "NUM",
|
||||
"dep_": "appos",
|
||||
"lemma_": "4532368231815457",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"tags": [
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"U-CREDIT_CARD"
|
||||
],
|
||||
"template_id": null,
|
||||
"metadata": {
|
||||
"Gender": "female",
|
||||
"NameSet": "Czech",
|
||||
"Country": "Austria",
|
||||
"Lowercase": false,
|
||||
"Template#": 7
|
||||
}
|
||||
},
|
||||
{
|
||||
"full_text": "My first name is Rogelio and my last is Patrick",
|
||||
"masked": null,
|
||||
"spans": [
|
||||
{
|
||||
"entity_type": "PERSON",
|
||||
"entity_value": "Rogelio",
|
||||
"start_position": 17,
|
||||
"end_position": 24
|
||||
},
|
||||
{
|
||||
"entity_type": "PERSON",
|
||||
"entity_value": "Patrick",
|
||||
"start_position": 40,
|
||||
"end_position": 47
|
||||
}
|
||||
],
|
||||
"tokens": [
|
||||
{
|
||||
"text": "My",
|
||||
"idx": 0,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "first",
|
||||
"idx": 3,
|
||||
"tag_": "JJ",
|
||||
"pos_": "ADJ",
|
||||
"dep_": "amod",
|
||||
"lemma_": "first",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "name",
|
||||
"idx": 9,
|
||||
"tag_": "NN",
|
||||
"pos_": "NOUN",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "name",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "is",
|
||||
"idx": 14,
|
||||
"tag_": "VBZ",
|
||||
"pos_": "AUX",
|
||||
"dep_": "ROOT",
|
||||
"lemma_": "be",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Rogelio",
|
||||
"idx": 17,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "attr",
|
||||
"lemma_": "Rogelio",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "and",
|
||||
"idx": 25,
|
||||
"tag_": "CC",
|
||||
"pos_": "CCONJ",
|
||||
"dep_": "cc",
|
||||
"lemma_": "and",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "my",
|
||||
"idx": 29,
|
||||
"tag_": "PRP$",
|
||||
"pos_": "DET",
|
||||
"dep_": "poss",
|
||||
"lemma_": "-PRON-",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "last",
|
||||
"idx": 32,
|
||||
"tag_": "JJ",
|
||||
"pos_": "ADJ",
|
||||
"dep_": "nsubj",
|
||||
"lemma_": "last",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "is",
|
||||
"idx": 37,
|
||||
"tag_": "VBZ",
|
||||
"pos_": "AUX",
|
||||
"dep_": "conj",
|
||||
"lemma_": "be",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Patrick",
|
||||
"idx": 40,
|
||||
"tag_": "NNP",
|
||||
"pos_": "PROPN",
|
||||
"dep_": "attr",
|
||||
"lemma_": "Patrick",
|
||||
"_": {
|
||||
"is_in_vocabulary": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"tags": [
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"U-PERSON",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"O",
|
||||
"U-PERSON"
|
||||
],
|
||||
"template_id": null,
|
||||
"metadata": {
|
||||
"Gender": "male",
|
||||
"NameSet": "American",
|
||||
"Country": "California",
|
||||
"Lowercase": false,
|
||||
"Template#": 2
|
||||
}
|
||||
}
|
||||
]
|
|
@ -0,0 +1,3 @@
|
|||
from .model_mock import IdentityTokensMockModel, \
|
||||
FiftyFiftyIdentityTokensMockModel, \
|
||||
MockTokensModel
|
|
@ -0,0 +1,50 @@
|
|||
from typing import List
|
||||
|
||||
from presidio_evaluator import InputSample, ModelEvaluator
|
||||
|
||||
|
||||
class MockTokensModel(ModelEvaluator):
|
||||
"""
|
||||
Simulates a real model, returns the prediction given in the constructor
|
||||
"""
|
||||
|
||||
def __init__(self, prediction: List[str], entities_to_keep: List = None,
|
||||
verbose: bool = False, **kwargs):
|
||||
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose,
|
||||
**kwargs)
|
||||
self.prediction = prediction
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
return self.prediction
|
||||
|
||||
|
||||
class IdentityTokensMockModel(ModelEvaluator):
|
||||
"""
|
||||
Simulates a real model, always return the label as prediction
|
||||
"""
|
||||
|
||||
def __init__(self, entities_to_keep: List = None,
|
||||
verbose: bool = False):
|
||||
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
return sample.tags
|
||||
|
||||
|
||||
class FiftyFiftyIdentityTokensMockModel(ModelEvaluator):
|
||||
"""
|
||||
Simulates a real model, returns the label or no predictions (list of 'O')
|
||||
alternately
|
||||
"""
|
||||
|
||||
def __init__(self, entities_to_keep: List = None,
|
||||
verbose: bool = False):
|
||||
super().__init__(entities_to_keep=entities_to_keep, verbose=verbose)
|
||||
self.counter = 0
|
||||
|
||||
def predict(self, sample: InputSample) -> List[str]:
|
||||
self.counter += 1
|
||||
if self.counter % 2 == 0:
|
||||
return sample.tags
|
||||
else:
|
||||
return ["O" for i in range(len(sample.tags))]
|
|
@ -0,0 +1,22 @@
|
|||
import numpy as np
|
||||
|
||||
from presidio_evaluator.crf_evaluator import CRFEvaluator
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
|
||||
|
||||
# no_test since the CRF model is not supplied with the package
|
||||
def no_test_test_crf_simple():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
model_path = os.path.abspath(os.path.join(dir_path, "..", "model-outputs/crf.pickle"))
|
||||
|
||||
crf_evaluator = CRFEvaluator(model_pickle_path=model_path,entities_to_keep=['PERSON'])
|
||||
evaluation_results = crf_evaluator.evaluate_all(input_samples)
|
||||
scores = crf_evaluator.calculate_score(evaluation_results)
|
||||
|
||||
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
|
||||
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
|
||||
assert scores.pii_recall > 0
|
||||
assert scores.pii_precision > 0
|
|
@ -0,0 +1,48 @@
|
|||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
|
||||
|
||||
def test_to_conll():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
conll = InputSample.create_conll_dataset(input_samples)
|
||||
|
||||
sentences = conll['sentence'].unique()
|
||||
assert len(sentences) == len(input_samples)
|
||||
|
||||
|
||||
def test_to_spacy_all_entities():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
spacy_ver = InputSample.create_spacy_dataset(input_samples)
|
||||
|
||||
assert len(spacy_ver) == len(input_samples)
|
||||
|
||||
|
||||
def test_to_spacy_all_entities_specific_entities():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
spacy_ver = InputSample.create_spacy_dataset(input_samples, entities=['PERSON'])
|
||||
|
||||
spacy_ver_with_labels = [sample for sample in spacy_ver if len(sample[1]['entities'])]
|
||||
|
||||
assert len(spacy_ver_with_labels) < len(input_samples)
|
||||
assert len(spacy_ver_with_labels) > 0
|
||||
|
||||
|
||||
def test_to_spach_json():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
spacy_ver = InputSample.create_spacy_json(input_samples)
|
||||
|
||||
assert len(spacy_ver) == len(input_samples)
|
||||
assert 'id' in spacy_ver[0]
|
||||
assert 'paragraphs' in spacy_ver[0]
|
|
@ -0,0 +1,26 @@
|
|||
try:
|
||||
from flair.models import SequenceTagger
|
||||
except ImportError:
|
||||
print("Flair is not installed by default")
|
||||
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.flair_evaluator import FlairEvaluator
|
||||
|
||||
import numpy as np
|
||||
|
||||
# no-unit because flair is not a dependency by default
|
||||
def no_unit_test_flair_simple():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
model = SequenceTagger.load('ner-ontonotes-fast') # .load('ner')
|
||||
|
||||
flair_evaluator = FlairEvaluator(model=model, entities_to_keep=['PERSON'])
|
||||
evaluation_results = flair_evaluator.evaluate_all(input_samples)
|
||||
scores = flair_evaluator.calculate_score(evaluation_results)
|
||||
|
||||
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
|
||||
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
|
||||
assert scores.pii_recall > 0
|
||||
assert scores.pii_precision > 0
|
|
@ -0,0 +1,121 @@
|
|||
from presidio_evaluator.data_generator import generate, read_synth_dataset, FakeDataGenerator
|
||||
|
||||
|
||||
def get_fake_generator(template, fake_pii_df):
|
||||
class MockFakeGenerator(FakeDataGenerator):
|
||||
"""
|
||||
Mock class that doesn't add to the fake PII DF so you could inject entities yourself.
|
||||
"""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
def prep_fake_pii(self, df):
|
||||
return df
|
||||
|
||||
return MockFakeGenerator(templates=[template],
|
||||
fake_pii_df=fake_pii_df,
|
||||
include_metadata=False,
|
||||
span_to_tag=False,
|
||||
dictionary_path=None,
|
||||
lower_case_ratio=0)
|
||||
|
||||
|
||||
def test_generator_correct_output():
|
||||
OUTPUT = "generated_test.txt"
|
||||
EXAMPLES = 3
|
||||
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
fake_pii_csv = "{}/data/FakeNameGenerator.com_100.csv".format(dir_path)
|
||||
utterances_file = "{}/data/templates.txt".format(dir_path)
|
||||
dictionary = "{}/data/Dictionary_test.csv".format(dir_path)
|
||||
|
||||
generate(fake_pii_csv=fake_pii_csv,
|
||||
utterances_file=utterances_file,
|
||||
dictionary_path=dictionary,
|
||||
output_file=OUTPUT,
|
||||
lower_case_ratio=0.3,
|
||||
num_of_examples=EXAMPLES)
|
||||
|
||||
input_samples = read_synth_dataset(OUTPUT)
|
||||
|
||||
for sample in input_samples:
|
||||
assert len(sample.tags) == len(sample.tokens)
|
||||
|
||||
|
||||
def test_a_turned_to_an():
|
||||
fake_pii_df = get_mock_fake_df(GENDER="Ale")
|
||||
template = "I am a [GENDER] living in [COUNTRY]"
|
||||
bracket_location = template.find("[")
|
||||
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
|
||||
template=template)
|
||||
|
||||
examples = [x for x in fake_generator.sample_examples(1)]
|
||||
assert " an " in examples[0].full_text
|
||||
# entity location updated
|
||||
assert examples[0].spans[0].start_position == bracket_location + 1
|
||||
|
||||
|
||||
def test_a_not_turning_into_an():
|
||||
fake_pii_df = get_mock_fake_df(GENDER="Male")
|
||||
template = "I am a [GENDER] living in [COUNTRY]"
|
||||
previous_bracket = template.find("[")
|
||||
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
|
||||
template=template)
|
||||
|
||||
examples = [x for x in fake_generator.sample_examples(1)]
|
||||
assert " an " not in examples[0].full_text
|
||||
assert examples[0].spans[0].start_position == previous_bracket
|
||||
|
||||
|
||||
def test_A_turning_into_An():
|
||||
fake_pii_df = get_mock_fake_df(GENDER="ale")
|
||||
template = "A [GENDER] living in [COUNTRY]"
|
||||
previous_bracket = template.find("[")
|
||||
fake_generator = get_fake_generator(fake_pii_df=fake_pii_df,
|
||||
template=template)
|
||||
|
||||
examples = [x for x in fake_generator.sample_examples(1)]
|
||||
assert "An " in examples[0].full_text
|
||||
assert examples[0].spans[0].start_position == previous_bracket + 1
|
||||
|
||||
|
||||
def get_mock_fake_df(**kwargs):
|
||||
dict = {
|
||||
"Number": 1,
|
||||
"Gender": "Male",
|
||||
"NameSet": "English",
|
||||
"Title": "Mr.",
|
||||
"GivenName": "Dondo",
|
||||
"MiddleInitial": "N",
|
||||
"Surname": "Mondo",
|
||||
"StreetAddress": "Where I live 15",
|
||||
"City": "Amsterdam",
|
||||
"State": "",
|
||||
"StateFull": "",
|
||||
"ZipCode": "12345",
|
||||
"Country": "Netherlands",
|
||||
"CountryFull": "Netherlands",
|
||||
"EmailAddress": "dondo@mondo.net",
|
||||
"Username": "Dondo12",
|
||||
"Password": "123456",
|
||||
"TelephoneNumber": "+1412391",
|
||||
"TelephoneCountryCode": "14",
|
||||
"MothersMaiden": "",
|
||||
"Birthday": "15 Aug 1966",
|
||||
"Age": "200",
|
||||
"CCType": "astercard",
|
||||
"CCNumber": "12371832821",
|
||||
"CVV2": "123",
|
||||
"CCExpires": "19-19",
|
||||
"NationalID": "14124",
|
||||
"Occupation": "Hunter",
|
||||
"Company": "Lolo and sons",
|
||||
"Domain": "lolo.com"}
|
||||
|
||||
dict.update(kwargs)
|
||||
|
||||
import pandas as pd
|
||||
fake_pii_df = pd.DataFrame(dict, index=[0])
|
||||
return fake_pii_df
|
|
@ -0,0 +1,271 @@
|
|||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from presidio_evaluator import InputSample, EvaluationResult
|
||||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from tests.mocks import IdentityTokensMockModel, \
|
||||
FiftyFiftyIdentityTokensMockModel, MockTokensModel
|
||||
|
||||
|
||||
def test_evaluator_simple():
|
||||
prediction = ["O", "O", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample(full_text="I am the walrus",
|
||||
masked="I am the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
final_evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert final_evaluation.pii_precision == 1
|
||||
assert final_evaluation.pii_recall == 1
|
||||
|
||||
|
||||
def test_evaluate_sample_wrong_entities_to_keep_correct_statistics():
|
||||
prediction = ["O", "O", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction,
|
||||
entities_to_keep=['SPACESHIP'])
|
||||
|
||||
sample = InputSample(full_text="I am the walrus",
|
||||
masked="I am the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
assert evaluated.results[("O", "O")] == 4
|
||||
|
||||
|
||||
def test_evaluate_same_entity_correct_statistics():
|
||||
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample(full_text="I dog the walrus",
|
||||
masked="I [ANIMAL] the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluation_result = model.evaluate_sample(sample)
|
||||
assert evaluation_result.results[("O", "O")] == 2
|
||||
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
|
||||
assert evaluation_result.results[("O", "ANIMAL")] == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_entities_to_keep_correct_statistics():
|
||||
prediction = ["O", "U-ANIMAL", "O", "U-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, labeling_scheme='BIO',
|
||||
entities_to_keep=['ANIMAL', 'PLANT', 'SPACESHIP'])
|
||||
sample = InputSample(full_text="I dog the walrus",
|
||||
masked="I [ANIMAL] the [ANIMAL]",
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus"]
|
||||
sample.tags = ["O", "O", "O", "U-ANIMAL"]
|
||||
|
||||
evaluation_result = model.evaluate_sample(sample)
|
||||
assert evaluation_result.results[("O", "O")] == 2
|
||||
assert evaluation_result.results[("ANIMAL", "ANIMAL")] == 1
|
||||
assert evaluation_result.results[("O", "ANIMAL")] == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the",
|
||||
"walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O",
|
||||
"B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert evaluation.pii_precision == 1
|
||||
assert evaluation.pii_recall == 1
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_partial_match_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-ANIMAL", "L-ANIMAL", "O"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert evaluation.pii_precision == 1
|
||||
assert evaluation.pii_recall == 4 / 6
|
||||
|
||||
|
||||
def test_evaluate_multiple_tokens_no_match_match_correct_statistics():
|
||||
prediction = ["O", "O", "O", "B-SPACESHIP", "L-SPACESHIP", "O"]
|
||||
model = MockTokensModel(prediction=prediction, entities_to_keep=['ANIMAL'])
|
||||
|
||||
sample = InputSample("I am the walrus amaericanus magnifico", masked=None,
|
||||
spans=None)
|
||||
sample.tokens = ["I", "am", "the", "walrus", "americanus", "magnifico"]
|
||||
sample.tags = ["O", "O", "O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL"]
|
||||
|
||||
evaluated = model.evaluate_sample(sample)
|
||||
evaluation = model.calculate_score(
|
||||
[evaluated])
|
||||
|
||||
assert np.isnan(evaluation.pii_precision)
|
||||
assert evaluation.pii_recall == 0
|
||||
|
||||
|
||||
def test_evaluate_multiple_examples_correct_statistics():
|
||||
prediction = ["U-PERSON", "O", "O", "U-PERSON", "O", "O"]
|
||||
model = MockTokensModel(prediction=prediction,
|
||||
labeling_scheme='BILOU',
|
||||
entities_to_keep=['PERSON'])
|
||||
input_sample = InputSample("My name is Raphael or David", masked=None,
|
||||
spans=None)
|
||||
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
|
||||
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
|
||||
|
||||
evaluated = model.evaluate_all(
|
||||
[input_sample, input_sample, input_sample, input_sample])
|
||||
scores = model.calculate_score(
|
||||
evaluated)
|
||||
assert scores.pii_precision == 0.5
|
||||
assert scores.pii_recall == 0.5
|
||||
|
||||
|
||||
def test_evaluate_multiple_examples_ignore_entity_correct_statistics():
|
||||
prediction = ["O", "O", "O", "U-PERSON", "O", "U-TENNIS_PLAYER"]
|
||||
model = MockTokensModel(prediction=prediction,
|
||||
labeling_scheme='BILOU',
|
||||
entities_to_keep=['PERSON', 'TENNIS_PLAYER'])
|
||||
input_sample = InputSample("My name is Raphael or David", masked=None,
|
||||
spans=None)
|
||||
input_sample.tokens = ["My", "name", "is", "Raphael", "or", "David"]
|
||||
input_sample.tags = ["O", "O", "O", "U-PERSON", "O", "U-PERSON"]
|
||||
|
||||
evaluated = model.evaluate_all(
|
||||
[input_sample, input_sample, input_sample, input_sample])
|
||||
scores = model.calculate_score(evaluated)
|
||||
assert scores.pii_precision == 1
|
||||
assert scores.pii_recall == 1
|
||||
|
||||
|
||||
def test_confusion_matrix_correct_metrics():
|
||||
from collections import Counter
|
||||
|
||||
evaluated = [EvaluationResult(results=Counter({
|
||||
('O', 'O'): 150,
|
||||
('O', 'PERSON'): 30,
|
||||
('O', 'COMPANY'): 30,
|
||||
('PERSON', 'PERSON'): 40,
|
||||
('COMPANY', 'COMPANY'): 40,
|
||||
('PERSON', 'COMPANY'): 10,
|
||||
('COMPANY', 'PERSON'): 10,
|
||||
('PERSON', 'O'): 30,
|
||||
('COMPANY', 'O'): 30}), model_errors=None, text=None)]
|
||||
|
||||
model = MockTokensModel(prediction=None,
|
||||
entities_to_keep=['PERSON', 'COMPANY'])
|
||||
|
||||
scores = model.calculate_score(evaluated, beta=2.5)
|
||||
|
||||
assert scores.pii_precision == 0.625
|
||||
assert scores.pii_recall == 0.625
|
||||
assert scores.entity_recall_dict['PERSON'] == 0.5
|
||||
assert scores.entity_precision_dict['PERSON'] == 0.5
|
||||
assert scores.entity_recall_dict['COMPANY'] == 0.5
|
||||
assert scores.entity_precision_dict['COMPANY'] == 0.5
|
||||
|
||||
|
||||
def test_confusion_matrix_2_correct_metrics():
|
||||
from collections import Counter
|
||||
|
||||
evaluated = [EvaluationResult(results=Counter(
|
||||
{('O', 'O'): 65467,
|
||||
('O', 'ORG'): 4189,
|
||||
('GPE', 'O'): 3370,
|
||||
('PERSON', 'PERSON'): 2024,
|
||||
('GPE', 'PERSON'): 1488,
|
||||
('GPE', 'GPE'): 1033,
|
||||
('O', 'GPE'): 964,
|
||||
('ORG', 'ORG'): 914,
|
||||
('O', 'PERSON'): 834,
|
||||
('GPE', 'ORG'): 401,
|
||||
('PERSON', 'ORG'): 35,
|
||||
('PERSON', 'O'): 33,
|
||||
('ORG', 'O'): 8,
|
||||
('PERSON', 'GPE'): 5,
|
||||
('ORG', 'PERSON'): 1}), model_errors=None, text=None)]
|
||||
|
||||
model = MockTokensModel(prediction=None)
|
||||
|
||||
scores = model.calculate_score(evaluated, beta=2.5)
|
||||
|
||||
pii_tp = evaluated[0].results[('PERSON', 'PERSON')] + \
|
||||
evaluated[0].results[('ORG', 'ORG')] + \
|
||||
evaluated[0].results[('GPE', 'GPE')] + \
|
||||
evaluated[0].results[('ORG', 'GPE')] + \
|
||||
evaluated[0].results[('ORG', 'PERSON')] + \
|
||||
evaluated[0].results[('GPE', 'ORG')] + \
|
||||
evaluated[0].results[('GPE', 'PERSON')] + \
|
||||
evaluated[0].results[('PERSON', 'GPE')] + \
|
||||
evaluated[0].results[('PERSON', 'ORG')]
|
||||
|
||||
pii_fp = evaluated[0].results[('O', 'PERSON')] + \
|
||||
evaluated[0].results[('O', 'GPE')] + \
|
||||
evaluated[0].results[('O', 'ORG')]
|
||||
|
||||
pii_fn = evaluated[0].results[('PERSON', 'O')] + \
|
||||
evaluated[0].results[('GPE', 'O')] + \
|
||||
evaluated[0].results[('ORG', 'O')]
|
||||
|
||||
assert scores.pii_precision == pii_tp / (pii_tp + pii_fp)
|
||||
assert scores.pii_recall == pii_tp / (pii_tp + pii_fn)
|
||||
|
||||
|
||||
def test_dataset_to_metric_identity_model():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
"{}/data/generated_small.txt".format(dir_path), length=10)
|
||||
|
||||
model = IdentityTokensMockModel()
|
||||
|
||||
evaluation_results = model.evaluate_all(input_samples)
|
||||
metrics = model.calculate_score(
|
||||
evaluation_results)
|
||||
|
||||
assert metrics.pii_precision == 1
|
||||
assert metrics.pii_recall == 1
|
||||
|
||||
|
||||
def test_dataset_to_metric_50_50_model():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(
|
||||
"{}/data/generated_small.txt".format(dir_path), length=100)
|
||||
|
||||
# Replace 50% of the predictions with a list of "O"
|
||||
model = FiftyFiftyIdentityTokensMockModel(entities_to_keep='PERSON')
|
||||
|
||||
evaluation_results = model.evaluate_all(input_samples)
|
||||
metrics = model.calculate_score(
|
||||
evaluation_results)
|
||||
|
||||
print(metrics.pii_precision)
|
||||
print(metrics.pii_recall)
|
||||
print(metrics.pii_f)
|
||||
|
||||
assert metrics.pii_precision == 1
|
||||
assert metrics.pii_recall < 0.75
|
||||
assert metrics.pii_recall > 0.25
|
|
@ -0,0 +1,80 @@
|
|||
'''
|
||||
Presidio Analyzer not yet on PyPI, ignoring temporarily
|
||||
'''
|
||||
#
|
||||
# import pytest
|
||||
#
|
||||
# from presidio_evaluator import InputSample, Span
|
||||
# from presidio_evaluator.data_generator import read_synth_dataset
|
||||
# from presidio_evaluator.presidio_analyzer import PresidioAnalyzer
|
||||
#
|
||||
#
|
||||
# class GeneratedTextTestCase:
|
||||
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
|
||||
# self.test_name = test_name
|
||||
# self.test_input = test_input
|
||||
# self.acceptance_threshold = acceptance_threshold
|
||||
# self.marks = marks
|
||||
#
|
||||
# def to_pytest_param(self):
|
||||
# return pytest.param(self.test_input, self.acceptance_threshold,
|
||||
# id=self.test_name, marks=self.marks)
|
||||
#
|
||||
#
|
||||
# # generated-text test cases
|
||||
# analyzer_test_generate_text_testdata = [
|
||||
# # small set fixture which expects all results.
|
||||
# GeneratedTextTestCase(
|
||||
# test_name="small-set",
|
||||
# test_input="{}/data/generated_small.txt",
|
||||
# acceptance_threshold=0.3,
|
||||
# marks=pytest.mark.none
|
||||
# )
|
||||
# ]
|
||||
#
|
||||
#
|
||||
# @pytest.mark.skip(reason="Presidio analyzer not on PyPi")
|
||||
# def test_analyzer_simple_input():
|
||||
# model = PresidioAnalyzer(entities_to_keep=['PERSON'])
|
||||
#
|
||||
# sample = InputSample(full_text="My name is Mike",
|
||||
# masked="My name is [PERSON]",
|
||||
# spans=[Span('PERSON', 'Mike', 10, 14)],
|
||||
# create_tags_from_span=True)
|
||||
#
|
||||
# evaluated = model.evaluate_sample(sample)
|
||||
# metrics = model.calculate_score(
|
||||
# [evaluated])
|
||||
#
|
||||
# assert metrics.pii_precision == 1
|
||||
# assert metrics.pii_recall == 1
|
||||
#
|
||||
#
|
||||
# # analyzer tests on generated data
|
||||
# @pytest.mark.skip(reason="Presidio analyzer not on PyPi")
|
||||
# @pytest.mark.parametrize("test_input,acceptance_threshold",
|
||||
# [testcase.to_pytest_param() for testcase in
|
||||
# analyzer_test_generate_text_testdata])
|
||||
# def test_analyzer_with_generated_text(test_input, acceptance_threshold):
|
||||
# """
|
||||
# Test analyzer with a generated dataset text file
|
||||
# :param test_input: input text file location
|
||||
# :param acceptance_threshold: minimim precision/recall
|
||||
# allowed for tests to pass
|
||||
# """
|
||||
# # read test input from generated file
|
||||
#
|
||||
# import os
|
||||
# dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
# input_samples = read_synth_dataset(
|
||||
# test_input.format(dir_path))
|
||||
#
|
||||
# updated_samples = PresidioAnalyzer. \
|
||||
# align_input_samples_to_presidio_analyzer(input_samples)
|
||||
#
|
||||
# analyzer = PresidioAnalyzer()
|
||||
# evaluated_samples = analyzer.evaluate_all(updated_samples)
|
||||
# scores = analyzer.calculate_score(evaluation_results=evaluated_samples)
|
||||
#
|
||||
# assert acceptance_threshold <= scores.pii_precision
|
||||
# assert acceptance_threshold <= scores.pii_recall
|
|
@ -0,0 +1,62 @@
|
|||
'''
|
||||
Presidio Analyzer not yet on PyPI, ignoring temporarily
|
||||
'''
|
||||
|
||||
# from presidio_evaluator.data_generator import read_synth_dataset
|
||||
# from presidio_evaluator.presidio_recognizer_evaluator import score_presidio_recognizer
|
||||
# import pytest
|
||||
#
|
||||
# from analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
|
||||
#
|
||||
# # test case parameters for tests with dataset which was previously generated.
|
||||
# class GeneratedTextTestCase:
|
||||
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
|
||||
# self.test_name = test_name
|
||||
# self.test_input = test_input
|
||||
# self.acceptance_threshold = acceptance_threshold
|
||||
# self.marks = marks
|
||||
#
|
||||
# def to_pytest_param(self):
|
||||
# return pytest.param(self.test_input, self.acceptance_threshold,
|
||||
# id=self.test_name, marks=self.marks)
|
||||
#
|
||||
#
|
||||
# # generated-text test cases
|
||||
# cc_test_generate_text_testdata = [
|
||||
# # small set fixture which expects all type results.
|
||||
# GeneratedTextTestCase(
|
||||
# test_name="small-set",
|
||||
# test_input="{}/data/generated_small.txt",
|
||||
# acceptance_threshold=1,
|
||||
# marks=pytest.mark.none
|
||||
# ),
|
||||
# # large set fixture which expects all type results. marked as "slow"
|
||||
# GeneratedTextTestCase(
|
||||
# test_name="large_set",
|
||||
# test_input="{}/data/generated_large.txt",
|
||||
# acceptance_threshold=1,
|
||||
# marks=pytest.mark.slow
|
||||
# )
|
||||
# ]
|
||||
#
|
||||
#
|
||||
# # credit card recognizer tests on generated data
|
||||
# @pytest.mark.parametrize("test_input,acceptance_threshold",
|
||||
# [testcase.to_pytest_param()
|
||||
# for testcase in cc_test_generate_text_testdata])
|
||||
# def test_credit_card_recognizer_with_generated_text(test_input, acceptance_threshold):
|
||||
# """
|
||||
# Test credit card recognizer with a generated dataset text file
|
||||
# :param test_input: input text file location
|
||||
# :param acceptance_threshold: minimim precision/recall
|
||||
# allowed for tests to pass
|
||||
# """
|
||||
#
|
||||
# # read test input from generated file
|
||||
# import os
|
||||
# dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
# input_samples = read_synth_dataset(
|
||||
# test_input.format(dir_path))
|
||||
# scores = score_presidio_recognizer(
|
||||
# CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
|
||||
# assert acceptance_threshold <= scores.pii_f
|
|
@ -0,0 +1,83 @@
|
|||
'''
|
||||
Presidio Analyzer not yet on PyPI, ignoring temporarily
|
||||
'''
|
||||
|
||||
# from presidio_evaluator.data_generator import generate
|
||||
# from presidio_evaluator.presidio_recognizer_evaluator import \
|
||||
# score_presidio_recognizer
|
||||
# import pytest
|
||||
# import numpy as np
|
||||
#
|
||||
# from analyzer.predefined_recognizers.credit_card_recognizer import CreditCardRecognizer
|
||||
#
|
||||
# # test case parameters for tests with dataset generated from a template and csv values
|
||||
# class TemplateTextTestCase:
|
||||
# def __init__(self, test_name, pii_csv, utterances, dictionary_path,
|
||||
# num_of_examples, acceptance_threshold, marks):
|
||||
# self.test_name = test_name
|
||||
# self.pii_csv = pii_csv
|
||||
# self.utterances = utterances
|
||||
# self.dictionary_path = dictionary_path
|
||||
# self.num_of_examples = num_of_examples
|
||||
# self.acceptance_threshold = acceptance_threshold
|
||||
# self.marks = marks
|
||||
#
|
||||
# def to_pytest_param(self):
|
||||
# return pytest.param(self.pii_csv, self.utterances, self.dictionary_path,
|
||||
# self.num_of_examples, self.acceptance_threshold,
|
||||
# id=self.test_name, marks=self.marks)
|
||||
#
|
||||
#
|
||||
# # template-dataset test cases
|
||||
# cc_test_template_testdata = [
|
||||
# # large dataset fixture. marked as slow
|
||||
# TemplateTextTestCase(
|
||||
# test_name="fake-names-100",
|
||||
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
|
||||
# utterances="{}/data/templates.txt",
|
||||
# dictionary_path="{}/data/Dictionary_test.csv",
|
||||
# num_of_examples=100,
|
||||
# acceptance_threshold=0.9,
|
||||
# marks=pytest.mark.slow
|
||||
# )
|
||||
# ]
|
||||
#
|
||||
#
|
||||
# # credit card recognizer tests on template-generates data
|
||||
# @pytest.mark.parametrize("pii_csv, "
|
||||
# "utterances, "
|
||||
# "dictionary_path, "
|
||||
# "num_of_examples, "
|
||||
# "acceptance_threshold",
|
||||
# [testcase.to_pytest_param()
|
||||
# for testcase in cc_test_template_testdata])
|
||||
# def test_credit_card_recognizer_with_template(pii_csv, utterances,
|
||||
# dictionary_path,
|
||||
# num_of_examples,
|
||||
# acceptance_threshold):
|
||||
# """
|
||||
# Test credit card recognizer with a dataset generated from
|
||||
# template and a CSV values file
|
||||
# :param pii_csv: input csv file location
|
||||
# :param utterances: template file location
|
||||
# :param dictionary_path: dictionary/vocabulary file location
|
||||
# :param num_of_examples: number of samples to be used from dataset
|
||||
# to test
|
||||
# :param acceptance_threshold: minimim precision/recall
|
||||
# allowed for tests to pass
|
||||
# """
|
||||
#
|
||||
# # read template and CSV files
|
||||
# import os
|
||||
# dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
#
|
||||
# input_samples = generate(fake_pii_csv=pii_csv.format(dir_path),
|
||||
# utterances_file=utterances.format(dir_path),
|
||||
# dictionary_path=dictionary_path.format(dir_path),
|
||||
# lower_case_ratio=0.5,
|
||||
# num_of_examples=num_of_examples)
|
||||
#
|
||||
# scores = score_presidio_recognizer(
|
||||
# CreditCardRecognizer(), 'CREDIT_CARD', input_samples)
|
||||
# if not np.isnan(scores.pii_f):
|
||||
# assert acceptance_threshold <= scores.pii_f
|
|
@ -0,0 +1,148 @@
|
|||
'''
|
||||
Presidio Analyzer not yet on PyPI, ignoring temporarily
|
||||
'''
|
||||
|
||||
# from presidio_evaluator.data_generator import FakeDataGenerator
|
||||
# from presidio_evaluator.presidio_recognizer_evaluator import \
|
||||
# score_presidio_recognizer
|
||||
# import pandas as pd
|
||||
# import pytest
|
||||
# import numpy as np
|
||||
#
|
||||
# from analyzer import Pattern, PatternRecognizer
|
||||
#
|
||||
# # test case parameters for tests with dataset generated from a template and
|
||||
# # two csv value files, one containing the common-entities and another one with custom entities
|
||||
# class PatternRecognizerTestCase:
|
||||
# def __init__(self, test_name, entity_name, pattern, score, pii_csv, ext_csv,
|
||||
# utterances, dictionary_path, num_of_examples, acceptance_threshold,
|
||||
# max_mistakes_number, marks):
|
||||
# self.test_name = test_name
|
||||
# self.entity_name = entity_name
|
||||
# self.pattern = pattern
|
||||
# self.score = score
|
||||
# self.pii_csv = pii_csv
|
||||
# self.ext_csv = ext_csv
|
||||
# self.utterances = utterances
|
||||
# self.dictionary_path = dictionary_path
|
||||
# self.num_of_examples = num_of_examples
|
||||
# self.acceptance_threshold = acceptance_threshold
|
||||
# self.max_mistakes_number = max_mistakes_number
|
||||
# self.marks = marks
|
||||
#
|
||||
# def to_pytest_param(self):
|
||||
# return pytest.param(self.pii_csv, self.ext_csv, self.utterances,
|
||||
# self.dictionary_path,
|
||||
# self.entity_name, self.pattern, self.score,
|
||||
# self.num_of_examples, self.acceptance_threshold,
|
||||
# self.max_mistakes_number, id=self.test_name,
|
||||
# marks=self.marks)
|
||||
#
|
||||
#
|
||||
# # template-dataset test cases
|
||||
# rocket_test_template_testdata = [
|
||||
# # large dataset fixture. marked as slow.
|
||||
# # all input is correct, test is conclusive
|
||||
# PatternRecognizerTestCase(
|
||||
# test_name="rocket-no-errors",
|
||||
# entity_name="ROCKET",
|
||||
# pattern=r'\W*(rocket)\W*',
|
||||
# score=0.8,
|
||||
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
|
||||
# ext_csv="{}/data/FakeRocketGenerator.csv",
|
||||
# utterances="{}/data/rocket_example_sentences.txt",
|
||||
# dictionary_path="{}/data/Dictionary_test.csv",
|
||||
# num_of_examples=100,
|
||||
# acceptance_threshold=1,
|
||||
# max_mistakes_number=0,
|
||||
# marks=pytest.mark.slow
|
||||
# ),
|
||||
# # large dataset fixture. marked as slow
|
||||
# # all input is correct, test is conclusive
|
||||
# PatternRecognizerTestCase(
|
||||
# test_name="rocket-all-errors",
|
||||
# entity_name="ROCKET",
|
||||
# pattern=r'\W*(rocket)\W*',
|
||||
# score=0.8,
|
||||
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
|
||||
# ext_csv="{}/data/FakeRocketErrorsGenerator.csv",
|
||||
# utterances="{}/data/rocket_example_sentences.txt",
|
||||
# dictionary_path="{}/data/Dictionary_test.csv",
|
||||
# num_of_examples=100,
|
||||
# acceptance_threshold=0,
|
||||
# max_mistakes_number=100,
|
||||
# marks=pytest.mark.slow
|
||||
# ),
|
||||
# # large dataset fixture. marked as slow
|
||||
# # some input is correct some is not, test is inconclusive
|
||||
# PatternRecognizerTestCase(
|
||||
# test_name="rocket-some-errors",
|
||||
# entity_name="ROCKET",
|
||||
# pattern=r'\W*(rocket)\W*',
|
||||
# score=0.8,
|
||||
# pii_csv="{}/data/FakeNameGenerator.com_100.csv",
|
||||
# ext_csv="{}/data/FakeRocket50PercentErrorsGenerator.csv",
|
||||
# utterances="{}/data/rocket_example_sentences.txt",
|
||||
# dictionary_path="{}/data/Dictionary_test.csv",
|
||||
# num_of_examples=100,
|
||||
# acceptance_threshold=0.3,
|
||||
# max_mistakes_number=70,
|
||||
# marks=[pytest.mark.slow, pytest.mark.inconclusive]
|
||||
# )
|
||||
# ]
|
||||
#
|
||||
#
|
||||
# @pytest.mark.parametrize(
|
||||
# "pii_csv, ext_csv, utterances, dictionary_path, "
|
||||
# "entity_name, pattern, score, num_of_examples, "
|
||||
# "acceptance_threshold, max_mistakes_number",
|
||||
# [testcase.to_pytest_param()
|
||||
# for testcase in rocket_test_template_testdata])
|
||||
# def test_pattern_recognizer(pii_csv, ext_csv, utterances, dictionary_path,
|
||||
# entity_name, pattern,
|
||||
# score, num_of_examples, acceptance_threshold,
|
||||
# max_mistakes_number):
|
||||
# """
|
||||
# Test generic pattern recognizer with a dataset generated from template, a CSV values file with common entities
|
||||
# and another CSV values file with a custom entity
|
||||
# :param pii_csv: input csv file location with the common entities
|
||||
# :param ext_csv: input csv file location with custom entities
|
||||
# :param utterances: template file location
|
||||
# :param dictionary_path: vocabulary/dictionary file location
|
||||
# :param entity_name: custom entity name
|
||||
# :param pattern: recognizer pattern
|
||||
# :param num_of_examples: number of samples to be used from dataset to test
|
||||
# :param acceptance_threshold: minimim precision/recall
|
||||
# allowed for tests to pass
|
||||
# """
|
||||
#
|
||||
# import os
|
||||
# dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
# dfpii = pd.read_csv(pii_csv.format(dir_path), encoding='utf-8')
|
||||
# dfext = pd.read_csv(ext_csv.format(dir_path), encoding='utf-8')
|
||||
# dictionary_path = dictionary_path.format(dir_path)
|
||||
# ext_column_name = dfext.columns[0]
|
||||
#
|
||||
# def get_from_ext(i):
|
||||
# index = i % dfext.shape[0]
|
||||
# return dfext.iat[index, 0]
|
||||
#
|
||||
# # extend pii with ext data
|
||||
# dfpii[ext_column_name] = [get_from_ext(i) for i in range(0, dfpii.shape[0])]
|
||||
#
|
||||
# # generate examples
|
||||
# generator = FakeDataGenerator(fake_pii_csv_file=dfpii,
|
||||
# utterances_file=utterances.format(dir_path),
|
||||
# dictionary_path=dictionary_path)
|
||||
# examples = generator.sample_examples(num_of_examples)
|
||||
#
|
||||
# pattern = Pattern("test pattern", pattern, score)
|
||||
# pattern_recognizer = PatternRecognizer(entity_name,
|
||||
# name="test recognizer",
|
||||
# patterns=[pattern])
|
||||
#
|
||||
# scores = score_presidio_recognizer(
|
||||
# pattern_recognizer, [entity_name], examples)
|
||||
# if not np.isnan(scores.pii_f):
|
||||
# assert acceptance_threshold <= scores.pii_f
|
||||
# assert max_mistakes_number >= len(scores.model_errors)
|
|
@ -0,0 +1,18 @@
|
|||
from presidio_evaluator.data_generator import read_synth_dataset
|
||||
from presidio_evaluator.spacy_evaluator import SpacyEvaluator
|
||||
import numpy as np
|
||||
|
||||
|
||||
def test_spacy_simple():
|
||||
import os
|
||||
dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
input_samples = read_synth_dataset(os.path.join(dir_path, "data/generated_small.txt"))
|
||||
|
||||
spacy_evaluator = SpacyEvaluator(model_name="en_core_web_lg", entities_to_keep=['PERSON'])
|
||||
evaluation_results = spacy_evaluator.evaluate_all(input_samples)
|
||||
scores = spacy_evaluator.calculate_score(evaluation_results)
|
||||
|
||||
np.testing.assert_almost_equal(scores.pii_precision, scores.entity_precision_dict['PERSON'])
|
||||
np.testing.assert_almost_equal(scores.pii_recall, scores.entity_recall_dict['PERSON'])
|
||||
assert scores.pii_recall > 0
|
||||
assert scores.pii_precision > 0
|
|
@ -0,0 +1,63 @@
|
|||
'''
|
||||
Presidio Analyzer not yet on PyPI, ignoring temporarily
|
||||
'''
|
||||
|
||||
# from presidio_evaluator.data_generator import read_synth_dataset
|
||||
# from presidio_evaluator.presidio_recognizer_evaluator import \
|
||||
# score_presidio_recognizer
|
||||
#
|
||||
# import pytest
|
||||
# from analyzer.predefined_recognizers.spacy_recognizer import SpacyRecognizer
|
||||
#
|
||||
# # test case parameters for tests with dataset which was previously generated.
|
||||
# class GeneratedTextTestCase:
|
||||
# def __init__(self, test_name, test_input, acceptance_threshold, marks):
|
||||
# self.test_name = test_name
|
||||
# self.test_input = test_input
|
||||
# self.acceptance_threshold = acceptance_threshold
|
||||
# self.marks = marks
|
||||
#
|
||||
# def to_pytest_param(self):
|
||||
# return pytest.param(self.test_input, self.acceptance_threshold,
|
||||
# id=self.test_name, marks=self.marks)
|
||||
#
|
||||
#
|
||||
# # generated-text test cases
|
||||
# cc_test_generate_text_testdata = [
|
||||
# # small dataset, inconclusive results
|
||||
# GeneratedTextTestCase(
|
||||
# test_name="small-set",
|
||||
# test_input="{}/data/generated_small.txt",
|
||||
# acceptance_threshold=0.5,
|
||||
# marks=pytest.mark.inconclusive
|
||||
# ),
|
||||
# # large dataset - test is slow and inconclusive
|
||||
# GeneratedTextTestCase(
|
||||
# test_name="large-set",
|
||||
# test_input="{}/data/generated_large.txt",
|
||||
# acceptance_threshold=0.5,
|
||||
# marks=pytest.mark.slow
|
||||
# )
|
||||
# ]
|
||||
#
|
||||
#
|
||||
# # credit card recognizer tests on generated data
|
||||
# @pytest.mark.parametrize("test_input,acceptance_threshold",
|
||||
# [testcase.to_pytest_param() for testcase in
|
||||
# cc_test_generate_text_testdata])
|
||||
# def test_spacy_recognizer_with_generated_text(test_input, acceptance_threshold):
|
||||
# """
|
||||
# Test spacy recognizer with a generated dataset text file
|
||||
# :param test_input: input text file location
|
||||
# :param acceptance_threshold: minimim precision/recall
|
||||
# allowed for tests to pass
|
||||
# """
|
||||
#
|
||||
# # read test input from generated file
|
||||
# import os
|
||||
# dir_path = os.path.dirname(os.path.realpath(__file__))
|
||||
# input_samples = read_synth_dataset(
|
||||
# test_input.format(dir_path))
|
||||
# scores = score_presidio_recognizer(
|
||||
# SpacyRecognizer(), ['PERSON'], input_samples, True)
|
||||
# assert acceptance_threshold <= scores.pii_f
|
|
@ -0,0 +1,212 @@
|
|||
from presidio_evaluator import span_to_tag
|
||||
|
||||
BILOU_SCHEME = "BILOU"
|
||||
BIO_SCHEME = "BIO"
|
||||
|
||||
|
||||
def test_span_to_bio_multiple_tokens():
|
||||
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
|
||||
start = 14
|
||||
end = 38
|
||||
tag = "ADDRESS"
|
||||
|
||||
bio = span_to_tag(BIO_SCHEME, text, [start], [end], [tag])
|
||||
|
||||
print(bio)
|
||||
|
||||
expected = ['O', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS', 'I-ADDRESS',
|
||||
'I-ADDRESS', 'I-ADDRESS', 'I-ADDRESS', 'O', 'O', 'O', 'O', 'O']
|
||||
assert bio == expected
|
||||
|
||||
|
||||
def test_span_to_bio_single_at_end():
|
||||
text = "My name is Josh"
|
||||
start = 11
|
||||
end = 15
|
||||
tag = "NAME"
|
||||
|
||||
bilou = span_to_tag(BIO_SCHEME, text, [start], [end], [tag], )
|
||||
|
||||
print(bilou)
|
||||
|
||||
expected = ['O', 'O', 'O', 'I-NAME']
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bilou_multiple_tokens():
|
||||
text = "My Address is 409 Bob st. Manhattan NY. I just moved in"
|
||||
start = 14
|
||||
end = 38
|
||||
tag = "ADDRESS"
|
||||
|
||||
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], [tag])
|
||||
|
||||
print(bilou)
|
||||
|
||||
expected = ['O', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS', 'I-ADDRESS',
|
||||
'I-ADDRESS', 'I-ADDRESS', 'L-ADDRESS', 'O', 'O', 'O', 'O', 'O']
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bilou_adjacent_entities():
|
||||
text = "Mr. Tree"
|
||||
start1 = 0
|
||||
end1 = 2
|
||||
start2 = 4
|
||||
end2 = 8
|
||||
|
||||
start = [start1, start2]
|
||||
end = [end1, end2]
|
||||
|
||||
tag = ["TITLE", "NAME"]
|
||||
|
||||
bilou = span_to_tag(BILOU_SCHEME, text, start, end, tag)
|
||||
|
||||
print(bilou)
|
||||
|
||||
expected = ['U-TITLE', 'U-NAME']
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bilou_single_at_end():
|
||||
text = "My name is Josh"
|
||||
start = 11
|
||||
end = 15
|
||||
tag = "NAME"
|
||||
|
||||
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], [tag])
|
||||
|
||||
print(bilou)
|
||||
|
||||
expected = ['O', 'O', 'O', 'U-NAME']
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bilou_multiple_entities():
|
||||
text = "My name is Josh or David"
|
||||
start1 = 11
|
||||
end1 = 15
|
||||
start2 = 19
|
||||
end2 = 26
|
||||
|
||||
start = [start1, start2]
|
||||
end = [end1, end2]
|
||||
|
||||
tag = ["NAME", "NAME"]
|
||||
|
||||
bilou = span_to_tag(BILOU_SCHEME, text, start, end, tag)
|
||||
|
||||
print(bilou)
|
||||
|
||||
expected = ['O', 'O', 'O', 'U-NAME', 'O', 'U-NAME']
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bio_multiple_entities():
|
||||
text = "My name is Josh or David"
|
||||
start1 = 11
|
||||
end1 = 15
|
||||
start2 = 19
|
||||
end2 = 26
|
||||
|
||||
start = [start1, start2]
|
||||
end = [end1, end2]
|
||||
|
||||
tag = ["NAME", "NAME"]
|
||||
|
||||
bilou = span_to_tag(scheme=BIO_SCHEME, text=text, start=start,
|
||||
end=end, tag=tag)
|
||||
|
||||
print(bilou)
|
||||
|
||||
expected = ['O', 'O', 'O', 'I-NAME', 'O', 'I-NAME']
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bio_specific_input():
|
||||
text = "Someone stole my credit card. The number is 5277716201469117 and " \
|
||||
"the my name is Mary Anguiano"
|
||||
start = 80
|
||||
end = 93
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
|
||||
'O', 'O', 'B-PERSON', 'I-PERSON']
|
||||
tag = ["PERSON"]
|
||||
bilou = span_to_tag(BIO_SCHEME, text, [start], [end], tag)
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bilou_specific_input():
|
||||
text = "Someone stole my credit card. The number is 5277716201469117 and " \
|
||||
"the my name is Mary Anguiano"
|
||||
start = 80
|
||||
end = 93
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
|
||||
'O', 'O', 'B-PERSON', 'L-PERSON']
|
||||
tag = ["PERSON"]
|
||||
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], tag)
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_span_to_bilou_adjecent_identical_entities():
|
||||
text = "May I get access to Jessica Gump's account?"
|
||||
start = 20
|
||||
end = 32
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O']
|
||||
tag = ["PERSON"]
|
||||
bilou = span_to_tag(BILOU_SCHEME, text, [start], [end], tag)
|
||||
assert bilou == expected
|
||||
|
||||
|
||||
def test_overlapping_entities_first_ends_in_mid_second():
|
||||
text = "My new phone number is 1 705 774 8720. Thanks, man"
|
||||
start = [22, 25]
|
||||
end = [37, 37]
|
||||
scores = [0.6, 0.6]
|
||||
tag = ["PHONE_NUMBER", "US_PHONE_NUMBER"]
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
|
||||
'US_PHONE_NUMBER', 'US_PHONE_NUMBER',
|
||||
'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
assert io == expected
|
||||
|
||||
|
||||
def test_overlapping_entities_second_embedded_in_first_with_lower_score():
|
||||
text = "My new phone number is 1 705 774 8720. Thanks, man"
|
||||
start = [22, 25]
|
||||
end = [37, 33]
|
||||
scores = [0.6, 0.5]
|
||||
tag = ["PHONE_NUMBER", "US_PHONE_NUMBER"]
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'PHONE_NUMBER',
|
||||
'PHONE_NUMBER', 'PHONE_NUMBER',
|
||||
'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
assert io == expected
|
||||
|
||||
|
||||
def test_overlapping_entities_second_embedded_in_first_has_higher_score():
|
||||
text = "My new phone number is 1 705 774 8720. Thanks, man"
|
||||
start = [23, 25]
|
||||
end = [37, 28]
|
||||
scores = [0.6, 0.7]
|
||||
tag = ["PHONE_NUMBER", "US_PHONE_NUMBER"]
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'PHONE_NUMBER', 'US_PHONE_NUMBER',
|
||||
'PHONE_NUMBER', 'PHONE_NUMBER',
|
||||
'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
assert io == expected
|
||||
|
||||
|
||||
def test_overlapping_entities_pyramid():
|
||||
text = "My new phone number is 1 705 999 774 8720. Thanks, cya"
|
||||
start = [23, 25, 29]
|
||||
end = [41, 36, 32]
|
||||
scores = [0.6, 0.7, 0.8]
|
||||
tag = ["A1", "B2","C3"]
|
||||
expected = ['O', 'O', 'O', 'O', 'O', 'A1', 'B2', 'C3', 'B2',
|
||||
'A1', 'O', 'O', 'O', 'O']
|
||||
io = span_to_tag(BIO_SCHEME, text, start, end, tag, scores,
|
||||
io_tags_only=True)
|
||||
assert io == expected
|
|
@ -0,0 +1,98 @@
|
|||
import pytest
|
||||
|
||||
from presidio_evaluator import InputSample
|
||||
from presidio_evaluator.validation import split_by_template, get_samples_by_pattern, split_dataset
|
||||
|
||||
|
||||
def get_mock_dataset():
|
||||
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample5 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
|
||||
sample6 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
|
||||
sample7 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
|
||||
sample8 = InputSample("Bye there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
|
||||
|
||||
return [sample1, sample2, sample3, sample4, sample5, sample6, sample7, sample8]
|
||||
|
||||
|
||||
def test_split_by_template():
|
||||
dataset = get_mock_dataset()
|
||||
train_templates, test_templates = split_by_template(dataset, 0.5)
|
||||
assert len(train_templates) == 2
|
||||
assert len(test_templates) == 2
|
||||
|
||||
|
||||
def test_get_samples_by_pattern():
|
||||
dataset = get_mock_dataset()
|
||||
train_templates, test_templates = split_by_template(dataset, 0.5)
|
||||
train_samples = get_samples_by_pattern(dataset, train_templates)
|
||||
test_samples = get_samples_by_pattern(dataset, test_templates)
|
||||
|
||||
dataset_templates = set([sample.metadata['Template#'] for sample in dataset])
|
||||
train_samples_templates = set([sample.metadata['Template#'] for sample in train_samples])
|
||||
test_samples_templates = set([sample.metadata['Template#'] for sample in test_samples])
|
||||
|
||||
assert len(train_samples) + len(test_samples) == len(dataset)
|
||||
assert dataset_templates == train_samples_templates | test_samples_templates
|
||||
assert train_samples_templates & test_samples_templates == set()
|
||||
assert train_samples_templates == set(train_templates)
|
||||
assert test_samples_templates == set(test_templates)
|
||||
|
||||
|
||||
def test_split_dataset_two_sets():
|
||||
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
|
||||
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
|
||||
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
|
||||
train, test = split_dataset([sample1, sample2, sample3, sample4], [0.5, 0.5])
|
||||
assert len(train) == 2
|
||||
assert len(test) == 2
|
||||
|
||||
|
||||
def test_split_dataset_four_sets():
|
||||
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
|
||||
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
|
||||
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
|
||||
dataset = [sample1, sample2, sample3, sample4]
|
||||
train, test, val, dev = split_dataset(dataset, [0.25, 0.25, 0.25, 0.25])
|
||||
assert len(train) == 1
|
||||
assert len(test) == 1
|
||||
assert len(val) == 1
|
||||
assert len(dev) == 1
|
||||
|
||||
|
||||
# make sure all original template IDs are in the new sets
|
||||
|
||||
original_keys = set([1, 2, 3, 4])
|
||||
t1 = set([sample.metadata['Template#'] for sample in train])
|
||||
t2 = set([sample.metadata['Template#'] for sample in test])
|
||||
t3 = set([sample.metadata['Template#'] for sample in dev])
|
||||
t4 = set([sample.metadata['Template#'] for sample in val])
|
||||
|
||||
assert original_keys == t1 | t2 | t3 | t4
|
||||
|
||||
|
||||
def test_split_dataset_test_with_0_ratio():
|
||||
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
|
||||
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
|
||||
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
|
||||
dataset = [sample1, sample2, sample3, sample4]
|
||||
with pytest.raises(ValueError):
|
||||
train, test, zero = split_dataset(dataset, [0.5, 0.5, 0])
|
||||
|
||||
|
||||
def test_split_dataset_test_with_smallish_ratio():
|
||||
sample1 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 1})
|
||||
sample2 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 2})
|
||||
sample3 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 3})
|
||||
sample4 = InputSample("Hi there", masked=None, spans=None, create_tags_from_span=False, metadata={"Template#": 4})
|
||||
dataset = [sample1, sample2, sample3, sample4]
|
||||
|
||||
train, test, zero = split_dataset(dataset, [0.5, 0.4999995, 0.0000005])
|
||||
assert len(train) == 2
|
||||
assert len(test) == 2
|
||||
assert len(zero) == 0
|
Загрузка…
Ссылка в новой задаче