presidio/presidio-structured
Sharon Hart ce63783fcc
Unlock numpy after dropping 3.8 (#1480)
2024-11-06 15:21:45 +02:00
..
presidio_structured Migrate Python Packaging to pyproject.toml (#1383) 2024-05-19 08:16:51 +03:00
tests Unlock numpy after dropping 3.8 (#1480) 2024-11-06 15:21:45 +02:00
README.md Added docs for structured (#1287) 2024-02-12 15:42:34 +02:00
logging.ini
pyproject.toml Unlock numpy after dropping 3.8 (#1480) 2024-11-06 15:21:45 +02:00

README.md

Presidio structured

Status

Alpha: This package is currently in alpha, meaning it is in its early stages of development. Features and functionality may change as the project evolves.

Description

The Presidio structured package is a flexible and customizable framework designed to identify and protect structured sensitive data. This tool extends the capabilities of Presidio, focusing on structured data formats such as tabular formats and semi-structured formats (JSON). It leverages the detection capabilities of Presidio-Analyzer to identify columns or keys containing personally identifiable information (PII), and establishes a mapping between these column/keys names and the detected PII entities. Following the detection, Presidio-Anonymizer is used to apply de-identification techniques to each value in columns identified as containing PII, ensuring the sensitive data is appropriately protected.

Installation

As a python package

To install the presidio-structured package, run the following command:

pip install presidio-structured

Getting started

Anonymizing Data Frames:

import pandas as pd
from presidio_structured import StructuredEngine, PandasAnalysisBuilder
from presidio_anonymizer.entities import OperatorConfig
from faker import Faker # optionally using faker as an example

# Initialize the engine with a Pandas data processor (default)
pandas_engine = StructuredEngine()

# Create a sample DataFrame
sample_df = pd.DataFrame({'name': ['John Doe', 'Jane Smith'], 'email': ['john.doe@example.com', 'jane.smith@example.com']})

# Generate a tabular analysis which describes PII entities in the DataFrame.
tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df)

# Define anonymization operators
fake = Faker()
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "REDACTED"}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()})
}

# Anonymize DataFrame
anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)
print(anonymized_df)

More information