Implementation of Differentially Private n-gram Extraction (DPNE) paper

Перейти к файлу

Dhruv Joshi ddcac1043f Simplify usage to container and jupyter notebook implementation (#2 ) * Add run shell script for linux vm * Add dpne.zip to ease running experiment out of the box * Remove run.sh, replace with jupyter notebook * add run.cmd code to notebook * uncomment lines * uncomment lines * Parse results into dataframe * Update README.md Add details on how to run on pyspark container * Update README.md update shrike version to the one that works with container * Update README.md * Update README.md added docker cp command * Update README.md nltk.download('punkt_tab') * Remove persist flags since they cause errors running local cluster * Update comments * Add cell to replicate reddit experiment * Add logging to reddit conversion script * Update README.md Add instructions to replicate results from paper in notebook. * Add note about perist-flags in DPNE step		2024-09-27 16:15:04 -07:00
dpne	push initial code (#1 )	2021-10-14 11:47:53 -07:00
scripts	Simplify usage to container and jupyter notebook implementation (#2 )	2024-09-27 16:15:04 -07:00
.gitignore	Simplify usage to container and jupyter notebook implementation (#2 )	2024-09-27 16:15:04 -07:00
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md committed	2021-08-05 16:51:04 -07:00
DPNE Experiments.ipynb	Simplify usage to container and jupyter notebook implementation (#2 )	2024-09-27 16:15:04 -07:00
LICENSE	LICENSE committed	2021-08-05 16:51:05 -07:00
README.md	Simplify usage to container and jupyter notebook implementation (#2 )	2024-09-27 16:15:04 -07:00
SECURITY.md	SECURITY.md committed	2021-08-05 16:51:06 -07:00
SUPPORT.md	push initial code (#1 )	2021-10-14 11:47:53 -07:00
dpne.zip	Simplify usage to container and jupyter notebook implementation (#2 )	2024-09-27 16:15:04 -07:00

README.md

Differentially Private n-gram Extraction

This is a repository for implementing Differentially Private N-grams Extraction (DPNE) paper (preprint version), to appear in NeurIPS 2021.

Directory structure

The code repository structure is as follows:

dpne: has python codes to run each step for extracting DPNE n-grams with PySpark code.
- dpne_utils.py: has generic util functions used across differnt scripts
- extract_dpne.py: implements main algorithm of DPNE
- gaussian_process.py: calcualtes gaussian noise to be added for each n-gram
- k_anon_coverage.py: generates k-anonymized n-grams, calculates the coverage of DPNE n-grams agains k-anonmized n-grams
- split_ngrams.py: split into subfolder for each size of the tokenized n-grams (input preparation)
- tokenize_text.py: tokenizs text with nltk tokenizer
scripts: has scripts to run the code
- convert_msnbc.py: converts MSNBC data
- convert_reddit.py: converts Reddit data
DPNE Experiments.ipynb: Jupyter notebook to run on pyspark container for experimentation, which also runs the scripts in run.cmd (see below for instructions on how to run this)

Prerequisites

The code requires following libraries installed:

python >= 3.6
nltk
numpy
PySpark == 2.3
shrike

Preparing container to run experiments

Running this code on a container makes getting started fairly easy and reliably. Follow these steps to get this running on a local container:

Make sure you have docker installed and running
Run docker pull jupyter/pyspark-notebook to install the pyspark-jupyter container
Run the container mapping port 8888 locally so you can run the notebook on your machine, using docker run -p 8888:8888 --name jupyter-pyspark jupyter/pyspark-notebook - from the logs that open up, paste the command into your browser to run the notebook - something like http://127.0.0.1:8888/lab?token=<TOKEN>
Bash into the container by running docker exec -it jupyter-pyspark bash
Run git clone https://github.com/microsoft/differentially-private-ngram-extraction.git to pull this repo into the container
Install the required libraries as mentioned above:
```
pip install nltk
pip install pyspark
pip install shrike==1.31.18
```
Additionally, run a python shell and run the following commands:
```
import nltk
nltk.download('punkt_tab')
```

Now at this point, you can replicate results from the paper or run DP N-grams extraction on your own dataset. See instructions for each case below:

Replicate results from the paper

There are two data sources cited:

MSNBC: https://archive.ics.uci.edu/ml/datasets/msnbc.com+anonymous+web+data
Reddit: https://github.com/webis-de/webis-tldr-17-corpus, downloadable from https://zenodo.org/record/1043504/files/corpus-webis-tldr-17.zip To prepare data from these in the right format, the following scripts from DPNE home directory are used.

python scripts/convert_msnbc.py --input_path [Input file path which has the downloaded file] --output_path [output directory, like /output]
python scripts/convert_reddit.py --input_path [Input file path which has the downloaded file] --output_path [output directory, like /output]

This is simplified within the attached notebook, where you can simply follow these steps to run this:

With the container running, navigate to the notebook and run the code starting from the first cell which downloads and prepares data (for the reddit case).
You may also change the default values of the variables DP_EPSILON and NGRAM_SIZE_LIMIT based on your needs. Run the commands in the cells which should eventually provide you with the extracted DP n-grams in the DPNGRAMS dictionary - DPNGRAMS["1gram"] will be a pandas dataframe with the extracted DP 1-grams and so on.
Follow the steps in the subsequent cells, which break up the tokenization, splitting of n-grams and then DP n-grams extraction into separate spark sessions, and cache the results locally.
Once these scripts have successfully run, the 3rd cell allows reads them into a dictionary of pandas dataframes, from where you may access the extracted DP n-grams.

Run on your own dataset

Copy over into the differentially-private-ngram-extraction folder a dataset as a newline delimited JSON file with keys "author" and "content" representing the distinct author name/id, and their content you want to extract DP n-grams from, respectively. On another terminal you can use the command docker cp /path/to/file.json jupyter-pyspark:/home/jovyan/differentially-private-ngram-extraction/
Now you can simply navigate to the notebook and run the code, changing SOURCE_DATASET to the name of the JSON file you just copied. If you are using something other than JSON, please change FILE_EXTENSION accordingly. You may also change the default values of the variables DP_EPSILON and NGRAM_SIZE_LIMIT based on your needs. Run the commands in the cells which should eventually provide you with the extracted DP n-grams in the DPNGRAMS dictionary - DPNGRAMS["1gram"] will be a pandas dataframe with the extracted DP 1-grams and so on.
Follow the steps in the subsequent cells, which break up the tokenization, splitting of n-grams and then DP n-grams extraction into separate spark sessions, and cache the results locally.
Once these scripts have successfully run, the 3rd cell allows reads them into a dictionary of pandas dataframes, from where you may access the extracted DP n-grams.

Run DPNE without the container

If you choose to run this within the shell or with local modifications without using the container method described above, simply follow these steps

If you made changes to any file in the dpne/ folder, re-archive the dpne directory to dpne.zip, this is needed for PySpark to use the package of the whole python scripts.
Assuming you are on a windows machine, use run.cmd, you will need to modify the first line of DATA_HOME where your converted data exists. Simply you can run it below from DPNE home directory,

.\scripts\run.cmd

If you are on a Linux based environment, see the corresponding shell scrips in the notebook.

References

[1] Kunho Kim, Sivakanth Gopi, Janardhan Kulkarni, Sergey Yekhanin. Differentially Private n-gram Extraction. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.