зеркало из https://github.com/mozilla/rappor.git
Initial import.
This commit is contained in:
Родитель
236f930036
Коммит
761aa0bcd8
|
@ -0,0 +1,5 @@
|
|||
*.pyc
|
||||
*.swp
|
||||
_tmp
|
||||
client/python/build
|
||||
client/python/_fastrand.so
|
|
@ -0,0 +1,202 @@
|
|||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
136
README.md
136
README.md
|
@ -0,0 +1,136 @@
|
|||
RAPPOR
|
||||
======
|
||||
|
||||
RAPPOR is a novel privacy technology that allows inferring statistics of
|
||||
populations while preserving the privacy of individual users.
|
||||
|
||||
This repository currently contains simulation and analysis code in Python and
|
||||
R.
|
||||
|
||||
For a detailed description of the algorithm, see the
|
||||
[paper](http://arxiv.org/abs/1407.6981) and links below.
|
||||
|
||||
<!-- TODO: We should have a more user friendly non-mathematical explanation?
|
||||
-->
|
||||
|
||||
Running the Demo
|
||||
----------------
|
||||
|
||||
Although the Python and R libraries should be portable to any platform, our
|
||||
end-to-end demo has only been tested on Linux .
|
||||
|
||||
If you don't have a Linux box handy, you can [view the generated
|
||||
output](report.html).
|
||||
|
||||
To get your feet wet, install the R dependencies (details below), which should
|
||||
look something like this:
|
||||
|
||||
$ R
|
||||
...
|
||||
> install.packages(c('glmnet', 'optparse', 'ggplot2'))
|
||||
|
||||
Then run:
|
||||
|
||||
$ ./demo.sh build # optional speedup, it's OK for now if it fails
|
||||
$ ./demo.sh run
|
||||
|
||||
The `build` action compiles and tests the optional `fastrand` C extension
|
||||
module for Python, which speeds up the simulation.
|
||||
|
||||
The `run` action strings together the Python and R code. It:
|
||||
|
||||
1. Generates simulated input data with different distributions
|
||||
2. Runs it through the RAPPOR privacy algorithm
|
||||
3. Analyzes and plots the obfuscated reports against the true input
|
||||
|
||||
The output is written to `_tmp/report.html`, and can be opened with a browser.
|
||||
|
||||
<!-- TODO: Link to Github pages version of report.html. -->
|
||||
|
||||
Dependencies
|
||||
------------
|
||||
|
||||
[R](http://r-project.org) analysis (`analysis/R`):
|
||||
|
||||
- [glmnet](http://cran.r-project.org/web/packages/glmnet/index.html)
|
||||
|
||||
Demo dependencies (`demo.sh`):
|
||||
|
||||
These are necessary if you want to test changes to the code.
|
||||
|
||||
- R libraries
|
||||
- [ggplot2](http://cran.r-project.org/web/packages/ggplot2/index.html)
|
||||
- [optparse](http://cran.r-project.org/web/packages/optparse/index.html)
|
||||
- bash shell / coreutils: to run tests
|
||||
|
||||
Python client (`client/python`):
|
||||
|
||||
- None. You should be able to just import the `rappor.py` file.
|
||||
|
||||
Platform:
|
||||
|
||||
- R: tested on R 3.0.
|
||||
- Python: tested on Python 2.7.
|
||||
- OS: the shell script tests have been tested on Linux, but may work on
|
||||
Mac/Cygwin. The R and Python code should work on any OS.
|
||||
|
||||
API
|
||||
---
|
||||
|
||||
`rappor.py` is a tiny standalone Python file, and you can easily copy it into a
|
||||
Python program.
|
||||
|
||||
NOTE: Its interface is subject to change. We are in the demo stage now, but if
|
||||
there's demand, we will document and publish the interface.
|
||||
|
||||
The R interface is also subject to change.
|
||||
|
||||
<!-- TODO: Add links to interface docs when available. -->
|
||||
|
||||
The `fastrand` C module is optional. It's likely only useful for simulation of
|
||||
thousands of clients. It doesn't use crytographically strong randomness, and
|
||||
thus should **not** be used in production.
|
||||
|
||||
Directory Structure
|
||||
-------------------
|
||||
|
||||
client/ # client libraries
|
||||
python/
|
||||
rappor.py
|
||||
rappor_test.py # unit tests next to files
|
||||
cpp/ # placeholder
|
||||
analysis/
|
||||
R/
|
||||
# R code for analysis.
|
||||
tests/ # for system tests. Unit tests should go next to the
|
||||
# source file.
|
||||
gen_sim_input.py # generate test input data
|
||||
rappor_sim.py # run simulation
|
||||
run.sh # driver for unit tests, lint, statistical tests,
|
||||
# end to end demo with Python/R
|
||||
doc/
|
||||
build.sh # build docs or C code
|
||||
demo.sh # run deom
|
||||
|
||||
<!--
|
||||
TODO: add apps?
|
||||
|
||||
apps/
|
||||
# Shiny apps for demo. Depends on the analysis code.
|
||||
-->
|
||||
|
||||
Links
|
||||
-----
|
||||
|
||||
<!-- TODO: link back to blog post -->
|
||||
|
||||
- [Tutorial](doc/tutorial.html) - More details about the tools here.
|
||||
- [RAPPOR paper](http://arxiv.org/abs/1407.6981)
|
||||
- [RAPPOR implementation in Chrome](http://www.chromium.org/developers/design-documents/rappor)
|
||||
- This is a production quality C++ implementation, but it's somewhat tied to
|
||||
Chrome, and doesn't support all privacy parameters (e.g. only a few values
|
||||
of p and q). On the other hand, the code in this repo is not yet
|
||||
production quality, but supports experimentation with different parameters
|
||||
and data sets. Of course, anyone is free to implement RAPPOR independently
|
||||
as well.
|
||||
|
|
@ -0,0 +1,85 @@
|
|||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
GetFN <- function(name) {
|
||||
# Helper function to strip extension from the filename.
|
||||
strsplit(basename(name), ".", fixed = TRUE)[[1]][1]
|
||||
}
|
||||
|
||||
ValidateInput <- function(params, counts, map) {
|
||||
val <- "valid"
|
||||
if (is.null(counts)) {
|
||||
val <- "No counts file found. Skipping"
|
||||
return(val)
|
||||
}
|
||||
|
||||
if (nrow(map) != (params$m * params$k)) {
|
||||
val <- paste("Map does not match the counts file!",
|
||||
"mk = ", params$m * params$k,
|
||||
"nrow(map):", nrow(map),
|
||||
collapse = " ")
|
||||
}
|
||||
|
||||
if ((ncol(counts) - 1) != params$k) {
|
||||
val <- paste("Dimensions of counts file do not match:",
|
||||
"m =", params$m, "counts rows: ", nrow(counts),
|
||||
"k =", params$k, "counts cols: ", ncol(counts) - 1,
|
||||
collapse = " ")
|
||||
}
|
||||
val
|
||||
}
|
||||
|
||||
AnalyzeRAPPOR <- function(params, counts, map, correction, alpha, cv_step,
|
||||
experiment_name = "", map_name = "", config_name = "",
|
||||
date = NULL, date_num = NULL, ...) {
|
||||
val <- ValidateInput(params, counts, map)
|
||||
if (val != "valid") {
|
||||
cat(val, "\n")
|
||||
return(NULL)
|
||||
}
|
||||
|
||||
cat("Sample Size: ", sum(counts[, 1]), "\n",
|
||||
"Number of cohorts: ", nrow(counts), "\n", sep = "")
|
||||
|
||||
fit <- Decode(counts, map, params, correction = correction,
|
||||
alpha = alpha, cv_step = cv_step, ...)
|
||||
|
||||
if (nrow(fit$fit) > 0) {
|
||||
res <- fit$fit
|
||||
|
||||
res$rank <- 1:nrow(fit$fit)
|
||||
res$detected <- fit$summary[2, 2]
|
||||
res$sample_size <- fit$summary[3, 2]
|
||||
res$detected_prop <- fit$summary[4, 2]
|
||||
res$explained_var <- fit$summary[5, 2]
|
||||
res$missing_var <- fit$summary[6, 2]
|
||||
|
||||
res$exp_e_1 <- fit$privacy[3, 2]
|
||||
res$exp_e_inf <- fit$privacy[5, 2]
|
||||
res$detection_freq <- fit$privacy[7, 2]
|
||||
res$correction <- correction
|
||||
res$alpha <- alpha
|
||||
|
||||
res$experiment <- experiment_name
|
||||
res$map <- map_name
|
||||
res$config <- config_name
|
||||
res$date <- date
|
||||
res$date_num <- date_num
|
||||
} else {
|
||||
return(NULL)
|
||||
}
|
||||
|
||||
res
|
||||
}
|
|
@ -0,0 +1,290 @@
|
|||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#
|
||||
# This library implements the RAPPOR, an anonymous collection mechanism.
|
||||
|
||||
library(glmnet)
|
||||
|
||||
EstimateBloomCounts <- function(params, obs_counts) {
|
||||
# Estimates the number of times each bit in each cohort was set in original
|
||||
# Bloom filters.
|
||||
#
|
||||
# Input:
|
||||
# params: a list of RAPPOR parameters:
|
||||
# k - size of a Bloom filter
|
||||
# h - number of hash functions
|
||||
# m - number of cohorts
|
||||
# p - P(IRR = 1 | PRR = 0)
|
||||
# q - P(IRR = 1 | PRR = 1)
|
||||
# f - Proportion of bits in the Bloom filter that are set randomly
|
||||
# to 0 or 1 regardless of the underlying true bit value
|
||||
# obs_counts: a matrix of size m by (k + 1). Column one contains sample
|
||||
# sizes for each cohort. Other counts indicated how many times
|
||||
# each bit was set in each cohort.
|
||||
#
|
||||
# Output:
|
||||
# ests: a matrix of size m by x with estimated counts for the number of
|
||||
# times each bit was set in the true Bloom filter.
|
||||
|
||||
p <- params$p
|
||||
q <- params$q
|
||||
f <- params$f
|
||||
|
||||
# N = x[1] is the sample size for cohort i.
|
||||
ests <- t(apply(obs_counts, 1, function(x) {
|
||||
(x[-1] - (p + .5 * f * q - .5 * f * p) * x[1]) / ((1 - f) * (q - p))
|
||||
}))
|
||||
|
||||
ests
|
||||
}
|
||||
|
||||
FitLasso <- function(X, Y, intercept = TRUE, cv_step = 1, max_lambda = 100) {
|
||||
# Fits a Lasso model to select a subset of columns of X.
|
||||
#
|
||||
# Input:
|
||||
# X: a design matrix of size km by M (the number of candidate strings).
|
||||
# Y: a vector of size km with estimated counts from EstimateBloomCounts().
|
||||
#
|
||||
# Output:
|
||||
# lasso: a cross-validated Lasso object.
|
||||
# non_zero: indices of non-zero coefficients for optimal selection of
|
||||
# lambda.
|
||||
|
||||
zero_coefs <- rep(0, ncol(X))
|
||||
names(zero_coefs) <- colnames(X)
|
||||
|
||||
lambdas <- seq(0, max_lambda, cv_step)
|
||||
mod <- try(cv.glmnet(X, Y, standardize = FALSE, intercept = intercept,
|
||||
lambda = lambdas,
|
||||
type.measure = "mae", nfolds = 10), silent = TRUE)
|
||||
|
||||
# If fitting fails, return an empty data.frame.
|
||||
if (class(mod) == "try-error") {
|
||||
return(list(fit = NULL, coefs = zero_coefs))
|
||||
}
|
||||
|
||||
# More refined lambda's based on the first coarse run.
|
||||
if ((as.numeric(ncol(X)) * as.numeric(nrow(X))) < 10^7) {
|
||||
min_lambda <- mod$lambda.min
|
||||
if (min_lambda == max(lambdas)) {
|
||||
lambdas <- seq(301, 500, cv_step)
|
||||
} else if (min_lambda == min(lambdas)) {
|
||||
lambdas <- seq(0, 1, .01)
|
||||
} else {
|
||||
lambdas <- c(seq(0, max(0, min_lambda - 2), cv_step),
|
||||
seq(max(0, min_lambda - 2), max(min_lambda + 2, 0), .01),
|
||||
seq(max(0, min_lambda + 2), 500, cv_step))
|
||||
lambdas <- sort(unique(lambdas[lambdas > 0]))
|
||||
}
|
||||
mod <- try(cv.glmnet(X, Y, standardize = FALSE, intercept = intercept,
|
||||
lambda = lambdas,
|
||||
type.measure = "mae", nfolds = 10), silent = TRUE)
|
||||
# If fitting fails, return an empty data.frame.
|
||||
if (class(mod) == "try-error") {
|
||||
return(list(fit = NULL, coefs = zero_coefs))
|
||||
}
|
||||
}
|
||||
|
||||
# Select the best model based on cross-validation.
|
||||
coefs <- coef(mod, s = mod$lambda.min)
|
||||
resid <- Y - predict(mod, X, s = mod$lambda.min, type = "response")
|
||||
|
||||
list(fit = mod, coefs = coefs[-1, ], intercept = coefs[1, 1], resid = resid)
|
||||
}
|
||||
|
||||
CustomLM <- function(X, Y) {
|
||||
if (class(X) == "ngCMatrix") {
|
||||
X <- as.data.frame(apply(as.matrix(X), 2, as.numeric))
|
||||
}
|
||||
mod <- lm(Y ~ ., data = X)
|
||||
resid <- Y - predict(mod, X)
|
||||
list(fit = mod, coefs = coef(mod)[-1], intercept = coef(mod)[1],
|
||||
resid = resid)
|
||||
}
|
||||
|
||||
PerformInference <- function(X, Y, N, mod, params, alpha, correction) {
|
||||
m <- params$m
|
||||
p <- params$p
|
||||
q <- params$q
|
||||
f <- params$f
|
||||
h <- params$h
|
||||
|
||||
q2 <- .5 * f * (p + q) + (1 - f) * q
|
||||
p2 <- .5 * f * (p + q) + (1 - f) * p
|
||||
resid_var <- p2 * (1 - p2) * (N / m) / (q2 - p2)^2
|
||||
|
||||
# Total Sum of Squares (SS).
|
||||
TSS <- sum((Y - mean(Y))^2)
|
||||
# Error Sum of Squares (ESS).
|
||||
ESS <- resid_var * nrow(X)
|
||||
|
||||
betas <- matrix(mod$coefs, ncol = 1)
|
||||
mod_var <- summary(mod$fit)$sigma^2
|
||||
betas_sd <- rep(sqrt(max(resid_var, mod_var) / (m * h)), length(betas))
|
||||
z_values <- betas / betas_sd
|
||||
|
||||
# 1-sided t-test.
|
||||
p_values <- pnorm(z_values, lower = FALSE)
|
||||
|
||||
fit <- data.frame(String = colnames(X), Estimate = betas,
|
||||
SD = betas_sd, z_stat = z_values, pvalue = p_values,
|
||||
stringsAsFactors = FALSE)
|
||||
|
||||
if (correction == "FDR") {
|
||||
fit <- fit[order(fit$pvalue, decreasing = FALSE), ]
|
||||
ind <- which(fit$pvalue < (1:nrow(fit)) * alpha / nrow(fit))
|
||||
if (length(ind) > 0) {
|
||||
fit <- fit[1:max(ind), ]
|
||||
} else {
|
||||
fit <- fit[numeric(0), ]
|
||||
}
|
||||
} else {
|
||||
fit <- fit[fit$p < alpha, ]
|
||||
}
|
||||
|
||||
fit <- fit[order(fit$Estimate, decreasing = TRUE), ]
|
||||
|
||||
if (nrow(fit) > 0) {
|
||||
str_names <- fit$String
|
||||
if (length(str_names) > 0 && length(str_names) < nrow(X)) {
|
||||
this_data <- as.data.frame(as.matrix(X[, str_names]))
|
||||
Y_hat <- predict(lm(Y ~ ., data = this_data))
|
||||
RSS <- sum((Y_hat - mean(Y))^2)
|
||||
} else {
|
||||
RSS <- NA
|
||||
}
|
||||
} else {
|
||||
RSS <- 0
|
||||
}
|
||||
|
||||
USS <- TSS - ESS - RSS
|
||||
SS <- c(RSS, USS, ESS) / TSS
|
||||
|
||||
list(fit = fit, SS = SS, resid_sigma = sqrt(resid_var))
|
||||
}
|
||||
|
||||
ComputePrivacyGuarantees <- function(params, alpha, N) {
|
||||
# Compute privacy parameters and guarantees.
|
||||
p <- params$p
|
||||
q <- params$q
|
||||
f <- params$f
|
||||
h <- params$h
|
||||
|
||||
q2 <- .5 * f * (p + q) + (1 - f) * q
|
||||
p2 <- .5 * f * (p + q) + (1 - f) * p
|
||||
|
||||
exp_e_one <- ((q2 * (1 - p2)) / (p2 * (1 - q2)))^h
|
||||
if (exp_e_one < 1) {
|
||||
exp_e_one <- 1 / exp_e_one
|
||||
}
|
||||
e_one <- log(exp_e_one)
|
||||
|
||||
exp_e_inf <- ((1 - .5 * f) / (.5 * f))^(2 * h)
|
||||
e_inf <- log(exp_e_inf)
|
||||
|
||||
std_dev_counts <- sqrt(p2 * (1 - p2) * N) / (q2 - p2)
|
||||
detection_freq <- qnorm(1 - alpha) * std_dev_counts / N
|
||||
|
||||
privacy_names <- c("Effective p", "Effective q", "exp(e_1)",
|
||||
"e_1", "exp(e_inf)", "e_inf", "Detection frequency")
|
||||
privacy_vals <- c(p2, q2, exp_e_one, e_one, exp_e_inf, e_inf, detection_freq)
|
||||
|
||||
privacy <- data.frame(parameters = privacy_names,
|
||||
values = privacy_vals)
|
||||
privacy
|
||||
}
|
||||
|
||||
Decode <- function(counts, map, params, alpha = 0.05,
|
||||
correction = c("Bonferroni"), ...) {
|
||||
k <- params$k
|
||||
p <- params$p
|
||||
q <- params$q
|
||||
f <- params$f
|
||||
h <- params$h
|
||||
m <- params$m
|
||||
|
||||
strs <- colnames(map)
|
||||
ests <- EstimateBloomCounts(params, counts)
|
||||
N <- sum(counts[, 1])
|
||||
Y <- as.vector(t(ests))
|
||||
|
||||
if (ncol(map) > (k * m * .8) ||
|
||||
(as.numeric(ncol(map)) * as.numeric(nrow(map))) > 10^6) {
|
||||
mod_lasso <- FitLasso(map, Y, ...)
|
||||
lasso <- mod_lasso$fit
|
||||
|
||||
# Select non-zero coefficients.
|
||||
coefs <- sort(mod_lasso$coef, decreasing = TRUE)
|
||||
non_zero <- sum(coefs > 0)
|
||||
if (non_zero > 0) {
|
||||
coefs <- names(coefs[1:min(non_zero, k * m * .9)])
|
||||
} else {
|
||||
coefs <- names(coefs[1:2])
|
||||
}
|
||||
ind <- match(coefs, names(mod_lasso$coefs))
|
||||
|
||||
# Fit regular linear model to obtain unbiased estimates.
|
||||
X <- as.data.frame(apply(as.matrix(map[, coefs]), 2, as.numeric))
|
||||
mod <- CustomLM(X, Y)
|
||||
|
||||
# Return complete vector of coefficients with 0's.
|
||||
coefs <- rep(0, length(mod_lasso$coefs))
|
||||
names(coefs) <- names(mod_lasso$coefs)
|
||||
coefs[ind] <- mod$coef
|
||||
mod$coefs <- coefs
|
||||
} else {
|
||||
mod <- CustomLM(as.data.frame(as.matrix(map)), Y)
|
||||
lasso <- NULL
|
||||
}
|
||||
|
||||
if (correction == "Bonferroni") {
|
||||
alpha <- alpha / length(strs)
|
||||
}
|
||||
|
||||
inf <- PerformInference(map, Y, N, mod, params, alpha, correction)
|
||||
fit <- inf$fit
|
||||
resid <- mod$resid / inf$resid_sigma
|
||||
|
||||
# Estimates from the model are per instance so must be multipled by h.
|
||||
# Standard errors are also adjusted.
|
||||
fit$Total_Est <- floor(fit$Estimate * m)
|
||||
fit$Total_SD <- floor(fit$SD * m)
|
||||
fit$Prop <- fit$Total_Est / N
|
||||
fit$LPB <- fit$Prop - 1.96 * fit$Total_SD / N
|
||||
fit$UPB <- fit$Prop + 1.96 * fit$Total_SD / N
|
||||
|
||||
fit <- fit[, c("String", "Total_Est", "Total_SD", "Prop", "LPB", "UPB")]
|
||||
colnames(fit) <- c("strings", "estimate", "std_dev", "proportion",
|
||||
"lower_bound", "upper_bound")
|
||||
|
||||
# Compute summary of the fit.
|
||||
parameters =
|
||||
c("Candidate strings", "Detected strings",
|
||||
"Sample size (N)", "Discovered Prop (out of N)",
|
||||
"Explained Variance", "Missing Variance", "Noise Variance",
|
||||
"Theoretical Noise Std. Dev.")
|
||||
values <- c(length(strs), nrow(fit), N, round(sum(fit[, 2]) / N, 3),
|
||||
round(inf$SS, 3),
|
||||
round(inf$resid_sigma, 3))
|
||||
res_summary <- data.frame(parameters = parameters, values = values)
|
||||
|
||||
privacy <- ComputePrivacyGuarantees(params, alpha, N)
|
||||
params <- data.frame(parameters =
|
||||
c("k", "h", "m", "p", "q", "f", "N", "alpha"),
|
||||
values = c(k, h, m, p, q, f, N, alpha))
|
||||
|
||||
list(fit = fit, summary = res_summary, privacy = privacy, params = params,
|
||||
lasso = lasso, ests = ests, counts = counts[, -1], resid = resid)
|
||||
}
|
|
@ -0,0 +1,128 @@
|
|||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
Encode <- function(value, map, strs, params, N, id = NULL,
|
||||
cohort = NULL, B = NULL, BP = NULL) {
|
||||
# Encode value to RAPPOR and return a report.
|
||||
#
|
||||
# Input:
|
||||
# value: value to be encoded
|
||||
# map: a mapping matrix describing where each element of strs map in
|
||||
# each cohort
|
||||
# strs: a vector of possible values with value being one of them
|
||||
# params: a list of RAPPOR parameters described in decode.R
|
||||
# N: sample size
|
||||
# Optional parameters:
|
||||
# id: user ID (smaller than N)
|
||||
# cohort: specifies cohort number (smaller than m)
|
||||
# B: input Bloom filter itself, in which case value is ignored
|
||||
# BP: input Permanent Randomized Response (memoized for multiple colections
|
||||
# from the same user
|
||||
|
||||
k <- params$k
|
||||
p <- params$p
|
||||
q <- params$q
|
||||
f <- params$f
|
||||
h <- params$h
|
||||
m <- params$m
|
||||
if (is.null(cohort)) {
|
||||
cohort <- sample(1:m, 1)
|
||||
}
|
||||
|
||||
if (is.null(id)) {
|
||||
id <- sample(N, 1)
|
||||
}
|
||||
|
||||
ind <- which(value == strs)
|
||||
|
||||
if (is.null(B)) {
|
||||
B <- as.numeric(map[[cohort]][, ind])
|
||||
}
|
||||
|
||||
if (is.null(BP)) {
|
||||
BP <- sapply(B, function(x) sample(c(0, 1, x), 1,
|
||||
prob = c(0.5 * f, 0.5 * f, 1 - f)))
|
||||
}
|
||||
rappor <- sapply(BP, function(x) rbinom(1, 1, ifelse(x == 1, q, p)))
|
||||
|
||||
list(value = value, rappor = rappor, B = B, BP = BP, cohort = cohort, id = id)
|
||||
}
|
||||
|
||||
ExamplePlot <- function(res, k, ebs = 1, title = "", title_cex = 4,
|
||||
voff = .17, acex = 1.5, posa = 2, ymin = 1,
|
||||
horiz = FALSE) {
|
||||
PC <- function(k, report) {
|
||||
char <- as.character(report)
|
||||
if (k > 128) {
|
||||
char[char != ""] <- "|"
|
||||
}
|
||||
char
|
||||
}
|
||||
|
||||
# Annotation settings
|
||||
anc <- "darkorange2"
|
||||
colors <- c("lavenderblush3", "maroon4")
|
||||
|
||||
par(omi = c(0, .55, 0, 0))
|
||||
# Setup plotting.
|
||||
plot(1:k, rep(1, k), ylim = c(ymin, 4), type = "n",
|
||||
xlab = "Bloom filter bits",
|
||||
yaxt = "n", ylab = "", xlim = c(0, k), bty = "n", xaxt = "n")
|
||||
mtext(paste0("Participant ", res$id, " in cohort ", res$cohort), 3, 2,
|
||||
adj = 1, col = anc, cex = acex)
|
||||
axis(1, 2^(0:15), 2^(0:15))
|
||||
abline(v = which(res$B == 1), lty = 2, col = "grey")
|
||||
|
||||
# First row with the true value.
|
||||
text(k / 2, 4, paste0('"', paste0(title, as.character(res$value)), '"'),
|
||||
cex = title_cex, col = colors[2], xpd = NA)
|
||||
|
||||
# Second row with BF: B.
|
||||
points(1:k, rep(3, k), pch = PC(k, res$B), col = colors[res$B + 1],
|
||||
cex = res$B + 1)
|
||||
text(k, 3 + voff, paste0(sum(res$B), " signal bits"), cex = acex,
|
||||
col = anc, pos = posa)
|
||||
|
||||
# Third row: B'.
|
||||
points(1:k, rep(2, k), pch = PC(k, res$BP), col = colors[res$BP + 1],
|
||||
cex = res$BP + 1)
|
||||
text(k, 2 + voff, paste0(sum(res$BP), " bits on"),
|
||||
cex = acex, col = anc, pos = posa)
|
||||
|
||||
# Row 4: actual RAPPOR report.
|
||||
report <- res$rappor
|
||||
points(1:k, rep(1, k), pch = PC(k, as.character(report)),
|
||||
col = colors[report + 1], cex = report + 1)
|
||||
text(k, 1 + voff, paste0(sum(res$rappor), " bits on"), cex = acex,
|
||||
col = anc, pos = posa)
|
||||
|
||||
mtext(c("True value:", "Bloom filter (B):",
|
||||
"Fake Bloom \n filter (B'):", "Report sent\n to server:"),
|
||||
2, 1, at = 4:1, las = 2)
|
||||
legend("topright", legend = c("0", "1"), fill = colors, bty = "n",
|
||||
cex = 1.5, horiz = horiz)
|
||||
legend("topleft", legend = ebs, plot = FALSE)
|
||||
}
|
||||
|
||||
PlotPopulation <- function(probs, detected, detection_frequency) {
|
||||
cc <- c("gray80", "darkred")
|
||||
color <- rep(cc[1], length(probs))
|
||||
color[detected] <- cc[2]
|
||||
bp <- barplot(probs, col = color, border = color)
|
||||
inds <- c(1, c(max(which(probs > 0)), length(probs)))
|
||||
axis(1, bp[inds], inds)
|
||||
legend("topright", legend = c("Detected", "Not-detected"),
|
||||
fill = rev(cc), bty = "n")
|
||||
abline(h = detection_frequency, lty = 2, col = "grey")
|
||||
}
|
|
@ -0,0 +1,122 @@
|
|||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#
|
||||
# Read parameter, counts and map files.
|
||||
|
||||
gfile <- function(str) { str } # NOTE: gfile will be identity function in open source version
|
||||
library(Matrix)
|
||||
|
||||
ReadParameterFile <- function(params_file) {
|
||||
# Read parameter file. Format:
|
||||
# k, h, m, p, q, f
|
||||
# 128, 2, 8, 0.5, 0.75, 0.75
|
||||
|
||||
params <- as.list(read.csv(gfile(params_file)))
|
||||
if (length(params) != 6) {
|
||||
stop("There should be exactly 6 columns in the parameter file.")
|
||||
}
|
||||
if (any(names(params) != c("k", "h", "m", "p", "q", "f"))) {
|
||||
stop("Parameter names must be k,h,m,p,q,f.")
|
||||
}
|
||||
params
|
||||
}
|
||||
|
||||
ReadCountsFile <- function(counts_file, params = NULL) {
|
||||
# Read in the counts file.
|
||||
if (!file.exists(counts_file)) {
|
||||
return(NULL)
|
||||
}
|
||||
counts <- read.csv(gfile(counts_file), header = FALSE)
|
||||
|
||||
if (!is.null(params)) {
|
||||
if (nrow(counts) != params$m) {
|
||||
stop("Counts file: number of rows should equal number of cohorts (m).")
|
||||
}
|
||||
|
||||
if ((ncol(counts) - 1) != params$k) {
|
||||
stop(paste0("Counts file: number of columns should equal to k + 1: ",
|
||||
ncol(counts)))
|
||||
}
|
||||
}
|
||||
|
||||
if (any(counts < 0)) {
|
||||
stop("Counts file: all counts must be positive.")
|
||||
}
|
||||
|
||||
counts
|
||||
}
|
||||
|
||||
ReadMapFile <- function(map_file, params = NULL, quote = "") {
|
||||
# Read in the map file which is in the following format (two hash functions):
|
||||
# str1, h11, h12, h21 + k, h22 + k, h31 + 2k, h32 + 2k ...
|
||||
# str2, ...
|
||||
# Output:
|
||||
# map: a sparse representation of set bits for each candidate string.
|
||||
# strs: a vector of all candidate strings.
|
||||
|
||||
map_pos <- read.csv(gfile(map_file), header = FALSE, as.is = TRUE,
|
||||
quote = quote)
|
||||
strs <- map_pos[, 1]
|
||||
strs[strs == ""] <- "Empty"
|
||||
|
||||
# Remove duplicated strings.
|
||||
ind <- which(!duplicated(strs))
|
||||
strs <- strs[ind]
|
||||
map_pos <- map_pos[ind, ]
|
||||
|
||||
if (!is.null(params)) {
|
||||
n <- ncol(map_pos) - 1
|
||||
if (n != (params$h * params$m)) {
|
||||
stop(paste0("Map file: number of columns should equal hm + 1:",
|
||||
n, "_", params$h * params$m))
|
||||
}
|
||||
}
|
||||
row_pos <- unlist(map_pos[, -1])
|
||||
col_pos <- rep(1:nrow(map_pos), times = ncol(map_pos) - 1)
|
||||
removed <- which(is.na(row_pos))
|
||||
if (length(removed) > 0) {
|
||||
row_pos <- row_pos[-removed]
|
||||
col_pos <- col_pos[-removed]
|
||||
}
|
||||
|
||||
if (!is.null(params)) {
|
||||
map <- sparseMatrix(row_pos, col_pos,
|
||||
dims = c(params$m * params$k, length(strs)))
|
||||
} else {
|
||||
map <- sparseMatrix(row_pos, col_pos)
|
||||
}
|
||||
colnames(map) <- strs
|
||||
list(map = map, strs = strs, map_pos = map_pos)
|
||||
}
|
||||
|
||||
LoadMapFile <- function(map_file, params = NULL, quote = "") {
|
||||
# Reads the map file and creates an R binary .rda.
|
||||
# If .rda file already exists, just loads that file.
|
||||
|
||||
rda_file <- sub(".csv", ".rda", map_file, fixed = TRUE)
|
||||
|
||||
# file.info() is not implemented yet by the gfile package. One must delete
|
||||
# the .rda file manually when the .csv file is updated.
|
||||
# csv_updated <- file.info(map_file)$mtime > file.info(rda_file)$mtime
|
||||
|
||||
if (!file.exists(rda_file)) {
|
||||
cat("Parsing", map_file, "...\n")
|
||||
map <- ReadMapFile(map_file, params = params, quote = quote)
|
||||
save(map, file = file.path(tempdir(), basename(rda_file)))
|
||||
file.copy(file.path(tempdir(), basename(rda_file)), rda_file,
|
||||
overwrite = TRUE)
|
||||
}
|
||||
load(gfile(rda_file), .GlobalEnv)
|
||||
}
|
|
@ -0,0 +1,219 @@
|
|||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#
|
||||
# RAPPOR simulation library.
|
||||
|
||||
library(glmnet)
|
||||
|
||||
SetOfStrings <- function(num_strings = 100) {
|
||||
# Generates a set of strings for simulation purposes.
|
||||
strs <- paste0("V_", as.character(1:num_strings))
|
||||
strs
|
||||
}
|
||||
|
||||
GetSampleProbs <- function(params) {
|
||||
# Generate different underlying distributions for simulations purposes.
|
||||
# Args:
|
||||
# - params: a list describing the shape of the true distribution:
|
||||
# c(num_strings, prop_nonzero_strings, decay_type,
|
||||
# rate_exponetial).
|
||||
nstrs <- params[[1]]
|
||||
nonzero <- params[[2]]
|
||||
decay <- params[[3]]
|
||||
expo <- params[[4]]
|
||||
background <- params[[5]]
|
||||
|
||||
probs <- rep(0, nstrs)
|
||||
ind <- floor(nstrs * nonzero)
|
||||
if (decay == "Linear") {
|
||||
probs[1:ind] <- (ind:1) / sum(1:ind)
|
||||
} else if (decay == "Constant") {
|
||||
probs[1:ind] <- 1 / ind
|
||||
} else if (decay == "Exponential") {
|
||||
temp <- seq(0, nonzero, length.out = ind)
|
||||
temp <- exp(-temp * expo)
|
||||
temp <- temp + background
|
||||
temp <- temp / sum(temp)
|
||||
probs[1:ind] <- temp
|
||||
} else {
|
||||
stop('params[[4]] must be in c("Linear", "Exponenential", "Constant")')
|
||||
}
|
||||
probs
|
||||
}
|
||||
|
||||
CreateMap <- function(strs, params, generate_pos = TRUE) {
|
||||
# Creates a list of 0/1 matrices corresponding to mapping between the strs and
|
||||
# Bloom filters for each instance of the RAPPOR.
|
||||
# Ex. for 3 strings, 2 instances, 1 hash function and Bloom filter of size 4,
|
||||
# the result could look this:
|
||||
# [[1]]
|
||||
# 1 0 0 0
|
||||
# 0 1 0 0
|
||||
# 0 0 0 1
|
||||
# [[2]]
|
||||
# 0 1 0 0
|
||||
# 0 0 0 1
|
||||
# 0 0 1 0
|
||||
#
|
||||
# Args:
|
||||
# - strs: a vector of strings
|
||||
# - params: a list of parameters in the following format:
|
||||
# (k, h, m, p, q, f).
|
||||
|
||||
M <- length(strs)
|
||||
map <- list()
|
||||
k <- params$k
|
||||
h <- params$h
|
||||
m <- params$m
|
||||
|
||||
for (i in 1:m) {
|
||||
ones <- sample(1:k, M * h, replace = TRUE)
|
||||
cols <- rep(1:M, each = h)
|
||||
map[[i]] <- sparseMatrix(ones, cols, dims = c(k, M))
|
||||
colnames(map[[i]]) <- strs
|
||||
}
|
||||
|
||||
rmap <- do.call("rBind", map)
|
||||
if (generate_pos) {
|
||||
map_pos <- t(apply(rmap, 2, function(x) {
|
||||
ind <- which(x == 1)
|
||||
n <- length(ind)
|
||||
if (n < h * m) {
|
||||
ind <- c(ind, rep(NA, h * m - n))
|
||||
}
|
||||
ind
|
||||
}))
|
||||
} else {
|
||||
map_pos <- NULL
|
||||
}
|
||||
|
||||
list(map = map, rmap = rmap, map_pos = map_pos)
|
||||
}
|
||||
|
||||
GetSample <- function(N, strs, probs) {
|
||||
# Sample for the strs population with distribution probs.
|
||||
sample(strs, N, replace = TRUE, prob = probs)
|
||||
}
|
||||
|
||||
GetTrueBits <- function(samp, map, params) {
|
||||
# Convert sample generated by GetSample() to Bloom filters where mapping
|
||||
# is defined in map.
|
||||
# Output:
|
||||
# - reports: a matrix of size [num_instances x size] where each row
|
||||
# represents the number of times each bit in the Bloom filter
|
||||
# was set for a particular instance.
|
||||
# Note: reports[, 1] contains the same size for each instance.
|
||||
|
||||
N <- length(samp)
|
||||
k <- params$k
|
||||
m <- params$m
|
||||
strs <- colnames(map[[1]])
|
||||
reports <- matrix(0, m, k + 1)
|
||||
inst <- sample(1:m, N, replace = TRUE)
|
||||
for (i in 1:m) {
|
||||
tab <- table(samp[inst == i])
|
||||
tab2 <- rep(0, length(strs))
|
||||
tab2[match(names(tab), strs)] <- tab
|
||||
counts <- apply(map[[i]], 1, function(x) x * tab2)
|
||||
# cat(length(tab2), dim(map[[i]]), dim(counts), "\n")
|
||||
reports[i, ] <- c(sum(tab2), apply(counts, 2, sum))
|
||||
}
|
||||
reports
|
||||
}
|
||||
|
||||
GetNoisyBits <- function(truth, params) {
|
||||
# Applies RAPPOR to the Bloom filters.
|
||||
# Args:
|
||||
# - truth: a matrix generated by GetTrueBits().
|
||||
|
||||
k <- params$k
|
||||
p <- params$p
|
||||
q <- params$q
|
||||
f <- params$f
|
||||
|
||||
rappors <- apply(truth, 1, function(x) {
|
||||
# The following samples considering 4 cases:
|
||||
# 1. Signal and we lie on the bit.
|
||||
# 2. Signal and we tell the truth.
|
||||
# 3. Noise and we lie.
|
||||
# 4. Noise and we tell the truth.
|
||||
|
||||
# Lies when signal sampled from the binomial distribution.
|
||||
lied_signal <- rbinom(k, x[-1], f)
|
||||
|
||||
# Remaining must be the non-lying bits when signal. Sampled with q.
|
||||
truth_signal <- x[-1] - lied_signal
|
||||
|
||||
# Lies when there is no signal which happens x[1] - x[-1] times.
|
||||
lied_nosignal <- rbinom(k, x[1] - x[-1], f)
|
||||
|
||||
# Trtuh when there's no signal. These are sampled with p.
|
||||
truth_nosignal <- x[1] - x[-1] - lied_nosignal
|
||||
|
||||
# Total lies and sampling lies with 50/50 for either p or q.
|
||||
lied <- lied_signal + lied_nosignal
|
||||
lied_p <- rbinom(k, lied, .5)
|
||||
lied_q <- lied - lied_p
|
||||
|
||||
# Generating the report where sampling of either p or q occurs.
|
||||
rbinom(k, lied_q + truth_signal, q) + rbinom(k, lied_p + truth_nosignal, p)
|
||||
})
|
||||
|
||||
cbind(truth[, 1], t(rappors))
|
||||
}
|
||||
|
||||
GenerateSamples <- function(N = 10^5, params, pop_params, alpha = .05,
|
||||
prop_missing = 0,
|
||||
correction = "Bonferroni") {
|
||||
# Simulate N reports with pop_params describing the population and
|
||||
# params describing the RAPPOR configuration.
|
||||
num_strings = pop_params[[1]]
|
||||
|
||||
strs <- SetOfStrings(num_strings)
|
||||
probs <- GetSampleProbs(pop_params)
|
||||
samp <- GetSample(N, strs, probs)
|
||||
map <- CreateMap(strs, params)
|
||||
truth <- GetTrueBits(samp, map$map, params)
|
||||
rappors <- GetNoisyBits(truth, params)
|
||||
|
||||
strs_apprx <- strs
|
||||
map_apprx <- map$rmap
|
||||
# Remove % of strings to simulate missing variables.
|
||||
if (prop_missing > 0) {
|
||||
ind <- which(probs > 0)
|
||||
removed <- sample(ind, ceiling(prop_missing * length(ind)))
|
||||
map_apprx <- map$rmap[, -removed]
|
||||
strs_apprx <- strs[-removed]
|
||||
}
|
||||
|
||||
# Randomize the columns.
|
||||
ind <- sample(1:length(strs_apprx), length(strs_apprx))
|
||||
map_apprx <- map_apprx[, ind]
|
||||
strs_apprx <- strs_apprx[ind]
|
||||
|
||||
fit <- Decode(rappors, map_apprx, params, alpha = alpha,
|
||||
correction = correction)
|
||||
|
||||
# Add truth column.
|
||||
fit$fit$Truth <- table(samp)[fit$fit$strings]
|
||||
fit$fit$Truth[is.na(fit$fit$Truth)] <- 0
|
||||
|
||||
fit$map <- map$map
|
||||
fit$truth <- truth
|
||||
fit$strs <- strs
|
||||
fit$probs <- probs
|
||||
|
||||
fit
|
||||
}
|
|
@ -0,0 +1,85 @@
|
|||
#!/bin/bash
|
||||
#
|
||||
# Build automation.
|
||||
#
|
||||
# Usage:
|
||||
# ./build.sh <function name>
|
||||
#
|
||||
# Important targets are:
|
||||
# doc: build docs with Markdown
|
||||
# fastrand: build Python extension module to speed up the client simulation
|
||||
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
set -o errexit
|
||||
|
||||
log() {
|
||||
echo 1>&2 "$@"
|
||||
}
|
||||
|
||||
die() {
|
||||
log "FATAL: $@"
|
||||
exit 1
|
||||
}
|
||||
|
||||
run-markdown() {
|
||||
which markdown >/dev/null || die "Markdown not installed"
|
||||
|
||||
# Markdown is output unstyled; make it a little more readable.
|
||||
cat <<EOF
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<style>
|
||||
code { color: green }
|
||||
</style>
|
||||
</head>
|
||||
<body style="margin: 0 auto; width: 40em; text-align: left;">
|
||||
<p>
|
||||
EOF
|
||||
|
||||
markdown "$@"
|
||||
|
||||
cat <<EOF
|
||||
</p>
|
||||
</body>
|
||||
</html>
|
||||
EOF
|
||||
}
|
||||
|
||||
# Scan for TODOs. Does this belong somewhere else?
|
||||
todo() {
|
||||
find . -name \*.py -o -name \*.R -o -name \*.sh -o -name \*.md \
|
||||
| xargs --verbose -- grep -w TODO
|
||||
}
|
||||
|
||||
#
|
||||
# Targets: build "doc" or "fastrand"
|
||||
#
|
||||
|
||||
# Build dependencies: markdown tool.
|
||||
doc() {
|
||||
mkdir -p _tmp _tmp/doc
|
||||
|
||||
# For now, just one file.
|
||||
# TODO: generated docs
|
||||
run-markdown <README.md >_tmp/README.html
|
||||
run-markdown <doc/tutorial.md >_tmp/doc/tutorial.html
|
||||
|
||||
log 'Wrote docs to _tmp'
|
||||
}
|
||||
|
||||
# Build dependencies: Python development headers. Most systems should have
|
||||
# this. On Ubuntu/Debian, the 'python-dev' package contains headers.
|
||||
fastrand() {
|
||||
pushd client/python >/dev/null
|
||||
python setup.py build
|
||||
# So we can 'import _fastrand' without installing
|
||||
ln -s --force build/*/_fastrand.so .
|
||||
./fastrand_test.py
|
||||
|
||||
log 'fastrand built and tests PASSED'
|
||||
popd >/dev/null
|
||||
}
|
||||
|
||||
"$@"
|
|
@ -0,0 +1,2 @@
|
|||
Placeholder for the C++ client.
|
||||
|
|
@ -0,0 +1,86 @@
|
|||
/*
|
||||
Copyright 2014 Google Inc. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
*/
|
||||
|
||||
/*
|
||||
* _fastrand.c -- Python extension module to generate random bit vectors
|
||||
* quickly.
|
||||
*
|
||||
* IMPORTANT: This module does not use crytographically strong randomness. It
|
||||
* should be used ONLY be used to speed up the simulation. Don't use it in
|
||||
* production.
|
||||
*
|
||||
* If an adversary can predict which random bits are flipped, then RAPPOR's
|
||||
* privacy is compromised.
|
||||
*
|
||||
*/
|
||||
|
||||
#include <stdint.h> // uint64_t
|
||||
#include <stdio.h> // printf
|
||||
#include <stdlib.h> // srand
|
||||
#include <time.h> // time
|
||||
|
||||
#include <Python.h>
|
||||
|
||||
uint64_t randbits(float p1, int num_bits) {
|
||||
uint64_t result = 0;
|
||||
int i;
|
||||
for (i = 0; i < num_bits; ++i) {
|
||||
float r = (float)rand() / RAND_MAX;
|
||||
uint64_t bit = (r < p1);
|
||||
result |= (bit << i);
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
static PyObject *
|
||||
func_randbits(PyObject *self, PyObject *args) {
|
||||
float p1;
|
||||
int num_bits;
|
||||
|
||||
if (!PyArg_ParseTuple(args, "fi", &p1, &num_bits)) {
|
||||
return NULL;
|
||||
}
|
||||
if (p1 < 0.0 || p1 > 1.0) {
|
||||
printf("p1 must be between 0.0 and 1.0\n");
|
||||
// return None for now; easier than raising ValueError
|
||||
Py_INCREF(Py_None);
|
||||
return Py_None;
|
||||
}
|
||||
if (num_bits < 0 || num_bits > 64) {
|
||||
printf("num_bits must be 64 or less\n");
|
||||
// return None for now; easier than raising ValueError
|
||||
Py_INCREF(Py_None);
|
||||
return Py_None;
|
||||
}
|
||||
|
||||
//printf("p: %f\n", p);
|
||||
uint64_t r = randbits(p1, num_bits);
|
||||
return PyLong_FromUnsignedLongLong(r);
|
||||
}
|
||||
|
||||
PyMethodDef methods[] = {
|
||||
{"randbits", func_randbits, METH_VARARGS,
|
||||
"Get a 64 bit number where each bit is 1 with probability p."},
|
||||
{NULL, NULL},
|
||||
};
|
||||
|
||||
void init_fastrand() {
|
||||
Py_InitModule("_fastrand", methods);
|
||||
|
||||
// Just seed it here; we don't give the application any control.
|
||||
int seed = time(NULL);
|
||||
srand(seed);
|
||||
}
|
|
@ -0,0 +1,34 @@
|
|||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""fastrand.py - Python wrapper for _fastrand."""
|
||||
|
||||
import random
|
||||
|
||||
import _fastrand
|
||||
|
||||
|
||||
class FastRandFuncs(object):
|
||||
|
||||
def __init__(self, params):
|
||||
# NOTE: no rand attribute, so no seeding or getstate/setstate.
|
||||
# Also duplicating some of rappor._RandFuncs.
|
||||
self.cohort_rand_fn = random.randint
|
||||
|
||||
randbits = _fastrand.randbits
|
||||
num_bits = params.num_bloombits
|
||||
self.f_gen = lambda: randbits(params.prob_f, num_bits)
|
||||
self.p_gen = lambda: randbits(params.prob_p, num_bits)
|
||||
self.q_gen = lambda: randbits(params.prob_q, num_bits)
|
||||
self.uniform_gen = lambda: randbits(0.5, num_bits)
|
|
@ -0,0 +1,53 @@
|
|||
#!/usr/bin/python -S
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
fastrand_test.py: Tests for _fastrand extension module.
|
||||
"""
|
||||
|
||||
import unittest
|
||||
|
||||
import _fastrand # module under test
|
||||
|
||||
|
||||
class FastRandTest(unittest.TestCase):
|
||||
|
||||
def testRandbits64(self):
|
||||
for n in [8, 16, 32, 64]:
|
||||
#print '== %d' % n
|
||||
for p1 in [0.1, 0.5, 0.9]:
|
||||
#print '-- %f' % p1
|
||||
for i in xrange(5):
|
||||
r = _fastrand.randbits(p1, n)
|
||||
# Rough sanity check
|
||||
self.assertLess(r, 2 ** n)
|
||||
|
||||
# Visual check
|
||||
#b = bin(r)
|
||||
#print b
|
||||
#print b.count('1')
|
||||
|
||||
def testRandbitsError(self):
|
||||
r = _fastrand.randbits(-1, 64)
|
||||
# TODO: Should probably raise exceptions
|
||||
self.assertEqual(None, r)
|
||||
|
||||
r = _fastrand.randbits(0.0, 65)
|
||||
self.assertEqual(None, r)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
|
@ -0,0 +1,281 @@
|
|||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""RAPPOR client library.
|
||||
|
||||
Privacy is ensured without a third party by only sending RAPPOR'd data over the
|
||||
network (as opposed to raw client data).
|
||||
|
||||
Note that we use SHA1 for the Bloom filter hash function.
|
||||
"""
|
||||
import hashlib
|
||||
import random
|
||||
|
||||
|
||||
class Params(object):
|
||||
"""RAPPOR encoding parameters.
|
||||
|
||||
These affect privacy/anonymity. See the paper for details.
|
||||
"""
|
||||
def __init__(self):
|
||||
self.num_bloombits = 16 # Number of bloom filter bits (k)
|
||||
self.num_hashes = 2 # Number of bloom filter hashes (h)
|
||||
self.num_cohorts = 64 # Number of cohorts (m)
|
||||
self.prob_p = 0.50 # Probability p
|
||||
self.prob_q = 0.75 # Probability q
|
||||
self.prob_f = 0.50 # Probability f
|
||||
|
||||
self.flag_oneprr = False # One PRR for each user/word pair
|
||||
|
||||
# For testing
|
||||
def __eq__(self, other):
|
||||
return self.__dict__ == other.__dict__
|
||||
|
||||
def __repr__(self):
|
||||
return repr(self.__dict__)
|
||||
|
||||
|
||||
class SimpleRandom(object):
|
||||
"""Returns N 32-bit words where each bit has probability p of being 1."""
|
||||
|
||||
def __init__(self, prob_one, num_bits, rand=None):
|
||||
self.prob_one = prob_one
|
||||
self.num_bits = num_bits
|
||||
self.rand = rand or random.Random()
|
||||
|
||||
def __call__(self):
|
||||
p = self.prob_one
|
||||
rand_fn = self.rand.random # cache it for speed
|
||||
|
||||
r = 0
|
||||
for i in xrange(self.num_bits):
|
||||
bit = rand_fn() < p
|
||||
r |= (bit << i) # using bool as int
|
||||
return r
|
||||
|
||||
|
||||
# NOTE: This doesn't seem faster.
|
||||
|
||||
class ApproxRandom(object):
|
||||
"""Like SimpleRandom, but tries to make fewer random calls.
|
||||
|
||||
Represent prob_one in base 2 repr (up to 6 bits = 2^-6 accuracy)
|
||||
If X is a random bit with Pr[b=1] = p
|
||||
X & uniform is a random bit with Pr[b=1] = p/2
|
||||
X | uniform is a random bit with Pr[b=1] = p/2+1/2
|
||||
Read prob_one from LSB and do & or | operations depending on
|
||||
whether the bit is set or not a la repeated-squaring.
|
||||
#
|
||||
Eg. 0.3 = (0.010011...)_2 ~
|
||||
unif & (unif | (unif & (unif & (unif | unif))))
|
||||
0 1 0 0 1 1
|
||||
|
||||
Takes as input Pr[b=1], length of random bits, and a randomness
|
||||
function that outputs 32 bits. When not debugging, set rand_fn
|
||||
to random.getrandbits(32)
|
||||
"""
|
||||
|
||||
def __init__(self, prob_one, num_bits, rand=None):
|
||||
"""
|
||||
Args:
|
||||
rand: object satisfying Python random.Random() interface.
|
||||
"""
|
||||
if not isinstance(prob_one, float):
|
||||
raise RuntimeError('Probability must be a float')
|
||||
|
||||
if not (0 <= prob_one <= 1):
|
||||
raise RuntimeError('Probability not in [0,1]: %s' % prob_one)
|
||||
|
||||
self.num_bits = num_bits
|
||||
self.rand = rand or random.Random()
|
||||
|
||||
# This calculation depends on prob_one, but not the actual randomness.
|
||||
self.bits_in_prob_one = [0] * 6 # Store prob_one in bits
|
||||
for i in xrange(0, 6): # Loop at most six times
|
||||
if prob_one < 0.5:
|
||||
self.bits_in_prob_one[i] = 0
|
||||
prob_one *= 2
|
||||
else:
|
||||
self.bits_in_prob_one[i] = 1
|
||||
prob_one = prob_one * 2 - 1
|
||||
|
||||
if prob_one <= 0.01: # Finish loop early if less than 1% already
|
||||
break
|
||||
|
||||
def __call__(self):
|
||||
num_bits = self.num_bits
|
||||
rand_fn = lambda: self.rand.getrandbits(self.num_bits)
|
||||
|
||||
# We could special case these to be exact, but we're not using them for f,
|
||||
# p, q. Better to use the non-approximate method.
|
||||
|
||||
#if self.prob_one == 0:
|
||||
# return [0] * self.num_bits
|
||||
#if self.prob_one == 1:
|
||||
# return [0xffffffff] * self.num_bits
|
||||
|
||||
rand_bits = 0
|
||||
and_or = self.bits_in_prob_one
|
||||
|
||||
for i in xrange(5, -1, -1): # Count down from 5 to 0
|
||||
if and_or[i] == 0: # Corresponds to X & uniform
|
||||
rand_bits &= rand_fn()
|
||||
else:
|
||||
rand_bits |= rand_fn()
|
||||
|
||||
return rand_bits
|
||||
|
||||
|
||||
class _RandFuncs(object):
|
||||
"""Base class for randomness."""
|
||||
|
||||
def __init__(self, params, rand):
|
||||
"""
|
||||
Args:
|
||||
params: RAPPOR parameters
|
||||
rand: object satisfying random.Random() interface.
|
||||
"""
|
||||
self.rand = rand
|
||||
self.num_bits = params.num_bloombits
|
||||
self.cohort_rand_fn = rand.randint
|
||||
|
||||
|
||||
class SimpleRandFuncs(_RandFuncs):
|
||||
|
||||
def __init__(self, params, rand):
|
||||
_RandFuncs.__init__(self, params, rand)
|
||||
|
||||
self.f_gen = SimpleRandom(params.prob_f, self.num_bits, rand)
|
||||
self.p_gen = SimpleRandom(params.prob_p, self.num_bits, rand)
|
||||
self.q_gen = SimpleRandom(params.prob_q, self.num_bits, rand)
|
||||
self.uniform_gen = SimpleRandom(0.5, self.num_bits, rand)
|
||||
|
||||
|
||||
class ApproxRandFuncs(_RandFuncs):
|
||||
|
||||
def __init__(self, params, rand):
|
||||
_RandFuncs.__init__(self, params, rand)
|
||||
|
||||
self.f_gen = ApproxRandom(params.prob_f, self.num_bits, rand)
|
||||
self.p_gen = ApproxRandom(params.prob_p, self.num_bits, rand)
|
||||
self.q_gen = ApproxRandom(params.prob_q, self.num_bits, rand)
|
||||
# uniform generator (NOTE: could special case this)
|
||||
self.uniform_gen = ApproxRandom(0.5, self.num_bits, rand)
|
||||
|
||||
|
||||
# Compute masks for rappor's Permanent Randomized Response
|
||||
# The i^th Bloom Filter bit B_i is set to be B'_i equals
|
||||
# 1 w/ prob f/2 -- (*) -- f_bits
|
||||
# 0 w/ prob f/2
|
||||
# B_i w/ prob 1-f -- (&) -- mask_indices set to 0 here, i.e., no mask
|
||||
# Output bit indices corresponding to (&) and bits 0/1 corresponding to (*)
|
||||
def get_rappor_masks(user_id, word, params, rand_funcs):
|
||||
"""
|
||||
Call 3 random functions. Seed deterministically beforehand if oneprr.
|
||||
TODO:
|
||||
- Rewrite this to be clearer. We can use a completely different Random()
|
||||
instance in the case of oneprr.
|
||||
- Expose it in the simulation. It doesn't appear to be exercised now.
|
||||
"""
|
||||
if params.flag_oneprr:
|
||||
stored_state = rand_funcs.rand.getstate() # Store state
|
||||
rand_funcs.rand.seed(user_id + word) # Consistently seeded
|
||||
|
||||
assigned_cohort = rand_funcs.cohort_rand_fn(0, params.num_cohorts - 1)
|
||||
# Uniform bits for (*)
|
||||
f_bits = rand_funcs.uniform_gen()
|
||||
# Mask indices are 1 with probability f.
|
||||
mask_indices = rand_funcs.f_gen()
|
||||
|
||||
if params.flag_oneprr: # Restore state
|
||||
rand_funcs.rand.setstate(stored_state)
|
||||
|
||||
return assigned_cohort, f_bits, mask_indices
|
||||
|
||||
|
||||
def get_bf_bit(input_word, cohort, hash_no, num_bloombits):
|
||||
"""Compute Bloom Filter bits to set."""
|
||||
h = '%s%s%s' % (cohort, hash_no, input_word)
|
||||
sha1 = hashlib.sha1(h).digest()
|
||||
# Use last two bytes to get a bloom filter output. NOTE: This is only valid
|
||||
# for 16 bits (default num_bloombits). Should use struct module to get
|
||||
# arbitrary numbers of bits.
|
||||
a, b = sha1[0], sha1[1]
|
||||
return (ord(a) + ord(b) * 256) % num_bloombits
|
||||
|
||||
|
||||
class Encoder(object):
|
||||
"""Obfuscates values for a given user using the RAPPOR privacy algorithm."""
|
||||
|
||||
def __init__(self, params, user_id, rand_funcs=None):
|
||||
"""
|
||||
Args:
|
||||
params: RAPPOR Params() controlling privacy
|
||||
user_id: user ID, for generating cohort. (In the simulator, each user
|
||||
gets its own Encoder instance.)
|
||||
rand_funcs: randomness, can be deterministic for testing.
|
||||
"""
|
||||
self.params = params # RAPPOR params
|
||||
self.user_id = user_id
|
||||
|
||||
self.rand_funcs = rand_funcs
|
||||
self.p_gen = rand_funcs.p_gen
|
||||
self.q_gen = rand_funcs.q_gen
|
||||
|
||||
def encode(self, word):
|
||||
"""Compute rappor (Instantaneous Randomized Response)."""
|
||||
params = self.params
|
||||
|
||||
cohort, f_bits, mask_indices = get_rappor_masks(self.user_id, word,
|
||||
params,
|
||||
self.rand_funcs)
|
||||
|
||||
bloom_bits_array = 0
|
||||
# Compute Bloom Filter
|
||||
for hash_no in xrange(params.num_hashes):
|
||||
bit_to_set = get_bf_bit(word, cohort, hash_no, params.num_bloombits)
|
||||
bloom_bits_array |= (1 << bit_to_set)
|
||||
|
||||
# Both bit manipulations below use the following fact:
|
||||
# To set c = a if m = 0 or b if m = 1
|
||||
# c = (a & not m) | (b & m)
|
||||
#
|
||||
# Compute PRR as
|
||||
# f_bits if mask_indices = 1
|
||||
# bloom_bits_array if mask_indices = 0
|
||||
|
||||
# TODO: change 0xffff ^ to ~
|
||||
prr = (f_bits & mask_indices) | (bloom_bits_array & ~mask_indices)
|
||||
#print 'prr', bin(prr)
|
||||
|
||||
# Compute instantaneous randomized response:
|
||||
# If PRR bit is set, output 1 with probability q
|
||||
# If PRR bit is not set, output 1 with probability p
|
||||
p_bits = self.p_gen()
|
||||
q_bits = self.q_gen()
|
||||
|
||||
#print bin(f_bits), bin(mask_indices), bin(p_bits), bin(q_bits)
|
||||
|
||||
irr = (p_bits & ~prr) | (q_bits & prr)
|
||||
#print 'irr', bin(irr)
|
||||
|
||||
return cohort, irr # irr is the rappor
|
||||
|
||||
|
||||
# Update rappor sum
|
||||
def update_rappor_sums(rappor_sum, rappor, cohort, params):
|
||||
for bit_num in xrange(params.num_bloombits):
|
||||
if rappor & (1 << bit_num):
|
||||
rappor_sum[cohort][1 + bit_num] += 1
|
||||
rappor_sum[cohort][0] += 1 # The 0^th entry contains total reports in cohort
|
|
@ -0,0 +1,289 @@
|
|||
#!/usr/bin/python
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
rappor_test.py: Tests for rappor.py
|
||||
|
||||
NOTE! This contains tests that might fail with very small
|
||||
probability (< 1 in 10,000 times). This is implicitly required
|
||||
for testing probability. Such tests start with the stirng "testProbFailure."
|
||||
"""
|
||||
|
||||
import copy
|
||||
import math
|
||||
import random
|
||||
import unittest
|
||||
|
||||
import rappor # module under test
|
||||
|
||||
|
||||
class RapporParamsTest(unittest.TestCase):
|
||||
|
||||
def setUp(self):
|
||||
self.typical_instance = rappor.Params()
|
||||
ti = self.typical_instance # For convenience
|
||||
ti.num_cohorts = 64 # Number of cohorts
|
||||
ti.num_hashes = 2 # Number of bloom filter hashes
|
||||
ti.num_bloombits = 16 # Number of bloom filter bits
|
||||
ti.prob_p = 0.40 # Probability p
|
||||
ti.prob_q = 0.70 # Probability q
|
||||
ti.prob_f = 0.30 # Probability f
|
||||
|
||||
# TODO: Move this to constructor, or add a different constructor
|
||||
ti.flag_oneprr = False # One PRR for each user/word pair
|
||||
|
||||
def tearDown(self):
|
||||
pass
|
||||
|
||||
def testApproxRandom(self):
|
||||
get_rand_bits = rappor.ApproxRandom(0.1, 2)
|
||||
r = get_rand_bits()
|
||||
print r, bin(r)
|
||||
|
||||
def testSimpleRandom(self):
|
||||
# TODO: measure speed of naive implementation
|
||||
return
|
||||
for i in xrange(100000):
|
||||
r = rappor.get_rand_bits2(0.1, 2, lambda: random.getrandbits(32))
|
||||
if i % 10000 == 0:
|
||||
print i
|
||||
#print r, [bin(a) for a in r]
|
||||
|
||||
def testProbFailureWeakStatisticalTestForGetRandBits(self):
|
||||
"""Tests whether get_rand_bits outputs correctly biased random bits.
|
||||
|
||||
NOTE! This is a test with a small failure probability.
|
||||
The test succeeds with very very high probability and should only fail
|
||||
1 in 10,000 times or less.
|
||||
|
||||
Samples 256 bits of randomness 1000 times and checks to see that the
|
||||
cumulative number of bits set in each of the 256 positions is within
|
||||
3 \sigma of the mean
|
||||
|
||||
Repeats this experiment with several probability values
|
||||
"""
|
||||
length_in_words = 8 # A good sample size to test; 256 bits
|
||||
rand_fn = (lambda: random.getrandbits(32))
|
||||
# NOTE: 0.0 and 1.0 are not handled exactly.
|
||||
p_values = [0.5, 0.36, 0.9]
|
||||
|
||||
# Trials with different probabilities from p[]
|
||||
for p in p_values:
|
||||
get_rand_bits = rappor.ApproxRandom(p, length_in_words)
|
||||
|
||||
set_bit_count = [0] * 256
|
||||
for _ in xrange(1000):
|
||||
rand_sample = get_rand_bits()
|
||||
|
||||
bin_str = bin(rand_sample)[2:] # i^th word in binary as a str
|
||||
# +2 for the 0b prefix
|
||||
#print bin_str
|
||||
|
||||
# Prefix with leading zeroes
|
||||
bin_str = "0" * (32 - len(bin_str)) + bin_str
|
||||
for j in xrange(32):
|
||||
if bin_str[j] == "1":
|
||||
set_bit_count[32 + j] += 1
|
||||
|
||||
mean = int(1000 * p)
|
||||
# variance of N samples = Np(1-p)
|
||||
stddev = math.sqrt(1000 * p * (1 - p))
|
||||
num_infractions = 0 # Number of values over 3 \sigma
|
||||
infractions = []
|
||||
for i in xrange(length_in_words):
|
||||
for j in xrange(32):
|
||||
s = set_bit_count[i * 32 + j]
|
||||
if s > (mean + 3 * stddev) or s < (mean - 3 * stddev):
|
||||
num_infractions += 1
|
||||
infractions.append(s)
|
||||
|
||||
# 99% confidence for 3 \sigma implies less than 10 errors in 1000
|
||||
# Factor 2 to avoid flakiness as there is a 1% sampling rate error
|
||||
self.assertTrue(
|
||||
num_infractions <= 20, '%s %s' % (num_infractions, infractions))
|
||||
|
||||
def testUpdateRapporSumsWithLessThan32BitBloomFilter(self):
|
||||
report = 0x1d # From LSB, bits 1, 3, 4, 5 are set
|
||||
# Empty rappor_sum
|
||||
rappor_sum = [[0] * (self.typical_instance.num_bloombits + 1)
|
||||
for _ in xrange(self.typical_instance.num_cohorts)]
|
||||
# A random cohort number
|
||||
cohort = 42
|
||||
|
||||
# Setting up expected rappor sum
|
||||
expected_rappor_sum = [[0] * (self.typical_instance.num_bloombits + 1)
|
||||
for _ in xrange(self.typical_instance.num_cohorts)]
|
||||
expected_rappor_sum[42][0] = 1
|
||||
expected_rappor_sum[42][1] = 1
|
||||
expected_rappor_sum[42][3] = 1
|
||||
expected_rappor_sum[42][4] = 1
|
||||
expected_rappor_sum[42][5] = 1
|
||||
|
||||
rappor.update_rappor_sums(rappor_sum, report, cohort,
|
||||
self.typical_instance)
|
||||
self.assertEquals(expected_rappor_sum, rappor_sum)
|
||||
|
||||
def testGetRapporMasksWithoutOnePRR(self):
|
||||
params = copy.copy(self.typical_instance)
|
||||
params.prob_f = 0.5 # For simplicity
|
||||
|
||||
num_words = params.num_bloombits // 32 + 1
|
||||
rand = MockRandom()
|
||||
uniform_gen = rappor.ApproxRandom(0.5, num_words, rand=rand)
|
||||
f_gen = rappor.ApproxRandom(params.prob_f, num_words, rand=rand)
|
||||
rand_funcs = rappor.ApproxRandFuncs(params, rand)
|
||||
rand_funcs.cohort_rand_fn = (lambda a, b: a)
|
||||
|
||||
assigned_cohort, f_bits, mask_indices = rappor.get_rappor_masks(
|
||||
0, ["abc"], params, rand_funcs)
|
||||
|
||||
self.assertEquals(0, assigned_cohort)
|
||||
self.assertEquals(0xfff0000f, f_bits)
|
||||
self.assertEquals(0x0ffff000, mask_indices)
|
||||
|
||||
def testGetBFBit(self):
|
||||
cohort = 0
|
||||
hash_no = 0
|
||||
input_word = "abc"
|
||||
ti = self.typical_instance
|
||||
# expected_hash = ("\x13O\x0b\xa0\xcc\xc5\x89\x01oI\x85\xc8\xc3P\xfe\xa7 H"
|
||||
# "\xb0m")
|
||||
# Output should be
|
||||
# (ord(expected_hash[0]) + ord(expected_hash[1])*256) % 16
|
||||
expected_output = 3
|
||||
actual = rappor.get_bf_bit(input_word, cohort, hash_no, ti.num_bloombits)
|
||||
self.assertEquals(expected_output, actual)
|
||||
|
||||
hash_no = 1
|
||||
# expected_hash = ("\xb6\xcc\x7f\xee@\x95\xb0\xdb\xf5\xf1z\xc7\xdaPM"
|
||||
# "\xd4\xd6u\xed3")
|
||||
expected_output = 6
|
||||
actual = rappor.get_bf_bit(input_word, cohort, hash_no, ti.num_bloombits)
|
||||
self.assertEquals(expected_output, actual)
|
||||
|
||||
def testGetRapporMasksWithOnePRR(self):
|
||||
# Set randomness function to be used to sample 32 random bits
|
||||
# Set randomness function that takes two integers and returns a
|
||||
# random integer cohort in [a, b]
|
||||
|
||||
params = copy.copy(self.typical_instance)
|
||||
params.flag_oneprr = True
|
||||
|
||||
num_words = params.num_bloombits // 32 + 1
|
||||
rand = MockRandom()
|
||||
rand_funcs = rappor.ApproxRandFuncs(params, rand)
|
||||
|
||||
# First two calls to get_rappor_masks for identical inputs
|
||||
# Third call for a different input
|
||||
print '\tget_rappor_masks 1'
|
||||
cohort_1, f_bits_1, mask_indices_1 = rappor.get_rappor_masks(
|
||||
"0", "abc", params, rand_funcs)
|
||||
print '\tget_rappor_masks 2'
|
||||
cohort_2, f_bits_2, mask_indices_2 = rappor.get_rappor_masks(
|
||||
"0", "abc", params, rand_funcs)
|
||||
print '\tget_rappor_masks 3'
|
||||
cohort_3, f_bits_3, mask_indices_3 = rappor.get_rappor_masks(
|
||||
"0", "abcd", params, rand_funcs)
|
||||
|
||||
# First two outputs should be identical, i.e., identical PRRs
|
||||
self.assertEquals(f_bits_1, f_bits_2)
|
||||
self.assertEquals(mask_indices_1, mask_indices_2)
|
||||
self.assertEquals(cohort_1, cohort_2)
|
||||
|
||||
# Third PRR should be different from the first PRR
|
||||
self.assertNotEqual(f_bits_1, f_bits_3)
|
||||
self.assertNotEqual(mask_indices_1, mask_indices_3)
|
||||
self.assertNotEqual(cohort_1, cohort_3)
|
||||
|
||||
# Now testing with flag_oneprr false
|
||||
params.flag_oneprr = False
|
||||
cohort_1, f_bits_1, mask_indices_1 = rappor.get_rappor_masks(
|
||||
"0", "abc", params, rand_funcs)
|
||||
cohort_2, f_bits_2, mask_indices_2 = rappor.get_rappor_masks(
|
||||
"0", "abc", params, rand_funcs)
|
||||
|
||||
self.assertNotEqual(f_bits_1, f_bits_2)
|
||||
self.assertNotEqual(mask_indices_1, mask_indices_2)
|
||||
self.assertNotEqual(cohort_1, cohort_2)
|
||||
|
||||
def testEncoder(self):
|
||||
"""Expected bloom bits is computed as follows.
|
||||
|
||||
f_bits = 0xfff0000f and mask_indices = 0x0ffff000 from
|
||||
testGetRapporMasksWithoutPRR()
|
||||
|
||||
q_bits = 0xfffff0ff from mock_rand.randomness[] and how get_rand_bits works
|
||||
p_bits = 0x000ffff0 from -- do --
|
||||
|
||||
bloom_bits_array is 0x0000 0048 (3rd bit and 6th bit, from
|
||||
testSetBloomArray, are set)
|
||||
|
||||
Bit arithmetic ends up computing
|
||||
bloom_bits_prr = 0x0ff00048
|
||||
bloom_bits_irr= = 0x0ffffff8
|
||||
"""
|
||||
params = copy.copy(self.typical_instance)
|
||||
params.prob_f = 0.5
|
||||
params.prob_p = 0.5
|
||||
params.prob_q = 0.75
|
||||
|
||||
rand_funcs = rappor.ApproxRandFuncs(params, MockRandom())
|
||||
rand_funcs.cohort_rand_fn = lambda a, b: a
|
||||
e = rappor.Encoder(params, 0, rand_funcs=rand_funcs)
|
||||
|
||||
cohort, bloom_bits_irr = e.encode("abc")
|
||||
|
||||
self.assertEquals(0, cohort)
|
||||
self.assertEquals(0x0ffffff8, bloom_bits_irr)
|
||||
|
||||
|
||||
class MockRandom(object):
|
||||
"""Returns one of eight random strings in a cyclic manner.
|
||||
|
||||
Mock random function that involves *some* state, as needed for tests
|
||||
that call randomness several times. This makes it difficult to deal
|
||||
exclusively with stubs for testing purposes.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.counter = 0
|
||||
self.randomness = [0x0000ffff, 0x000ffff0, 0x00ffff00, 0x0ffff000,
|
||||
0xfff000f0, 0xfff0000f, 0xf0f0f0f0, 0xff0f00ff]
|
||||
|
||||
def seed(self, seed):
|
||||
self.counter = hash(seed) % 8
|
||||
#print 'SEED', self.counter
|
||||
|
||||
def getstate(self):
|
||||
#print 'GET STATE', self.counter
|
||||
return self.counter
|
||||
|
||||
def setstate(self, state):
|
||||
#print 'SET STATE', state
|
||||
self.counter = state
|
||||
|
||||
def getrandbits(self, unused_num_bits):
|
||||
#print 'GETRAND', self.counter
|
||||
rand_val = self.randomness[self.counter]
|
||||
self.counter = (self.counter + 1) % 8
|
||||
return rand_val
|
||||
|
||||
def randint(self, a, b):
|
||||
return a + self.counter
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
|
@ -0,0 +1,26 @@
|
|||
#!/usr/bin/python
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
from distutils.core import setup, Extension
|
||||
|
||||
module = Extension('_fastrand',
|
||||
sources = ['_fastrand.c'])
|
||||
|
||||
setup(name = '_fastrand',
|
||||
version = '1.0',
|
||||
description = 'Module to speed up RAPPOR simulation',
|
||||
ext_modules = [module])
|
|
@ -0,0 +1,188 @@
|
|||
#!/bin/bash
|
||||
#
|
||||
# Demo of RAPPOR. Automating Python and R scripts. See README.
|
||||
#
|
||||
# Usage:
|
||||
# ./demo.sh <function name>
|
||||
#
|
||||
# End to end demo for 3 distributions:
|
||||
#
|
||||
# $ tests/run.sh end-to-end-all
|
||||
#
|
||||
# (This takes a minute or so)
|
||||
#
|
||||
# To use a different R interpreter, set R_PREFIX: e.g.
|
||||
#
|
||||
# $ export R_PREFIX=/usr/local/bin/Rscript
|
||||
# $ ./run.sh end-to-end-all
|
||||
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
set -o errexit
|
||||
|
||||
readonly THIS_DIR=$(dirname $0)
|
||||
readonly REPO_ROOT=$THIS_DIR
|
||||
readonly CLIENT_DIR=$REPO_ROOT/client/python
|
||||
|
||||
#
|
||||
# Utility functions
|
||||
#
|
||||
|
||||
banner() {
|
||||
echo
|
||||
echo "----- $@"
|
||||
echo
|
||||
}
|
||||
|
||||
log() {
|
||||
echo 1>&2 "$@"
|
||||
}
|
||||
|
||||
die() {
|
||||
log "$0: $@"
|
||||
exit 1
|
||||
}
|
||||
|
||||
#
|
||||
# Semi-automated demos
|
||||
#
|
||||
|
||||
# This generates the simulated input s1 .. s<n> with 3 different distributions.
|
||||
gen-sim-input() {
|
||||
local dist=$1
|
||||
local num_clients=$2
|
||||
|
||||
local flag=''
|
||||
case $dist in
|
||||
exp)
|
||||
flag=-e
|
||||
;;
|
||||
gauss)
|
||||
flag=-g
|
||||
;;
|
||||
unif)
|
||||
flag=-u
|
||||
;;
|
||||
*)
|
||||
die "Invalid distribution '$dist'"
|
||||
esac
|
||||
|
||||
mkdir -p _tmp
|
||||
|
||||
# Simulating 10,000 clients runs reasonably fast but the results look poor.
|
||||
# 100,000 is slow but looks better.
|
||||
# 50 different client values are easier to plot (default is 100)
|
||||
time tests/gen_sim_input.py $flag \
|
||||
-n $num_clients \
|
||||
-r 50 \
|
||||
-o _tmp/$dist.csv
|
||||
}
|
||||
|
||||
# Do the RAPPOR transformation on our simulated input.
|
||||
rappor-sim() {
|
||||
local dist=$1
|
||||
shift
|
||||
PYTHONPATH=$CLIENT_DIR time $REPO_ROOT/tests/rappor_sim.py \
|
||||
-i _tmp/$dist.csv \
|
||||
"$@"
|
||||
#-s 0 # deterministic seed
|
||||
}
|
||||
|
||||
# Like rappor-sim, but run it through the Python profiler.
|
||||
rappor-sim-profile() {
|
||||
local dist=$1
|
||||
shift
|
||||
|
||||
export PYTHONPATH=$CLIENT_DIR
|
||||
# For now, just dump it to a text file. Sort by cumulative time.
|
||||
time python -m cProfile -s cumulative \
|
||||
tests/rappor_sim.py \
|
||||
-i _tmp/$dist.csv \
|
||||
"$@" \
|
||||
| tee _tmp/profile.txt
|
||||
}
|
||||
|
||||
# Analyze output of Python client library.
|
||||
analyze() {
|
||||
local dist=$1
|
||||
local title=$2
|
||||
local prefix=_tmp/$dist
|
||||
|
||||
# Workaround use a different R interpreter. 'env' is a noop.
|
||||
local r_prefix=${R_PREFIX:-env}
|
||||
|
||||
local out_dir=_tmp/${dist}_report
|
||||
mkdir -p $out_dir
|
||||
|
||||
time $r_prefix tests/analyze.R -t "$title" $prefix $out_dir
|
||||
}
|
||||
|
||||
# Use locally compiled R. This is useful for Google computers, i.e. instead of
|
||||
# using the Google R build.
|
||||
analyze2() {
|
||||
R_PREFIX=/usr/local/bin/Rscript analyze "$@"
|
||||
}
|
||||
|
||||
# Run end to end for one distribution.
|
||||
run-dist() {
|
||||
local dist=$1
|
||||
# TODO: parameterize output dirs by num_clients
|
||||
local num_clients=${2:-100000}
|
||||
|
||||
banner "Generating simulated input data ($dist)"
|
||||
gen-sim-input $dist $num_clients
|
||||
|
||||
banner "Running RAPPOR ($dist)"
|
||||
rappor-sim $dist
|
||||
|
||||
banner "Analyzing RAPPOR output ($dist)"
|
||||
analyze $dist "Distribution Comparison ($dist)"
|
||||
}
|
||||
|
||||
expand-html() {
|
||||
local template=${1:-../tests/report.html}
|
||||
local out_dir=${2:-_tmp}
|
||||
|
||||
pushd $out_dir >/dev/null
|
||||
|
||||
# NOTE: We're arbitrarily using the "exp" values since params are all
|
||||
# independent of distribution.
|
||||
|
||||
cat $template \
|
||||
| sed -e '/SIM_PARAMS/ r exp_sim_params.html' \
|
||||
-e '/RAPPOR_PARAMS/ r exp_params.html' \
|
||||
> report.html
|
||||
|
||||
log "Wrote $out_dir/report.html. Open this in your browser."
|
||||
|
||||
popd >/dev/null
|
||||
}
|
||||
|
||||
# Build prerequisites for the demo.
|
||||
build() {
|
||||
# This is optional now.
|
||||
./build.sh fastrand
|
||||
}
|
||||
|
||||
_run() {
|
||||
local num_clients=${1:-100000}
|
||||
for dist in exp gauss unif; do
|
||||
run-dist $dist $num_clients
|
||||
done
|
||||
# Link the HTML skeleton
|
||||
#
|
||||
# TODO:
|
||||
# - gen_sim_input output sim_params.html
|
||||
# - read params rappor_params.html
|
||||
|
||||
expand-html ../tests/report.html _tmp
|
||||
|
||||
wc -l _tmp/*.csv
|
||||
}
|
||||
|
||||
# Main entry point. Run it for all distributions, and time the result.
|
||||
run() {
|
||||
time _run "$@"
|
||||
}
|
||||
|
||||
"$@"
|
|
@ -0,0 +1,105 @@
|
|||
RAPPOR Tutorial
|
||||
===============
|
||||
|
||||
This doc explains the simulation tools for RAPPOR. For a detailed description
|
||||
of the algorithm, see the [paper](http://arxiv.org/abs/1407.6981).
|
||||
|
||||
Start with this command:
|
||||
|
||||
$ ./demo.sh run
|
||||
|
||||
It currently takes 45 seconds or so to run.
|
||||
|
||||
As described in the [README](../README.html), this command generates simulated
|
||||
input data with different distributions, runs it through RAPPOR, then analyzes
|
||||
and plots the output.
|
||||
|
||||
(The dependencies listed in the README must be installed.)
|
||||
|
||||
The command is composed of serveral part.
|
||||
|
||||
1. Generating Simulated Input Data
|
||||
----------------------------------
|
||||
|
||||
`gen_sim_input.py` generates test data. Each row contains a client ID, and a
|
||||
space separated list of reported values -- the true values we wish to keep
|
||||
private.
|
||||
|
||||
By default, we generate 5-9 values per client, out of 50 unique values, so the
|
||||
output may look something like this:
|
||||
|
||||
1,s10 s55 s1 s15 s29 s57 s6
|
||||
2,s20 s61 s9 s21 s39 s32 s32 s6 s49
|
||||
...
|
||||
<client N>,<client N's space-separated raw data>
|
||||
|
||||
You can select the distribution of the `sN` values by passing a flag. The
|
||||
shell script loops through 3 distributions: exponential, normal/gaussian, and
|
||||
uniform.
|
||||
|
||||
You can also write a script to generate a file in this format and pass it to
|
||||
the next two stages.
|
||||
|
||||
2. RAPPOR Transformation
|
||||
------------------------
|
||||
|
||||
`tests/rappor_sim.py` uses the Python client library
|
||||
(`client/python/rappor.py`) to obfuscate the `s1` .. `sN` strings.
|
||||
|
||||
To preserve the user's privacy, we add random noise by flipping bits in two
|
||||
different ways.
|
||||
|
||||
<!-- TODO: a realistic data set would be nice? How could we generate one? -->
|
||||
|
||||
It generates 4 files:
|
||||
|
||||
- Counts (`exp_out.csv`) -- This currently is the sum of what will be sent over
|
||||
the network. TODO: change it to output individual reports. Then have a
|
||||
separate tool that does the summing.
|
||||
|
||||
- Parameters (`exp_params.csv`) -- This is a 1-row CSV file with the 6 privacy parameters
|
||||
`k,h,m,p,q,f`. (The [report.html](../report.html) file and the paper both
|
||||
describe these parameters). This should be sent over the network along with
|
||||
the counts. When the raw RAPPOR data is persisted, this should also form
|
||||
part of the "schema", as the data can't be decoded correctly without it.
|
||||
|
||||
- True histogram of input values (`exp_hist.csv`) -- This is for debugging /
|
||||
comparison. You won't have this in a real setting, of course.
|
||||
|
||||
- Map file (`exp_map.csv`) -- Hashed candidates.
|
||||
|
||||
|
||||
3. RAPPOR Analysis
|
||||
------------------
|
||||
|
||||
Once you have the `counts`, `params`, and `map` files, you can pass it to the
|
||||
`tests/analyze.R` tool, which is a small wrapper around the `analyze/R`
|
||||
library.
|
||||
|
||||
Then you will get a plot of the true distribution vs. the distribution
|
||||
recovered from data obfuscated with the RAPPOR privacy algorithm.
|
||||
|
||||
[View the example output](../report.html).
|
||||
|
||||
You can change the simulation or RAPPOR parameters via flags, and compare the
|
||||
resulting distributions.
|
||||
|
||||
TODO
|
||||
----
|
||||
|
||||
The user should provide candidates, and we should have tool to hash them. This
|
||||
is like the gen_map tool.
|
||||
|
||||
$ hash_candidates.py <candidates>
|
||||
(Writes <map file>)
|
||||
|
||||
Tool to extract candidates from the input file.
|
||||
|
||||
$ ./demo.sh cheat-candidates <raw input>
|
||||
|
||||
In the real setting, it can be nontrivial to enumerate the candidates.
|
||||
|
||||
To simulate this, filter the list with `grep`.
|
||||
|
||||
Show more detailed command lines, --help?
|
||||
|
|
@ -0,0 +1,135 @@
|
|||
#!/usr/bin/Rscript --vanilla
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# Simple tool that wraps the analysis/R library.
|
||||
#
|
||||
# To run this you need:
|
||||
# - ggplot
|
||||
# - optparse
|
||||
# - glmnet -- dependency of analysis library
|
||||
|
||||
library(optparse)
|
||||
|
||||
# Do command line parsing first to catch errors. Loading libraries in R is
|
||||
# slow.
|
||||
if (!interactive()) {
|
||||
option_list <- list(
|
||||
make_option(c("-t", "--title"), help="Plot Title")
|
||||
)
|
||||
parsed <- parse_args(OptionParser(option_list = option_list),
|
||||
positional_arguments = 2) # input and output
|
||||
}
|
||||
|
||||
library(ggplot2)
|
||||
|
||||
source("analysis/R/analysis_lib.R")
|
||||
source("analysis/R/read_input.R")
|
||||
source("analysis/R/decode.R")
|
||||
|
||||
Log <- function(...) {
|
||||
cat('analyze.R: ')
|
||||
cat(sprintf(...))
|
||||
cat('\n')
|
||||
}
|
||||
|
||||
LoadInputs <- function(prefix, ctx) {
|
||||
# prefix: path prefix, e.g. '_tmp/exp'
|
||||
p <- paste0(prefix, '_params.csv')
|
||||
c <- paste0(prefix, '_out.csv')
|
||||
m <- paste0(prefix, '_map.csv')
|
||||
h <- paste0(prefix, '_hist.csv')
|
||||
|
||||
# Calls AnalyzeRAPPOR to run the analysis code
|
||||
# Date(s) are some dummy dates
|
||||
ctx$rappor <- AnalyzeRAPPOR(ReadParameterFile(p),
|
||||
ReadCountsFile(c),
|
||||
ReadMapFile(m)$map, "FDR", 0.05, 1,
|
||||
date="01/01/01", date_num="100001")
|
||||
if (is.null(ctx$rappor)) {
|
||||
stop("RAPPOR analysis failed.")
|
||||
}
|
||||
ctx$actual <- read.csv(h)
|
||||
}
|
||||
|
||||
# Prepare input data to be plotted.
|
||||
ProcessAll = function(ctx) {
|
||||
actual <- ctx$actual
|
||||
rappor <- ctx$rappor
|
||||
|
||||
# "s12" -> 12, for graphing
|
||||
StringToInt <- function(x) as.integer(substring(x, 2))
|
||||
|
||||
total <- sum(actual$count)
|
||||
a <- data.frame(index = StringToInt(actual$string),
|
||||
# Calculate the true proportion
|
||||
proportion = actual$count / total,
|
||||
dist = "actual")
|
||||
|
||||
r <- data.frame(index = StringToInt(rappor$strings),
|
||||
proportion = rappor$proportion,
|
||||
dist = "rappor")
|
||||
|
||||
# Fill in zeros for values missing in RAPPOR. It makes the ggplot bar plot
|
||||
# look better.
|
||||
fill <- setdiff(actual$string, rappor$strings)
|
||||
if (length(fill) > 0) {
|
||||
z <- data.frame(index = StringToInt(fill),
|
||||
proportion = 0.0,
|
||||
dist = "rappor")
|
||||
} else {
|
||||
z <- data.frame()
|
||||
}
|
||||
|
||||
rbind(r, a, z)
|
||||
}
|
||||
|
||||
PlotAll <- function(d, title) {
|
||||
# NOTE: geom_bar makes a histogram by default; need stat = "identity"
|
||||
g <- ggplot(d, aes(x = index, y = proportion, fill = factor(dist)))
|
||||
b <- geom_bar(stat = "identity", position = "dodge")
|
||||
t <- ggtitle(title)
|
||||
g + b + t
|
||||
}
|
||||
|
||||
WritePlot<- function(p, outdir, width = 800, height = 600) {
|
||||
filename <- file.path(outdir, 'dist.png')
|
||||
png(filename, width=width, height=height)
|
||||
plot(p)
|
||||
dev.off()
|
||||
Log('Wrote %s', filename)
|
||||
}
|
||||
|
||||
main <- function(parsed) {
|
||||
args <- parsed$args
|
||||
options <- parsed$options
|
||||
|
||||
input_prefix <- args[[1]]
|
||||
output_dir <- args[[2]]
|
||||
|
||||
# increase ggplot font size globally
|
||||
theme_set(theme_grey(base_size = 16))
|
||||
|
||||
ctx <- new.env()
|
||||
|
||||
LoadInputs(input_prefix, ctx)
|
||||
d <- ProcessAll(ctx)
|
||||
p <- PlotAll(d, options$title)
|
||||
WritePlot(p, output_dir)
|
||||
}
|
||||
|
||||
if (!interactive()) {
|
||||
main(parsed)
|
||||
}
|
|
@ -0,0 +1,242 @@
|
|||
#!/usr/bin/python
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# Copyright 2014 Google Inc. All Rights Reserved.
|
||||
"""Tool to generated simulated input data for RAPPOR.
|
||||
|
||||
We can output data in the following distributions:
|
||||
|
||||
a. Uniform
|
||||
b. Gaussian
|
||||
c. Exponential
|
||||
|
||||
After it goes through RAPPOR, we should be able see the distribution, but not
|
||||
any user's particular input data.
|
||||
"""
|
||||
|
||||
import getopt
|
||||
import math
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
|
||||
# Distributions
|
||||
DISTR_UNIF = 1 # Uniform
|
||||
DISTR_GAUSS = 2 # Gaussian
|
||||
DISTR_EXP = 3 # Exponential
|
||||
|
||||
|
||||
# Command line arguments
|
||||
OUTFILE = "" # Output file name
|
||||
DISTR = DISTR_UNIF # Distribution: default is uniform
|
||||
NUM_UNIQUE_VALUES = 100 # Range of client's values in reports
|
||||
# The default is strings "1" ... "100"
|
||||
DIST_PARAM = None # Parameter to pass to distribution
|
||||
NUM_CLIENTS = 100000 # Number of simulated clients
|
||||
|
||||
|
||||
# NOTE: unused. This is hard-coded now.
|
||||
LOG_NUM_UNIQUE_VALUES = 30 # Something like 4-5xlog(NUM_UNIQUE_VALUES) bits
|
||||
# should give enough entropy for good samples
|
||||
|
||||
ONE_MINUS_EXP_LAMBDA = 0 # 1-e^-lambda
|
||||
|
||||
|
||||
def log(msg, *args):
|
||||
if args:
|
||||
msg = msg % args
|
||||
print >>sys.stderr, msg
|
||||
|
||||
|
||||
# Script usage scenario
|
||||
def usage(script_name):
|
||||
sys.stdout.write("Usage: " + script_name + " -o <output file name>")
|
||||
sys.stdout.write(" -r <range of values \"s1\"-\"sXX\">")
|
||||
sys.stdout.write(" [-u|g|e|n|p]")
|
||||
|
||||
sys.stdout.write("""
|
||||
|
||||
-u Uniform distribution (default)
|
||||
-g Gaussian distribution
|
||||
-e Exponential distribution
|
||||
-n Number of users (default = 100,000)
|
||||
-p Parameter
|
||||
Ignored for uniform
|
||||
Std-dev for Gaussian
|
||||
Lambda for Exponential
|
||||
|
||||
""")
|
||||
|
||||
|
||||
def init_rand_precompute():
|
||||
global ONE_MINUS_EXP_LAMBDA
|
||||
if DISTR == DISTR_EXP:
|
||||
ONE_MINUS_EXP_LAMBDA = 1 - math.exp(-DIST_PARAM)
|
||||
|
||||
|
||||
def rand_sample_unif():
|
||||
return random.randrange(1, NUM_UNIQUE_VALUES)
|
||||
|
||||
|
||||
def rand_sample_gauss():
|
||||
"""Returns a value in [1, NUM_UNIQUE_VALUES] drawn from a Gaussian."""
|
||||
mean = float(NUM_UNIQUE_VALUES + 1) / 2
|
||||
while True:
|
||||
r = random.normalvariate(mean, DIST_PARAM)
|
||||
value = int(round(r))
|
||||
# Rejection sampling to cut off Gaussian to within [1, NUM_UNIQUE_VALUES]
|
||||
if 1 <= value <= NUM_UNIQUE_VALUES:
|
||||
break
|
||||
|
||||
return value # true client value
|
||||
|
||||
|
||||
def rand_sample_exp():
|
||||
"""Returns a random sample in [1, NUM_UNIQUE_VALUES] drawn from an
|
||||
exponential distribution.
|
||||
"""
|
||||
rand_in_cf = random.random()
|
||||
# Val sampled from exp distr in [0,1] is CDF^{-1}(unif in [0,1))
|
||||
rand_sample_in_01 = (
|
||||
-math.log(1 - rand_in_cf * ONE_MINUS_EXP_LAMBDA) / DIST_PARAM)
|
||||
# Scale up to NUM_UNIQUE_VALUES and floor to integer
|
||||
rand_val = int((rand_sample_in_01 * NUM_UNIQUE_VALUES) + 1)
|
||||
return rand_val
|
||||
|
||||
|
||||
PARAMS_HTML = """
|
||||
<h3>Simulation Input</h3>
|
||||
<table align="center">
|
||||
<tr>
|
||||
<td>Number of clients</td>
|
||||
<td align="right">{num_clients:,}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Total values reported / obfuscated</td>
|
||||
<td align="right">{num_values:,}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Unique values reported / obfuscated</td>
|
||||
<td align="right">{num_unique_values}</td>
|
||||
</tr>
|
||||
</table>
|
||||
"""
|
||||
|
||||
|
||||
def WriteParamsHtml(num_values, f):
|
||||
d = {
|
||||
'num_clients': NUM_CLIENTS,
|
||||
'num_unique_values': NUM_UNIQUE_VALUES,
|
||||
'num_values': num_values,
|
||||
}
|
||||
# NOTE: No HTML escaping since we're writing numbers
|
||||
print >>f, PARAMS_HTML.format(**d)
|
||||
|
||||
|
||||
def main(argv):
|
||||
# All command line arguments are placed into global vars
|
||||
global OUTFILE, NUM_UNIQUE_VALUES, DISTR, DIST_PARAM, NUM_CLIENTS
|
||||
|
||||
# Get arguments
|
||||
try:
|
||||
opts, args = getopt.getopt(argv[1:], "ugen:p:o:r:")
|
||||
except getopt.GetoptError:
|
||||
usage(argv[0])
|
||||
sys.exit(2)
|
||||
|
||||
# Parsing arguments
|
||||
for opt, arg in opts:
|
||||
if opt == "-o":
|
||||
OUTFILE = arg
|
||||
elif opt == "-r":
|
||||
NUM_UNIQUE_VALUES = int(arg)
|
||||
elif opt == "-u":
|
||||
DISTR = DISTR_UNIF
|
||||
elif opt == "-g":
|
||||
DISTR = DISTR_GAUSS
|
||||
elif opt == "-e":
|
||||
DISTR = DISTR_EXP
|
||||
elif opt == "-p":
|
||||
DIST_PARAM = float(arg)
|
||||
elif opt == "-n":
|
||||
NUM_CLIENTS = int(arg)
|
||||
|
||||
# Some sanity checking
|
||||
if not OUTFILE:
|
||||
sys.stdout.write("Output file is required.\n")
|
||||
usage(argv[0])
|
||||
sys.exit(2)
|
||||
|
||||
if NUM_UNIQUE_VALUES < 2:
|
||||
sys.stdout.write("Range should be at least 2. Setting to default 100.\n")
|
||||
NUM_UNIQUE_VALUES = 100
|
||||
|
||||
if DIST_PARAM is None:
|
||||
if DISTR == DISTR_GAUSS:
|
||||
DIST_PARAM = float(NUM_UNIQUE_VALUES) / 6
|
||||
elif DISTR == DISTR_EXP:
|
||||
DIST_PARAM = float(NUM_UNIQUE_VALUES) / 5
|
||||
|
||||
if NUM_CLIENTS < 10:
|
||||
sys.stdout.write("RAPPOR works typically with much larger user sizes.")
|
||||
sys.stdout.write(" Setting number of users to 10.\n")
|
||||
NUM_CLIENTS = 10
|
||||
|
||||
random.seed()
|
||||
|
||||
# Precompute and initialize constants needed for random samples
|
||||
init_rand_precompute()
|
||||
|
||||
# Choose a function that yields the desired distrubtion. Each of these
|
||||
# functions returns a randomly sampled integer between 1 and
|
||||
# NUM_UNIQUE_VALUES. The functions use some globals.
|
||||
if DISTR == DISTR_UNIF:
|
||||
rand_sample = rand_sample_unif
|
||||
elif DISTR == DISTR_GAUSS:
|
||||
rand_sample = rand_sample_gauss
|
||||
elif DISTR == DISTR_EXP:
|
||||
rand_sample = rand_sample_exp
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
# Printing values into file OUTFILE
|
||||
num_values = 0
|
||||
with open(OUTFILE, "w") as f:
|
||||
for i in xrange(1, NUM_CLIENTS + 1):
|
||||
if i % 10000 == 0:
|
||||
elapsed = time.time() - start_time
|
||||
log('Generated %d rows in %.2f seconds', i, elapsed)
|
||||
|
||||
f.write('%d,' % i)
|
||||
# Generates between 5 and 9 values for each user/client. This is hard
|
||||
# coded for now -- could be set by flags.
|
||||
values = [rand_sample() for _ in xrange(random.randint(5, 9))]
|
||||
f.write(' '.join('s%d' % v for v in values))
|
||||
f.write("\n")
|
||||
num_values += len(values)
|
||||
log('Wrote %s', OUTFILE)
|
||||
|
||||
prefix, _ = os.path.splitext(OUTFILE)
|
||||
params_filename = prefix + '_sim_params.html'
|
||||
# TODO: This should take 'opts'
|
||||
with open(params_filename, 'w') as f:
|
||||
WriteParamsHtml(num_values, f)
|
||||
log('Wrote %s', params_filename)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main(sys.argv)
|
|
@ -0,0 +1,361 @@
|
|||
#!/usr/bin/python
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Tool to run RAPPOR on simulated client input.
|
||||
|
||||
It takes a 2-column CSV file as generated by gen_sim_data.py. Example:
|
||||
|
||||
1,s10 s55 s1 s15 s29 s57 s6
|
||||
2,s20 s61 s9 s21 s39 s64 s32 s6 s49
|
||||
...
|
||||
<client N>,<client N's space-separated raw data>
|
||||
|
||||
We output 4 files:
|
||||
- params: RAPPOR parameters, needed to recover distributions from the output
|
||||
- out: output the total counts of the bloom filter bits set by RAPPOR on
|
||||
input data
|
||||
- map file: candidate strings and hashes; required for RAPPOR
|
||||
- hist: histogram of actual input values. Compare this with the histogram
|
||||
the RAPPOR analysis infers from the first 3 values.
|
||||
"""
|
||||
|
||||
import collections
|
||||
import getopt
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
|
||||
import rappor # client library
|
||||
try:
|
||||
import fastrand
|
||||
except ImportError:
|
||||
print >>sys.stderr, (
|
||||
"Native fastrand module not imported; see README for speedups")
|
||||
fastrand = None
|
||||
|
||||
|
||||
# Error flags
|
||||
PARSE_SUCCESS = 0
|
||||
PARSE_ERROR = 1
|
||||
|
||||
|
||||
def log(msg, *args):
|
||||
if args:
|
||||
msg = msg % args
|
||||
print >>sys.stderr, msg
|
||||
|
||||
|
||||
class RapporInstance(object):
|
||||
"""Simple class to create a RAPPOR instance with specific default params."""
|
||||
def __init__(self):
|
||||
self.params = rappor.Params()
|
||||
|
||||
self.infile = "" # Input file name; must be user-provided
|
||||
self.outfile = "" # Output file name
|
||||
self.histfile = "" # Output histogram file
|
||||
self.mapfile = "" # Output BF map file
|
||||
self.paramsfile = "" # Output params file
|
||||
self.randomness_seed = None # Randomness seed
|
||||
# For debugging purposes only
|
||||
|
||||
# TODO: Add orthogonal flag for crytographic randomness.
|
||||
self.random_mode = 'fast' # simple/approx/fast.
|
||||
|
||||
# For testing
|
||||
def __eq__(self, other):
|
||||
return self.__dict__ == other.__dict__
|
||||
|
||||
def __repr__(self):
|
||||
return repr(self.__dict__)
|
||||
|
||||
|
||||
def parse_args(argv):
|
||||
"""Parse and validate flags."""
|
||||
try:
|
||||
opts, args = getopt.getopt(
|
||||
argv[1:], "i:o:p:q:f:c:nh:hf:m:pf:s:r:",
|
||||
["input=", "output=", "cohorts=",
|
||||
"hashes=", "bloombits=", "oneprr",
|
||||
"mapfile=", "rseed="])
|
||||
except getopt.GetoptError:
|
||||
usage(argv[0])
|
||||
sys.exit(2)
|
||||
|
||||
inst = RapporInstance()
|
||||
for opt, arg in opts:
|
||||
if opt in ("-i", "--input"):
|
||||
inst.infile = arg
|
||||
elif opt in ("-o", "--output"):
|
||||
inst.outfile = arg
|
||||
|
||||
# Privacy params
|
||||
elif opt in ("-b", "--bloombits"):
|
||||
inst.params.num_bloombits = int(arg)
|
||||
elif opt in ("-nh", "--hashes"):
|
||||
inst.params.num_hashes = int(arg)
|
||||
elif opt in ("-c", "--cohorts"):
|
||||
inst.params.num_cohorts = int(arg)
|
||||
elif opt == "-p":
|
||||
inst.params.prob_p = float(arg)
|
||||
elif opt == "-q":
|
||||
inst.params.prob_q = float(arg)
|
||||
elif opt == "-f":
|
||||
inst.params.prob_f = float(arg)
|
||||
# Pseudo-param
|
||||
elif opt == "--oneprr":
|
||||
inst.params.flag_oneprr = True
|
||||
|
||||
elif opt == "-r":
|
||||
VALID = ('simple', 'approx', 'fast')
|
||||
arg = arg.strip()
|
||||
if arg not in VALID:
|
||||
raise RuntimeError('random most must be one of: %s' % ' '.join(VALID))
|
||||
inst.random_mode = arg
|
||||
elif opt == "-hf":
|
||||
inst.histfile = arg
|
||||
elif opt in ("-m", "--mapfile"):
|
||||
inst.mapfile = arg
|
||||
elif opt == "-pf":
|
||||
inst.paramsfile = arg
|
||||
elif opt in ("-s", "--rseed"):
|
||||
inst.randomness_seed = arg
|
||||
|
||||
# Warn anyone that accidentally turns on the flag
|
||||
if inst.randomness_seed is not None:
|
||||
sys.stdout.write("""
|
||||
|
||||
WARNING! Randomness should be seeded with time or good entropy sources to
|
||||
ensure freshness. -s/--seed command line flag is for debugging purposes
|
||||
only.
|
||||
|
||||
\n""")
|
||||
|
||||
if not inst.infile:
|
||||
return inst, PARSE_ERROR
|
||||
|
||||
prefix, _ = os.path.splitext(inst.infile)
|
||||
inst.outfile = inst.outfile or (prefix + "_out.csv")
|
||||
inst.histfile = inst.histfile or (prefix + "_hist.csv")
|
||||
inst.mapfile = inst.mapfile or (prefix + "_map.csv")
|
||||
inst.paramsfile = inst.paramsfile or (prefix + "_params.csv")
|
||||
|
||||
return inst, PARSE_SUCCESS
|
||||
|
||||
|
||||
def usage(script_name):
|
||||
sys.stdout.write("Usage: " + script_name + " --input/-i <input file name>")
|
||||
sys.stdout.write(" [-o|c|nh|p|q|f|b] [--oneprr]")
|
||||
|
||||
sys.stdout.write("""
|
||||
|
||||
-o or --output Output file name
|
||||
-r simple/approx/fast Random algorithm
|
||||
-c or --cohorts Number of cohorts
|
||||
-nh or --hashes Number of hashes
|
||||
-p Probability p
|
||||
-q Probability q
|
||||
-f Probability f
|
||||
-b or --bloombits Size of bloom filter in bits
|
||||
-pf Parameters file
|
||||
-m or --mapfile Bloom filter map file
|
||||
--oneprr Include flag to set one PRR for each (user,word)
|
||||
|
||||
""")
|
||||
|
||||
|
||||
PARAMS_HTML = """
|
||||
<h3>RAPPOR Parameters</h3>
|
||||
<table align="center">
|
||||
<tr>
|
||||
<td><b>k</b></td>
|
||||
<td>Size of Bloom filter in bits</td>
|
||||
<td align="right">{}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>h</b></td>
|
||||
<td>Hash functions in Bloom filter</td>
|
||||
<td align="right">{}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>m</b></td>
|
||||
<td>Number of Cohorts</td>
|
||||
<td align="right">{}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>p</b></td>
|
||||
<td>Probability p</td>
|
||||
<td align="right">{}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>q</b></td>
|
||||
<td>Probability q</td>
|
||||
<td align="right">{}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>f</b></td>
|
||||
<td>Probability f</td>
|
||||
<td align="right">{}</td>
|
||||
</tr>
|
||||
</table>
|
||||
"""
|
||||
|
||||
|
||||
def print_params(params, csv_out, html_out):
|
||||
"""Print Rappor parameters to a text file."""
|
||||
row = (
|
||||
params.num_bloombits,
|
||||
params.num_hashes,
|
||||
params.num_cohorts,
|
||||
params.prob_p,
|
||||
params.prob_q,
|
||||
params.prob_f
|
||||
)
|
||||
print >>csv_out, "k,h,m,p,q,f\n" # CSV header
|
||||
print >>csv_out, "%s,%s,%s,%s,%s,%s\n" % row
|
||||
|
||||
# NOTE: No HTML escaping since we're writing numbers
|
||||
print >>html_out, PARAMS_HTML.format(*row)
|
||||
|
||||
|
||||
def make_histogram(infile):
|
||||
"""Make a histogram of the simulated input file."""
|
||||
# TODO: It would be better to share parsing with rappor_encode()
|
||||
words_counter = collections.Counter()
|
||||
for line in infile:
|
||||
_, words = line.strip().split(",")
|
||||
words_counter.update(words.split())
|
||||
return dict(words_counter.most_common())
|
||||
|
||||
|
||||
def print_map(all_words, params, mapfile):
|
||||
"""Print Bloom Filter map of values from infile."""
|
||||
# Print maps of distributions
|
||||
# Required by the R analysis tool
|
||||
k = params.num_bloombits
|
||||
for word in all_words:
|
||||
mapfile.write(word)
|
||||
for cohort in xrange(params.num_cohorts):
|
||||
for hash_no in xrange(params.num_hashes):
|
||||
bf_bit = rappor.get_bf_bit(word, cohort, hash_no, k) + 1
|
||||
mapfile.write("," + str(cohort * k + bf_bit))
|
||||
mapfile.write("\n")
|
||||
|
||||
|
||||
def print_histogram(word_hist, histfile):
|
||||
"""Write histogram of infile to histfile."""
|
||||
# Print histograms of distributions
|
||||
sorted_words = sorted(word_hist.iteritems(), key=lambda pair: pair[1],
|
||||
reverse=True)
|
||||
fmt = "%s,%s"
|
||||
print >>histfile, fmt % ("string", "count")
|
||||
for pair in sorted_words:
|
||||
print >>histfile, fmt % pair
|
||||
|
||||
|
||||
def rappor_encode(params, rand_funcs, infile):
|
||||
# Initializing array to capture sums of rappors.
|
||||
rappor_sums = [[0] * (params.num_bloombits + 1)
|
||||
for _ in xrange(params.num_cohorts)]
|
||||
|
||||
start_time = time.time()
|
||||
for i, line in enumerate(infile):
|
||||
user_id, words = line.strip().split(",")
|
||||
|
||||
if i % 1000 == 0:
|
||||
elapsed = time.time() - start_time
|
||||
log('Processed %d inputs in %.2f seconds', i, elapsed)
|
||||
|
||||
# New encoder instance for each user.
|
||||
e = rappor.Encoder(params, user_id, rand_funcs=rand_funcs)
|
||||
for word in words.split():
|
||||
cohort, r = e.encode(word)
|
||||
# Sum rappors. TODO: move this to separate tool.
|
||||
rappor.update_rappor_sums(rappor_sums, r, cohort, params)
|
||||
return rappor_sums
|
||||
|
||||
|
||||
def main(argv):
|
||||
inst, ret_val = parse_args(argv)
|
||||
if ret_val == PARSE_ERROR:
|
||||
usage(argv[0])
|
||||
sys.exit(2)
|
||||
|
||||
params = inst.params
|
||||
|
||||
params_csv = inst.paramsfile
|
||||
base, _ = os.path.splitext(params_csv)
|
||||
params_html = base + '.html'
|
||||
|
||||
# Print parameters to parameters file -- needed for the R analysis tool.
|
||||
with open(params_csv, 'w') as csv_out:
|
||||
with open(params_html, 'w') as html_out:
|
||||
print_params(params, csv_out, html_out)
|
||||
|
||||
with open(inst.infile) as f:
|
||||
word_hist = make_histogram(f)
|
||||
|
||||
# Print true histograms.
|
||||
with open(inst.histfile, 'w') as f:
|
||||
print_histogram(word_hist, f)
|
||||
|
||||
# Print maps to map file -- needed for the R analysis tool.
|
||||
all_words = sorted(word_hist) # unique words
|
||||
with open(inst.mapfile, 'w') as f:
|
||||
print_map(all_words, params, f)
|
||||
|
||||
rand = random.Random() # default Mersenne Twister randomness
|
||||
#rand = random.SystemRandom() # cryptographic randomness from OS
|
||||
|
||||
if inst.randomness_seed is not None:
|
||||
rand.seed(inst.randomness_seed) # Seed with cmd line arg
|
||||
log('Seeded to %r', inst.randomness_seed)
|
||||
else:
|
||||
rand.seed() # Default: seed with sys time
|
||||
|
||||
if inst.random_mode == 'simple':
|
||||
rand_funcs = rappor.SimpleRandFuncs(params, rand)
|
||||
elif inst.random_mode == 'approx':
|
||||
rand_funcs = rappor.ApproxRandFuncs(params, rand)
|
||||
elif inst.random_mode == 'fast':
|
||||
if fastrand:
|
||||
log('Using fastrand extension')
|
||||
# NOTE: This doesn't take 'rand'
|
||||
rand_funcs = fastrand.FastRandFuncs(params)
|
||||
else:
|
||||
log('Warning: fastrand module not importable; see README for build '
|
||||
'instructions. Falling back to simple randomness.')
|
||||
rand_funcs = rappor.SimpleRandFuncs(params, rand)
|
||||
else:
|
||||
raise AssertionError
|
||||
|
||||
# Do RAPPOR transformation.
|
||||
with open(inst.infile) as f:
|
||||
rappor_sums = rappor_encode(params, rand_funcs, f)
|
||||
|
||||
# Print sums of all rappor bits into output file
|
||||
with open(inst.outfile, 'w') as f:
|
||||
for row in xrange(params.num_cohorts):
|
||||
for col in xrange(params.num_bloombits):
|
||||
f.write(str(rappor_sums[row][col]) + ",")
|
||||
f.write(str(rappor_sums[row][params.num_bloombits]) + "\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
main(sys.argv)
|
||||
except RuntimeError, e:
|
||||
log('rappor_sim.py: FATAL: %s', e)
|
|
@ -0,0 +1,60 @@
|
|||
#!/usr/bin/python
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
rappor_params_test.py: Tests for rappor_params.py
|
||||
"""
|
||||
|
||||
import unittest
|
||||
|
||||
import rappor_sim # module under test
|
||||
|
||||
|
||||
class RapporParamsTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
pass
|
||||
|
||||
def tearDown(self):
|
||||
pass
|
||||
|
||||
def testParseArgs(self):
|
||||
expected = rappor_sim.RapporInstance()
|
||||
p = expected.params
|
||||
p.num_bloombits = 16 # Number of bloom filter bits
|
||||
p.num_hashes = 2 # Number of bloom filter hashes
|
||||
p.num_cohorts = 64 # Number of cohorts
|
||||
p.prob_p = 0.40 # Probability p
|
||||
p.prob_q = 0.70 # Probability q
|
||||
p.prob_f = 0.30 # Probability f
|
||||
p.flag_oneprr = False # One PRR for each user/word pair
|
||||
|
||||
expected.infile = "test.txt" # Input file name
|
||||
expected.outfile = "test_out.csv" # Output file name
|
||||
expected.histfile = "test_hist.csv" # Output histogram file
|
||||
expected.mapfile = "test_map.csv" # Output BF map file
|
||||
expected.paramsfile = "test_params.csv" # Output params file
|
||||
|
||||
arg_string = ("script --cohorts 64 --hashes 2 --bloombits 16 -p 0.4"
|
||||
" -q 0.7 -f 0.3 -i test.txt")
|
||||
arg = arg_string.strip().split()
|
||||
result, error = rappor_sim.parse_args(arg)
|
||||
|
||||
self.assertEquals(expected, result)
|
||||
self.assertEquals(error, rappor_sim.PARSE_SUCCESS)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
|
@ -0,0 +1,23 @@
|
|||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>RAPPOR Demo</title>
|
||||
</head>
|
||||
|
||||
<body style="text-align: center">
|
||||
<h2>RAPPOR Demo</h2>
|
||||
|
||||
<!-- These strings will be replaced by a sed script. -->
|
||||
|
||||
<!-- SIM_PARAMS -->
|
||||
|
||||
<!-- RAPPOR_PARAMS -->
|
||||
|
||||
<hr/>
|
||||
|
||||
<img src="exp_report/dist.png" alt="exponential distribution" />
|
||||
<img src="gauss_report/dist.png" alt="gauss distribution" />
|
||||
<img src="unif_report/dist.png" alt="uniform distribution" />
|
||||
</body>
|
||||
|
||||
</html>
|
|
@ -0,0 +1,102 @@
|
|||
#!/bin/bash
|
||||
#
|
||||
# Copyright 2014 Google Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# Test automation script.
|
||||
#
|
||||
# Usage:
|
||||
# run.sh <function name>
|
||||
#
|
||||
# Examples:
|
||||
# $ tests/run.sh py-unit # run Python unit tests
|
||||
# $ tests/run.sh all # all tests
|
||||
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
set -o errexit
|
||||
|
||||
readonly THIS_DIR=$(dirname $0)
|
||||
readonly REPO_ROOT=$THIS_DIR/..
|
||||
readonly CLIENT_DIR=$REPO_ROOT/client/python
|
||||
|
||||
#
|
||||
# Utility functions
|
||||
#
|
||||
|
||||
die() {
|
||||
echo 1>&2 "$0: $@"
|
||||
exit 1
|
||||
}
|
||||
|
||||
#
|
||||
# Fully Automated Tests
|
||||
#
|
||||
|
||||
# Python unit tests.
|
||||
#
|
||||
# TODO: Separate out deterministic tests from statistical tests (which may
|
||||
# rarely fail)
|
||||
py-unit() {
|
||||
export PYTHONPATH=$CLIENT_DIR # to find client library
|
||||
|
||||
set +o errexit
|
||||
# -e: exit at first failure
|
||||
find $REPO_ROOT -name \*_test.py | sh -x -e
|
||||
local exit_code=$?
|
||||
if test $exit_code -eq 0; then
|
||||
echo 'ALL PASSED'
|
||||
else
|
||||
echo 'FAIL'
|
||||
exit 1
|
||||
fi
|
||||
set -o errexit
|
||||
}
|
||||
|
||||
# All tests
|
||||
all() {
|
||||
py-unit
|
||||
py-lint
|
||||
|
||||
# TODO: Add R tests, end to end demo
|
||||
}
|
||||
|
||||
#
|
||||
# Lint
|
||||
#
|
||||
|
||||
python-lint() {
|
||||
# E111: indent not a multiple of 4. We are following the Google/Chrome style
|
||||
# and using 2 space indents.
|
||||
if pep8 --ignore=E111 "$@"; then
|
||||
echo
|
||||
echo 'LINT PASSED'
|
||||
else
|
||||
echo
|
||||
echo 'LINT FAILED'
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
py-lint() {
|
||||
which pep8 || die "pep8 not installed ('sudo apt-get install pep8' on Ubuntu)"
|
||||
|
||||
# Excluding setup.py, because it's a config file and uses "invalid" 'name =
|
||||
# 1' style (spaces around =).
|
||||
find $REPO_ROOT -name \*.py \
|
||||
| grep -v /setup.py \
|
||||
| xargs --verbose -- $0 python-lint
|
||||
}
|
||||
|
||||
"$@"
|
Загрузка…
Ссылка в новой задаче