This commit is contained in:
Andy Chu 2014-10-17 16:17:57 -07:00
Родитель 236f930036
Коммит 761aa0bcd8
24 изменённых файлов: 3259 добавлений и 0 удалений

5
.gitignore поставляемый Normal file
Просмотреть файл

@ -0,0 +1,5 @@
*.pyc
*.swp
_tmp
client/python/build
client/python/_fastrand.so

202
LICENSE Normal file
Просмотреть файл

@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

136
README.md
Просмотреть файл

@ -0,0 +1,136 @@
RAPPOR
======
RAPPOR is a novel privacy technology that allows inferring statistics of
populations while preserving the privacy of individual users.
This repository currently contains simulation and analysis code in Python and
R.
For a detailed description of the algorithm, see the
[paper](http://arxiv.org/abs/1407.6981) and links below.
<!-- TODO: We should have a more user friendly non-mathematical explanation?
-->
Running the Demo
----------------
Although the Python and R libraries should be portable to any platform, our
end-to-end demo has only been tested on Linux .
If you don't have a Linux box handy, you can [view the generated
output](report.html).
To get your feet wet, install the R dependencies (details below), which should
look something like this:
$ R
...
> install.packages(c('glmnet', 'optparse', 'ggplot2'))
Then run:
$ ./demo.sh build # optional speedup, it's OK for now if it fails
$ ./demo.sh run
The `build` action compiles and tests the optional `fastrand` C extension
module for Python, which speeds up the simulation.
The `run` action strings together the Python and R code. It:
1. Generates simulated input data with different distributions
2. Runs it through the RAPPOR privacy algorithm
3. Analyzes and plots the obfuscated reports against the true input
The output is written to `_tmp/report.html`, and can be opened with a browser.
<!-- TODO: Link to Github pages version of report.html. -->
Dependencies
------------
[R](http://r-project.org) analysis (`analysis/R`):
- [glmnet](http://cran.r-project.org/web/packages/glmnet/index.html)
Demo dependencies (`demo.sh`):
These are necessary if you want to test changes to the code.
- R libraries
- [ggplot2](http://cran.r-project.org/web/packages/ggplot2/index.html)
- [optparse](http://cran.r-project.org/web/packages/optparse/index.html)
- bash shell / coreutils: to run tests
Python client (`client/python`):
- None. You should be able to just import the `rappor.py` file.
Platform:
- R: tested on R 3.0.
- Python: tested on Python 2.7.
- OS: the shell script tests have been tested on Linux, but may work on
Mac/Cygwin. The R and Python code should work on any OS.
API
---
`rappor.py` is a tiny standalone Python file, and you can easily copy it into a
Python program.
NOTE: Its interface is subject to change. We are in the demo stage now, but if
there's demand, we will document and publish the interface.
The R interface is also subject to change.
<!-- TODO: Add links to interface docs when available. -->
The `fastrand` C module is optional. It's likely only useful for simulation of
thousands of clients. It doesn't use crytographically strong randomness, and
thus should **not** be used in production.
Directory Structure
-------------------
client/ # client libraries
python/
rappor.py
rappor_test.py # unit tests next to files
cpp/ # placeholder
analysis/
R/
# R code for analysis.
tests/ # for system tests. Unit tests should go next to the
# source file.
gen_sim_input.py # generate test input data
rappor_sim.py # run simulation
run.sh # driver for unit tests, lint, statistical tests,
# end to end demo with Python/R
doc/
build.sh # build docs or C code
demo.sh # run deom
<!--
TODO: add apps?
apps/
# Shiny apps for demo. Depends on the analysis code.
-->
Links
-----
<!-- TODO: link back to blog post -->
- [Tutorial](doc/tutorial.html) - More details about the tools here.
- [RAPPOR paper](http://arxiv.org/abs/1407.6981)
- [RAPPOR implementation in Chrome](http://www.chromium.org/developers/design-documents/rappor)
- This is a production quality C++ implementation, but it's somewhat tied to
Chrome, and doesn't support all privacy parameters (e.g. only a few values
of p and q). On the other hand, the code in this repo is not yet
production quality, but supports experimentation with different parameters
and data sets. Of course, anyone is free to implement RAPPOR independently
as well.

85
analysis/R/analysis_lib.R Normal file
Просмотреть файл

@ -0,0 +1,85 @@
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
GetFN <- function(name) {
# Helper function to strip extension from the filename.
strsplit(basename(name), ".", fixed = TRUE)[[1]][1]
}
ValidateInput <- function(params, counts, map) {
val <- "valid"
if (is.null(counts)) {
val <- "No counts file found. Skipping"
return(val)
}
if (nrow(map) != (params$m * params$k)) {
val <- paste("Map does not match the counts file!",
"mk = ", params$m * params$k,
"nrow(map):", nrow(map),
collapse = " ")
}
if ((ncol(counts) - 1) != params$k) {
val <- paste("Dimensions of counts file do not match:",
"m =", params$m, "counts rows: ", nrow(counts),
"k =", params$k, "counts cols: ", ncol(counts) - 1,
collapse = " ")
}
val
}
AnalyzeRAPPOR <- function(params, counts, map, correction, alpha, cv_step,
experiment_name = "", map_name = "", config_name = "",
date = NULL, date_num = NULL, ...) {
val <- ValidateInput(params, counts, map)
if (val != "valid") {
cat(val, "\n")
return(NULL)
}
cat("Sample Size: ", sum(counts[, 1]), "\n",
"Number of cohorts: ", nrow(counts), "\n", sep = "")
fit <- Decode(counts, map, params, correction = correction,
alpha = alpha, cv_step = cv_step, ...)
if (nrow(fit$fit) > 0) {
res <- fit$fit
res$rank <- 1:nrow(fit$fit)
res$detected <- fit$summary[2, 2]
res$sample_size <- fit$summary[3, 2]
res$detected_prop <- fit$summary[4, 2]
res$explained_var <- fit$summary[5, 2]
res$missing_var <- fit$summary[6, 2]
res$exp_e_1 <- fit$privacy[3, 2]
res$exp_e_inf <- fit$privacy[5, 2]
res$detection_freq <- fit$privacy[7, 2]
res$correction <- correction
res$alpha <- alpha
res$experiment <- experiment_name
res$map <- map_name
res$config <- config_name
res$date <- date
res$date_num <- date_num
} else {
return(NULL)
}
res
}

290
analysis/R/decode.R Normal file
Просмотреть файл

@ -0,0 +1,290 @@
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This library implements the RAPPOR, an anonymous collection mechanism.
library(glmnet)
EstimateBloomCounts <- function(params, obs_counts) {
# Estimates the number of times each bit in each cohort was set in original
# Bloom filters.
#
# Input:
# params: a list of RAPPOR parameters:
# k - size of a Bloom filter
# h - number of hash functions
# m - number of cohorts
# p - P(IRR = 1 | PRR = 0)
# q - P(IRR = 1 | PRR = 1)
# f - Proportion of bits in the Bloom filter that are set randomly
# to 0 or 1 regardless of the underlying true bit value
# obs_counts: a matrix of size m by (k + 1). Column one contains sample
# sizes for each cohort. Other counts indicated how many times
# each bit was set in each cohort.
#
# Output:
# ests: a matrix of size m by x with estimated counts for the number of
# times each bit was set in the true Bloom filter.
p <- params$p
q <- params$q
f <- params$f
# N = x[1] is the sample size for cohort i.
ests <- t(apply(obs_counts, 1, function(x) {
(x[-1] - (p + .5 * f * q - .5 * f * p) * x[1]) / ((1 - f) * (q - p))
}))
ests
}
FitLasso <- function(X, Y, intercept = TRUE, cv_step = 1, max_lambda = 100) {
# Fits a Lasso model to select a subset of columns of X.
#
# Input:
# X: a design matrix of size km by M (the number of candidate strings).
# Y: a vector of size km with estimated counts from EstimateBloomCounts().
#
# Output:
# lasso: a cross-validated Lasso object.
# non_zero: indices of non-zero coefficients for optimal selection of
# lambda.
zero_coefs <- rep(0, ncol(X))
names(zero_coefs) <- colnames(X)
lambdas <- seq(0, max_lambda, cv_step)
mod <- try(cv.glmnet(X, Y, standardize = FALSE, intercept = intercept,
lambda = lambdas,
type.measure = "mae", nfolds = 10), silent = TRUE)
# If fitting fails, return an empty data.frame.
if (class(mod) == "try-error") {
return(list(fit = NULL, coefs = zero_coefs))
}
# More refined lambda's based on the first coarse run.
if ((as.numeric(ncol(X)) * as.numeric(nrow(X))) < 10^7) {
min_lambda <- mod$lambda.min
if (min_lambda == max(lambdas)) {
lambdas <- seq(301, 500, cv_step)
} else if (min_lambda == min(lambdas)) {
lambdas <- seq(0, 1, .01)
} else {
lambdas <- c(seq(0, max(0, min_lambda - 2), cv_step),
seq(max(0, min_lambda - 2), max(min_lambda + 2, 0), .01),
seq(max(0, min_lambda + 2), 500, cv_step))
lambdas <- sort(unique(lambdas[lambdas > 0]))
}
mod <- try(cv.glmnet(X, Y, standardize = FALSE, intercept = intercept,
lambda = lambdas,
type.measure = "mae", nfolds = 10), silent = TRUE)
# If fitting fails, return an empty data.frame.
if (class(mod) == "try-error") {
return(list(fit = NULL, coefs = zero_coefs))
}
}
# Select the best model based on cross-validation.
coefs <- coef(mod, s = mod$lambda.min)
resid <- Y - predict(mod, X, s = mod$lambda.min, type = "response")
list(fit = mod, coefs = coefs[-1, ], intercept = coefs[1, 1], resid = resid)
}
CustomLM <- function(X, Y) {
if (class(X) == "ngCMatrix") {
X <- as.data.frame(apply(as.matrix(X), 2, as.numeric))
}
mod <- lm(Y ~ ., data = X)
resid <- Y - predict(mod, X)
list(fit = mod, coefs = coef(mod)[-1], intercept = coef(mod)[1],
resid = resid)
}
PerformInference <- function(X, Y, N, mod, params, alpha, correction) {
m <- params$m
p <- params$p
q <- params$q
f <- params$f
h <- params$h
q2 <- .5 * f * (p + q) + (1 - f) * q
p2 <- .5 * f * (p + q) + (1 - f) * p
resid_var <- p2 * (1 - p2) * (N / m) / (q2 - p2)^2
# Total Sum of Squares (SS).
TSS <- sum((Y - mean(Y))^2)
# Error Sum of Squares (ESS).
ESS <- resid_var * nrow(X)
betas <- matrix(mod$coefs, ncol = 1)
mod_var <- summary(mod$fit)$sigma^2
betas_sd <- rep(sqrt(max(resid_var, mod_var) / (m * h)), length(betas))
z_values <- betas / betas_sd
# 1-sided t-test.
p_values <- pnorm(z_values, lower = FALSE)
fit <- data.frame(String = colnames(X), Estimate = betas,
SD = betas_sd, z_stat = z_values, pvalue = p_values,
stringsAsFactors = FALSE)
if (correction == "FDR") {
fit <- fit[order(fit$pvalue, decreasing = FALSE), ]
ind <- which(fit$pvalue < (1:nrow(fit)) * alpha / nrow(fit))
if (length(ind) > 0) {
fit <- fit[1:max(ind), ]
} else {
fit <- fit[numeric(0), ]
}
} else {
fit <- fit[fit$p < alpha, ]
}
fit <- fit[order(fit$Estimate, decreasing = TRUE), ]
if (nrow(fit) > 0) {
str_names <- fit$String
if (length(str_names) > 0 && length(str_names) < nrow(X)) {
this_data <- as.data.frame(as.matrix(X[, str_names]))
Y_hat <- predict(lm(Y ~ ., data = this_data))
RSS <- sum((Y_hat - mean(Y))^2)
} else {
RSS <- NA
}
} else {
RSS <- 0
}
USS <- TSS - ESS - RSS
SS <- c(RSS, USS, ESS) / TSS
list(fit = fit, SS = SS, resid_sigma = sqrt(resid_var))
}
ComputePrivacyGuarantees <- function(params, alpha, N) {
# Compute privacy parameters and guarantees.
p <- params$p
q <- params$q
f <- params$f
h <- params$h
q2 <- .5 * f * (p + q) + (1 - f) * q
p2 <- .5 * f * (p + q) + (1 - f) * p
exp_e_one <- ((q2 * (1 - p2)) / (p2 * (1 - q2)))^h
if (exp_e_one < 1) {
exp_e_one <- 1 / exp_e_one
}
e_one <- log(exp_e_one)
exp_e_inf <- ((1 - .5 * f) / (.5 * f))^(2 * h)
e_inf <- log(exp_e_inf)
std_dev_counts <- sqrt(p2 * (1 - p2) * N) / (q2 - p2)
detection_freq <- qnorm(1 - alpha) * std_dev_counts / N
privacy_names <- c("Effective p", "Effective q", "exp(e_1)",
"e_1", "exp(e_inf)", "e_inf", "Detection frequency")
privacy_vals <- c(p2, q2, exp_e_one, e_one, exp_e_inf, e_inf, detection_freq)
privacy <- data.frame(parameters = privacy_names,
values = privacy_vals)
privacy
}
Decode <- function(counts, map, params, alpha = 0.05,
correction = c("Bonferroni"), ...) {
k <- params$k
p <- params$p
q <- params$q
f <- params$f
h <- params$h
m <- params$m
strs <- colnames(map)
ests <- EstimateBloomCounts(params, counts)
N <- sum(counts[, 1])
Y <- as.vector(t(ests))
if (ncol(map) > (k * m * .8) ||
(as.numeric(ncol(map)) * as.numeric(nrow(map))) > 10^6) {
mod_lasso <- FitLasso(map, Y, ...)
lasso <- mod_lasso$fit
# Select non-zero coefficients.
coefs <- sort(mod_lasso$coef, decreasing = TRUE)
non_zero <- sum(coefs > 0)
if (non_zero > 0) {
coefs <- names(coefs[1:min(non_zero, k * m * .9)])
} else {
coefs <- names(coefs[1:2])
}
ind <- match(coefs, names(mod_lasso$coefs))
# Fit regular linear model to obtain unbiased estimates.
X <- as.data.frame(apply(as.matrix(map[, coefs]), 2, as.numeric))
mod <- CustomLM(X, Y)
# Return complete vector of coefficients with 0's.
coefs <- rep(0, length(mod_lasso$coefs))
names(coefs) <- names(mod_lasso$coefs)
coefs[ind] <- mod$coef
mod$coefs <- coefs
} else {
mod <- CustomLM(as.data.frame(as.matrix(map)), Y)
lasso <- NULL
}
if (correction == "Bonferroni") {
alpha <- alpha / length(strs)
}
inf <- PerformInference(map, Y, N, mod, params, alpha, correction)
fit <- inf$fit
resid <- mod$resid / inf$resid_sigma
# Estimates from the model are per instance so must be multipled by h.
# Standard errors are also adjusted.
fit$Total_Est <- floor(fit$Estimate * m)
fit$Total_SD <- floor(fit$SD * m)
fit$Prop <- fit$Total_Est / N
fit$LPB <- fit$Prop - 1.96 * fit$Total_SD / N
fit$UPB <- fit$Prop + 1.96 * fit$Total_SD / N
fit <- fit[, c("String", "Total_Est", "Total_SD", "Prop", "LPB", "UPB")]
colnames(fit) <- c("strings", "estimate", "std_dev", "proportion",
"lower_bound", "upper_bound")
# Compute summary of the fit.
parameters =
c("Candidate strings", "Detected strings",
"Sample size (N)", "Discovered Prop (out of N)",
"Explained Variance", "Missing Variance", "Noise Variance",
"Theoretical Noise Std. Dev.")
values <- c(length(strs), nrow(fit), N, round(sum(fit[, 2]) / N, 3),
round(inf$SS, 3),
round(inf$resid_sigma, 3))
res_summary <- data.frame(parameters = parameters, values = values)
privacy <- ComputePrivacyGuarantees(params, alpha, N)
params <- data.frame(parameters =
c("k", "h", "m", "p", "q", "f", "N", "alpha"),
values = c(k, h, m, p, q, f, N, alpha))
list(fit = fit, summary = res_summary, privacy = privacy, params = params,
lasso = lasso, ests = ests, counts = counts[, -1], resid = resid)
}

128
analysis/R/encode.R Normal file
Просмотреть файл

@ -0,0 +1,128 @@
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Encode <- function(value, map, strs, params, N, id = NULL,
cohort = NULL, B = NULL, BP = NULL) {
# Encode value to RAPPOR and return a report.
#
# Input:
# value: value to be encoded
# map: a mapping matrix describing where each element of strs map in
# each cohort
# strs: a vector of possible values with value being one of them
# params: a list of RAPPOR parameters described in decode.R
# N: sample size
# Optional parameters:
# id: user ID (smaller than N)
# cohort: specifies cohort number (smaller than m)
# B: input Bloom filter itself, in which case value is ignored
# BP: input Permanent Randomized Response (memoized for multiple colections
# from the same user
k <- params$k
p <- params$p
q <- params$q
f <- params$f
h <- params$h
m <- params$m
if (is.null(cohort)) {
cohort <- sample(1:m, 1)
}
if (is.null(id)) {
id <- sample(N, 1)
}
ind <- which(value == strs)
if (is.null(B)) {
B <- as.numeric(map[[cohort]][, ind])
}
if (is.null(BP)) {
BP <- sapply(B, function(x) sample(c(0, 1, x), 1,
prob = c(0.5 * f, 0.5 * f, 1 - f)))
}
rappor <- sapply(BP, function(x) rbinom(1, 1, ifelse(x == 1, q, p)))
list(value = value, rappor = rappor, B = B, BP = BP, cohort = cohort, id = id)
}
ExamplePlot <- function(res, k, ebs = 1, title = "", title_cex = 4,
voff = .17, acex = 1.5, posa = 2, ymin = 1,
horiz = FALSE) {
PC <- function(k, report) {
char <- as.character(report)
if (k > 128) {
char[char != ""] <- "|"
}
char
}
# Annotation settings
anc <- "darkorange2"
colors <- c("lavenderblush3", "maroon4")
par(omi = c(0, .55, 0, 0))
# Setup plotting.
plot(1:k, rep(1, k), ylim = c(ymin, 4), type = "n",
xlab = "Bloom filter bits",
yaxt = "n", ylab = "", xlim = c(0, k), bty = "n", xaxt = "n")
mtext(paste0("Participant ", res$id, " in cohort ", res$cohort), 3, 2,
adj = 1, col = anc, cex = acex)
axis(1, 2^(0:15), 2^(0:15))
abline(v = which(res$B == 1), lty = 2, col = "grey")
# First row with the true value.
text(k / 2, 4, paste0('"', paste0(title, as.character(res$value)), '"'),
cex = title_cex, col = colors[2], xpd = NA)
# Second row with BF: B.
points(1:k, rep(3, k), pch = PC(k, res$B), col = colors[res$B + 1],
cex = res$B + 1)
text(k, 3 + voff, paste0(sum(res$B), " signal bits"), cex = acex,
col = anc, pos = posa)
# Third row: B'.
points(1:k, rep(2, k), pch = PC(k, res$BP), col = colors[res$BP + 1],
cex = res$BP + 1)
text(k, 2 + voff, paste0(sum(res$BP), " bits on"),
cex = acex, col = anc, pos = posa)
# Row 4: actual RAPPOR report.
report <- res$rappor
points(1:k, rep(1, k), pch = PC(k, as.character(report)),
col = colors[report + 1], cex = report + 1)
text(k, 1 + voff, paste0(sum(res$rappor), " bits on"), cex = acex,
col = anc, pos = posa)
mtext(c("True value:", "Bloom filter (B):",
"Fake Bloom \n filter (B'):", "Report sent\n to server:"),
2, 1, at = 4:1, las = 2)
legend("topright", legend = c("0", "1"), fill = colors, bty = "n",
cex = 1.5, horiz = horiz)
legend("topleft", legend = ebs, plot = FALSE)
}
PlotPopulation <- function(probs, detected, detection_frequency) {
cc <- c("gray80", "darkred")
color <- rep(cc[1], length(probs))
color[detected] <- cc[2]
bp <- barplot(probs, col = color, border = color)
inds <- c(1, c(max(which(probs > 0)), length(probs)))
axis(1, bp[inds], inds)
legend("topright", legend = c("Detected", "Not-detected"),
fill = rev(cc), bty = "n")
abline(h = detection_frequency, lty = 2, col = "grey")
}

122
analysis/R/read_input.R Normal file
Просмотреть файл

@ -0,0 +1,122 @@
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Read parameter, counts and map files.
gfile <- function(str) { str } # NOTE: gfile will be identity function in open source version
library(Matrix)
ReadParameterFile <- function(params_file) {
# Read parameter file. Format:
# k, h, m, p, q, f
# 128, 2, 8, 0.5, 0.75, 0.75
params <- as.list(read.csv(gfile(params_file)))
if (length(params) != 6) {
stop("There should be exactly 6 columns in the parameter file.")
}
if (any(names(params) != c("k", "h", "m", "p", "q", "f"))) {
stop("Parameter names must be k,h,m,p,q,f.")
}
params
}
ReadCountsFile <- function(counts_file, params = NULL) {
# Read in the counts file.
if (!file.exists(counts_file)) {
return(NULL)
}
counts <- read.csv(gfile(counts_file), header = FALSE)
if (!is.null(params)) {
if (nrow(counts) != params$m) {
stop("Counts file: number of rows should equal number of cohorts (m).")
}
if ((ncol(counts) - 1) != params$k) {
stop(paste0("Counts file: number of columns should equal to k + 1: ",
ncol(counts)))
}
}
if (any(counts < 0)) {
stop("Counts file: all counts must be positive.")
}
counts
}
ReadMapFile <- function(map_file, params = NULL, quote = "") {
# Read in the map file which is in the following format (two hash functions):
# str1, h11, h12, h21 + k, h22 + k, h31 + 2k, h32 + 2k ...
# str2, ...
# Output:
# map: a sparse representation of set bits for each candidate string.
# strs: a vector of all candidate strings.
map_pos <- read.csv(gfile(map_file), header = FALSE, as.is = TRUE,
quote = quote)
strs <- map_pos[, 1]
strs[strs == ""] <- "Empty"
# Remove duplicated strings.
ind <- which(!duplicated(strs))
strs <- strs[ind]
map_pos <- map_pos[ind, ]
if (!is.null(params)) {
n <- ncol(map_pos) - 1
if (n != (params$h * params$m)) {
stop(paste0("Map file: number of columns should equal hm + 1:",
n, "_", params$h * params$m))
}
}
row_pos <- unlist(map_pos[, -1])
col_pos <- rep(1:nrow(map_pos), times = ncol(map_pos) - 1)
removed <- which(is.na(row_pos))
if (length(removed) > 0) {
row_pos <- row_pos[-removed]
col_pos <- col_pos[-removed]
}
if (!is.null(params)) {
map <- sparseMatrix(row_pos, col_pos,
dims = c(params$m * params$k, length(strs)))
} else {
map <- sparseMatrix(row_pos, col_pos)
}
colnames(map) <- strs
list(map = map, strs = strs, map_pos = map_pos)
}
LoadMapFile <- function(map_file, params = NULL, quote = "") {
# Reads the map file and creates an R binary .rda.
# If .rda file already exists, just loads that file.
rda_file <- sub(".csv", ".rda", map_file, fixed = TRUE)
# file.info() is not implemented yet by the gfile package. One must delete
# the .rda file manually when the .csv file is updated.
# csv_updated <- file.info(map_file)$mtime > file.info(rda_file)$mtime
if (!file.exists(rda_file)) {
cat("Parsing", map_file, "...\n")
map <- ReadMapFile(map_file, params = params, quote = quote)
save(map, file = file.path(tempdir(), basename(rda_file)))
file.copy(file.path(tempdir(), basename(rda_file)), rda_file,
overwrite = TRUE)
}
load(gfile(rda_file), .GlobalEnv)
}

219
analysis/R/simulation.R Normal file
Просмотреть файл

@ -0,0 +1,219 @@
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# RAPPOR simulation library.
library(glmnet)
SetOfStrings <- function(num_strings = 100) {
# Generates a set of strings for simulation purposes.
strs <- paste0("V_", as.character(1:num_strings))
strs
}
GetSampleProbs <- function(params) {
# Generate different underlying distributions for simulations purposes.
# Args:
# - params: a list describing the shape of the true distribution:
# c(num_strings, prop_nonzero_strings, decay_type,
# rate_exponetial).
nstrs <- params[[1]]
nonzero <- params[[2]]
decay <- params[[3]]
expo <- params[[4]]
background <- params[[5]]
probs <- rep(0, nstrs)
ind <- floor(nstrs * nonzero)
if (decay == "Linear") {
probs[1:ind] <- (ind:1) / sum(1:ind)
} else if (decay == "Constant") {
probs[1:ind] <- 1 / ind
} else if (decay == "Exponential") {
temp <- seq(0, nonzero, length.out = ind)
temp <- exp(-temp * expo)
temp <- temp + background
temp <- temp / sum(temp)
probs[1:ind] <- temp
} else {
stop('params[[4]] must be in c("Linear", "Exponenential", "Constant")')
}
probs
}
CreateMap <- function(strs, params, generate_pos = TRUE) {
# Creates a list of 0/1 matrices corresponding to mapping between the strs and
# Bloom filters for each instance of the RAPPOR.
# Ex. for 3 strings, 2 instances, 1 hash function and Bloom filter of size 4,
# the result could look this:
# [[1]]
# 1 0 0 0
# 0 1 0 0
# 0 0 0 1
# [[2]]
# 0 1 0 0
# 0 0 0 1
# 0 0 1 0
#
# Args:
# - strs: a vector of strings
# - params: a list of parameters in the following format:
# (k, h, m, p, q, f).
M <- length(strs)
map <- list()
k <- params$k
h <- params$h
m <- params$m
for (i in 1:m) {
ones <- sample(1:k, M * h, replace = TRUE)
cols <- rep(1:M, each = h)
map[[i]] <- sparseMatrix(ones, cols, dims = c(k, M))
colnames(map[[i]]) <- strs
}
rmap <- do.call("rBind", map)
if (generate_pos) {
map_pos <- t(apply(rmap, 2, function(x) {
ind <- which(x == 1)
n <- length(ind)
if (n < h * m) {
ind <- c(ind, rep(NA, h * m - n))
}
ind
}))
} else {
map_pos <- NULL
}
list(map = map, rmap = rmap, map_pos = map_pos)
}
GetSample <- function(N, strs, probs) {
# Sample for the strs population with distribution probs.
sample(strs, N, replace = TRUE, prob = probs)
}
GetTrueBits <- function(samp, map, params) {
# Convert sample generated by GetSample() to Bloom filters where mapping
# is defined in map.
# Output:
# - reports: a matrix of size [num_instances x size] where each row
# represents the number of times each bit in the Bloom filter
# was set for a particular instance.
# Note: reports[, 1] contains the same size for each instance.
N <- length(samp)
k <- params$k
m <- params$m
strs <- colnames(map[[1]])
reports <- matrix(0, m, k + 1)
inst <- sample(1:m, N, replace = TRUE)
for (i in 1:m) {
tab <- table(samp[inst == i])
tab2 <- rep(0, length(strs))
tab2[match(names(tab), strs)] <- tab
counts <- apply(map[[i]], 1, function(x) x * tab2)
# cat(length(tab2), dim(map[[i]]), dim(counts), "\n")
reports[i, ] <- c(sum(tab2), apply(counts, 2, sum))
}
reports
}
GetNoisyBits <- function(truth, params) {
# Applies RAPPOR to the Bloom filters.
# Args:
# - truth: a matrix generated by GetTrueBits().
k <- params$k
p <- params$p
q <- params$q
f <- params$f
rappors <- apply(truth, 1, function(x) {
# The following samples considering 4 cases:
# 1. Signal and we lie on the bit.
# 2. Signal and we tell the truth.
# 3. Noise and we lie.
# 4. Noise and we tell the truth.
# Lies when signal sampled from the binomial distribution.
lied_signal <- rbinom(k, x[-1], f)
# Remaining must be the non-lying bits when signal. Sampled with q.
truth_signal <- x[-1] - lied_signal
# Lies when there is no signal which happens x[1] - x[-1] times.
lied_nosignal <- rbinom(k, x[1] - x[-1], f)
# Trtuh when there's no signal. These are sampled with p.
truth_nosignal <- x[1] - x[-1] - lied_nosignal
# Total lies and sampling lies with 50/50 for either p or q.
lied <- lied_signal + lied_nosignal
lied_p <- rbinom(k, lied, .5)
lied_q <- lied - lied_p
# Generating the report where sampling of either p or q occurs.
rbinom(k, lied_q + truth_signal, q) + rbinom(k, lied_p + truth_nosignal, p)
})
cbind(truth[, 1], t(rappors))
}
GenerateSamples <- function(N = 10^5, params, pop_params, alpha = .05,
prop_missing = 0,
correction = "Bonferroni") {
# Simulate N reports with pop_params describing the population and
# params describing the RAPPOR configuration.
num_strings = pop_params[[1]]
strs <- SetOfStrings(num_strings)
probs <- GetSampleProbs(pop_params)
samp <- GetSample(N, strs, probs)
map <- CreateMap(strs, params)
truth <- GetTrueBits(samp, map$map, params)
rappors <- GetNoisyBits(truth, params)
strs_apprx <- strs
map_apprx <- map$rmap
# Remove % of strings to simulate missing variables.
if (prop_missing > 0) {
ind <- which(probs > 0)
removed <- sample(ind, ceiling(prop_missing * length(ind)))
map_apprx <- map$rmap[, -removed]
strs_apprx <- strs[-removed]
}
# Randomize the columns.
ind <- sample(1:length(strs_apprx), length(strs_apprx))
map_apprx <- map_apprx[, ind]
strs_apprx <- strs_apprx[ind]
fit <- Decode(rappors, map_apprx, params, alpha = alpha,
correction = correction)
# Add truth column.
fit$fit$Truth <- table(samp)[fit$fit$strings]
fit$fit$Truth[is.na(fit$fit$Truth)] <- 0
fit$map <- map$map
fit$truth <- truth
fit$strs <- strs
fit$probs <- probs
fit
}

85
build.sh Executable file
Просмотреть файл

@ -0,0 +1,85 @@
#!/bin/bash
#
# Build automation.
#
# Usage:
# ./build.sh <function name>
#
# Important targets are:
# doc: build docs with Markdown
# fastrand: build Python extension module to speed up the client simulation
set -o nounset
set -o pipefail
set -o errexit
log() {
echo 1>&2 "$@"
}
die() {
log "FATAL: $@"
exit 1
}
run-markdown() {
which markdown >/dev/null || die "Markdown not installed"
# Markdown is output unstyled; make it a little more readable.
cat <<EOF
<!DOCTYPE html>
<html>
<head>
<style>
code { color: green }
</style>
</head>
<body style="margin: 0 auto; width: 40em; text-align: left;">
<p>
EOF
markdown "$@"
cat <<EOF
</p>
</body>
</html>
EOF
}
# Scan for TODOs. Does this belong somewhere else?
todo() {
find . -name \*.py -o -name \*.R -o -name \*.sh -o -name \*.md \
| xargs --verbose -- grep -w TODO
}
#
# Targets: build "doc" or "fastrand"
#
# Build dependencies: markdown tool.
doc() {
mkdir -p _tmp _tmp/doc
# For now, just one file.
# TODO: generated docs
run-markdown <README.md >_tmp/README.html
run-markdown <doc/tutorial.md >_tmp/doc/tutorial.html
log 'Wrote docs to _tmp'
}
# Build dependencies: Python development headers. Most systems should have
# this. On Ubuntu/Debian, the 'python-dev' package contains headers.
fastrand() {
pushd client/python >/dev/null
python setup.py build
# So we can 'import _fastrand' without installing
ln -s --force build/*/_fastrand.so .
./fastrand_test.py
log 'fastrand built and tests PASSED'
popd >/dev/null
}
"$@"

2
client/cpp/README.md Normal file
Просмотреть файл

@ -0,0 +1,2 @@
Placeholder for the C++ client.

86
client/python/_fastrand.c Normal file
Просмотреть файл

@ -0,0 +1,86 @@
/*
Copyright 2014 Google Inc. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
/*
* _fastrand.c -- Python extension module to generate random bit vectors
* quickly.
*
* IMPORTANT: This module does not use crytographically strong randomness. It
* should be used ONLY be used to speed up the simulation. Don't use it in
* production.
*
* If an adversary can predict which random bits are flipped, then RAPPOR's
* privacy is compromised.
*
*/
#include <stdint.h> // uint64_t
#include <stdio.h> // printf
#include <stdlib.h> // srand
#include <time.h> // time
#include <Python.h>
uint64_t randbits(float p1, int num_bits) {
uint64_t result = 0;
int i;
for (i = 0; i < num_bits; ++i) {
float r = (float)rand() / RAND_MAX;
uint64_t bit = (r < p1);
result |= (bit << i);
}
return result;
}
static PyObject *
func_randbits(PyObject *self, PyObject *args) {
float p1;
int num_bits;
if (!PyArg_ParseTuple(args, "fi", &p1, &num_bits)) {
return NULL;
}
if (p1 < 0.0 || p1 > 1.0) {
printf("p1 must be between 0.0 and 1.0\n");
// return None for now; easier than raising ValueError
Py_INCREF(Py_None);
return Py_None;
}
if (num_bits < 0 || num_bits > 64) {
printf("num_bits must be 64 or less\n");
// return None for now; easier than raising ValueError
Py_INCREF(Py_None);
return Py_None;
}
//printf("p: %f\n", p);
uint64_t r = randbits(p1, num_bits);
return PyLong_FromUnsignedLongLong(r);
}
PyMethodDef methods[] = {
{"randbits", func_randbits, METH_VARARGS,
"Get a 64 bit number where each bit is 1 with probability p."},
{NULL, NULL},
};
void init_fastrand() {
Py_InitModule("_fastrand", methods);
// Just seed it here; we don't give the application any control.
int seed = time(NULL);
srand(seed);
}

34
client/python/fastrand.py Normal file
Просмотреть файл

@ -0,0 +1,34 @@
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""fastrand.py - Python wrapper for _fastrand."""
import random
import _fastrand
class FastRandFuncs(object):
def __init__(self, params):
# NOTE: no rand attribute, so no seeding or getstate/setstate.
# Also duplicating some of rappor._RandFuncs.
self.cohort_rand_fn = random.randint
randbits = _fastrand.randbits
num_bits = params.num_bloombits
self.f_gen = lambda: randbits(params.prob_f, num_bits)
self.p_gen = lambda: randbits(params.prob_p, num_bits)
self.q_gen = lambda: randbits(params.prob_q, num_bits)
self.uniform_gen = lambda: randbits(0.5, num_bits)

53
client/python/fastrand_test.py Executable file
Просмотреть файл

@ -0,0 +1,53 @@
#!/usr/bin/python -S
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
fastrand_test.py: Tests for _fastrand extension module.
"""
import unittest
import _fastrand # module under test
class FastRandTest(unittest.TestCase):
def testRandbits64(self):
for n in [8, 16, 32, 64]:
#print '== %d' % n
for p1 in [0.1, 0.5, 0.9]:
#print '-- %f' % p1
for i in xrange(5):
r = _fastrand.randbits(p1, n)
# Rough sanity check
self.assertLess(r, 2 ** n)
# Visual check
#b = bin(r)
#print b
#print b.count('1')
def testRandbitsError(self):
r = _fastrand.randbits(-1, 64)
# TODO: Should probably raise exceptions
self.assertEqual(None, r)
r = _fastrand.randbits(0.0, 65)
self.assertEqual(None, r)
if __name__ == '__main__':
unittest.main()

281
client/python/rappor.py Normal file
Просмотреть файл

@ -0,0 +1,281 @@
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""RAPPOR client library.
Privacy is ensured without a third party by only sending RAPPOR'd data over the
network (as opposed to raw client data).
Note that we use SHA1 for the Bloom filter hash function.
"""
import hashlib
import random
class Params(object):
"""RAPPOR encoding parameters.
These affect privacy/anonymity. See the paper for details.
"""
def __init__(self):
self.num_bloombits = 16 # Number of bloom filter bits (k)
self.num_hashes = 2 # Number of bloom filter hashes (h)
self.num_cohorts = 64 # Number of cohorts (m)
self.prob_p = 0.50 # Probability p
self.prob_q = 0.75 # Probability q
self.prob_f = 0.50 # Probability f
self.flag_oneprr = False # One PRR for each user/word pair
# For testing
def __eq__(self, other):
return self.__dict__ == other.__dict__
def __repr__(self):
return repr(self.__dict__)
class SimpleRandom(object):
"""Returns N 32-bit words where each bit has probability p of being 1."""
def __init__(self, prob_one, num_bits, rand=None):
self.prob_one = prob_one
self.num_bits = num_bits
self.rand = rand or random.Random()
def __call__(self):
p = self.prob_one
rand_fn = self.rand.random # cache it for speed
r = 0
for i in xrange(self.num_bits):
bit = rand_fn() < p
r |= (bit << i) # using bool as int
return r
# NOTE: This doesn't seem faster.
class ApproxRandom(object):
"""Like SimpleRandom, but tries to make fewer random calls.
Represent prob_one in base 2 repr (up to 6 bits = 2^-6 accuracy)
If X is a random bit with Pr[b=1] = p
X & uniform is a random bit with Pr[b=1] = p/2
X | uniform is a random bit with Pr[b=1] = p/2+1/2
Read prob_one from LSB and do & or | operations depending on
whether the bit is set or not a la repeated-squaring.
#
Eg. 0.3 = (0.010011...)_2 ~
unif & (unif | (unif & (unif & (unif | unif))))
0 1 0 0 1 1
Takes as input Pr[b=1], length of random bits, and a randomness
function that outputs 32 bits. When not debugging, set rand_fn
to random.getrandbits(32)
"""
def __init__(self, prob_one, num_bits, rand=None):
"""
Args:
rand: object satisfying Python random.Random() interface.
"""
if not isinstance(prob_one, float):
raise RuntimeError('Probability must be a float')
if not (0 <= prob_one <= 1):
raise RuntimeError('Probability not in [0,1]: %s' % prob_one)
self.num_bits = num_bits
self.rand = rand or random.Random()
# This calculation depends on prob_one, but not the actual randomness.
self.bits_in_prob_one = [0] * 6 # Store prob_one in bits
for i in xrange(0, 6): # Loop at most six times
if prob_one < 0.5:
self.bits_in_prob_one[i] = 0
prob_one *= 2
else:
self.bits_in_prob_one[i] = 1
prob_one = prob_one * 2 - 1
if prob_one <= 0.01: # Finish loop early if less than 1% already
break
def __call__(self):
num_bits = self.num_bits
rand_fn = lambda: self.rand.getrandbits(self.num_bits)
# We could special case these to be exact, but we're not using them for f,
# p, q. Better to use the non-approximate method.
#if self.prob_one == 0:
# return [0] * self.num_bits
#if self.prob_one == 1:
# return [0xffffffff] * self.num_bits
rand_bits = 0
and_or = self.bits_in_prob_one
for i in xrange(5, -1, -1): # Count down from 5 to 0
if and_or[i] == 0: # Corresponds to X & uniform
rand_bits &= rand_fn()
else:
rand_bits |= rand_fn()
return rand_bits
class _RandFuncs(object):
"""Base class for randomness."""
def __init__(self, params, rand):
"""
Args:
params: RAPPOR parameters
rand: object satisfying random.Random() interface.
"""
self.rand = rand
self.num_bits = params.num_bloombits
self.cohort_rand_fn = rand.randint
class SimpleRandFuncs(_RandFuncs):
def __init__(self, params, rand):
_RandFuncs.__init__(self, params, rand)
self.f_gen = SimpleRandom(params.prob_f, self.num_bits, rand)
self.p_gen = SimpleRandom(params.prob_p, self.num_bits, rand)
self.q_gen = SimpleRandom(params.prob_q, self.num_bits, rand)
self.uniform_gen = SimpleRandom(0.5, self.num_bits, rand)
class ApproxRandFuncs(_RandFuncs):
def __init__(self, params, rand):
_RandFuncs.__init__(self, params, rand)
self.f_gen = ApproxRandom(params.prob_f, self.num_bits, rand)
self.p_gen = ApproxRandom(params.prob_p, self.num_bits, rand)
self.q_gen = ApproxRandom(params.prob_q, self.num_bits, rand)
# uniform generator (NOTE: could special case this)
self.uniform_gen = ApproxRandom(0.5, self.num_bits, rand)
# Compute masks for rappor's Permanent Randomized Response
# The i^th Bloom Filter bit B_i is set to be B'_i equals
# 1 w/ prob f/2 -- (*) -- f_bits
# 0 w/ prob f/2
# B_i w/ prob 1-f -- (&) -- mask_indices set to 0 here, i.e., no mask
# Output bit indices corresponding to (&) and bits 0/1 corresponding to (*)
def get_rappor_masks(user_id, word, params, rand_funcs):
"""
Call 3 random functions. Seed deterministically beforehand if oneprr.
TODO:
- Rewrite this to be clearer. We can use a completely different Random()
instance in the case of oneprr.
- Expose it in the simulation. It doesn't appear to be exercised now.
"""
if params.flag_oneprr:
stored_state = rand_funcs.rand.getstate() # Store state
rand_funcs.rand.seed(user_id + word) # Consistently seeded
assigned_cohort = rand_funcs.cohort_rand_fn(0, params.num_cohorts - 1)
# Uniform bits for (*)
f_bits = rand_funcs.uniform_gen()
# Mask indices are 1 with probability f.
mask_indices = rand_funcs.f_gen()
if params.flag_oneprr: # Restore state
rand_funcs.rand.setstate(stored_state)
return assigned_cohort, f_bits, mask_indices
def get_bf_bit(input_word, cohort, hash_no, num_bloombits):
"""Compute Bloom Filter bits to set."""
h = '%s%s%s' % (cohort, hash_no, input_word)
sha1 = hashlib.sha1(h).digest()
# Use last two bytes to get a bloom filter output. NOTE: This is only valid
# for 16 bits (default num_bloombits). Should use struct module to get
# arbitrary numbers of bits.
a, b = sha1[0], sha1[1]
return (ord(a) + ord(b) * 256) % num_bloombits
class Encoder(object):
"""Obfuscates values for a given user using the RAPPOR privacy algorithm."""
def __init__(self, params, user_id, rand_funcs=None):
"""
Args:
params: RAPPOR Params() controlling privacy
user_id: user ID, for generating cohort. (In the simulator, each user
gets its own Encoder instance.)
rand_funcs: randomness, can be deterministic for testing.
"""
self.params = params # RAPPOR params
self.user_id = user_id
self.rand_funcs = rand_funcs
self.p_gen = rand_funcs.p_gen
self.q_gen = rand_funcs.q_gen
def encode(self, word):
"""Compute rappor (Instantaneous Randomized Response)."""
params = self.params
cohort, f_bits, mask_indices = get_rappor_masks(self.user_id, word,
params,
self.rand_funcs)
bloom_bits_array = 0
# Compute Bloom Filter
for hash_no in xrange(params.num_hashes):
bit_to_set = get_bf_bit(word, cohort, hash_no, params.num_bloombits)
bloom_bits_array |= (1 << bit_to_set)
# Both bit manipulations below use the following fact:
# To set c = a if m = 0 or b if m = 1
# c = (a & not m) | (b & m)
#
# Compute PRR as
# f_bits if mask_indices = 1
# bloom_bits_array if mask_indices = 0
# TODO: change 0xffff ^ to ~
prr = (f_bits & mask_indices) | (bloom_bits_array & ~mask_indices)
#print 'prr', bin(prr)
# Compute instantaneous randomized response:
# If PRR bit is set, output 1 with probability q
# If PRR bit is not set, output 1 with probability p
p_bits = self.p_gen()
q_bits = self.q_gen()
#print bin(f_bits), bin(mask_indices), bin(p_bits), bin(q_bits)
irr = (p_bits & ~prr) | (q_bits & prr)
#print 'irr', bin(irr)
return cohort, irr # irr is the rappor
# Update rappor sum
def update_rappor_sums(rappor_sum, rappor, cohort, params):
for bit_num in xrange(params.num_bloombits):
if rappor & (1 << bit_num):
rappor_sum[cohort][1 + bit_num] += 1
rappor_sum[cohort][0] += 1 # The 0^th entry contains total reports in cohort

289
client/python/rappor_test.py Executable file
Просмотреть файл

@ -0,0 +1,289 @@
#!/usr/bin/python
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
rappor_test.py: Tests for rappor.py
NOTE! This contains tests that might fail with very small
probability (< 1 in 10,000 times). This is implicitly required
for testing probability. Such tests start with the stirng "testProbFailure."
"""
import copy
import math
import random
import unittest
import rappor # module under test
class RapporParamsTest(unittest.TestCase):
def setUp(self):
self.typical_instance = rappor.Params()
ti = self.typical_instance # For convenience
ti.num_cohorts = 64 # Number of cohorts
ti.num_hashes = 2 # Number of bloom filter hashes
ti.num_bloombits = 16 # Number of bloom filter bits
ti.prob_p = 0.40 # Probability p
ti.prob_q = 0.70 # Probability q
ti.prob_f = 0.30 # Probability f
# TODO: Move this to constructor, or add a different constructor
ti.flag_oneprr = False # One PRR for each user/word pair
def tearDown(self):
pass
def testApproxRandom(self):
get_rand_bits = rappor.ApproxRandom(0.1, 2)
r = get_rand_bits()
print r, bin(r)
def testSimpleRandom(self):
# TODO: measure speed of naive implementation
return
for i in xrange(100000):
r = rappor.get_rand_bits2(0.1, 2, lambda: random.getrandbits(32))
if i % 10000 == 0:
print i
#print r, [bin(a) for a in r]
def testProbFailureWeakStatisticalTestForGetRandBits(self):
"""Tests whether get_rand_bits outputs correctly biased random bits.
NOTE! This is a test with a small failure probability.
The test succeeds with very very high probability and should only fail
1 in 10,000 times or less.
Samples 256 bits of randomness 1000 times and checks to see that the
cumulative number of bits set in each of the 256 positions is within
3 \sigma of the mean
Repeats this experiment with several probability values
"""
length_in_words = 8 # A good sample size to test; 256 bits
rand_fn = (lambda: random.getrandbits(32))
# NOTE: 0.0 and 1.0 are not handled exactly.
p_values = [0.5, 0.36, 0.9]
# Trials with different probabilities from p[]
for p in p_values:
get_rand_bits = rappor.ApproxRandom(p, length_in_words)
set_bit_count = [0] * 256
for _ in xrange(1000):
rand_sample = get_rand_bits()
bin_str = bin(rand_sample)[2:] # i^th word in binary as a str
# +2 for the 0b prefix
#print bin_str
# Prefix with leading zeroes
bin_str = "0" * (32 - len(bin_str)) + bin_str
for j in xrange(32):
if bin_str[j] == "1":
set_bit_count[32 + j] += 1
mean = int(1000 * p)
# variance of N samples = Np(1-p)
stddev = math.sqrt(1000 * p * (1 - p))
num_infractions = 0 # Number of values over 3 \sigma
infractions = []
for i in xrange(length_in_words):
for j in xrange(32):
s = set_bit_count[i * 32 + j]
if s > (mean + 3 * stddev) or s < (mean - 3 * stddev):
num_infractions += 1
infractions.append(s)
# 99% confidence for 3 \sigma implies less than 10 errors in 1000
# Factor 2 to avoid flakiness as there is a 1% sampling rate error
self.assertTrue(
num_infractions <= 20, '%s %s' % (num_infractions, infractions))
def testUpdateRapporSumsWithLessThan32BitBloomFilter(self):
report = 0x1d # From LSB, bits 1, 3, 4, 5 are set
# Empty rappor_sum
rappor_sum = [[0] * (self.typical_instance.num_bloombits + 1)
for _ in xrange(self.typical_instance.num_cohorts)]
# A random cohort number
cohort = 42
# Setting up expected rappor sum
expected_rappor_sum = [[0] * (self.typical_instance.num_bloombits + 1)
for _ in xrange(self.typical_instance.num_cohorts)]
expected_rappor_sum[42][0] = 1
expected_rappor_sum[42][1] = 1
expected_rappor_sum[42][3] = 1
expected_rappor_sum[42][4] = 1
expected_rappor_sum[42][5] = 1
rappor.update_rappor_sums(rappor_sum, report, cohort,
self.typical_instance)
self.assertEquals(expected_rappor_sum, rappor_sum)
def testGetRapporMasksWithoutOnePRR(self):
params = copy.copy(self.typical_instance)
params.prob_f = 0.5 # For simplicity
num_words = params.num_bloombits // 32 + 1
rand = MockRandom()
uniform_gen = rappor.ApproxRandom(0.5, num_words, rand=rand)
f_gen = rappor.ApproxRandom(params.prob_f, num_words, rand=rand)
rand_funcs = rappor.ApproxRandFuncs(params, rand)
rand_funcs.cohort_rand_fn = (lambda a, b: a)
assigned_cohort, f_bits, mask_indices = rappor.get_rappor_masks(
0, ["abc"], params, rand_funcs)
self.assertEquals(0, assigned_cohort)
self.assertEquals(0xfff0000f, f_bits)
self.assertEquals(0x0ffff000, mask_indices)
def testGetBFBit(self):
cohort = 0
hash_no = 0
input_word = "abc"
ti = self.typical_instance
# expected_hash = ("\x13O\x0b\xa0\xcc\xc5\x89\x01oI\x85\xc8\xc3P\xfe\xa7 H"
# "\xb0m")
# Output should be
# (ord(expected_hash[0]) + ord(expected_hash[1])*256) % 16
expected_output = 3
actual = rappor.get_bf_bit(input_word, cohort, hash_no, ti.num_bloombits)
self.assertEquals(expected_output, actual)
hash_no = 1
# expected_hash = ("\xb6\xcc\x7f\xee@\x95\xb0\xdb\xf5\xf1z\xc7\xdaPM"
# "\xd4\xd6u\xed3")
expected_output = 6
actual = rappor.get_bf_bit(input_word, cohort, hash_no, ti.num_bloombits)
self.assertEquals(expected_output, actual)
def testGetRapporMasksWithOnePRR(self):
# Set randomness function to be used to sample 32 random bits
# Set randomness function that takes two integers and returns a
# random integer cohort in [a, b]
params = copy.copy(self.typical_instance)
params.flag_oneprr = True
num_words = params.num_bloombits // 32 + 1
rand = MockRandom()
rand_funcs = rappor.ApproxRandFuncs(params, rand)
# First two calls to get_rappor_masks for identical inputs
# Third call for a different input
print '\tget_rappor_masks 1'
cohort_1, f_bits_1, mask_indices_1 = rappor.get_rappor_masks(
"0", "abc", params, rand_funcs)
print '\tget_rappor_masks 2'
cohort_2, f_bits_2, mask_indices_2 = rappor.get_rappor_masks(
"0", "abc", params, rand_funcs)
print '\tget_rappor_masks 3'
cohort_3, f_bits_3, mask_indices_3 = rappor.get_rappor_masks(
"0", "abcd", params, rand_funcs)
# First two outputs should be identical, i.e., identical PRRs
self.assertEquals(f_bits_1, f_bits_2)
self.assertEquals(mask_indices_1, mask_indices_2)
self.assertEquals(cohort_1, cohort_2)
# Third PRR should be different from the first PRR
self.assertNotEqual(f_bits_1, f_bits_3)
self.assertNotEqual(mask_indices_1, mask_indices_3)
self.assertNotEqual(cohort_1, cohort_3)
# Now testing with flag_oneprr false
params.flag_oneprr = False
cohort_1, f_bits_1, mask_indices_1 = rappor.get_rappor_masks(
"0", "abc", params, rand_funcs)
cohort_2, f_bits_2, mask_indices_2 = rappor.get_rappor_masks(
"0", "abc", params, rand_funcs)
self.assertNotEqual(f_bits_1, f_bits_2)
self.assertNotEqual(mask_indices_1, mask_indices_2)
self.assertNotEqual(cohort_1, cohort_2)
def testEncoder(self):
"""Expected bloom bits is computed as follows.
f_bits = 0xfff0000f and mask_indices = 0x0ffff000 from
testGetRapporMasksWithoutPRR()
q_bits = 0xfffff0ff from mock_rand.randomness[] and how get_rand_bits works
p_bits = 0x000ffff0 from -- do --
bloom_bits_array is 0x0000 0048 (3rd bit and 6th bit, from
testSetBloomArray, are set)
Bit arithmetic ends up computing
bloom_bits_prr = 0x0ff00048
bloom_bits_irr= = 0x0ffffff8
"""
params = copy.copy(self.typical_instance)
params.prob_f = 0.5
params.prob_p = 0.5
params.prob_q = 0.75
rand_funcs = rappor.ApproxRandFuncs(params, MockRandom())
rand_funcs.cohort_rand_fn = lambda a, b: a
e = rappor.Encoder(params, 0, rand_funcs=rand_funcs)
cohort, bloom_bits_irr = e.encode("abc")
self.assertEquals(0, cohort)
self.assertEquals(0x0ffffff8, bloom_bits_irr)
class MockRandom(object):
"""Returns one of eight random strings in a cyclic manner.
Mock random function that involves *some* state, as needed for tests
that call randomness several times. This makes it difficult to deal
exclusively with stubs for testing purposes.
"""
def __init__(self):
self.counter = 0
self.randomness = [0x0000ffff, 0x000ffff0, 0x00ffff00, 0x0ffff000,
0xfff000f0, 0xfff0000f, 0xf0f0f0f0, 0xff0f00ff]
def seed(self, seed):
self.counter = hash(seed) % 8
#print 'SEED', self.counter
def getstate(self):
#print 'GET STATE', self.counter
return self.counter
def setstate(self, state):
#print 'SET STATE', state
self.counter = state
def getrandbits(self, unused_num_bits):
#print 'GETRAND', self.counter
rand_val = self.randomness[self.counter]
self.counter = (self.counter + 1) % 8
return rand_val
def randint(self, a, b):
return a + self.counter
if __name__ == "__main__":
unittest.main()

26
client/python/setup.py Normal file
Просмотреть файл

@ -0,0 +1,26 @@
#!/usr/bin/python
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from distutils.core import setup, Extension
module = Extension('_fastrand',
sources = ['_fastrand.c'])
setup(name = '_fastrand',
version = '1.0',
description = 'Module to speed up RAPPOR simulation',
ext_modules = [module])

188
demo.sh Executable file
Просмотреть файл

@ -0,0 +1,188 @@
#!/bin/bash
#
# Demo of RAPPOR. Automating Python and R scripts. See README.
#
# Usage:
# ./demo.sh <function name>
#
# End to end demo for 3 distributions:
#
# $ tests/run.sh end-to-end-all
#
# (This takes a minute or so)
#
# To use a different R interpreter, set R_PREFIX: e.g.
#
# $ export R_PREFIX=/usr/local/bin/Rscript
# $ ./run.sh end-to-end-all
set -o nounset
set -o pipefail
set -o errexit
readonly THIS_DIR=$(dirname $0)
readonly REPO_ROOT=$THIS_DIR
readonly CLIENT_DIR=$REPO_ROOT/client/python
#
# Utility functions
#
banner() {
echo
echo "----- $@"
echo
}
log() {
echo 1>&2 "$@"
}
die() {
log "$0: $@"
exit 1
}
#
# Semi-automated demos
#
# This generates the simulated input s1 .. s<n> with 3 different distributions.
gen-sim-input() {
local dist=$1
local num_clients=$2
local flag=''
case $dist in
exp)
flag=-e
;;
gauss)
flag=-g
;;
unif)
flag=-u
;;
*)
die "Invalid distribution '$dist'"
esac
mkdir -p _tmp
# Simulating 10,000 clients runs reasonably fast but the results look poor.
# 100,000 is slow but looks better.
# 50 different client values are easier to plot (default is 100)
time tests/gen_sim_input.py $flag \
-n $num_clients \
-r 50 \
-o _tmp/$dist.csv
}
# Do the RAPPOR transformation on our simulated input.
rappor-sim() {
local dist=$1
shift
PYTHONPATH=$CLIENT_DIR time $REPO_ROOT/tests/rappor_sim.py \
-i _tmp/$dist.csv \
"$@"
#-s 0 # deterministic seed
}
# Like rappor-sim, but run it through the Python profiler.
rappor-sim-profile() {
local dist=$1
shift
export PYTHONPATH=$CLIENT_DIR
# For now, just dump it to a text file. Sort by cumulative time.
time python -m cProfile -s cumulative \
tests/rappor_sim.py \
-i _tmp/$dist.csv \
"$@" \
| tee _tmp/profile.txt
}
# Analyze output of Python client library.
analyze() {
local dist=$1
local title=$2
local prefix=_tmp/$dist
# Workaround use a different R interpreter. 'env' is a noop.
local r_prefix=${R_PREFIX:-env}
local out_dir=_tmp/${dist}_report
mkdir -p $out_dir
time $r_prefix tests/analyze.R -t "$title" $prefix $out_dir
}
# Use locally compiled R. This is useful for Google computers, i.e. instead of
# using the Google R build.
analyze2() {
R_PREFIX=/usr/local/bin/Rscript analyze "$@"
}
# Run end to end for one distribution.
run-dist() {
local dist=$1
# TODO: parameterize output dirs by num_clients
local num_clients=${2:-100000}
banner "Generating simulated input data ($dist)"
gen-sim-input $dist $num_clients
banner "Running RAPPOR ($dist)"
rappor-sim $dist
banner "Analyzing RAPPOR output ($dist)"
analyze $dist "Distribution Comparison ($dist)"
}
expand-html() {
local template=${1:-../tests/report.html}
local out_dir=${2:-_tmp}
pushd $out_dir >/dev/null
# NOTE: We're arbitrarily using the "exp" values since params are all
# independent of distribution.
cat $template \
| sed -e '/SIM_PARAMS/ r exp_sim_params.html' \
-e '/RAPPOR_PARAMS/ r exp_params.html' \
> report.html
log "Wrote $out_dir/report.html. Open this in your browser."
popd >/dev/null
}
# Build prerequisites for the demo.
build() {
# This is optional now.
./build.sh fastrand
}
_run() {
local num_clients=${1:-100000}
for dist in exp gauss unif; do
run-dist $dist $num_clients
done
# Link the HTML skeleton
#
# TODO:
# - gen_sim_input output sim_params.html
# - read params rappor_params.html
expand-html ../tests/report.html _tmp
wc -l _tmp/*.csv
}
# Main entry point. Run it for all distributions, and time the result.
run() {
time _run "$@"
}
"$@"

105
doc/tutorial.md Normal file
Просмотреть файл

@ -0,0 +1,105 @@
RAPPOR Tutorial
===============
This doc explains the simulation tools for RAPPOR. For a detailed description
of the algorithm, see the [paper](http://arxiv.org/abs/1407.6981).
Start with this command:
$ ./demo.sh run
It currently takes 45 seconds or so to run.
As described in the [README](../README.html), this command generates simulated
input data with different distributions, runs it through RAPPOR, then analyzes
and plots the output.
(The dependencies listed in the README must be installed.)
The command is composed of serveral part.
1. Generating Simulated Input Data
----------------------------------
`gen_sim_input.py` generates test data. Each row contains a client ID, and a
space separated list of reported values -- the true values we wish to keep
private.
By default, we generate 5-9 values per client, out of 50 unique values, so the
output may look something like this:
1,s10 s55 s1 s15 s29 s57 s6
2,s20 s61 s9 s21 s39 s32 s32 s6 s49
...
<client N>,<client N's space-separated raw data>
You can select the distribution of the `sN` values by passing a flag. The
shell script loops through 3 distributions: exponential, normal/gaussian, and
uniform.
You can also write a script to generate a file in this format and pass it to
the next two stages.
2. RAPPOR Transformation
------------------------
`tests/rappor_sim.py` uses the Python client library
(`client/python/rappor.py`) to obfuscate the `s1` .. `sN` strings.
To preserve the user's privacy, we add random noise by flipping bits in two
different ways.
<!-- TODO: a realistic data set would be nice? How could we generate one? -->
It generates 4 files:
- Counts (`exp_out.csv`) -- This currently is the sum of what will be sent over
the network. TODO: change it to output individual reports. Then have a
separate tool that does the summing.
- Parameters (`exp_params.csv`) -- This is a 1-row CSV file with the 6 privacy parameters
`k,h,m,p,q,f`. (The [report.html](../report.html) file and the paper both
describe these parameters). This should be sent over the network along with
the counts. When the raw RAPPOR data is persisted, this should also form
part of the "schema", as the data can't be decoded correctly without it.
- True histogram of input values (`exp_hist.csv`) -- This is for debugging /
comparison. You won't have this in a real setting, of course.
- Map file (`exp_map.csv`) -- Hashed candidates.
3. RAPPOR Analysis
------------------
Once you have the `counts`, `params`, and `map` files, you can pass it to the
`tests/analyze.R` tool, which is a small wrapper around the `analyze/R`
library.
Then you will get a plot of the true distribution vs. the distribution
recovered from data obfuscated with the RAPPOR privacy algorithm.
[View the example output](../report.html).
You can change the simulation or RAPPOR parameters via flags, and compare the
resulting distributions.
TODO
----
The user should provide candidates, and we should have tool to hash them. This
is like the gen_map tool.
$ hash_candidates.py <candidates>
(Writes <map file>)
Tool to extract candidates from the input file.
$ ./demo.sh cheat-candidates <raw input>
In the real setting, it can be nontrivial to enumerate the candidates.
To simulate this, filter the list with `grep`.
Show more detailed command lines, --help?

135
tests/analyze.R Executable file
Просмотреть файл

@ -0,0 +1,135 @@
#!/usr/bin/Rscript --vanilla
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Simple tool that wraps the analysis/R library.
#
# To run this you need:
# - ggplot
# - optparse
# - glmnet -- dependency of analysis library
library(optparse)
# Do command line parsing first to catch errors. Loading libraries in R is
# slow.
if (!interactive()) {
option_list <- list(
make_option(c("-t", "--title"), help="Plot Title")
)
parsed <- parse_args(OptionParser(option_list = option_list),
positional_arguments = 2) # input and output
}
library(ggplot2)
source("analysis/R/analysis_lib.R")
source("analysis/R/read_input.R")
source("analysis/R/decode.R")
Log <- function(...) {
cat('analyze.R: ')
cat(sprintf(...))
cat('\n')
}
LoadInputs <- function(prefix, ctx) {
# prefix: path prefix, e.g. '_tmp/exp'
p <- paste0(prefix, '_params.csv')
c <- paste0(prefix, '_out.csv')
m <- paste0(prefix, '_map.csv')
h <- paste0(prefix, '_hist.csv')
# Calls AnalyzeRAPPOR to run the analysis code
# Date(s) are some dummy dates
ctx$rappor <- AnalyzeRAPPOR(ReadParameterFile(p),
ReadCountsFile(c),
ReadMapFile(m)$map, "FDR", 0.05, 1,
date="01/01/01", date_num="100001")
if (is.null(ctx$rappor)) {
stop("RAPPOR analysis failed.")
}
ctx$actual <- read.csv(h)
}
# Prepare input data to be plotted.
ProcessAll = function(ctx) {
actual <- ctx$actual
rappor <- ctx$rappor
# "s12" -> 12, for graphing
StringToInt <- function(x) as.integer(substring(x, 2))
total <- sum(actual$count)
a <- data.frame(index = StringToInt(actual$string),
# Calculate the true proportion
proportion = actual$count / total,
dist = "actual")
r <- data.frame(index = StringToInt(rappor$strings),
proportion = rappor$proportion,
dist = "rappor")
# Fill in zeros for values missing in RAPPOR. It makes the ggplot bar plot
# look better.
fill <- setdiff(actual$string, rappor$strings)
if (length(fill) > 0) {
z <- data.frame(index = StringToInt(fill),
proportion = 0.0,
dist = "rappor")
} else {
z <- data.frame()
}
rbind(r, a, z)
}
PlotAll <- function(d, title) {
# NOTE: geom_bar makes a histogram by default; need stat = "identity"
g <- ggplot(d, aes(x = index, y = proportion, fill = factor(dist)))
b <- geom_bar(stat = "identity", position = "dodge")
t <- ggtitle(title)
g + b + t
}
WritePlot<- function(p, outdir, width = 800, height = 600) {
filename <- file.path(outdir, 'dist.png')
png(filename, width=width, height=height)
plot(p)
dev.off()
Log('Wrote %s', filename)
}
main <- function(parsed) {
args <- parsed$args
options <- parsed$options
input_prefix <- args[[1]]
output_dir <- args[[2]]
# increase ggplot font size globally
theme_set(theme_grey(base_size = 16))
ctx <- new.env()
LoadInputs(input_prefix, ctx)
d <- ProcessAll(ctx)
p <- PlotAll(d, options$title)
WritePlot(p, output_dir)
}
if (!interactive()) {
main(parsed)
}

242
tests/gen_sim_input.py Executable file
Просмотреть файл

@ -0,0 +1,242 @@
#!/usr/bin/python
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2014 Google Inc. All Rights Reserved.
"""Tool to generated simulated input data for RAPPOR.
We can output data in the following distributions:
a. Uniform
b. Gaussian
c. Exponential
After it goes through RAPPOR, we should be able see the distribution, but not
any user's particular input data.
"""
import getopt
import math
import os
import random
import sys
import time
# Distributions
DISTR_UNIF = 1 # Uniform
DISTR_GAUSS = 2 # Gaussian
DISTR_EXP = 3 # Exponential
# Command line arguments
OUTFILE = "" # Output file name
DISTR = DISTR_UNIF # Distribution: default is uniform
NUM_UNIQUE_VALUES = 100 # Range of client's values in reports
# The default is strings "1" ... "100"
DIST_PARAM = None # Parameter to pass to distribution
NUM_CLIENTS = 100000 # Number of simulated clients
# NOTE: unused. This is hard-coded now.
LOG_NUM_UNIQUE_VALUES = 30 # Something like 4-5xlog(NUM_UNIQUE_VALUES) bits
# should give enough entropy for good samples
ONE_MINUS_EXP_LAMBDA = 0 # 1-e^-lambda
def log(msg, *args):
if args:
msg = msg % args
print >>sys.stderr, msg
# Script usage scenario
def usage(script_name):
sys.stdout.write("Usage: " + script_name + " -o <output file name>")
sys.stdout.write(" -r <range of values \"s1\"-\"sXX\">")
sys.stdout.write(" [-u|g|e|n|p]")
sys.stdout.write("""
-u Uniform distribution (default)
-g Gaussian distribution
-e Exponential distribution
-n Number of users (default = 100,000)
-p Parameter
Ignored for uniform
Std-dev for Gaussian
Lambda for Exponential
""")
def init_rand_precompute():
global ONE_MINUS_EXP_LAMBDA
if DISTR == DISTR_EXP:
ONE_MINUS_EXP_LAMBDA = 1 - math.exp(-DIST_PARAM)
def rand_sample_unif():
return random.randrange(1, NUM_UNIQUE_VALUES)
def rand_sample_gauss():
"""Returns a value in [1, NUM_UNIQUE_VALUES] drawn from a Gaussian."""
mean = float(NUM_UNIQUE_VALUES + 1) / 2
while True:
r = random.normalvariate(mean, DIST_PARAM)
value = int(round(r))
# Rejection sampling to cut off Gaussian to within [1, NUM_UNIQUE_VALUES]
if 1 <= value <= NUM_UNIQUE_VALUES:
break
return value # true client value
def rand_sample_exp():
"""Returns a random sample in [1, NUM_UNIQUE_VALUES] drawn from an
exponential distribution.
"""
rand_in_cf = random.random()
# Val sampled from exp distr in [0,1] is CDF^{-1}(unif in [0,1))
rand_sample_in_01 = (
-math.log(1 - rand_in_cf * ONE_MINUS_EXP_LAMBDA) / DIST_PARAM)
# Scale up to NUM_UNIQUE_VALUES and floor to integer
rand_val = int((rand_sample_in_01 * NUM_UNIQUE_VALUES) + 1)
return rand_val
PARAMS_HTML = """
<h3>Simulation Input</h3>
<table align="center">
<tr>
<td>Number of clients</td>
<td align="right">{num_clients:,}</td>
</tr>
<tr>
<td>Total values reported / obfuscated</td>
<td align="right">{num_values:,}</td>
</tr>
<tr>
<td>Unique values reported / obfuscated</td>
<td align="right">{num_unique_values}</td>
</tr>
</table>
"""
def WriteParamsHtml(num_values, f):
d = {
'num_clients': NUM_CLIENTS,
'num_unique_values': NUM_UNIQUE_VALUES,
'num_values': num_values,
}
# NOTE: No HTML escaping since we're writing numbers
print >>f, PARAMS_HTML.format(**d)
def main(argv):
# All command line arguments are placed into global vars
global OUTFILE, NUM_UNIQUE_VALUES, DISTR, DIST_PARAM, NUM_CLIENTS
# Get arguments
try:
opts, args = getopt.getopt(argv[1:], "ugen:p:o:r:")
except getopt.GetoptError:
usage(argv[0])
sys.exit(2)
# Parsing arguments
for opt, arg in opts:
if opt == "-o":
OUTFILE = arg
elif opt == "-r":
NUM_UNIQUE_VALUES = int(arg)
elif opt == "-u":
DISTR = DISTR_UNIF
elif opt == "-g":
DISTR = DISTR_GAUSS
elif opt == "-e":
DISTR = DISTR_EXP
elif opt == "-p":
DIST_PARAM = float(arg)
elif opt == "-n":
NUM_CLIENTS = int(arg)
# Some sanity checking
if not OUTFILE:
sys.stdout.write("Output file is required.\n")
usage(argv[0])
sys.exit(2)
if NUM_UNIQUE_VALUES < 2:
sys.stdout.write("Range should be at least 2. Setting to default 100.\n")
NUM_UNIQUE_VALUES = 100
if DIST_PARAM is None:
if DISTR == DISTR_GAUSS:
DIST_PARAM = float(NUM_UNIQUE_VALUES) / 6
elif DISTR == DISTR_EXP:
DIST_PARAM = float(NUM_UNIQUE_VALUES) / 5
if NUM_CLIENTS < 10:
sys.stdout.write("RAPPOR works typically with much larger user sizes.")
sys.stdout.write(" Setting number of users to 10.\n")
NUM_CLIENTS = 10
random.seed()
# Precompute and initialize constants needed for random samples
init_rand_precompute()
# Choose a function that yields the desired distrubtion. Each of these
# functions returns a randomly sampled integer between 1 and
# NUM_UNIQUE_VALUES. The functions use some globals.
if DISTR == DISTR_UNIF:
rand_sample = rand_sample_unif
elif DISTR == DISTR_GAUSS:
rand_sample = rand_sample_gauss
elif DISTR == DISTR_EXP:
rand_sample = rand_sample_exp
start_time = time.time()
# Printing values into file OUTFILE
num_values = 0
with open(OUTFILE, "w") as f:
for i in xrange(1, NUM_CLIENTS + 1):
if i % 10000 == 0:
elapsed = time.time() - start_time
log('Generated %d rows in %.2f seconds', i, elapsed)
f.write('%d,' % i)
# Generates between 5 and 9 values for each user/client. This is hard
# coded for now -- could be set by flags.
values = [rand_sample() for _ in xrange(random.randint(5, 9))]
f.write(' '.join('s%d' % v for v in values))
f.write("\n")
num_values += len(values)
log('Wrote %s', OUTFILE)
prefix, _ = os.path.splitext(OUTFILE)
params_filename = prefix + '_sim_params.html'
# TODO: This should take 'opts'
with open(params_filename, 'w') as f:
WriteParamsHtml(num_values, f)
log('Wrote %s', params_filename)
if __name__ == "__main__":
main(sys.argv)

361
tests/rappor_sim.py Executable file
Просмотреть файл

@ -0,0 +1,361 @@
#!/usr/bin/python
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tool to run RAPPOR on simulated client input.
It takes a 2-column CSV file as generated by gen_sim_data.py. Example:
1,s10 s55 s1 s15 s29 s57 s6
2,s20 s61 s9 s21 s39 s64 s32 s6 s49
...
<client N>,<client N's space-separated raw data>
We output 4 files:
- params: RAPPOR parameters, needed to recover distributions from the output
- out: output the total counts of the bloom filter bits set by RAPPOR on
input data
- map file: candidate strings and hashes; required for RAPPOR
- hist: histogram of actual input values. Compare this with the histogram
the RAPPOR analysis infers from the first 3 values.
"""
import collections
import getopt
import os
import random
import sys
import time
import rappor # client library
try:
import fastrand
except ImportError:
print >>sys.stderr, (
"Native fastrand module not imported; see README for speedups")
fastrand = None
# Error flags
PARSE_SUCCESS = 0
PARSE_ERROR = 1
def log(msg, *args):
if args:
msg = msg % args
print >>sys.stderr, msg
class RapporInstance(object):
"""Simple class to create a RAPPOR instance with specific default params."""
def __init__(self):
self.params = rappor.Params()
self.infile = "" # Input file name; must be user-provided
self.outfile = "" # Output file name
self.histfile = "" # Output histogram file
self.mapfile = "" # Output BF map file
self.paramsfile = "" # Output params file
self.randomness_seed = None # Randomness seed
# For debugging purposes only
# TODO: Add orthogonal flag for crytographic randomness.
self.random_mode = 'fast' # simple/approx/fast.
# For testing
def __eq__(self, other):
return self.__dict__ == other.__dict__
def __repr__(self):
return repr(self.__dict__)
def parse_args(argv):
"""Parse and validate flags."""
try:
opts, args = getopt.getopt(
argv[1:], "i:o:p:q:f:c:nh:hf:m:pf:s:r:",
["input=", "output=", "cohorts=",
"hashes=", "bloombits=", "oneprr",
"mapfile=", "rseed="])
except getopt.GetoptError:
usage(argv[0])
sys.exit(2)
inst = RapporInstance()
for opt, arg in opts:
if opt in ("-i", "--input"):
inst.infile = arg
elif opt in ("-o", "--output"):
inst.outfile = arg
# Privacy params
elif opt in ("-b", "--bloombits"):
inst.params.num_bloombits = int(arg)
elif opt in ("-nh", "--hashes"):
inst.params.num_hashes = int(arg)
elif opt in ("-c", "--cohorts"):
inst.params.num_cohorts = int(arg)
elif opt == "-p":
inst.params.prob_p = float(arg)
elif opt == "-q":
inst.params.prob_q = float(arg)
elif opt == "-f":
inst.params.prob_f = float(arg)
# Pseudo-param
elif opt == "--oneprr":
inst.params.flag_oneprr = True
elif opt == "-r":
VALID = ('simple', 'approx', 'fast')
arg = arg.strip()
if arg not in VALID:
raise RuntimeError('random most must be one of: %s' % ' '.join(VALID))
inst.random_mode = arg
elif opt == "-hf":
inst.histfile = arg
elif opt in ("-m", "--mapfile"):
inst.mapfile = arg
elif opt == "-pf":
inst.paramsfile = arg
elif opt in ("-s", "--rseed"):
inst.randomness_seed = arg
# Warn anyone that accidentally turns on the flag
if inst.randomness_seed is not None:
sys.stdout.write("""
WARNING! Randomness should be seeded with time or good entropy sources to
ensure freshness. -s/--seed command line flag is for debugging purposes
only.
\n""")
if not inst.infile:
return inst, PARSE_ERROR
prefix, _ = os.path.splitext(inst.infile)
inst.outfile = inst.outfile or (prefix + "_out.csv")
inst.histfile = inst.histfile or (prefix + "_hist.csv")
inst.mapfile = inst.mapfile or (prefix + "_map.csv")
inst.paramsfile = inst.paramsfile or (prefix + "_params.csv")
return inst, PARSE_SUCCESS
def usage(script_name):
sys.stdout.write("Usage: " + script_name + " --input/-i <input file name>")
sys.stdout.write(" [-o|c|nh|p|q|f|b] [--oneprr]")
sys.stdout.write("""
-o or --output Output file name
-r simple/approx/fast Random algorithm
-c or --cohorts Number of cohorts
-nh or --hashes Number of hashes
-p Probability p
-q Probability q
-f Probability f
-b or --bloombits Size of bloom filter in bits
-pf Parameters file
-m or --mapfile Bloom filter map file
--oneprr Include flag to set one PRR for each (user,word)
""")
PARAMS_HTML = """
<h3>RAPPOR Parameters</h3>
<table align="center">
<tr>
<td><b>k</b></td>
<td>Size of Bloom filter in bits</td>
<td align="right">{}</td>
</tr>
<tr>
<td><b>h</b></td>
<td>Hash functions in Bloom filter</td>
<td align="right">{}</td>
</tr>
<tr>
<td><b>m</b></td>
<td>Number of Cohorts</td>
<td align="right">{}</td>
</tr>
<tr>
<td><b>p</b></td>
<td>Probability p</td>
<td align="right">{}</td>
</tr>
<tr>
<td><b>q</b></td>
<td>Probability q</td>
<td align="right">{}</td>
</tr>
<tr>
<td><b>f</b></td>
<td>Probability f</td>
<td align="right">{}</td>
</tr>
</table>
"""
def print_params(params, csv_out, html_out):
"""Print Rappor parameters to a text file."""
row = (
params.num_bloombits,
params.num_hashes,
params.num_cohorts,
params.prob_p,
params.prob_q,
params.prob_f
)
print >>csv_out, "k,h,m,p,q,f\n" # CSV header
print >>csv_out, "%s,%s,%s,%s,%s,%s\n" % row
# NOTE: No HTML escaping since we're writing numbers
print >>html_out, PARAMS_HTML.format(*row)
def make_histogram(infile):
"""Make a histogram of the simulated input file."""
# TODO: It would be better to share parsing with rappor_encode()
words_counter = collections.Counter()
for line in infile:
_, words = line.strip().split(",")
words_counter.update(words.split())
return dict(words_counter.most_common())
def print_map(all_words, params, mapfile):
"""Print Bloom Filter map of values from infile."""
# Print maps of distributions
# Required by the R analysis tool
k = params.num_bloombits
for word in all_words:
mapfile.write(word)
for cohort in xrange(params.num_cohorts):
for hash_no in xrange(params.num_hashes):
bf_bit = rappor.get_bf_bit(word, cohort, hash_no, k) + 1
mapfile.write("," + str(cohort * k + bf_bit))
mapfile.write("\n")
def print_histogram(word_hist, histfile):
"""Write histogram of infile to histfile."""
# Print histograms of distributions
sorted_words = sorted(word_hist.iteritems(), key=lambda pair: pair[1],
reverse=True)
fmt = "%s,%s"
print >>histfile, fmt % ("string", "count")
for pair in sorted_words:
print >>histfile, fmt % pair
def rappor_encode(params, rand_funcs, infile):
# Initializing array to capture sums of rappors.
rappor_sums = [[0] * (params.num_bloombits + 1)
for _ in xrange(params.num_cohorts)]
start_time = time.time()
for i, line in enumerate(infile):
user_id, words = line.strip().split(",")
if i % 1000 == 0:
elapsed = time.time() - start_time
log('Processed %d inputs in %.2f seconds', i, elapsed)
# New encoder instance for each user.
e = rappor.Encoder(params, user_id, rand_funcs=rand_funcs)
for word in words.split():
cohort, r = e.encode(word)
# Sum rappors. TODO: move this to separate tool.
rappor.update_rappor_sums(rappor_sums, r, cohort, params)
return rappor_sums
def main(argv):
inst, ret_val = parse_args(argv)
if ret_val == PARSE_ERROR:
usage(argv[0])
sys.exit(2)
params = inst.params
params_csv = inst.paramsfile
base, _ = os.path.splitext(params_csv)
params_html = base + '.html'
# Print parameters to parameters file -- needed for the R analysis tool.
with open(params_csv, 'w') as csv_out:
with open(params_html, 'w') as html_out:
print_params(params, csv_out, html_out)
with open(inst.infile) as f:
word_hist = make_histogram(f)
# Print true histograms.
with open(inst.histfile, 'w') as f:
print_histogram(word_hist, f)
# Print maps to map file -- needed for the R analysis tool.
all_words = sorted(word_hist) # unique words
with open(inst.mapfile, 'w') as f:
print_map(all_words, params, f)
rand = random.Random() # default Mersenne Twister randomness
#rand = random.SystemRandom() # cryptographic randomness from OS
if inst.randomness_seed is not None:
rand.seed(inst.randomness_seed) # Seed with cmd line arg
log('Seeded to %r', inst.randomness_seed)
else:
rand.seed() # Default: seed with sys time
if inst.random_mode == 'simple':
rand_funcs = rappor.SimpleRandFuncs(params, rand)
elif inst.random_mode == 'approx':
rand_funcs = rappor.ApproxRandFuncs(params, rand)
elif inst.random_mode == 'fast':
if fastrand:
log('Using fastrand extension')
# NOTE: This doesn't take 'rand'
rand_funcs = fastrand.FastRandFuncs(params)
else:
log('Warning: fastrand module not importable; see README for build '
'instructions. Falling back to simple randomness.')
rand_funcs = rappor.SimpleRandFuncs(params, rand)
else:
raise AssertionError
# Do RAPPOR transformation.
with open(inst.infile) as f:
rappor_sums = rappor_encode(params, rand_funcs, f)
# Print sums of all rappor bits into output file
with open(inst.outfile, 'w') as f:
for row in xrange(params.num_cohorts):
for col in xrange(params.num_bloombits):
f.write(str(rappor_sums[row][col]) + ",")
f.write(str(rappor_sums[row][params.num_bloombits]) + "\n")
if __name__ == "__main__":
try:
main(sys.argv)
except RuntimeError, e:
log('rappor_sim.py: FATAL: %s', e)

60
tests/rappor_sim_test.py Executable file
Просмотреть файл

@ -0,0 +1,60 @@
#!/usr/bin/python
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
rappor_params_test.py: Tests for rappor_params.py
"""
import unittest
import rappor_sim # module under test
class RapporParamsTest(unittest.TestCase):
def setUp(self):
pass
def tearDown(self):
pass
def testParseArgs(self):
expected = rappor_sim.RapporInstance()
p = expected.params
p.num_bloombits = 16 # Number of bloom filter bits
p.num_hashes = 2 # Number of bloom filter hashes
p.num_cohorts = 64 # Number of cohorts
p.prob_p = 0.40 # Probability p
p.prob_q = 0.70 # Probability q
p.prob_f = 0.30 # Probability f
p.flag_oneprr = False # One PRR for each user/word pair
expected.infile = "test.txt" # Input file name
expected.outfile = "test_out.csv" # Output file name
expected.histfile = "test_hist.csv" # Output histogram file
expected.mapfile = "test_map.csv" # Output BF map file
expected.paramsfile = "test_params.csv" # Output params file
arg_string = ("script --cohorts 64 --hashes 2 --bloombits 16 -p 0.4"
" -q 0.7 -f 0.3 -i test.txt")
arg = arg_string.strip().split()
result, error = rappor_sim.parse_args(arg)
self.assertEquals(expected, result)
self.assertEquals(error, rappor_sim.PARSE_SUCCESS)
if __name__ == "__main__":
unittest.main()

23
tests/report.html Normal file
Просмотреть файл

@ -0,0 +1,23 @@
<!DOCTYPE html>
<html>
<head>
<title>RAPPOR Demo</title>
</head>
<body style="text-align: center">
<h2>RAPPOR Demo</h2>
<!-- These strings will be replaced by a sed script. -->
<!-- SIM_PARAMS -->
<!-- RAPPOR_PARAMS -->
<hr/>
<img src="exp_report/dist.png" alt="exponential distribution" />
<img src="gauss_report/dist.png" alt="gauss distribution" />
<img src="unif_report/dist.png" alt="uniform distribution" />
</body>
</html>

102
tests/run.sh Executable file
Просмотреть файл

@ -0,0 +1,102 @@
#!/bin/bash
#
# Copyright 2014 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Test automation script.
#
# Usage:
# run.sh <function name>
#
# Examples:
# $ tests/run.sh py-unit # run Python unit tests
# $ tests/run.sh all # all tests
set -o nounset
set -o pipefail
set -o errexit
readonly THIS_DIR=$(dirname $0)
readonly REPO_ROOT=$THIS_DIR/..
readonly CLIENT_DIR=$REPO_ROOT/client/python
#
# Utility functions
#
die() {
echo 1>&2 "$0: $@"
exit 1
}
#
# Fully Automated Tests
#
# Python unit tests.
#
# TODO: Separate out deterministic tests from statistical tests (which may
# rarely fail)
py-unit() {
export PYTHONPATH=$CLIENT_DIR # to find client library
set +o errexit
# -e: exit at first failure
find $REPO_ROOT -name \*_test.py | sh -x -e
local exit_code=$?
if test $exit_code -eq 0; then
echo 'ALL PASSED'
else
echo 'FAIL'
exit 1
fi
set -o errexit
}
# All tests
all() {
py-unit
py-lint
# TODO: Add R tests, end to end demo
}
#
# Lint
#
python-lint() {
# E111: indent not a multiple of 4. We are following the Google/Chrome style
# and using 2 space indents.
if pep8 --ignore=E111 "$@"; then
echo
echo 'LINT PASSED'
else
echo
echo 'LINT FAILED'
exit 1
fi
}
py-lint() {
which pep8 || die "pep8 not installed ('sudo apt-get install pep8' on Ubuntu)"
# Excluding setup.py, because it's a config file and uses "invalid" 'name =
# 1' style (spaces around =).
find $REPO_ROOT -name \*.py \
| grep -v /setup.py \
| xargs --verbose -- $0 python-lint
}
"$@"