Граф коммитов

23 Коммитов

Автор SHA1 Сообщение Дата
Andy Chu 077fe1d70e Changes to export code to google3.
- Bug fix: if the map cache .rda file can't be written, log an error rather
than stopping analysis
- bin/test.sh: for decode-dist and decode-assoc smoke tests, write the testdata
in a layout that is more easily exported
- add DEP_PYTHON and DEP_FAST_EM environment variables
- scripts/g3export: script to export analysis code and generated testdata
2016-01-21 15:11:34 -08:00
Andy Chu 421d583e76 Fix non-deterministic map matrix dimension bug that causes bad Decode().
- test.sh: Change to h=1 to trigger the bug more reliably
- move validation of inputs to Decode(), instead of decode_dist.R, so we
  get it whenever we call Decode()
- Check dimensions in CreateAssocStringMap, for a better error message
- Require 'params' when calling ReadMapFile/LoadMapFile
- Log a message when entries are removed from the map
2016-01-08 12:43:14 -08:00
Andy Chu 2ee22e919f Merge branch 'master' into em-tensorflow 2015-12-21 12:52:28 -08:00
Andy Chu 6df7403e17 Fix column names again. The column names of the output should be taken
from the input, e.g. "domain","flags..HTTPS","proportion" rather than
literally "Var1","Var2","proportion".
2015-12-18 14:38:21 -08:00
Andy Chu 6f37ce1c3e Merge branch 'master' into em-tensorflow 2015-12-17 16:39:16 -08:00
Ananth Raghunathan bcc13e6b67 Merge branch 'master' into decode-assoc-errors 2015-12-16 23:47:03 +00:00
Ananth Raghunathan 6e7ce5425c Boolean decode without boolean map.
Fixed the boolean decode pipeline to avoid using the Boolean map; this was
causing a few errors in the association pipeline when TRUE values were close
to zero.
2015-12-14 19:56:40 +00:00
Andy Chu eff76ef163 Fix typo (again) 2015-12-10 21:52:59 -08:00
Andy Chu bee31cd91b Basic TensorFlow implementation of EM, and associated changes.
It can be selected by passing --em-executable as fast_em.sh (wrapper for
Python) instead of the binary compiled from fast_em.cc.

See comments at the top of fast_em.py.

- association_test.R: test all three implementations
- bin/test.sh
  - Add demo for TensorFlow
  - Add demo of early convergence
  - Put the output from each implementation in its own directory.
- fast_em.R: Test exit code when shelling out to --em-executable
- Write number of EM iterations in the assoc-metrics.json output (all
implementations). This changes the protocol between the R driver and EM
implementation.
- Clean up the log output from fast_em.cc
2015-12-10 21:49:44 -08:00
Andy Chu 10f51edf27 Error conditions in decode_assoc.R:
- if there are no reports, exit with code 9
- check the case where ComputeDistributionEM failed
2015-12-10 19:17:54 -08:00
Andy Chu 27847c4a61 Use Log() to record the timing instead of a one-off. 2015-12-02 01:19:37 -08:00
Andy Chu b80947bc64 Abort early if there are no rows 2015-11-24 23:46:34 +00:00
Andy Chu b4d5b9caf2 Fix typo 2015-11-24 23:39:32 +00:00
Andy Chu 5c0865d685 Add option to remove bad rows. This is off by default since typically
the data extraction process should make sure it's well-formed.
2015-11-24 14:44:18 -08:00
Andy Chu cbd63d4e27 Log argv for debugging purposes. 2015-11-24 13:23:58 -08:00
Andy Chu 623bed27a5 Fix quoting behavior of LoadMapFile/ReadMapFile. Added a test.
Also fix write-assoc-testdata.
2015-11-12 22:48:50 -08:00
Andy Chu 2882a4e6d2 Call the column "proportion" instead of "Freq" to be consistent with
decode-dist.
2015-11-05 13:49:47 -08:00
Andy Chu ca6e666376 Flatten the 2D result matrix into a data frame. Currently this assumes
that the first variable is a string and the second is a boolean.
2015-11-05 13:42:26 -08:00
Andy Chu 050a0cfa6b Add a test that the two EM implementations give the same results.
Remove the --test-em-executable option to decode_assoc.R, since this is
now in the unit tests.
2015-11-03 17:05:46 -08:00
Andy Chu 669db8d118 Make the map_by_cohort stuff a little more readable. 2015-11-03 15:34:04 -08:00
Andy Chu f0a2584223 - Clean up map creation for association. Rename rmap and map to
all_cohorts_map and map_by_cohort.
- Don't use mclapply to create map, since we're not ranging over all
  reports.
2015-11-03 15:14:51 -08:00
Andy Chu c66f04d96a Clean up logging and comments. 2015-11-03 14:53:27 -08:00
Andy Chu 663d5af77c Optimization of association analysis. Command line interface. Bug fixes.
Code cleanup.

- Added bin/decode-assoc command line tool for automation.  Define and
  read a new schema so we know whether variables are string or boolean,
  and what their encoding parameters are.

- bin/test.sh: Add tests for bin/decode-assoc, e.g. for the (string x
  boolean) problem
  - Add Boolean RAPPOR encoding (no hashing step) to the Python client
    with Encoder.encode_bits
  - Add association testdata generation to rappor_sim.py

- Optimizations
  - In association.R, remove from GetCondProb() any computation that
    doesn't depend on the report.  (e.g. 100x-1000x speedup for the
    joint conditional stage)
  - Use mclapply in R for the steps that did lapply() over all reports
    (N).  The number of cores is controlled by decode-assoc --num-cores.
  - Provide an alternative implementation of EM in C++ in
    analysis/cpp/fast_em.cc (analysis/R/fast_em.R is a wrapper that does
    serialization and a system() call)  ~100x speedup.

- Fixed bugs in the k=1 case, due to R matrix/vector confusion (adapted
  from Ananth's changes)

- Allow different variables in association analysis to have different
  parameters (adapted from Ananth's changes, but tests still pass)

- Add hacky support for boolean vars (adapted from Ananth's changes)

- Code cleanup in association.R.  Rename variables to be more clear.

- Minor refactoring of decode_dist.R to resemble decode_assoc.R
2015-11-02 14:11:49 -08:00