Merge pull request #116 from microsoft/dev

Version 1.8.5
2022-12-19 10:35:19 -03:00 · 2022-12-19 10:35:19 -03:00 · bf058366aa
--- a/Cargo.lock
+++ b/Cargo.lock
@ -765,7 +765,7 @@ checksum = "d29ab0c6d3fc0ee92fe66e2d99f700eab17a8d57d1c1d3b748380fb20baa78cd"

 [[package]]
 name = "sds-cli"
-version = "1.8.4"
+version = "1.8.5"
 dependencies = [
 "csv",
 "env_logger",
@ -777,7 +777,7 @@ dependencies = [

 [[package]]
 name = "sds-core"
-version = "1.8.4"
+version = "1.8.5"
 dependencies = [
 "csv",
 "fnv",
@ -796,7 +796,7 @@ dependencies = [

 [[package]]
 name = "sds-pyo3"
-version = "1.8.4"
+version = "1.8.5"
 dependencies = [
 "csv",
 "env_logger",
@ -807,7 +807,7 @@ dependencies = [

 [[package]]
 name = "sds-wasm"
-version = "1.8.4"
+version = "1.8.5"
 dependencies = [
 "console_error_panic_hook",
 "csv",
--- a/README.md
+++ b/README.md
@ -13,46 +13,46 @@
 In many cases, the best way to share sensitive datasets is not to share the actual sensitive datasets, but user interfaces to derived datasets that are inherently anonymous. Our name for such an interface is a _data showcase_. In this project, we provide an automated set of tools for generating the three elements of a _synthetic data showcase_:

 1. _Synthetic data_ representing the overall structure and statistics of the input data, without describing actual identifiable individuals.
-2. _Aggregate data_ reporting the number of individuals with different combinations of attributes, without disclosing precise counts.
+2. _Aggregate data_ reporting the number of individuals with different combinations of attributes, without disclosing exact counts.
 3. _Data dashboards_ enabling exploratory visual analysis of both datasets, without the need for custom data science or interface development.

-To generate such elements, our tools provide two approaches to anonymize data: (i) k-anonymity and (ii) differential privacy (DP).
-
-# K-anonymity
-
-## Privacy guarantees
-
-The main privacy control offered by the tools is based on the numbers of individuals described by different combinations of attributes. The `resolution` determines the minimum group size that will be (a) reported explicitly in the aggregate data and (b) represented implicitly by the records of the synthetic data. This makes it possible to offer privacy guarantees in clearly understandable terms, e.g.:
-
-"All attribute combinations in this synthetic dataset describe groups of 10 or more individuals in the original sensitive dataset, therefore may never be used to infer the presence of individuals or groups smaller than 10."
-
-Under such guarantees, it is impossible for attackers to infer the presence of groups whose size is below the `resolution`. For groups at or above this resolution, the 'safety in numbers' principle applies &ndash; the higher the limit, the harder it becomes to make inferences about the presence of known individuals.
-
-This anonymization method can be viewed as enforcing [k-anonymity](https://en.wikipedia.org/wiki/K-anonymity) across all columns of a sensitive dataset. While typical implementations of k-anonymity divide data columns into quasi-identifiers and sensitive attributes, only enforcing k-anonymity over quasi-identifiers leaves the remaining attributes open to linking attacks based on background knowledge. The data synthesis approach used to create a synthetic data showcase safeguards against such attacks while preserving the structure and statistics of the sensitive dataset.
-
-## Usage
-
-Use of k-anonymity synthesizers is recommended for **one-off data releases** where the accuracy of attribute counts is critical.
-
-These methods are designed to offer strong group-level protection against **membership inference**, i.e., preventing an adversary from inferring whether a known individual or small group of individuals is present in the sensitive dataset.
-
-They should not be used in situations where **attribute inference** from **homogeneity attacks** are a concern, i.e., when an adversary knows that a certain individual is present in the sensitive dataset, identifies them as part of a group sharing known attributes, and then infers previously unknown attributes of the individual because those attributes are common to the group.
+To generate these elements, our tool provides two approaches to create anonymous datasets that are safe to release: (i) differential privacy and (ii) k-anonymity.

 # Differential privacy

 ## Privacy guarantees

-Differential privacy is not a tool, but a set of mathematical techniques that can be used to protect data. Protection is accomplished by adding some uncertainty (noise) to the data, up to a level that achieves the protection desired by the user (privacy budget).
+The paradigm of differential privacy (DP) offers "safety in noise" &ndash; just enough calibrated noise is added to the data to control the maximum possible privacy loss, $\varepsilon$ (epsilon). When applied in the context of private data release, $\varepsilon$ bounds the ratio of probabilities of getting an arbitrary result to an arbitrary computation when using two synthetic datasets &ndash; one generated from the sensitive dataset itself and the other from a neighboring dataset missing a single arbitrary record.

-This tool, protects attribute combination counts in the aggregate data with differential privacy [**`(epsilon, delta)-DP`**](https://en.wikipedia.org/wiki/Differential_privacy), and then uses the resulting DP aggregate counts to derive synthetic records that retain differential privacy under the post-processing property.
+Our approach to synthesizing data with differential privacy first protects attribute combination counts in the aggregate data using our [DP Marginals](./docs/dp/dp_marginals.pdf) algorithm and then uses the resulting DP aggregate counts to derive synthetic records that retain differential privacy under the post-processing property.

 > For a detailed explanation of how SDS uses differential privacy, please check our [DP documentation](./docs/dp/README.md).

 ## Usage

-Use of differential privacy synthesizers is recommended for **repeated data releases** where cumulative privacy loss must be quantified and controlled, where **attribute inference** from **homogeneity attacks** is a concern, or where provable guarantees against all possible privacy attacks are desired.
+Use of our differential privacy synthesizer is recommended for **repeated data releases** where cumulative privacy loss must be quantified and controlled and where provable guarantees against all possible privacy attacks are desired.

-They should be used with caution, however, whenever missing, fabricated, or inaccurate counts of attribute combinations could trigger inappropriate downstream decisions or actions.
+Any differentially-private dataset should be evaluated for potential risks in situations where missing, fabricated, or inaccurate counts of attribute combinations could trigger inappropriate downstream decisions or actions. Our DP synthesizer prioritises the release of accurate combination counts (with minimal noise) of actual combinations (with minimal fabrication).
+
+# K-anonymity
+
+## Privacy guarantees
+
+The paradigm of k-anonymity offers "safety in numbers" &ndash; combinations of attributes are only released when they occur at least k times in the sensitive dataset. When applied in the context of private data release, we interpret k as a privacy resolution determining the minimum group size that will be (a) reported explicitly in the aggregate dataset and (b) represented implicitly by the records of the synthetic dataset. This makes it possible to offer privacy guarantees in clearly understandable terms, e.g.:
+
+"All attribute combinations in this synthetic dataset describe groups of 10 or more individuals in the original sensitive dataset, therefore may never be used to infer the presence of individuals or groups smaller than 10."
+
+Our approach to synthesizing data with k-anonymity overcomes many of the limitations of standard [k-anonymization](https://en.wikipedia.org/wiki/K-anonymity), in which attributes of sensitive data records are generalized and suppressed until k-anonymity is reached, and only for those attributes determined in advance to be potentially identifying when used in combination (so-called quasi-identifiers). In this standard approach, all remaining sensitive attributes are released so long as k-anonymity holds for the designated quasi-identifiers. This makes the records (and thus subjects) of k-anonymized datasets susceptible to linking attacks based on auxiliary data or background knowledge.
+
+In contrast, our k-anonymity synthesizers generate synthetic records that do not represent actual individuals, yet are composed exclusively from common combinations of attributes in the sensitive dataset. The k-anonymity guarantee therefore holds for all data columns and all combinations of attributes.
+
+## Usage
+
+Use of our k-anonymity synthesizers is recommended only for **one-off data releases** where there is a need for precise counts of attribute combinations (at a given privacy resolution).
+
+These synthesizers are designed to offer strong group-level protection against membership inference, i.e., preventing an adversary from inferring whether a known individual or small group of individuals is present in the sensitive dataset.
+
+They should not be used in situations where attribute inference from homogeneity attacks are a concern, i.e., when an adversary knows that a certain individual is present in the sensitive dataset, identifies them as part of a group sharing known attributes, and then infers previously unknown attributes of the individual because those attributes are common to the group.

 # Quick setup

--- a/package.json
+++ b/package.json
@ -22,29 +22,29 @@
 	"prettier": "@essex/prettier-config",
 	"devDependencies": {
 		"@essex/eslint-config": "^20.3.5",
-		"@essex/eslint-plugin": "^20.3.10",
-		"@essex/jest-config": "^21.0.15",
-		"@essex/prettier-config": "^18.0.3",
-		"@essex/scripts": "^22.0.8",
+		"@essex/eslint-plugin": "^20.3.12",
+		"@essex/jest-config": "^21.0.17",
+		"@essex/prettier-config": "^18.0.4",
+		"@essex/scripts": "^22.2.0",
 		"@jsdevtools/npm-publish": "^1.4.3",
-		"@types/eslint": "^8.4.6",
+		"@types/eslint": "^8.4.10",
 		"@types/prettier": "^2.7.1",
-		"@typescript-eslint/eslint-plugin": "^5.39.0",
-		"@typescript-eslint/parser": "^5.39.0",
-		"eslint": "^8.25.0",
+		"@typescript-eslint/eslint-plugin": "^5.46.1",
+		"@typescript-eslint/parser": "^5.46.1",
+		"eslint": "^8.29.0",
 		"eslint-import-resolver-node": "^0.3.6",
 		"eslint-plugin-header": "^3.1.1",
 		"eslint-plugin-import": "^2.26.0",
-		"eslint-plugin-jest": "^27.1.1",
+		"eslint-plugin-jest": "^27.1.7",
 		"eslint-plugin-jsx-a11y": "^6.6.1",
-		"eslint-plugin-react": "^7.31.9",
+		"eslint-plugin-react": "^7.31.11",
 		"eslint-plugin-react-hooks": "^4.6.0",
 		"eslint-plugin-simple-import-sort": "^8.0.0",
-		"husky": "^8.0.1",
-		"lint-staged": "^13.0.3",
+		"husky": "^8.0.2",
+		"lint-staged": "^13.1.0",
 		"npm-run-all": "^4.1.5",
-		"prettier": "^2.7.1",
-		"replace": "^1.2.1",
+		"prettier": "^2.8.1",
+		"replace": "^1.2.2",
 		"typescript": "^4.8.4"
 	},
 	"workspaces": [
--- a/packages/cli/Cargo.toml
+++ b/packages/cli/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "sds-cli"
-version = "1.8.4"
+version = "1.8.5"
 license = "MIT"
 description = "Command line interface for the sds-core library"
 repository = "https://github.com/microsoft/synthetic-data-showcase"
--- a/packages/components/package.json
+++ b/packages/components/package.json
@ -22,8 +22,8 @@
 	"license": "MIT",
 	"devDependencies": {
 		"@types/mime": "^3.0.1",
-		"@types/node": "^16.11.64",
-		"@types/react": "^17.0.50",
+		"@types/node": "^16.18.9",
+		"@types/react": "^17.0.52",
 		"npm-run-all": "^4.1.5",
 		"react": "^17.0.2",
 		"shx": "^0.3.4",
@ -34,8 +34,8 @@
 		"react": "^17.0.2"
 	},
 	"dependencies": {
-		"@griffel/react": "^1.4.0",
+		"@griffel/react": "^1.5.1",
 		"mime": "^3.0.0",
-		"react-dropzone": "^14.2.2"
+		"react-dropzone": "^14.2.3"
 	}
 }
--- a/packages/core/Cargo.toml
+++ b/packages/core/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "sds-core"
-version = "1.8.4"
+version = "1.8.5"
 license = "MIT"
 description = "Synthetic data showcase core library"
 repository = "https://github.com/microsoft/synthetic-data-showcase"
--- a/packages/core/src/processing/aggregator/aggregated_data.rs
+++ b/packages/core/src/processing/aggregator/aggregated_data.rs
@ -420,7 +420,7 @@ impl AggregatedData {
                            .entry(Arc::new(value_combination))
                            .or_insert_with(AggregatedCount::default);

-                        (*max_count).count = max_count.count.max(count.count);
+                        max_count.count = max_count.count.max(count.count);
                    }
                }
            }
--- a/packages/core/src/processing/evaluator/data_evaluator.rs
+++ b/packages/core/src/processing/evaluator/data_evaluator.rs
@ -43,8 +43,8 @@ impl Evaluator {
                    .entry(sensitive_comb.len())
                    .or_insert((0.0, 0));

-                (*err_sum_count).0 += err as f64;
-                (*err_sum_count).1 += 1;
+                err_sum_count.0 += err as f64;
+                err_sum_count.1 += 1;
            }
        }
        error_sum_count_by_len
--- a/packages/lib-python/Cargo.toml
+++ b/packages/lib-python/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "sds-pyo3"
-version = "1.8.4"
+version = "1.8.5"
 license = "MIT"
 description = "Python bindings for the sds-core library"
 repository = "https://github.com/microsoft/synthetic-data-showcase"
--- a/packages/lib-wasm/Cargo.toml
+++ b/packages/lib-wasm/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "sds-wasm"
-version = "1.8.4"
+version = "1.8.5"
 license = "MIT"
 description = "Web Assembly bindings for the sds-core library"
 repository = "https://github.com/microsoft/synthetic-data-showcase"
--- a/packages/webapp/package.json
+++ b/packages/webapp/package.json
@ -1,6 +1,6 @@
 {
 	"name": "webapp",
-	"version": "1.8.4",
+	"version": "1.8.5",
 	"private": true,
 	"license": "MIT",
 	"main": "src/index.ts",
@ -22,51 +22,51 @@
 		"@essex/arquero": "^2.0.3",
 		"@essex/arquero-react": "^1.1.0",
 		"@essex/sds-core": "workspace:^",
-		"@fluentui/font-icons-mdl2": "^8.5.1",
-		"@fluentui/react": "^8.98.0",
-		"@fluentui/react-hooks": "^8.6.11",
-		"@fluentui/utilities": "^8.13.1",
+		"@fluentui/font-icons-mdl2": "^8.5.4",
+		"@fluentui/react": "^8.103.9",
+		"@fluentui/react-hooks": "^8.6.14",
+		"@fluentui/utilities": "^8.13.4",
 		"@sds/components": "workspace:^",
 		"@thematic/core": "^3.1.0",
 		"@thematic/d3": "^2.0.13",
 		"@thematic/fluent": "^4.1.0",
 		"@thematic/react": "^2.1.0",
 		"@types/mime": "^3.0.1",
-		"@uifabric/icons": "7.9.4",
+		"@uifabric/icons": "7.9.5",
 		"arquero": "^5.1.0",
 		"chart.js": "^3.9.1",
-		"chartjs-plugin-datalabels": "^2.1.0",
+		"chartjs-plugin-datalabels": "^2.2.0",
 		"comlink": "^4.3.1",
-		"dompurify": "^2.4.0",
+		"dompurify": "^2.4.1",
 		"formik": "^2.2.9",
 		"lodash": "^4.17.21",
-		"marked": "^4.1.1",
+		"marked": "^4.2.4",
 		"mime": "^3.0.0",
 		"react": "^17.0.2",
 		"react-chartjs-2": "^4.3.1",
 		"react-dom": "^17.0.2",
 		"react-is": "^17.0.2",
-		"react-router-dom": "^6.4.2",
-		"recoil": "^0.7.5",
+		"react-router-dom": "^6.4.5",
+		"recoil": "^0.7.6",
 		"styled-components": "^5.3.6",
 		"uuid": "^9.0.0",
 		"yup": "^0.32.11"
 	},
 	"devDependencies": {
-		"@types/dompurify": "^2.3.4",
-		"@types/lodash": "^4.14.186",
-		"@types/marked": "^4.0.7",
-		"@types/node": "^16.11.64",
-		"@types/react": "^17.0.50",
-		"@types/react-dom": "^17.0.17",
+		"@types/dompurify": "^2.4.0",
+		"@types/lodash": "^4.14.191",
+		"@types/marked": "^4.0.8",
+		"@types/node": "^16.18.9",
+		"@types/react": "^17.0.52",
+		"@types/react-dom": "^17.0.18",
 		"@types/react-is": "^17.0.3",
 		"@types/recoil": "^0.0.9",
 		"@types/styled-components": "^5.1.26",
-		"@types/uuid": "^8.3.4",
-		"@vitejs/plugin-react": "^2.1.0",
+		"@types/uuid": "^9.0.0",
+		"@vitejs/plugin-react": "^3.0.0",
 		"ts-node": "^10.9.1",
 		"typescript": "^4.8.4",
-		"vite": "^3.1.7",
-		"vite-tsconfig-paths": "^3.5.1"
+		"vite": "^4.0.1",
+		"vite-tsconfig-paths": "^4.0.3"
 	}
 }
--- a/packages/webapp/public/synthetic_data_showcase.pbit
+++ b/packages/webapp/public/synthetic_data_showcase.pbit
--- a/packages/webapp/src/ui-tooltips/mds/SYNTHESIS_MODE.md
+++ b/packages/webapp/src/ui-tooltips/mds/SYNTHESIS_MODE.md
@ -4,11 +4,11 @@ In general, this set of methods proceeds by sampling attributes until the additi

 Since precise attribute counts constitute a privacy risk, it is advisable to create some uncertainty over the actual counts by adding noise to the synthetic data. The same **`privacy resolution`** is used again here to suppress attributes or synthesize additional records such that synthetic attribute counts are equal to the (already imprecise) reported aggregate count.

-Use of k-anonymity synthesizers is recommended for **one-off data releases** where the accuracy of attribute counts is critical.
+Use of our k-anonymity synthesizers is recommended only for **one-off data releases** where there is a need for precise counts of attribute combinations (at a given privacy resolution).

-These methods are designed to offer strong group-level protection against **membership inference**, i.e., preventing an adversary from inferring whether a known individual or small group of individuals is present in the sensitive dataset.
+These synthesizers are designed to offer strong group-level protection against membership inference, i.e., preventing an adversary from inferring whether a known individual or small group of individuals is present in the sensitive dataset.

-They should not be used in situations where **attribute inference** from **homogeneity attacks** are a concern, i.e., when an adversary knows that a certain individual is present in the sensitive dataset, identifies them as part of a group sharing known attributes, and then infers previously unknown attributes of the individual because those attributes are common to the group.
+They should not be used in situations where attribute inference from homogeneity attacks are a concern, i.e., when an adversary knows that a certain individual is present in the sensitive dataset, identifies them as part of a group sharing known attributes, and then infers previously unknown attributes of the individual because those attributes are common to the group.

 **`Row Seeded`**:

@ -44,9 +44,9 @@ They should not be used in situations where **attribute inference** from **homog

 This method protects aggregate counts with differential privacy [**`(epsilon, delta)-DP`**] and then uses the resulting DP aggregate counts to derive synthetic records that retain differential privacy under the post-processing property.

-Use of differential privacy synthesizers is recommended for **repeated data releases** where cumulative privacy loss must be quantified and controlled, where **attribute inference** from **homogeneity attacks** is a concern, or where provable guarantees against all possible privacy attacks are desired.
+Use of our differential privacy synthesizer is recommended for **repeated data releases** where cumulative privacy loss must be quantified and controlled and where provable guarantees against all possible privacy attacks are desired.

-They should be used with caution, however, whenever missing, fabricated, or inaccurate counts of attribute combinations could trigger inappropriate downstream decisions or actions.
+Any differentially-private dataset should be evaluated for potential risks in situations where missing, fabricated, or inaccurate counts of attribute combinations could trigger inappropriate downstream decisions or actions. Our DP synthesizer prioritises the release of accurate combination counts (with minimal noise) of actual combinations (with minimal fabrication).

 **`DP Aggregate Seeded`**:

--- a/yarn.lock
+++ b/yarn.lock