A toolkit for conducting machine learning trials against confidential data

Перейти к файлу

Noel Bundick 03fc892df9 Update VSCode settings		2020-08-12 15:04:52 +00:00
.github/workflows	Remove benchmark mode, remove tslint comments, add Docker in CI	2020-08-11 22:48:32 +00:00
.vscode	Update VSCode settings	2020-08-12 15:04:52 +00:00
deploy	fix deploy.sh	2020-07-22 22:19:12 +00:00
docs/images	Initial folder structure	2020-02-11 19:23:32 -08:00
packages	Remove benchmark mode, remove tslint comments, add Docker in CI	2020-08-11 22:48:32 +00:00
sample-data	Remove benchmark mode, remove tslint comments, add Docker in CI	2020-08-11 22:48:32 +00:00
scripts	Remove benchmark mode, remove tslint comments, add Docker in CI	2020-08-11 22:48:32 +00:00
.dockerignore	Updated Dockerfile & bin scripts	2020-08-08 04:30:27 +00:00
.env.template	Adds AAD auth to laboratory and CLI	2020-07-31 19:02:46 +00:00
.eslintrc.json	WIP: moving services into their own projects	2020-08-01 03:14:21 +00:00
.gitignore	better monorepo with TS package references	2020-08-07 15:51:29 +00:00
.mocharc.yml	WIP: moving services into their own projects	2020-08-01 03:14:21 +00:00
.ngrok.yml	Adds Azure deployment	2020-07-20 21:20:35 +00:00
.prettierrc.js	WIP: moving services into their own projects	2020-08-01 03:14:21 +00:00
CODE_OF_CONDUCT.md	Initial CODE_OF_CONDUCT.md commit	2020-02-12 14:17:39 -08:00
Dockerfile	Updated Dockerfile & bin scripts	2020-08-08 04:30:27 +00:00
LICENSE	Initial LICENSE commit	2020-02-12 14:17:40 -08:00
README.md	Remove benchmark mode, remove tslint comments, add Docker in CI	2020-08-11 22:48:32 +00:00
SECURITY.md	Initial SECURITY.md commit	2020-02-12 14:17:41 -08:00
docker-compose.yml	ci: wait before hitting laboratory	2020-08-11 23:35:44 +00:00
lerna.json	better monorepo with TS package references	2020-08-07 15:51:29 +00:00
package-lock.json	WIP: moving services into their own projects	2020-08-01 03:14:21 +00:00
package.json	Updated Dockerfile & bin scripts	2020-08-08 04:30:27 +00:00
tsconfig.json	better monorepo with TS package references	2020-08-07 15:51:29 +00:00

README.md

Secure Data Sandbox

SDS IS UNDER CONSTRUCTION AND NOT USABLE AT THIS POINT. THIS PAGE WILL BE UPDATED AS FUNCTIONALITY BECOMES AVAILABLE.

SDS is a secure execution environment for conducting machine learning trials against confidential data.

The goal of SDS is to enable collaboration between data scientists and organizations with interesting problems. The challenge is that interesting problems come with interesting data sets that are almost always proprietary. These data sets are rich with trade secrets and personably identifiable information, and are usually encumbered by contracts, regulated by statute, and subject to corporate data stewardship policies.

In-house data science departments know how to work with this data, but the compliance issues make it is hard for them to collaborate with third parties and experts from industry and academia.

SDS aims to solve this problem by creating a sandbox for machine learning experiments inside the environment that hosts sensitive data. With SDS, an organization can host machine learning challenges and invite third parties to submit solutions for evaluation against sensitive data that would otherwise be unavailable.

Try SDS

Building SDS

SDS is a Node.js project, written in TypeScript. In order to use SDS you must have Node installed on your machine. SDS has been tested with Node version 12.16.3.

Here are the steps for cloning and building SDS:

% git clone https://github.com/microsoft/secure-data-sandbox.git
% npm install
% npm run compile

Running SDS Locally

Now that we've built SDS, let's run a local instance of the Laboratory service. This local instance does not have a worker pool, so it won't be able to actually run tests, but it allows you to get a feel for the CLI commands. Note that the local instance does not run in a secure environment.

Open two shell windows. In the first window, start the laboratory service:

% npm run laboratory

We can run the CLI run the second shell window. Let's start with the help command:

% npm run cli help

Usage: sds [options] [command]

Secure Data Sandbox CLI

Options:
  -h, --help                   display help for command

Commands:
  connect [service]            Connect to a Laboratory [service] or print connection info.
  create <type> <spec>         Create a benchmark, candidate, or suite from a specification where <type> is either "benchmark", "candidate", or
                               "suite".
  demo                         Configures Laboratory service with demo data.
  deploy <server>              NOT YET IMPLEMENTED. Deploy a Laboratory service.
  examples                     Show usage examples.
  list <type>                  Display summary information about benchmarks, candidates, runs, and suites.
  results <benchmark> <suite>  Display the results of all runs against a named benchmark and suite.
  run <candidate> <suite>      Run a named <candidate> against a named <suite>.
  show <type> [name]           Display all benchmarks, candidates, suites, or runs. If optional [name] is specified, only show matching items.
  help [command]               display help for command

For more information and examples, see https://github.com/microsoft/secure-data-sandbox/blob/main/laboratory/README.md

The first thing we need to do is connect the CLI to the laboratory service that we just started. Currently packages/server/dist/main.js listens on port 3000 of localhost.

% npm run cli connect http://localhost:3000

Connected to http://localhost:3000/.

This writes the connection information to ~/.sds, which is consulted every time the CLI is run. If you don't connect to a Laboratory, you will get the following error:

%npm run cli list benchmark

Error: No laboratory connection. Use the "connect" command to specify a laboratory.

Now that we're connected to a Laboratory service, we can use the demo command to populate the server with sample data, including

A benchmark
A candidate
A suite
Two runs with results.

% npm run cli demo

=== Sample benchmark ===
name: benchmark1
author: author1
stages:
  - name: candidate
    kind: candidate
    volumes:
      - volume: training
        path: /input
  - name: scoring
    image: benchmark-image
    kind: container
    volumes:
      - volume: reference
        path: /reference


=== Sample candidate ===
name: candidate1
author: author1
benchmark: benchmark1
image: candidate1-image


=== Sample suite ===
name: suite1
author: author1
benchmark: benchmark1
volumes:
  - name: training
    type: AzureBlob
    target: 'https://sample.blob.core.windows.net/training'
  - name: reference
    type: AzureBlob
    target: 'https://sample.blob.core.windows.net/reference'


Initiated run 0db6c510-d059-11ea-ab64-31e44163fc86
Initiated run 0dba4780-d059-11ea-ab64-31e44163fc86

If we didn't want to use the built-in demo command, we could have created the benchmark, candidate, suite, and runs manually as follows:

% npm run cli create benchmark sample-data/benchmark1.yaml
benchmark created

% npm run cli create candidate sample-data/candidate1.yaml
candidate created

% npm run cli create suite sample-data/suite1.yaml
suite created

% npm run cli run candidate1 suite1
Scheduling run 1dae9970-d059-11ea-ab64-31e44163fc86

% npm run cli run candidate1 suite1
Scheduling run 1fbe1880-d059-11ea-ab64-31e44163fc86

The demo command does one thing we can't do through the CLI, and that is to pretend to be a worker and report status for the runs.

List benchmarks, candidates, suites

% npm run cli list benchmark
name         submitter   date
benchmark1   author1     2020-07-27 22:32:28 UTC

% npm run cli list candidate
name         submitter   date  
candidate1   author1     2020-07-27 22:32:28 UTC

% npm run cli list suite
name     submitter   date
suite1   author1     2020-07-27 22:32:28 UTC

Show benchmarks, candidates, suites

% npm run cli show benchmark benchmark1
stages:
  - name: candidate
    kind: candidate
    volumes:
      - volume: training
        path: /input
  - name: scoring
    kind: container
    image: benchmark-image
    volumes:
      - volume: reference
        path: /reference
name: benchmark1
author: author1
createdAt: 2020-07-27T22:32:28.865Z
updatedAt: 2020-07-27T22:32:43.284Z


% npm run cli show candidate candidate1
name: candidate1
author: author1
benchmark: benchmark1
image: candidate1-image
createdAt: 2020-07-27T22:32:28.883Z
updatedAt: 2020-07-27T22:32:47.384Z


% npm run cli show suite suite1
volumes:
  - name: training
    type: AzureBlob
    target: 'https://sample.blob.core.windows.net/training'
  - name: reference
    type: AzureBlob
    target: 'https://sample.blob.core.windows.net/reference'
name: suite1
author: author1
benchmark: benchmark1
createdAt: 2020-07-27T22:32:28.889Z
updatedAt: 2020-07-27T22:32:50.623Z

List runs

% npm run cli list run
name                                   submitter   date                      candidate    suite    status   
0db6c510-d059-11ea-ab64-31e44163fc86   unknown     2020-07-27 22:32:28 UTC   candidate1   suite1   completed
0dba4780-d059-11ea-ab64-31e44163fc86   unknown     2020-07-27 22:32:28 UTC   candidate1   suite1   completed
1dae9970-d059-11ea-ab64-31e44163fc86   unknown     2020-07-27 22:32:55 UTC   candidate1   suite1   created  
1fbe1880-d059-11ea-ab64-31e44163fc86   unknown     2020-07-27 22:32:59 UTC   candidate1   suite1   created

Displaying Run Results

% npm run cli results benchmark1 suite1

run                                    submitter   date                      passed   failed   skipped
0db6c510-d059-11ea-ab64-31e44163fc86   unknown     2020-07-27 22:32:28 UTC        5        6       ---
0dba4780-d059-11ea-ab64-31e44163fc86   unknown     2020-07-27 22:32:28 UTC        3      ---         7

Deploying SDS to the cloud

Requirements

Run the following commands in a bash terminal

# Create a resource group to hold all the sandbox resources
az group create -n sandbox -l southcentralus

# Deploy an instance of the sandbox
./deploy/deploy.sh -g sandbox

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.