OnDemandMLflowTrainAndServe/training
dependabot[bot] 3d77782f53 Bump lodash from 4.17.11 to 4.17.15 in /training (#25)
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.11 to 4.17.15.
- [Release notes](https://github.com/lodash/lodash/releases)
- [Commits](https://github.com/lodash/lodash/compare/4.17.11...4.17.15)

Signed-off-by: dependabot[bot] <support@github.com>
2019-08-11 12:42:41 +03:00
..
app Merge dev to master (#20) 2019-06-30 11:14:34 +03:00
tests Merge dev to master (#20) 2019-06-30 11:14:34 +03:00
.dockerignore Merge dev to master (#20) 2019-06-30 11:14:34 +03:00
.eslintignore Merge dev to master (#20) 2019-06-30 11:14:34 +03:00
.eslintrc.json Merge dev to master (#20) 2019-06-30 11:14:34 +03:00
README.md Merge dev to master (#20) 2019-06-30 11:14:34 +03:00
dockerfile Merge dev to master (#20) 2019-06-30 11:14:34 +03:00
package-lock.json Bump lodash from 4.17.11 to 4.17.15 in /training (#25) 2019-08-11 12:42:41 +03:00
package.json Merge dev to master (#20) 2019-06-30 11:14:34 +03:00

README.md

Training Service

The training service is responsible to communicate with a third party service to train a new ML model. In our case, we are using Azure Databricks. Using other third party will require an implementation change in this service, however the other parts of the system should remain the same.

Request Flows

New Training Request

New requests for training are going through the following steps:

  1. A new request (POST) for training includes the following body:

    {
        "modelType": "MODEL",
        "parameters": {
            // Parameters key-values
        }
    }
    
  2. Retrieve the notebook path in Databricks according to the modelType ('MODEL' in this case).

    Note: modelType -> 'Databricks notebook path' mappings are described in the json value of the DATABRICKS_TYPE_MAPPING environment variable (more details below).

  3. Start the Databricks cluster if it is in 'TERMINATED' state.

  4. Send a request to Databricks to run the notebook with the specified parameters.

  5. The response from Databricks will include runId, and will be returned in the response json in the following structure:

    {
        "runId": 123
    }
    

Get Run Status Request

  1. A new request for run status is received with runId in the request path. Example: GET /123

  2. The service will check the run on Databricks and will return a response that includes state and message in the following structure:

    {
        "state": "<Run Status>",
        "message": "<Run Message>"
    }
    
    • 'Run Status' could be one of the following values: pending, running or completed
    • 'Run Message' is generally empty, but will include error message if the run finishes unsuccessfully.

Environment Variables

The service expects several environment variables to be set in order to run:

Var Required Description
PORT yes Service port. default=80
DATABRICKS_WORKSPACE_URL yes Databricks Workspace URL
DATABRICKS_AUTH_TOKEN yes Authentication Token for Databricks
DATABRICKS_CLUSTER_ID yes Databricks cluster ID. More information can be found here
DATABRICKS_RUN_TIMEOUT yes Run timeout for notebook runs
DATABRICKS_TYPE_MAPPING yes Json including MODEL:NOTEBOOK_PATH mapping
NODE_ENV no test for unit testing
APP_INSIGHTS_INSTRUMENTATION_KEY no Application Insights instrumentation key
SERVICE_NAME no service name for Application Insights logging

Sample environment variables

PORT=3000
DATABRICKS_WORKSPACE_URL=https://westeurope.azuredatabricks.net
DATABRICKS_AUTH_TOKEN=abcdefghi123456a123a1234a123456abc12
DATABRICKS_CLUSTER_ID=1234-123456-hurts123
DATABRICKS_RUN_TIMEOUT=3600
DATABRICKS_TYPE_MAPPING={"wine":"/shared/wine_notebook","diabetes":"/shared/diabetes_notebook"}
APP_INSIGHTS_INSTRUMENTATION_KEY=01e9c546-1234-1234-cf56-7d6b49fc053a
SERVICE_NAME=training

Build and Run with Docker

docker build . -t training-service
docker run --env-file=.env-file -p 127.0.0.1:3000:3000 training-service