UPDATE: 📚 initial readme plus improved tests for sklearn and add max_rows arg

2021-01-15 11:48:54 -08:00 · 2021-01-15 11:48:54 -08:00 · 9147cd243c
--- a/README.md
+++ b/README.md
@ -1,223 +1,54 @@
-# Data driven model creation for simulators to train brains on Bonsai
+# Training Data-Driven or Surrogate Simulators

-Tooling to simplify the creation and use of data driven simulators using supervised learning with the purpose of training brains with Project Bonsai. It digests data as csv and will generate simulation models which can then be directly used to train a reinforcement learning agent.
+This repository provides a template for training data-driven simulators that can then be leveraged for training brains (reinforcement learning agents) with [Project Bonsai](https://docs.bons.ai/).

->🚩 Disclaimer: This is not an official Microsoft product. This application is considered an experimental addition to Microsoft Project Bonsai's software toolchain. It's primary goal is to reduce barriers of entry to use Project Bonsai's core Machine Teaching. Pull requests for fixes and small enhancements are welcome, but we do expect this to be replaced by out-of-the-box features of Project Bonsai in the near future.
+:warning: Disclaimer: This is not an official Microsoft product. This application is considered an experimental addition to Microsoft's Project Bonsai toolbox. Its primary goal is to reduce barriers of entry to use Project Bonsai's core Machine Teaching. Pull requests for fixes and small enhancements are welcome, but we expect this to be replaced by out-of-the-box features of Project Bonsai shortly.

 ## Dependencies

+This repository leverages [Anaconda](https://docs.conda.io/en/latest/miniconda.html) for Python virtual environments and all dependencies. Please install Anaconda or miniconda first and then run the following:
+
 ```bash
 conda env update -f environment.yml
-conda activate datadriven
+conda activate ddm
 ```

-## Main steps to follow
+## Tests

-`Step 1.` Obtain historical or surrogate sim data in csv format.
+To get an understanding of the package, you may want to look at the tests in [`tests`](./tests), and the configuration files in [`conf`](./conf).

- header names 
- a single row should be a slice in time
- ensure data ranges cover what we might set reinforcement learning to explore
- smooth noisy data, get rid of outliers
- remove NaN or SNA values
+## Usage

-Refer to later sections for help with checking data quality before using this tool.
+The scripts in this package expect that you have a dataset of CSVs or numpy arrays. If you are using a CSV, you should ensure your CSV has a header with unique column names describing your inputs to the model and the outputs of the model. In addition, your CSV should have a column for the episode index and another column for the iteration index.

-`Step 2.` Change the `config_model.yml` file in the `config/` folder
+### Generating Logs from an Existing Simulator

-Enter the csv file name. Enter the timelag, i.e. the number of rows or iterations that define the state transition in the data. A timelag of `1` will use every row of data, where if one makes a change in the system/process the result is shown in the next sample measurement.
+For an example on how to generate logged datasets from a simulator using the Python SDK, take a look at the examples in the [samples repository](https://github.com/microsoft/microsoft-bonsai-api/tree/main/Python/samples), in particular, you can use the flag `--test-local True --log-iteration True` to generate a CSV data that matches the schema used in this repository.

- Define the names of the features as input to the simulator model you will create. The names should match the headers of the csv file you provide. Set the values as either `state` or `action`. Define the `output_name` matching the headers in your csv. 
+### Training Your Models

-Define the model type as either `gb, poly, nn, or lstm`. Depending on the specific model one chooses, alter the hyperparameters in this config file as well. 
-
-```YAML
-# Define csv file path to train a simulator with
-DATA:
-  path: ./csv_data/example_data.csv
-  timelag: 1
-# Define the inputs and outputs of datadriven simulator
-IO:
-  feature_name:
-    theta: state
-    alpha: state
-    theta_dot: state
-    alpha_dot: state
-    Vm: action
-  output_name:
-    - theta
-    - alpha
-    - theta_dot
-    - alpha_dot
-# Select the model type gb, poly, nn, or lstm
-MODEL:
-  type: gb
-# Polynomial Regression hyperparameters
-POLY:
-  degree: 1
-# Gradient Boost hyperparameters
-GB:
-  n_estimators: 100
-  lr: 0.1
-  max_depth: 3
-# MLP Neural Network hyperparameters
-NN:
-  epochs: 100
-  batch_size: 512
-  activation: linear
-  n_layer: 5
-  n_neuron: 12
-  lr: 0.00001
-  decay: 0.0000003
-  dropout: 0.5
-# LSTM Neural Network hyperparameters
-LSTM:
-  epochs: 100
-  batch_size: 512
-  activation: linear
-  num_hidden_layer: 5
-  n_neuron: 12
-  lr: 0.00001
-  decay: 0.0000003
-  dropout: 0.5
-  markovian_order: 2
-  num_lstm_units: 1
-```
-
-`Step 3.` Run the tool
+The scripts in this package leverage the configuration files saved in the [`conf`](./conf) folder to load CSV files, train and save models, and interface them to the Bonsai service. The library comes with a default configuration set in [`conf/config.yaml`](conf/config.yaml).

 ```bash
-python datamodeler.py
+python datamodeler2.py
 ```

-The tool will ingest your csv file as input and create a simulator model of the type you selected. THe resultant model will be placed into `models/`.
-
-`Step 4.` Use the model directly 
-
-An adaptor class is available for usage in the following way to make custom integrations. We've already done this for you in `Step 5`, but this provides a good understanding. Initialize the class with model type, which consists of either `'gb', 'poly', 'nn', or 'lstm'`. 
-
-Specify a `noise_percentage` to optionally add to the states of the simulator, leaving it at zero will not add noise. Training a brain can benefit from adding noise to the states of an approximated simulator to promote robustness.
-
-Define the `action_space_dimensions` and the `state_space_dimensions`. The `markovian_order` is needed when setting the sequence length of the features for an `LSTM`. 
-
-```python
-from predictor import ModelPredictor
-
-predictor = ModelPredictor(
-    modeltype="gb",
-    noise_percentage=0,
-    state_space_dim=4,
-    action_space_dim=1,
-    markovian_order=0
-)
-```
-
-Calculate next state as a function of current state and current action. IMPORTANT: input state and action are arrays. You need to convert brain action, i.e. dictionary, to an array before feeding into the predictor class. 
-
-```python
-next_state = predictor.predict(action=action, state=state)
-```
-
-The thing to watch out for with datadriven simulators is one cannot trust the approximations when the feature inputs are not within the range it was trained on, i.e. you may get erroneous results. One can optionally evaluate if this occurs by using the `warn_limitation()` functionality. 
-
-```python
-features = np.concatenate([state, action]
-predictor.warn_limitation(features)
-```
-> Sim should not be necessarily trusted since predicting with the feature Vm outside of range it was trained on, i.e. extrapolating.
-
-`Step 5.` Train with Bonsai
-
-Create a brain and write Inkling with type definitions that match what the simulator can provide, which you defined in `config_model.yml`. Run the `train_bonsai_main.py` file to register your newly created simulator. The integration is already done! Then connect the simulator to your brain.
-
-Be sure to specify `noise_percentage` in your Inkling's scenario. Training a brain can benefit from adding noise to the states of an approximated simulator to promote robustness.
-
-> The episode_start in `train_bonsai_main.py` is expecting initial conditions of your states defined in `config_model.yml` to match scenario dictionary passed in. If you want to pass in other variables that are not modeled by the datadrivenmodel tool (except for noise_percentage), you'll likely have to modify `train_bonsai_main.py`.
-
-```javascript
-lesson `Start Inverted` {
-    scenario {
-        theta: number<-1.4 .. 1.4>,
-        alpha: number<-0.05 .. 0.05>,  # reset inverted
-        theta_dot: number <-0.05 .. 0.05>,
-        alpha_dot: number<-0.05 .. 0.05>,
-        noise_percentage: 5,
-    }
-}
-```
-
-> Ensure the SimConfig in Inkling matches the names of the headers in the `config_model.yml` to allow `train_bonsai_main.py` to work.
+You can change any configuration parameter by specifying the configuration file you would like to change and its new path, i.e.,

 ```bash
-python train_bonsai_main.py --workspace <workspace-id> --accesskey <accesskey>
+python datamodeler2.py data=cartpole_st_at
 ```

-## Optional Flags
+which will use the configuration file in [`conf/data/cartpole_st_at.yaml`](./conf/data/cartpole_st_at.yaml).

-### Use pickle instead of csv as data input

-Name your dataset as `x_set.pickle` and `y_set.pickle`. 
+You can also override parameters of the configuration file by specifying their name:

 ```bash
-python datamodeler.py --pickle <foldername>
+ython datamodeler2.py data.path=csv_data/cartpole_at_st.csv data.iteration_order=1
 ```

-For example one might use a `<foldername>` of `env_data`. 
-
-### Hyperparameter tuning
-
-Gradient Boost should not require much tuning at all. Polynomial Regression may benefit from changing the order. Neural Networks, however, may require significant hyperparameter tuning. Use the flag to use the specified ranges in the `model_config.yml` file to randomly search.
-
-```bash
-python datamodeler.py --tune-rs=True
-```
-
-## LSTM
-
-After creating an LSTM model for your sim, you can use the predictor class in the same way as the other models. The predictor class initializes a sequence and will continue to stack a history of state transitions and pop off the oldest information. In order to maintain a sequence of valid distributions when starting a sim using the LSTM model, the predictor class takes a single timestep of initial conditions and will automatically step through the sim using the mean value of each of the actions captured from the data in `model_limits.yml`. 
-
-```python
-from predictor import ModelPredictor
-import numpy as np
-
-predictor = ModelPredictor(
-    modeltype="lstm",
-    noise_percentage=0,
-    state_space_dim=4,
-    action_space_dim=1,
-    markovian_order=3
-)
-
-config = {
-    'theta': 0.01,
-    'alpha': 0.02,
-    'theta_dot': 0.04,
-    'alpha_dot': 0.05,
-    'noise_percentage': 5,
-}
-
-predictor.reset_state(config)
-
-for i in range(1):
-    next_state = predictor.predict(
-        action=np.array([0.83076303]),
-        state=np.array(
-            [ 0.6157155  , 0.19910571 , 0.15492792 , 0.09268583]
-        )
-    )
-    print('next_state: ', next_state)
-
-print('history state: ', predictor.state)
-print('history action: ', predictor.action_history)
-```
-
-The code snippet using the LSTM results in the following history of states and actions. Take note that `reset_state(config)` will auto populate realistic trajectories for the sequence using the mean action. This means that the `0th` iteration of the simulation does not necessarily start where the user specified in the `config`. Continuing to `predict()` will step through the sim and maintain a history of the state transitions automatically for you, matching the sequence of information required for the LSTM. 
-
-```bash
-next_state:  [ 0.41453919  0.07664483 -0.13645072  0.81335021]
-history state:  deque([0.6157155, 0.19910571, 0.15492792, 0.09268583, 0.338321704451164, 0.018040116405597596, -0.5707406858943783, 0.3023940018967715, 0.01, 0.02, 0.04, 0.05], maxlen=12)
-history action:  deque([0.83076303, -0.004065768182411825, -0.004065768182411825], maxlen=3)
-```
+The script automatically saves your model to the path specified by `model.saver.filename`. An `outputs` directory is also saved with your configuration file and logs.

 ## Build Simulator Package

--- a/base.py
+++ b/base.py
@ -74,6 +74,8 @@ class BaseModel(abc.ABC):
        if not os.path.exists(dataset_path):
            raise ValueError(f"No data found at {dataset_path}")
        else:
+            if max_rows < 0:
+                max_rows = None
            df = pd.read_csv(dataset_path, nrows=max_rows)
            if type(input_cols) == str:
                base_features = [str(col) for col in df if col.startswith(input_cols)]
--- a/conf/data/cartpole_st1_at.yaml
+++ b/conf/data/cartpole_st1_at.yaml
@ -7,4 +7,5 @@ data:
  iteration_order: -1
  episode_col: episode
  iteration_col: iteration
+  max_rows: 1000
  scale_data: True
--- a/conf/data/cartpole_st_at.yaml
+++ b/conf/data/cartpole_st_at.yaml
@ -7,4 +7,5 @@ data:
  iteration_order: 1
  episode_col: episode
  iteration_col: iteration
+  max_rows: -1
  scale_data: True
--- a/conf/model/torch_full.yaml
+++ b/conf/model/torch_full.yaml
@ -0,0 +1,19 @@
+model:
+  name: pytorch
+  build_params:
+    _target_: main.build
+    network_class: MVRegressor
+    num_units: 50
+    dropout: 0.5
+    num_layers: 10
+    device: cpu
+    batch_size: 128
+    num_epochs: 10
+    scale_data: True
+  saver:
+    - filename: models/torch_model
+  sweep:
+    - run: False
+    - search_algorithm: bayesian
+    - num_trials: 3
+    - scoring_func: r2
--- a/conf/model/xgboost.yaml
+++ b/conf/model/xgboost.yaml
@ -0,0 +1,18 @@
+model:
+  name: gboost
+  build_params:
+    - network_class: MVRegressor
+    - num_units: 50
+    - dropout: 0.5
+    - num_layers: 10
+    - device: cpu
+    - batch_size: 128
+    - num_epochs: 10
+    - scale_data: True
+  saver:
+    - filename: models/xgboost_model
+  sweep:
+    - run: False
+    - search_algorithm: bayesian
+    - num_trials: 3
+    - scoring_func: r2
--- a/datamodeler2.py
+++ b/datamodeler2.py
@ -27,6 +27,7 @@ def main(cfg: DictConfig) -> None:
    episode_col = cfg["data"]["episode_col"]
    iteration_col = cfg["data"]["iteration_col"]
    dataset_path = cfg["data"]["path"]
+    max_rows = cfg["data"]["max_rows"]
    save_path = cfg["model"]["saver"][0]["filename"]
    model_name = cfg["model"]["name"]
    Model = available_models[model_name]
@ -52,6 +53,7 @@ def main(cfg: DictConfig) -> None:
        iteration_order=iteration_order,
        episode_col=episode_col,
        iteration_col=iteration_col,
+        max_rows=max_rows,
    )
    logger.info("Building model...")
    model.build_model()
--- a/environment.yml
+++ b/environment.yml
@ -5,7 +5,6 @@ channels:
 dependencies:
  - python=3.7.7
  - pip=19.1.1
-
  - pytorch=1.7.0
  - torchvision=0.8
  - cryptography=3.1.1
--- a/skmodels.py
+++ b/skmodels.py
@ -11,6 +11,7 @@ from sklearn.preprocessing import PolynomialFeatures
 from sklearn.multioutput import MultiOutputRegressor
 from sklearn import linear_model
 from sklearn.preprocessing import StandardScaler
+from natsort import natsorted

 from tune_sklearn import TuneSearchCV
 from tune_sklearn import TuneGridSearchCV
@ -43,8 +44,6 @@ class SKModel(BaseModel):
            self.model = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
        elif model_type == "GradientBoostingRegressor":
            self.model = GradientBoostingRegressor()
-        elif model_type == "PCA":
-            self.model = PCA()
        else:
            raise NotImplementedError("unknown model selected")

@ -98,18 +97,22 @@ class SKModel(BaseModel):
        # preds_df.columns = label_col_names
        return preds

-    def save_model(self, dir_path):
+    def save_model(self, filename):

+        dir_path = pathlib.Path(filename).parent
+        if not pathlib.Path(dir_path).exists():
+            pathlib.Path(dir_path).mkdir(parents=True, exist_ok=True)
        if self.separate_models:
-            if not pathlib.Path(dir_path).exists():
-                pathlib.Path(dir_path).mkdir(parents=True, exist_ok=True)
            # pickle.dump(self.models, open(filename, "wb"))
+            if not pathlib.Path(filename).exists():
+                pathlib.Path(filename).mkdir(parents=True, exist_ok=True)
+            logger.info(f"Saving models to {filename}")
            for i in range(len(self.models)):
-                pickle.dump(
-                    self.models[i], open(os.path.join(dir_path, f"model{i}.pkl"), "wb")
-                )
+                save_path = os.path.join(filename, f"model{i}.pkl")
+                logger.info(f"Saving model {i} to {save_path}")
+                pickle.dump(self.models[i], open(save_path, "wb"))
        else:
-            pickle.dump(self.model, open(dir_path, "wb"))
+            pickle.dump(self.model, open(filename, "wb"))

    def load_model(
        self, dir_path: str, scale_data: bool = False, separate_models: bool = False
@ -118,7 +121,7 @@ class SKModel(BaseModel):
        self.separate_models = separate_models
        if self.separate_models:
            all_models = os.listdir(dir_path)
-            all_models.sort()
+            all_models = natsorted(all_models)
            num_models = len(all_models)
            models = []
            for i in range(num_models):
@ -148,39 +151,6 @@ class SKModel(BaseModel):
        return tune_search


-# pipe = Pipeline(
-#     [
-#         # the reduce_dim stage is populated by the param_grid
-#         ("reduce_dim", "passthrough"),
-#         ("classify", LinearSVC(dual=False, max_iter=10000)),
-#     ]
-# )
-
-# N_FEATURES_OPTIONS = [2, 4, 8]
-# C_OPTIONS = [1, 10]
-# param_grid = [
-#     {
-#         "reduce_dim": [PCA(iterated_power=7), NMF()],
-#         "reduce_dim__n_components": N_FEATURES_OPTIONS,
-#         "classify__C": C_OPTIONS,
-#     },
-#     {
-#         "reduce_dim": [SelectKBest(chi2)],
-#         "reduce_dim__k": N_FEATURES_OPTIONS,
-#         "classify__C": C_OPTIONS,
-#     },
-# ]
-
-# random = TuneSearchCV(pipe, param_grid, search_optimization="random")
-# X, y = load_digits(return_X_y=True)
-# random.fit(X, y)
-# print(random.cv_results_)
-
-# grid = TuneGridSearchCV(pipe, param_grid=param_grid)
-# grid.fit(X, y)
-# print(grid.cv_results_)
-
-
 if __name__ == "__main__":

    """Example using an sklearn Pipeline with TuneGridSearchCV.
--- a/tests/test_sklearn.py
+++ b/tests/test_sklearn.py
@ -23,11 +23,11 @@ def test_svm_train():
        pathlib.Path("tmp").mkdir(parents=True, exist_ok=True)
    lsvm = SKModel()
    lsvm.build_model(model_type="SVR")
-    lsvm.fit(X, y)
-    lsvm.save_model(dir_path="tmp/lsvm_pole.pkl")
+    lsvm.fit(X, y, fit_separate=True)
+    lsvm.save_model(filename="tmp/lsvm_pole")

    lsvm2 = SKModel()
-    lsvm2.load_model(dir_path="tmp/lsvm_pole.pkl", separate_models=True)
+    lsvm2.load_model(dir_path="tmp/lsvm_pole", separate_models=True)

    yhat0 = lsvm.predict(X)
    yhat = lsvm2.predict(X)
@ -42,7 +42,7 @@ def test_linear_train():
    linear = SKModel()
    linear.build_model(model_type="linear_model")
    linear.fit(X, y)
-    linear.save_model(dir_path="tmp/linear_pole.pkl")
+    linear.save_model(filename="tmp/linear_pole.pkl")

    linear2 = SKModel()
    linear2.load_model(dir_path="tmp/linear_pole.pkl")
@ -61,7 +61,7 @@ def test_gbr_train():
    gbr = SKModel()
    gbr.build_model(model_type="GradientBoostingRegressor")
    gbr.fit(X, y)
-    gbr.save_model(dir_path="tmp/gbr_pole.pkl")
+    gbr.save_model(filename="tmp/gbr_pole.pkl")

    gbr2 = SKModel()
    gbr2.load_model(dir_path="tmp/gbr_pole.pkl", separate_models=True)