UPDATE: πŸ“š initial readme plus improved tests for sklearn and add max_rows arg

This commit is contained in:
Ali Zaidi 2021-01-15 11:48:54 -08:00
Π ΠΎΠ΄ΠΈΡ‚Π΅Π»ΡŒ ca732bec49
ΠšΠΎΠΌΠΌΠΈΡ‚ 9147cd243c
10 ΠΈΠ·ΠΌΠ΅Π½Ρ‘Π½Π½Ρ‹Ρ… Ρ„Π°ΠΉΠ»ΠΎΠ²: 82 Π΄ΠΎΠ±Π°Π²Π»Π΅Π½ΠΈΠΉ ΠΈ 239 ΡƒΠ΄Π°Π»Π΅Π½ΠΈΠΉ

211
README.md
ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -1,223 +1,54 @@
# Data driven model creation for simulators to train brains on Bonsai
# Training Data-Driven or Surrogate Simulators
Tooling to simplify the creation and use of data driven simulators using supervised learning with the purpose of training brains with Project Bonsai. It digests data as csv and will generate simulation models which can then be directly used to train a reinforcement learning agent.
This repository provides a template for training data-driven simulators that can then be leveraged for training brains (reinforcement learning agents) with [Project Bonsai](https://docs.bons.ai/).
>🚩 Disclaimer: This is not an official Microsoft product. This application is considered an experimental addition to Microsoft Project Bonsai's software toolchain. It's primary goal is to reduce barriers of entry to use Project Bonsai's core Machine Teaching. Pull requests for fixes and small enhancements are welcome, but we do expect this to be replaced by out-of-the-box features of Project Bonsai in the near future.
:warning: Disclaimer: This is not an official Microsoft product. This application is considered an experimental addition to Microsoft's Project Bonsai toolbox. Its primary goal is to reduce barriers of entry to use Project Bonsai's core Machine Teaching. Pull requests for fixes and small enhancements are welcome, but we expect this to be replaced by out-of-the-box features of Project Bonsai shortly.
## Dependencies
This repository leverages [Anaconda](https://docs.conda.io/en/latest/miniconda.html) for Python virtual environments and all dependencies. Please install Anaconda or miniconda first and then run the following:
```bash
conda env update -f environment.yml
conda activate datadriven
conda activate ddm
```
## Main steps to follow
## Tests
`Step 1.` Obtain historical or surrogate sim data in csv format.
To get an understanding of the package, you may want to look at the tests in [`tests`](./tests), and the configuration files in [`conf`](./conf).
- header names
- a single row should be a slice in time
- ensure data ranges cover what we might set reinforcement learning to explore
- smooth noisy data, get rid of outliers
- remove NaN or SNA values
## Usage
Refer to later sections for help with checking data quality before using this tool.
The scripts in this package expect that you have a dataset of CSVs or numpy arrays. If you are using a CSV, you should ensure your CSV has a header with unique column names describing your inputs to the model and the outputs of the model. In addition, your CSV should have a column for the episode index and another column for the iteration index.
`Step 2.` Change the `config_model.yml` file in the `config/` folder
### Generating Logs from an Existing Simulator
Enter the csv file name. Enter the timelag, i.e. the number of rows or iterations that define the state transition in the data. A timelag of `1` will use every row of data, where if one makes a change in the system/process the result is shown in the next sample measurement.
For an example on how to generate logged datasets from a simulator using the Python SDK, take a look at the examples in the [samples repository](https://github.com/microsoft/microsoft-bonsai-api/tree/main/Python/samples), in particular, you can use the flag `--test-local True --log-iteration True` to generate a CSV data that matches the schema used in this repository.
Define the names of the features as input to the simulator model you will create. The names should match the headers of the csv file you provide. Set the values as either `state` or `action`. Define the `output_name` matching the headers in your csv.
### Training Your Models
Define the model type as either `gb, poly, nn, or lstm`. Depending on the specific model one chooses, alter the hyperparameters in this config file as well.
```YAML
# Define csv file path to train a simulator with
DATA:
path: ./csv_data/example_data.csv
timelag: 1
# Define the inputs and outputs of datadriven simulator
IO:
feature_name:
theta: state
alpha: state
theta_dot: state
alpha_dot: state
Vm: action
output_name:
- theta
- alpha
- theta_dot
- alpha_dot
# Select the model type gb, poly, nn, or lstm
MODEL:
type: gb
# Polynomial Regression hyperparameters
POLY:
degree: 1
# Gradient Boost hyperparameters
GB:
n_estimators: 100
lr: 0.1
max_depth: 3
# MLP Neural Network hyperparameters
NN:
epochs: 100
batch_size: 512
activation: linear
n_layer: 5
n_neuron: 12
lr: 0.00001
decay: 0.0000003
dropout: 0.5
# LSTM Neural Network hyperparameters
LSTM:
epochs: 100
batch_size: 512
activation: linear
num_hidden_layer: 5
n_neuron: 12
lr: 0.00001
decay: 0.0000003
dropout: 0.5
markovian_order: 2
num_lstm_units: 1
```
`Step 3.` Run the tool
The scripts in this package leverage the configuration files saved in the [`conf`](./conf) folder to load CSV files, train and save models, and interface them to the Bonsai service. The library comes with a default configuration set in [`conf/config.yaml`](conf/config.yaml).
```bash
python datamodeler.py
python datamodeler2.py
```
The tool will ingest your csv file as input and create a simulator model of the type you selected. THe resultant model will be placed into `models/`.
`Step 4.` Use the model directly
An adaptor class is available for usage in the following way to make custom integrations. We've already done this for you in `Step 5`, but this provides a good understanding. Initialize the class with model type, which consists of either `'gb', 'poly', 'nn', or 'lstm'`.
Specify a `noise_percentage` to optionally add to the states of the simulator, leaving it at zero will not add noise. Training a brain can benefit from adding noise to the states of an approximated simulator to promote robustness.
Define the `action_space_dimensions` and the `state_space_dimensions`. The `markovian_order` is needed when setting the sequence length of the features for an `LSTM`.
```python
from predictor import ModelPredictor
predictor = ModelPredictor(
modeltype="gb",
noise_percentage=0,
state_space_dim=4,
action_space_dim=1,
markovian_order=0
)
```
Calculate next state as a function of current state and current action. IMPORTANT: input state and action are arrays. You need to convert brain action, i.e. dictionary, to an array before feeding into the predictor class.
```python
next_state = predictor.predict(action=action, state=state)
```
The thing to watch out for with datadriven simulators is one cannot trust the approximations when the feature inputs are not within the range it was trained on, i.e. you may get erroneous results. One can optionally evaluate if this occurs by using the `warn_limitation()` functionality.
```python
features = np.concatenate([state, action]
predictor.warn_limitation(features)
```
> Sim should not be necessarily trusted since predicting with the feature Vm outside of range it was trained on, i.e. extrapolating.
`Step 5.` Train with Bonsai
Create a brain and write Inkling with type definitions that match what the simulator can provide, which you defined in `config_model.yml`. Run the `train_bonsai_main.py` file to register your newly created simulator. The integration is already done! Then connect the simulator to your brain.
Be sure to specify `noise_percentage` in your Inkling's scenario. Training a brain can benefit from adding noise to the states of an approximated simulator to promote robustness.
> The episode_start in `train_bonsai_main.py` is expecting initial conditions of your states defined in `config_model.yml` to match scenario dictionary passed in. If you want to pass in other variables that are not modeled by the datadrivenmodel tool (except for noise_percentage), you'll likely have to modify `train_bonsai_main.py`.
```javascript
lesson `Start Inverted` {
scenario {
theta: number<-1.4 .. 1.4>,
alpha: number<-0.05 .. 0.05>, # reset inverted
theta_dot: number <-0.05 .. 0.05>,
alpha_dot: number<-0.05 .. 0.05>,
noise_percentage: 5,
}
}
```
> Ensure the SimConfig in Inkling matches the names of the headers in the `config_model.yml` to allow `train_bonsai_main.py` to work.
You can change any configuration parameter by specifying the configuration file you would like to change and its new path, i.e.,
```bash
python train_bonsai_main.py --workspace <workspace-id> --accesskey <accesskey>
python datamodeler2.py data=cartpole_st_at
```
## Optional Flags
which will use the configuration file in [`conf/data/cartpole_st_at.yaml`](./conf/data/cartpole_st_at.yaml).
### Use pickle instead of csv as data input
Name your dataset as `x_set.pickle` and `y_set.pickle`.
You can also override parameters of the configuration file by specifying their name:
```bash
python datamodeler.py --pickle <foldername>
ython datamodeler2.py data.path=csv_data/cartpole_at_st.csv data.iteration_order=1
```
For example one might use a `<foldername>` of `env_data`.
### Hyperparameter tuning
Gradient Boost should not require much tuning at all. Polynomial Regression may benefit from changing the order. Neural Networks, however, may require significant hyperparameter tuning. Use the flag to use the specified ranges in the `model_config.yml` file to randomly search.
```bash
python datamodeler.py --tune-rs=True
```
## LSTM
After creating an LSTM model for your sim, you can use the predictor class in the same way as the other models. The predictor class initializes a sequence and will continue to stack a history of state transitions and pop off the oldest information. In order to maintain a sequence of valid distributions when starting a sim using the LSTM model, the predictor class takes a single timestep of initial conditions and will automatically step through the sim using the mean value of each of the actions captured from the data in `model_limits.yml`.
```python
from predictor import ModelPredictor
import numpy as np
predictor = ModelPredictor(
modeltype="lstm",
noise_percentage=0,
state_space_dim=4,
action_space_dim=1,
markovian_order=3
)
config = {
'theta': 0.01,
'alpha': 0.02,
'theta_dot': 0.04,
'alpha_dot': 0.05,
'noise_percentage': 5,
}
predictor.reset_state(config)
for i in range(1):
next_state = predictor.predict(
action=np.array([0.83076303]),
state=np.array(
[ 0.6157155 , 0.19910571 , 0.15492792 , 0.09268583]
)
)
print('next_state: ', next_state)
print('history state: ', predictor.state)
print('history action: ', predictor.action_history)
```
The code snippet using the LSTM results in the following history of states and actions. Take note that `reset_state(config)` will auto populate realistic trajectories for the sequence using the mean action. This means that the `0th` iteration of the simulation does not necessarily start where the user specified in the `config`. Continuing to `predict()` will step through the sim and maintain a history of the state transitions automatically for you, matching the sequence of information required for the LSTM.
```bash
next_state: [ 0.41453919 0.07664483 -0.13645072 0.81335021]
history state: deque([0.6157155, 0.19910571, 0.15492792, 0.09268583, 0.338321704451164, 0.018040116405597596, -0.5707406858943783, 0.3023940018967715, 0.01, 0.02, 0.04, 0.05], maxlen=12)
history action: deque([0.83076303, -0.004065768182411825, -0.004065768182411825], maxlen=3)
```
The script automatically saves your model to the path specified by `model.saver.filename`. An `outputs` directory is also saved with your configuration file and logs.
## Build Simulator Package

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -74,6 +74,8 @@ class BaseModel(abc.ABC):
if not os.path.exists(dataset_path):
raise ValueError(f"No data found at {dataset_path}")
else:
if max_rows < 0:
max_rows = None
df = pd.read_csv(dataset_path, nrows=max_rows)
if type(input_cols) == str:
base_features = [str(col) for col in df if col.startswith(input_cols)]

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -7,4 +7,5 @@ data:
iteration_order: -1
episode_col: episode
iteration_col: iteration
max_rows: 1000
scale_data: True

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -7,4 +7,5 @@ data:
iteration_order: 1
episode_col: episode
iteration_col: iteration
max_rows: -1
scale_data: True

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -0,0 +1,19 @@
model:
name: pytorch
build_params:
_target_: main.build
network_class: MVRegressor
num_units: 50
dropout: 0.5
num_layers: 10
device: cpu
batch_size: 128
num_epochs: 10
scale_data: True
saver:
- filename: models/torch_model
sweep:
- run: False
- search_algorithm: bayesian
- num_trials: 3
- scoring_func: r2

18
conf/model/xgboost.yaml Normal file
ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -0,0 +1,18 @@
model:
name: gboost
build_params:
- network_class: MVRegressor
- num_units: 50
- dropout: 0.5
- num_layers: 10
- device: cpu
- batch_size: 128
- num_epochs: 10
- scale_data: True
saver:
- filename: models/xgboost_model
sweep:
- run: False
- search_algorithm: bayesian
- num_trials: 3
- scoring_func: r2

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -27,6 +27,7 @@ def main(cfg: DictConfig) -> None:
episode_col = cfg["data"]["episode_col"]
iteration_col = cfg["data"]["iteration_col"]
dataset_path = cfg["data"]["path"]
max_rows = cfg["data"]["max_rows"]
save_path = cfg["model"]["saver"][0]["filename"]
model_name = cfg["model"]["name"]
Model = available_models[model_name]
@ -52,6 +53,7 @@ def main(cfg: DictConfig) -> None:
iteration_order=iteration_order,
episode_col=episode_col,
iteration_col=iteration_col,
max_rows=max_rows,
)
logger.info("Building model...")
model.build_model()

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -5,7 +5,6 @@ channels:
dependencies:
- python=3.7.7
- pip=19.1.1
- pytorch=1.7.0
- torchvision=0.8
- cryptography=3.1.1

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -11,6 +11,7 @@ from sklearn.preprocessing import PolynomialFeatures
from sklearn.multioutput import MultiOutputRegressor
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from natsort import natsorted
from tune_sklearn import TuneSearchCV
from tune_sklearn import TuneGridSearchCV
@ -43,8 +44,6 @@ class SKModel(BaseModel):
self.model = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
elif model_type == "GradientBoostingRegressor":
self.model = GradientBoostingRegressor()
elif model_type == "PCA":
self.model = PCA()
else:
raise NotImplementedError("unknown model selected")
@ -98,18 +97,22 @@ class SKModel(BaseModel):
# preds_df.columns = label_col_names
return preds
def save_model(self, dir_path):
def save_model(self, filename):
dir_path = pathlib.Path(filename).parent
if not pathlib.Path(dir_path).exists():
pathlib.Path(dir_path).mkdir(parents=True, exist_ok=True)
if self.separate_models:
if not pathlib.Path(dir_path).exists():
pathlib.Path(dir_path).mkdir(parents=True, exist_ok=True)
# pickle.dump(self.models, open(filename, "wb"))
if not pathlib.Path(filename).exists():
pathlib.Path(filename).mkdir(parents=True, exist_ok=True)
logger.info(f"Saving models to {filename}")
for i in range(len(self.models)):
pickle.dump(
self.models[i], open(os.path.join(dir_path, f"model{i}.pkl"), "wb")
)
save_path = os.path.join(filename, f"model{i}.pkl")
logger.info(f"Saving model {i} to {save_path}")
pickle.dump(self.models[i], open(save_path, "wb"))
else:
pickle.dump(self.model, open(dir_path, "wb"))
pickle.dump(self.model, open(filename, "wb"))
def load_model(
self, dir_path: str, scale_data: bool = False, separate_models: bool = False
@ -118,7 +121,7 @@ class SKModel(BaseModel):
self.separate_models = separate_models
if self.separate_models:
all_models = os.listdir(dir_path)
all_models.sort()
all_models = natsorted(all_models)
num_models = len(all_models)
models = []
for i in range(num_models):
@ -148,39 +151,6 @@ class SKModel(BaseModel):
return tune_search
# pipe = Pipeline(
# [
# # the reduce_dim stage is populated by the param_grid
# ("reduce_dim", "passthrough"),
# ("classify", LinearSVC(dual=False, max_iter=10000)),
# ]
# )
# N_FEATURES_OPTIONS = [2, 4, 8]
# C_OPTIONS = [1, 10]
# param_grid = [
# {
# "reduce_dim": [PCA(iterated_power=7), NMF()],
# "reduce_dim__n_components": N_FEATURES_OPTIONS,
# "classify__C": C_OPTIONS,
# },
# {
# "reduce_dim": [SelectKBest(chi2)],
# "reduce_dim__k": N_FEATURES_OPTIONS,
# "classify__C": C_OPTIONS,
# },
# ]
# random = TuneSearchCV(pipe, param_grid, search_optimization="random")
# X, y = load_digits(return_X_y=True)
# random.fit(X, y)
# print(random.cv_results_)
# grid = TuneGridSearchCV(pipe, param_grid=param_grid)
# grid.fit(X, y)
# print(grid.cv_results_)
if __name__ == "__main__":
"""Example using an sklearn Pipeline with TuneGridSearchCV.

ΠŸΡ€ΠΎΡΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„Π°ΠΉΠ»

@ -23,11 +23,11 @@ def test_svm_train():
pathlib.Path("tmp").mkdir(parents=True, exist_ok=True)
lsvm = SKModel()
lsvm.build_model(model_type="SVR")
lsvm.fit(X, y)
lsvm.save_model(dir_path="tmp/lsvm_pole.pkl")
lsvm.fit(X, y, fit_separate=True)
lsvm.save_model(filename="tmp/lsvm_pole")
lsvm2 = SKModel()
lsvm2.load_model(dir_path="tmp/lsvm_pole.pkl", separate_models=True)
lsvm2.load_model(dir_path="tmp/lsvm_pole", separate_models=True)
yhat0 = lsvm.predict(X)
yhat = lsvm2.predict(X)
@ -42,7 +42,7 @@ def test_linear_train():
linear = SKModel()
linear.build_model(model_type="linear_model")
linear.fit(X, y)
linear.save_model(dir_path="tmp/linear_pole.pkl")
linear.save_model(filename="tmp/linear_pole.pkl")
linear2 = SKModel()
linear2.load_model(dir_path="tmp/linear_pole.pkl")
@ -61,7 +61,7 @@ def test_gbr_train():
gbr = SKModel()
gbr.build_model(model_type="GradientBoostingRegressor")
gbr.fit(X, y)
gbr.save_model(dir_path="tmp/gbr_pole.pkl")
gbr.save_model(filename="tmp/gbr_pole.pkl")
gbr2 = SKModel()
gbr2.load_model(dir_path="tmp/gbr_pole.pkl", separate_models=True)