"Reinforcement learning (RL) is an area of machine learning inspired by behaviorist psychology, concerned with how [software agents](https://en.wikipedia.org/wiki/Software_agent) ought to take [actions](https://en.wikipedia.org/wiki/Action_selection) in an environment so as to maximize some notion of cumulative reward. In machine learning, the environment is typically formulated as a [Markov decision process](https://en.wikipedia.org/wiki/Markov_decision_process) (MDP) as many reinforcement learning algorithms for this context utilize [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming) techniques.\n",
"In some machine learning settings, we do not have immediate access to labels, so we cannot rely on supervised learning techniques. If, however, there is something we can interact with and thereby get some feedback that tells us occasionally, whether our previous behavior was good or not, we can use RL to learn how to improve our behavior.\n",
"Unlike in supervised learning, in RL, labeled correct input/output pairs are never presented and sub-optimal actions are never explicitly corrected. This mimics many of the online learning paradigms which involves finding a balance between exploration (of conditions or actions never learnt before) and exploitation (of already learnt conditions or actions from previous encounters). Multi-arm bandit problems is one of the category of RL algorithms where exploration vs. exploitation trade-off have been thoroughly studied. See figure below for [reference](http://www.simongrant.org/pubs/thesis/3/2.html).\n",
"We will use the [CartPole](https://gym.openai.com/envs/CartPole-v0) environment from OpenAI's [gym](https://github.com/openai/gym) simulator to teach a cart to balance a pole. As described in the link above, in the CartPole example, a pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. See figure below for reference.\n",
"Our goal is to prevent the pole from falling over as the cart moves with the pole in upright position (perpendicular to the cart) as the starting state. More specifically if the pole is less than 15 degrees from vertical while the cart is within 2.4 units of the center we will collect reward. In this tutorial, we will train till we learn a set of actions (policies) that lead to an average reward of 200 or more over last 50 batches.\n",
"In, RL terminology, the goal is to find _policies_ $a$, that maximize the _reward_ $r$ (feedback) through interaction with some environment (in this case the pole being balanced on the cart). So given a series of experiences $$s \\xrightarrow{a} r, s'$$ we then can learn how to choose action $a$ in a given state $s$ to maximize the accumulated reward $r$ over time:\n",
"where $\\gamma \\in [0,1)$ is the discount factor that controls how much we should value reward that is further away. This is called the [*Bellmann*-equation](https://en.wikipedia.org/wiki/Bellman_equation).\n",
"In this tutorial we will show how to model the state space, how to use the received reward to figure out which action yields the highest future reward.\n",
"We present two different popular approaches here:\n",
"\n",
"**Deep Q-Networks**: DQNs have become famous in 2015 when they were successfully used to train how to play Atari just form raw pixels. We train neural network to learn the $Q(s,a)$ values (thus _Q-Network _). From these $Q$ functions values we choose the best action.\n",
"\n",
"**Policy gradient**: This method directly estimates the policy (set of actions) in the network. The outcome is a learning of an ordered set of actions which leads to maximize reward by probabilistically choosing a subset of actions. In this tutorial, we learn the actions using a gradient descent approach to learn the policies.\n",
"In this tutorial, we focus how to implement RL in CNTK. We choose a straight forward shallow network. One can extend the approaches by replacing our shallow model with deeper networks that are introduced in other CNTK tutorials.\n",
"Please run the following cell from the menu above or select the cell below and hit `Shift + Enter` to ensure the environment is ready. Verify that the following imports work in your notebook."
"We use the following construct to install the OpenAI gym package if it is not installed. For users new to Jupyter environment, this construct can be used to install any python package. "
"- *Fast mode*: `isFast` is set to `True`. This is the default mode for the notebooks, which means we train for fewer iterations or train / test on limited data. This ensures functional correctness of the notebook though the models produced are far from what a completed training would produce.\n",
"\n",
"- *Slow mode*: We recommend the user to set this flag to `False` once the user has gained familiarity with the notebook content and wants to gain insight from running the notebooks for a longer period with different parameters for training. "
"We will use the [CartPole](https://gym.openai.com/envs/CartPole-v0) environment from OpenAI's [gym](https://github.com/openai/gym) simulator to teach a cart to balance a pole. Please follow the links to get more details.\n",
"\n",
"In every time step, the agent\n",
" * gets an observation $(x, \\dot{x}, \\theta, \\dot{\\theta})$, corresponding to *cart position*, *cart velocity*, *pole angle with the vertical*, *pole angular velocity*,\n",
" * the agent achieved and averaged reward of 200 over the last 50 episodes (if you manage to get a reward of 200 averaged over the last 100 episode you can consider submitting it to OpenAI).\n",
"After a transition $(s,a,r,s')$, we are trying to move our value function $Q(s,a)$ closer to our target $r+\\gamma \\max_{a'}Q(s',a')$, where $\\gamma$ is a discount factor for future rewards and ranges in value between 0 and 1.\n",
"We will start with a slightly modified version for Keras, https://github.com/jaara/AI-blog/blob/master/CartPole-basic.py, published by Jaromír Janisch in his [AI blog](https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/), and will then incrementally convert it to use CNTK.\n",
"Note: in the cell below we highlight how one would do it in Keras. And a marked similarity with CNTK. While CNTK allows for more compact representation, we present a slightly verbose illustration for ease of learning.\n",
"\n",
"Additionally, you will note that, CNTK model doesn't need to be compiled explicitly and is implicitly done when data is processed during training.\n",
"CNTK effectively uses available memory on the system between minibatch execution. Thus the learning rates are stated as **rates per sample** instead of **rates per minibatch** (as with other toolkits)."
"As any learning experiences, we expect to see the initial state of actions to be wild exploratory and over the iterations the system learns the range of actions that yield longer runs and collect more rewards. The tutorial below implements the [$\\epsilon$-greedy](https://en.wikipedia.org/wiki/Reinforcement_learning) approach. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"def plot_weights(weights, figsize=(7,5)):\n",
" '''Heat map of weights to see which neurons play which role'''\n",
"Note that the initial $\\epsilon$ is set to 1 which implies we are entirely exploring but as steps increase we reduce exploration and start leveraging the learnt space to collect rewards (a.k.a. exploitation) as well."
"plt.plot(range(10000), [epsilon(x) for x in range(10000)], 'r')\n",
"plt.xlabel('step');plt.ylabel('$\\epsilon$')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now ready to train our agent using **DQN**. Note this would take anywhere between 2-10 min and we stop whenever the learner hits the average reward of 200 over past 50 batches. One would get better results if they could train the learner until say one hits a reward of 200 or higher for say larger number of runs. This is left as an exercise."
"Play with different [learners](https://cntk.ai/pythondocs/cntk.learner.html#module-cntk.learner). Which one works better? Worse? Think about which parameters you would need to adapt when switching from one learner to the other."
"The problem: we normally do not know, which action led to a continuation of the game, and which was actually a bad one. Our simple heuristic: actions in the beginning of the episode are good, and those towards the end are likely bad (they led to losing the game after all)."
"**Policy Search**: The optimal policy search can be carried out with either gradient free approaches or by computing gradients over the policy space ($\\pi_\\theta$) which is parameterized by $\\theta$. In this tutorial, we use the classic forward (`loss.forward`) and back (`loss.backward`) propagation of errors over the parameterized space $\\theta$. In this case, $\\theta = \\{W_1, b_1, W_2, b_2\\}$, our model parameters. "