Merge pull request #1039 from tuzzer/patch-2

Fixed formatting errors in  CNTK_203_Reinforcement_Learning_Basics.ipynb
This commit is contained in:
Philipp Kranen 2016-11-15 12:20:13 +01:00 коммит произвёл GitHub
Родитель 20da1c7157 99389b3032
Коммит 2340b27379
1 изменённых файлов: 2 добавлений и 2 удалений

Просмотреть файл

@ -27,7 +27,7 @@
"Q(s,a) &= r_0 + \\gamma r_1 + \\gamma^2 r_2 + \\ldots \\newline\n",
"&= r_0 + \\gamma \\max_a Q^*(s',a)\n",
"\\end{align}\n",
"where $\\gamma \\in [0,1)$ is the discount factor that controls how much we should value reward that is further away. This is called the [$Bellmann$-equation](https://en.wikipedia.org/wiki/Bellman_equation). \n",
"where $\\gamma \\in [0,1)$ is the discount factor that controls how much we should value reward that is further away. This is called the [*Bellmann*-equation](https://en.wikipedia.org/wiki/Bellman_equation). \n",
"\n",
"In this tutorial we will show how to model the state space, how to use the received reward to figure out which action yields the highest future reward. \n",
"\n",
@ -126,7 +126,7 @@
"source": [
"# Part 1: DQN\n",
"\n",
"After a transition $(s,a,r,s)$, we are trying to move our value function $Q(s,a)$ closer to our target $r+\\gamma \\max_{a}Q(s,a)$, where $\\gamma$ is a discount factor for future rewards and ranges in value between 0 and 1.\n",
"After a transition $(s,a,r,s')$, we are trying to move our value function $Q(s,a)$ closer to our target $r+\\gamma \\max_{a'}Q(s',a')$, where $\\gamma$ is a discount factor for future rewards and ranges in value between 0 and 1.\n",
"\n",
"DQNs\n",
" * learn the _Q-function_ that maps observation (state, action) to a `score`\n",