{"componentChunkName":"component---src-templates-blog-post-js","path":"/does-dqn-coverge/","result":{"data":{"site":{"siteMetadata":{"title":"Hi, Blog","author":"Yoni Schirris"}},"markdownRemark":{"id":"fc788dbc-e290-5085-beab-9c271c3a6835","excerpt":"Introduction Reinforcement Learning (RL), the idea of having an agent learn by\ndirectly perceiving its environment and acting to manipulate it, is one\nof the…","html":"<h2>Introduction</h2>\n<p>Reinforcement Learning (RL), the idea of having an agent learn by\ndirectly perceiving its environment and acting to manipulate it, is one\nof the great branches of Artificial Intelligence. In this blog post you\nwill read about a specific breakthrough by DeepMind: its success in\ncreating a single deep RL architecture that was able to achieve gameplay\nin Atari games comparable to that of humans across almost all the <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>49</mn></mrow><annotation encoding=\"application/x-tex\">49</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">4</span><span class=\"mord\">9</span></span></span></span>\ngames <a href=\"#references\">[1]</a>. They called it DQN, which stands for “Deep\nQ-Network”. When its predecessor was first introduced in 2013 <a href=\"#references\">[2]</a>,\nit was, according to the authors, the first system that was able to\nlearn control policies “directly from high-dimensional sensory input\nusing reinforcement learning”.<br>\n<br>\nIn the coming sections, we will first study Q-learning, a specific RL\nalgorithm, in its tabular form. Then we will learn how Deep Q-learning\nmakes it in principle possible to apply RL to more complex problems and\nsee the so-called deadly triad, which are three properties that are\noften present in these problems and lead to divergence. In the section\nthat follows, we will shortly introduce the two main two tricks that DQN\nutilizes for stability, namely experience replay and a fixed target\nnetwork. Finally, we answer, through experimentation, how these tricks\ncan overcome the divergence problems that partly come from the deadly\ntriad when evaluating the performance in the OpenAI CartPole gym\nenvironment. Concretely, our research questions are:</p>\n<ul>\n<li>Can we find an environment where DQN diverges without the main two\ntricks employed in that paper, but converges when including the\ntricks?</li>\n<li>What are the individual contributions that come from using\nexperience replay and a fixed target network?</li>\n</ul>\n<p>We will answer these questions by investigating the return, i.e.\ncumulative reward, over time achieved by the agent. We expect that\nadding both tricks allows the method to converge on the chosen\nenvironment, while only using either of the tricks won’t. Additionally,\nwe will look at the phenomenon of “soft divergence”, i.e.\nunrealistically large Q-values. We hope to shed some light on the\neffectiveness of these tricks in less complex environments than Atari\ngames, allowing beginners in the field of RL to get an intuition of why\nDQN works so well, and to think of ways how to improve them even\nfurther.<br>\n<br>\nThis post assumes some familiarity with RL. If you feel a lack of prior\nknowledge while reading the post, we can recommend reading parts of the\nclassical introduction into Reinforcement Learning\n<a href=\"#references\">3</a> or watching the great (available online)\n<a href=\"https://www.youtube.com/watch?v=2pWv7GOvuf0&#x26;list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ\">lecture by David\nSilver</a>,\none of the creators of the famous\n<a href=\"https://en.wikipedia.org/wiki/AlphaZero\">AlphaZero</a>.</p>\n<h1>Q-Functions and Q-Learning</h1>\n<p>In the explanations that follow, there is always the implicit notion of\na <a href=\"https://en.wikipedia.org/wiki/Markov_decision_process\">Markov-Decision process (MDP)</a>: it consists of a finite state space\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi mathvariant=\"script\">S</mi></mrow><annotation encoding=\"application/x-tex\">\\mathcal{S}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathcal\" style=\"margin-right:0.075em;\">S</span></span></span></span></span>, a finite action space <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi mathvariant=\"script\">A</mi></mrow><annotation encoding=\"application/x-tex\">\\mathcal{A}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathcal\">A</span></span></span></span></span>, a reward space\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>R</mi><mo>⊆</mo><mi mathvariant=\"double-struck\">R</mi></mrow><annotation encoding=\"application/x-tex\">R \\subseteq \\mathbb{R}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8193em;vertical-align:-0.13597em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.00773em;\">R</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">⊆</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.68889em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathbb\">R</span></span></span></span></span>, a discount factor <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>γ</mi></mrow><annotation encoding=\"application/x-tex\">\\gamma</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span></span></span></span> and transition\ndynamics <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>p</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><mi>r</mi><mo>∣</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">p(s&#x27;, r \\mid s, a)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">p</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∣</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">)</span></span></span></span> that defines the probabilities to arrive\nin the next states with a certain reward, given the current state and\naction.<br>\n<br>\nThe agent’s goal is to maximize the expected return <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>G</mi></mrow><annotation encoding=\"application/x-tex\">G</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">G</span></span></span></span>, which is the\ncumulative discounted reward. This can be modeled by <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-functions,\nwhich represents the expected return for each state action pair\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo stretchy=\"false\">)</mo><mo>=</mo><mi>E</mi><mrow><mo fence=\"true\">[</mo><mi>G</mi><mo>∣</mo><mi>S</mi><mo>=</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>A</mi><mo>=</mo><mi>a</mi><mo fence=\"true\">]</mo></mrow><mo>=</mo><mi>E</mi><mo stretchy=\"false\">[</mo><msub><mi>R</mi><mn>1</mn></msub><mo>+</mo><mi>γ</mi><msub><mi>R</mi><mn>2</mn></msub><mo>+</mo><msup><mi>γ</mi><mn>2</mn></msup><msub><mi>R</mi><mn>3</mn></msub><mo>+</mo><mo>⋯</mo><mo>∣</mo><mi>S</mi><mo>=</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>A</mi><mo>=</mo><mi>a</mi><mo stretchy=\"false\">]</mo><mo separator=\"true\">,</mo></mrow><annotation encoding=\"application/x-tex\">Q(s,a) = E\\left[G \\mid S = s, A = a  \\right] = E[R_1 + \\gamma R_2 + \\gamma^2R_3 + \\dots \\mid S = s, A = a],</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">)</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05764em;\">E</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"minner\"><span class=\"mopen delimcenter\" style=\"top:0em;\">[</span><span class=\"mord mathdefault\">G</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∣</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05764em;\">S</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">A</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose delimcenter\" style=\"top:0em;\">]</span></span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05764em;\">E</span><span class=\"mopen\">[</span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.00773em;\">R</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.30110799999999993em;\"><span style=\"top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">1</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.00773em;\">R</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.30110799999999993em;\"><span style=\"top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.008548em;vertical-align:-0.19444em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.8141079999999999em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2</span></span></span></span></span></span></span></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.00773em;\">R</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.30110799999999993em;\"><span style=\"top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">3</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"minner\">⋯</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∣</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05764em;\">S</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">A</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">]</span><span class=\"mpunct\">,</span></span></span></span>\nwhere the rewards are stochastically determined by the actions according\nto the agent’s <em>policy</em> (i.e. a distribution over actions\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>π</mi><mo stretchy=\"false\">(</mo><mi>a</mi><mo>∣</mo><mi>s</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">\\pi(a \\mid s)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.03588em;\">π</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">a</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∣</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">s</span><span class=\"mclose\">)</span></span></span></span> for each state <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>s</mi><mo>∈</mo><mi mathvariant=\"script\">S</mi></mrow><annotation encoding=\"application/x-tex\">s \\in \\mathcal{S}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.5782em;vertical-align:-0.0391em;\"></span><span class=\"mord mathdefault\">s</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∈</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathcal\" style=\"margin-right:0.075em;\">S</span></span></span></span></span>) and the environment\ndynamics <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>p</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><mi>r</mi><mo>∣</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">p(s&#x27;, r \\mid s, a)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">p</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∣</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">)</span></span></span></span>. In particular, the <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-function provides\nthe information on which action leads to the highest value, given a\nstate, when following the given behaviour afterwards.<br>\n<br>\nQ-learning is the process of finding the <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-function of the optimal\npolicy while following a behaviour policy, e.g. <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ϵ</mi></mrow><annotation encoding=\"application/x-tex\">\\epsilon</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">ϵ</span></span></span></span>-greedy. This\ncan be done either in a tabular fashion or using function approximation:</p>\n<h2>Tabular Q-Learning</h2>\n<p>In tabular Q-learning, we use the following update rule:</p>\n<p><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo stretchy=\"false\">)</mo><mo>←</mo><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo stretchy=\"false\">)</mo><mo>+</mo><mi>α</mi><mo>⋅</mo><mrow><mo fence=\"true\">(</mo><mi>r</mi><mo>+</mo><mi>γ</mi><msub><mo><mi>max</mi><mo>⁡</mo></mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></msub><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo stretchy=\"false\">)</mo><mo>−</mo><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo stretchy=\"false\">)</mo><mo fence=\"true\">)</mo></mrow><mi mathvariant=\"normal\">.</mi></mrow><annotation encoding=\"application/x-tex\">Q(s, a) \\leftarrow Q(s, a) + \\alpha \\cdot \\left(r + \\gamma \\max_{a&#x27;} Q(s&#x27;, a&#x27;) - Q(s,a) \\right).</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">)</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">←</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">)</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.44445em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.0037em;\">α</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">⋅</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"minner\"><span class=\"mopen delimcenter\" style=\"top:0em;\">(</span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mop\"><span class=\"mop\">max</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.32797999999999994em;\"><span style=\"top:-2.5500000000000003em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.6828285714285715em;\"><span style=\"top:-2.786em;margin-right:0.07142857142857144em;\"><span class=\"pstrut\" style=\"height:2.5em;\"></span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mclose\">)</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">−</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">)</span><span class=\"mclose delimcenter\" style=\"top:0em;\">)</span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\">.</span></span></span></span></p>\n<p>Here, <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></mrow><annotation encoding=\"application/x-tex\">s&#x27;</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.751892em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span> is the next state and <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>α</mi></mrow><annotation encoding=\"application/x-tex\">\\alpha</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.0037em;\">α</span></span></span></span> is a hyperparameter, namely\nthe learning rate. Under the assumption that each state-action pair\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">(s,a)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mclose\">)</span></span></span></span> is visited infinitely often, this algorithm always converges to\nthe <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-function of the optimal policy <a href=\"#references\">[2]</a>.</p>\n<p>Note that with growing dimension of the state space the number of\npossible states grows exponentially. This is also called the <em>curse of\ndimensionality</em> since it makes tabular <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-learning and similar tabular\nmethods intractable in high-dimensional spaces. Therefore, we need ways\nto approximate this process.</p>\n<h2>Q-Learning with Function Approximation</h2>\n<p>In order to model these more complex scenarios, we model the Q-function\nas a parametrised differentiable function <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s, a; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> that takes\nin the state and action and, depending on the parameter values, outputs\na specific value. In practice, this is often modelled as a function\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> that only takes in the state and outputs a whole array of\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-values, one for each action. For this blog post, you can assume that\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> is a neural network with <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>θ</mi></mrow><annotation encoding=\"application/x-tex\">\\theta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.69444em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span></span></span></span> as its weights. Then,\nwhat is learned are the parameters <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>θ</mi></mrow><annotation encoding=\"application/x-tex\">\\theta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.69444em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span></span></span></span>, of which there are fewer\nthan there are states in the state space. The update rule of\nsemi-gradient Q-learning is then the following:</p>\n<span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mtable width=\"100%\"><mtr><mtd width=\"50%\"></mtd><mtd><mrow><mi mathvariant=\"normal\">Δ</mi><mi>θ</mi><mo>∝</mo><mrow><mo fence=\"true\">(</mo><mi>r</mi><mo>+</mo><mi>γ</mi><munder><mo><mi>max</mi><mo>⁡</mo></mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></munder><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo><mo>−</mo><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo><mo fence=\"true\">)</mo></mrow><msub><mi mathvariant=\"normal\">∇</mi><mi>θ</mi></msub><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow></mtd><mtd width=\"50%\"></mtd><mtd><mtext>(1)</mtext></mtd></mtr></mtable><annotation encoding=\"application/x-tex\">\\Delta\\theta \\propto \\left(r + \\gamma \\max_{a&#x27;} Q(s&#x27;, a&#x27;; \\theta) - Q(s, a; \\theta)\\right) \\nabla_{\\theta} Q(s, a; \\theta) \\tag{1}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.69444em;vertical-align:0em;\"></span><span class=\"mord\">Δ</span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∝</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.89398em;vertical-align:-0.7439800000000001em;\"></span><span class=\"minner\"><span class=\"mopen delimcenter\" style=\"top:0em;\"><span class=\"delimsizing size2\">(</span></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.43055999999999983em;\"><span style=\"top:-2.35602em;margin-left:0em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.6828285714285715em;\"><span style=\"top:-2.786em;margin-right:0.07142857142857144em;\"><span class=\"pstrut\" style=\"height:2.5em;\"></span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span><span style=\"top:-3em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span><span class=\"mop\">max</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.7439800000000001em;\"><span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.801892em;\"><span style=\"top:-3.113em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.801892em;\"><span style=\"top:-3.113em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">−</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span><span class=\"mclose delimcenter\" style=\"top:0em;\"><span class=\"delimsizing size2\">)</span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord\">∇</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.33610799999999996em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\" style=\"margin-right:0.02778em;\">θ</span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span><span class=\"tag\"><span class=\"strut\" style=\"height:1.89398em;vertical-align:-0.7439800000000001em;\"></span><span class=\"mord text\"><span class=\"mord\">(</span><span class=\"mord\"><span class=\"mord\">1</span></span><span class=\"mord\">)</span></span></span></span></span></span>\n<p>where <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>s</mi></mrow><annotation encoding=\"application/x-tex\">s</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">s</span></span></span></span> is your last state, <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>a</mi></mrow><annotation encoding=\"application/x-tex\">a</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">a</span></span></span></span> is the action you chose in that state,\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>r</mi></mrow><annotation encoding=\"application/x-tex\">r</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span></span></span></span> is the reward that followed it and <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></mrow><annotation encoding=\"application/x-tex\">s&#x27;</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.751892em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span> is the new state you found\nyourself in. It is a “semi-gradient method” because Equation (1) is not\nthe full gradient of the loss\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>L</mi><mo>=</mo><msup><mrow><mo fence=\"true\">(</mo><mi>r</mi><mo>+</mo><mi>γ</mi><msub><mo><mi>max</mi><mo>⁡</mo></mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></msub><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo><mo>−</mo><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo><mo fence=\"true\">)</mo></mrow><mn>2</mn></msup><mi mathvariant=\"normal\">.</mi></mrow><annotation encoding=\"application/x-tex\">L = \\left(r + \\gamma \\max_{a&#x27;} Q(s&#x27;, a&#x27;; \\theta) - Q(s, a; \\theta)\\right)^2.</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">L</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.2059em;vertical-align:-0.25em;\"></span><span class=\"minner\"><span class=\"minner\"><span class=\"mopen delimcenter\" style=\"top:0em;\">(</span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mop\"><span class=\"mop\">max</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.32797999999999994em;\"><span style=\"top:-2.5500000000000003em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.6828285714285715em;\"><span style=\"top:-2.786em;margin-right:0.07142857142857144em;\"><span class=\"pstrut\" style=\"height:2.5em;\"></span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">−</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span><span class=\"mclose delimcenter\" style=\"top:0em;\">)</span></span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.9559em;\"><span style=\"top:-3.2047920000000003em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2</span></span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\">.</span></span></span></span>\nInstead, the target <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>r</mi><mo>+</mo><mi>γ</mi><msub><mo><mi>max</mi><mo>⁡</mo></mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></msub><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">r + \\gamma \\max_{a&#x27;} Q(s&#x27;, a&#x27;; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.66666em;vertical-align:-0.08333em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mop\"><span class=\"mop\">max</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.32797999999999994em;\"><span style=\"top:-2.5500000000000003em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.6828285714285715em;\"><span style=\"top:-2.786em;margin-right:0.07142857142857144em;\"><span class=\"pstrut\" style=\"height:2.5em;\"></span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> is treated\nas a constant.</p>\n<h2>The Deadly Triad</h2>\n<p>The resulting method now has the following properties:</p>\n<ol>\n<li>The method is <em>off-policy</em>. That is, no matter what the behaviour\npolicy is, due to the maximization process in the update-rule, we\nactually directly learn the state-action values of the optimal\npolicy. This allows the agent to explore all possible actions, while\nat the same time learning what the optimal actions are.</li>\n<li>It is a <em>bootstrapping</em> method. That means, instead of updating its\nQ-value with an actual outcome of the behaviour, the agent uses an\nestimate <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s&#x27;, a&#x27;; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> for the update which is itself not\nperfect and might be biased.</li>\n<li>The method uses <em>function approximation</em> as explained before.</li>\n</ol>\n<p>The first two properties were already present in tabular <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-learning,\nbut together with the added function approximation it is called the\n“deadly triad” since it can lead to divergence of the algorithm<sup id=\"fnref-1\"><a href=\"#fn-1\" class=\"footnote-ref\">1</a></sup>.<br>\nIn the following section, we explain the tricks that <a href=\"#references\">[1]</a>\nintroduced to tackle the deadly triad and other convergence problems\nthat may arise in Deep RL. It is largely an open research question how\nlarge the influence of the deadly triad is on divergence problems in\nDeep RL, but specifically the fixed target network seems targeted for\nprecisely that, as described in <a href=\"#references\">[4]</a>. Other problems come from the\nnature of RL itself, with its highly non-<a href=\"https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables\">i.i.d</a>. data.</p>\n<h2>DQN and its Tricks</h2>\n<p>In principle, the method is just usual semi-gradient <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-learning with\ndeep neural networks as function approximators. But the authors employ\ntwo important tricks:</p>\n<h3>1. Experience Replay</h3>\n<p>This describes the process of storing transitions\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mo stretchy=\"false\">(</mo><msub><mi>s</mi><mi>t</mi></msub><mo separator=\"true\">,</mo><msub><mi>a</mi><mi>t</mi></msub><mo separator=\"true\">,</mo><msub><mi>r</mi><mi>t</mi></msub><mo separator=\"true\">,</mo><msub><mi>s</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">(s_t, a_t, r_t, s_{t+1})</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.2805559999999999em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathdefault mtight\">t</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.2805559999999999em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathdefault mtight\">t</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.2805559999999999em;\"><span style=\"top:-2.5500000000000003em;margin-left:-0.02778em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathdefault mtight\">t</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.301108em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">t</span><span class=\"mbin mtight\">+</span><span class=\"mord mtight\">1</span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.208331em;\"><span></span></span></span></span></span></span><span class=\"mclose\">)</span></span></span></span> in a memory <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>D</mi></mrow><annotation encoding=\"application/x-tex\">D</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">D</span></span></span></span> from which minibatches are\nlater sampled during training.</p>\n<p>That process has several advantages:</p>\n<ul>\n<li>It breaks correlations, so that the data is more i.i.d. This reduces\nvariance of the updates and increases stability, since correlated\ndata might lead to too large steps in one direction.</li>\n<li>One has gains in efficiency due to re-used data. Note that\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-learning is an off-policy method, and so using data from the\npast, when the behaviour policy was still quite different, does not\nlead to problems.</li>\n</ul>\n<h3>2. Fixed Target Network</h3>\n<p>Recall that the update in semi-gradient Q-learning is given by Equation (1), where\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">,</mo><mi>r</mi><mo separator=\"true\">,</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">(s, a, r, s&#x27;)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mclose\">)</span></span></span></span> is a saved transition and <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>θ</mi></mrow><annotation encoding=\"application/x-tex\">\\theta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.69444em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span></span></span></span> is the current\nparameter vector.</p>\n<p>There can be the following problem with this framework. First, some\nnotation: Denote the target by\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>T</mi><mo stretchy=\"false\">(</mo><mi>θ</mi><mo stretchy=\"false\">)</mo><mo>=</mo><mi>r</mi><mo>+</mo><mi>γ</mi><msub><mo><mi>max</mi><mo>⁡</mo></mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></msub><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">T(\\theta) = r + \\gamma \\max_{a&#x27;} Q(s&#x27;, a&#x27;; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.13889em;\">T</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.66666em;vertical-align:-0.08333em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">r</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">+</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mop\"><span class=\"mop\">max</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.32797999999999994em;\"><span style=\"top:-2.5500000000000003em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.6828285714285715em;\"><span style=\"top:-2.786em;margin-right:0.07142857142857144em;\"><span class=\"pstrut\" style=\"height:2.5em;\"></span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> and imagine it is\nlarger by <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>1</mn></mrow><annotation encoding=\"application/x-tex\">1</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span></span></span></span> compared to <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s, a; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span>. Then we obtain\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi mathvariant=\"normal\">Δ</mi><mi>θ</mi><mo>∝</mo><msub><mi mathvariant=\"normal\">∇</mi><mi>θ</mi></msub><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">\\Delta \\theta \\propto \\nabla_{\\theta} Q(s, a; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.69444em;vertical-align:0em;\"></span><span class=\"mord\">Δ</span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∝</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord\"><span class=\"mord\">∇</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.33610799999999996em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\" style=\"margin-right:0.02778em;\">θ</span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> and move in the\ndirection of increasing <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s, a; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span>. But <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span> is a function\napproximator, so this is not the only value that changes and similar\nstate-action pairs will change their value as well. Now, <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></mrow><annotation encoding=\"application/x-tex\">s&#x27;</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.751892em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span> and <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup></mrow><annotation encoding=\"application/x-tex\">a&#x27;</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.751892em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span></span></span></span>\noccur only one time-step after <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>s</mi></mrow><annotation encoding=\"application/x-tex\">s</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">s</span></span></span></span> and <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>a</mi></mrow><annotation encoding=\"application/x-tex\">a</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">a</span></span></span></span>, so they may be perceived as\nvery similar when processed by the <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi></mrow><annotation encoding=\"application/x-tex\">Q</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8777699999999999em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">Q</span></span></span></span>-network. Consequently,\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s&#x27;, a&#x27;; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> may increase as much as <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s, a; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span>. The\nresult: <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>T</mi><mo stretchy=\"false\">(</mo><msup><mi>θ</mi><mtext>new</mtext></msup><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">T(\\theta^{\\text{new}})</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.13889em;\">T</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.664392em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">new</span></span></span></span></span></span></span></span></span></span><span class=\"mclose\">)</span></span></span></span> may be one larger than\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><mi>s</mi><mo separator=\"true\">,</mo><mi>a</mi><mo separator=\"true\">;</mo><msup><mi>θ</mi><mtext>new</mtext></msup><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s, a; \\theta^{\\text{new}})</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord mathdefault\">s</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\">a</span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.664392em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">new</span></span></span></span></span></span></span></span></span></span><span class=\"mclose\">)</span></span></span></span> again!</p>\n<p>I guess you already see the problem: assuming we would run this\nindefinitely, the Q-value could explode. This happens due to the\noff-policy nature of the algorithm which does not ensure that\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>Q</mi><mo stretchy=\"false\">(</mo><msup><mi>s</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">,</mo><msup><mi>a</mi><mo mathvariant=\"normal\" lspace=\"0em\" rspace=\"0em\">′</mo></msup><mo separator=\"true\">;</mo><mi>θ</mi><mo stretchy=\"false\">)</mo></mrow><annotation encoding=\"application/x-tex\">Q(s&#x27;, a&#x27;; \\theta)</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.001892em;vertical-align:-0.25em;\"></span><span class=\"mord mathdefault\">Q</span><span class=\"mopen\">(</span><span class=\"mord\"><span class=\"mord mathdefault\">s</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">a</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.751892em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">′</span></span></span></span></span></span></span></span></span><span class=\"mpunct\">;</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"mclose\">)</span></span></span></span> gets the relevant feedback later on which would make\nit smaller again.</p>\n<p>For this problem, it may help to <em>fix the target network</em>, which was\nproposed in <a href=\"#references\">[1]</a> for the first time. In that framework, you\nhave parameters <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>θ</mi></mrow><annotation encoding=\"application/x-tex\">\\theta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.69444em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span></span></span></span> which you update constantly, and parameters\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msup><mi>θ</mi><mo lspace=\"0em\" rspace=\"0em\">−</mo></msup></mrow><annotation encoding=\"application/x-tex\">\\theta^{-}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.771331em;vertical-align:0em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.771331em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">−</span></span></span></span></span></span></span></span></span></span></span></span> which you hold constant for <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>C</mi></mrow><annotation encoding=\"application/x-tex\">C</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.07153em;\">C</span></span></span></span> timesteps, until you update\nthem to <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>θ</mi></mrow><annotation encoding=\"application/x-tex\">\\theta</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.69444em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.02778em;\">θ</span></span></span></span> and hold them constant again. This is claimed to\nprevent the spiraling that we describe here.</p>\n<h2>The Experiment</h2>\n<h3>Aim of the Experiments</h3>\n<p>The experiments we run aim to visualize the impact that the tricks of\nDQN have. Note that in <a href=\"#references\">[1]</a>, the authors find that each of\nthe two tricks, experience replay and a fixed target network, have their\nown contribution to the success of the agent and that their contribution\namplifies when they are combined. We are interested in whether we can\nconfirm these results in a simpler environment and highlighting the\nindividual contributions of the effects.</p>\n<p>The self-written source code of the experiments can be found on\nGitHub<sup id=\"fnref-2\"><a href=\"#fn-2\" class=\"footnote-ref\">2</a></sup>, where we reference to others if we were inspired by their\nwork.</p>\n<h3>The Environment</h3>\n<p>The desiderata for our environment were the following:</p>\n<ul>\n<li>It is simple enough to allow easy experimentation.</li>\n<li>It is complex and difficult enough so that DQN without experience\nreplay and a fixed target network is not able to solve it.</li>\n</ul>\n<p>With these criteria in mind, we will see if the tricks proposed in DQN\nmake it solvable.</p>\n<p>We qualitatively reviewed the performance of our model trained on all\nclassic control environments of OpenAI Gym<sup id=\"fnref-3\"><a href=\"#fn-3\" class=\"footnote-ref\">3</a></sup>. Performance is measured\nas the (negative) episode duration (sign dependant on the whether the\nagent is tasked to have long or short episodes), and plotted over the\ntraining period. We found that the Acrobot-v1 and MountainCar-v0\nenvironments would require further modifications to DQN to be solvable,\ni.e. even the tricks were not able to lead to convergence. However, the\n<a href=\"https://gym.openai.com/envs/CartPole-v1/\">Cartpole-v1 environment</a>\ndisplayed exactly the required properties: it diverged when using\nstandard semi-gradient Q-learning with function approximation, but\nconverged when adding experience replay and a target network. In this\nenvironment the return is the total number of steps that the pole stays\nupright, with the upper limit of the number of episodes being 500.</p>\n<h3>Methods</h3>\n<p>We have trained the network in the following four versions:</p>\n<ol>\n<li>vanilla semi-gradient Q-learning with function approximation.</li>\n<li>the same as (1) and including experience replay.</li>\n<li>the same as (1) and including a fixed target network which is\nupdated every <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>c</mi><mo>∈</mo><mi>C</mi></mrow><annotation encoding=\"application/x-tex\">c \\in C</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.5782em;vertical-align:-0.0391em;\"></span><span class=\"mord mathdefault\">c</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">∈</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.07153em;\">C</span></span></span></span> steps. We experiment with different values\nof <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>C</mi></mrow><annotation encoding=\"application/x-tex\">C</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.07153em;\">C</span></span></span></span> in the set <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mo stretchy=\"false\">{</mo><mn>1</mn><mo separator=\"true\">,</mo><mn>50</mn><mo separator=\"true\">,</mo><mn>100</mn><mo separator=\"true\">,</mo><mn>200</mn><mo separator=\"true\">,</mo><mn>500</mn><mo stretchy=\"false\">}</mo></mrow><annotation encoding=\"application/x-tex\">\\{1, 50, 100, 200, 500\\}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mopen\">{</span><span class=\"mord\">1</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\">5</span><span class=\"mord\">0</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\">1</span><span class=\"mord\">0</span><span class=\"mord\">0</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\">2</span><span class=\"mord\">0</span><span class=\"mord\">0</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\">5</span><span class=\"mord\">0</span><span class=\"mord\">0</span><span class=\"mclose\">}</span></span></span></span>.</li>\n<li>a combination of (2) and (3). This is the full training process\nproposed in <a href=\"#references\">[1]</a>.</li>\n</ol>\n<p>Each variant uses the same simple neural network implemented in PyTorch:</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">class QNetwork(nn.Module):\n    \n    def __init__(self, num_s, num_a, num_hidden=128):\n        nn.Module.__init__(self)\n        self.l1 = nn.Linear(num_s, num_hidden)\n        self.l2 = nn.Linear(num_hidden, num_a)\n\n    def forward(self, x):\n        x = F.relu(self.l1(x))\n        x = self.l2(x)\n        return x</code></pre></div>\n<p>All variants use the same static set of hyperparameters listed in the table below, where we also explain the\nreasoning behind their choice.</p>\n<table>\n<thead>\n<tr>\n<th>Hyperparameter</th>\n<th>Value</th>\n<th>Rationale</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Number of episodes</td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>100</mn></mrow><annotation encoding=\"application/x-tex\">100</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span><span class=\"mord\">0</span><span class=\"mord\">0</span></span></span></span></td>\n<td>Methods that proved convergence properties converge after around 50 episodes.</td>\n</tr>\n<tr>\n<td>Batch size</td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>64</mn></mrow><annotation encoding=\"application/x-tex\">64</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">6</span><span class=\"mord\">4</span></span></span></span></td>\n<td>Standard ”exponent of 2” value with no particular meaning.</td>\n</tr>\n<tr>\n<td>Discount Factor</td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>0.8</mn></mrow><annotation encoding=\"application/x-tex\">0.8</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">0</span><span class=\"mord\">.</span><span class=\"mord\">8</span></span></span></span></td>\n<td>Hand-picked during experimentation and implementation process.</td>\n</tr>\n<tr>\n<td>Learning Rate</td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>1</mn><msup><mn>0</mn><mrow><mo>−</mo><mn>3</mn></mrow></msup></mrow><annotation encoding=\"application/x-tex\">10^{-3}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.8141079999999999em;vertical-align:0em;\"></span><span class=\"mord\">1</span><span class=\"mord\"><span class=\"mord\">0</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.8141079999999999em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">−</span><span class=\"mord mtight\">3</span></span></span></span></span></span></span></span></span></span></span></span></td>\n<td>Hand-picked during experimentation and implementation process.</td>\n</tr>\n<tr>\n<td>Memory Capacity</td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>10.000</mn></mrow><annotation encoding=\"application/x-tex\">10.000</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span><span class=\"mord\">0</span><span class=\"mord\">.</span><span class=\"mord\">0</span><span class=\"mord\">0</span><span class=\"mord\">0</span></span></span></span></td>\n<td>Hand-picked during experimentation and implementation process</td>\n</tr>\n<tr>\n<td>Replay Start Size</td>\n<td>Batch Size</td>\n<td>After <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>64</mn></mrow><annotation encoding=\"application/x-tex\">64</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">6</span><span class=\"mord\">4</span></span></span></span> state transitions, we begin training, since then the memory contains enough transitions to fill a batch.</td>\n</tr>\n<tr>\n<td>Seed</td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>0</mn><mo>−</mo><mn>10</mn></mrow><annotation encoding=\"application/x-tex\">0-10</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.72777em;vertical-align:-0.08333em;\"></span><span class=\"mord\">0</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">−</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span><span class=\"mord\">0</span></span></span></span></td>\n<td>Set of different seeds used to ensure reproducibility and to get uncorrelated results.</td>\n</tr>\n<tr>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msub><mi>ϵ</mi><mtext>start</mtext></msub></mrow><annotation encoding=\"application/x-tex\">\\epsilon_{\\text{start}}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.58056em;vertical-align:-0.15em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">ϵ</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.2805559999999999em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">start</span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span></span></span></span></td>\n<td>1</td>\n<td>At the beginning we play a maximal exploration strategy since the policy did not find good behaviour yet.</td>\n</tr>\n<tr>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msub><mi>ϵ</mi><mtext>end</mtext></msub></mrow><annotation encoding=\"application/x-tex\">\\epsilon_{\\text{end}}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.58056em;vertical-align:-0.15em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">ϵ</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.33610799999999996em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">end</span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span></span></span></span></td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>0.05</mn></mrow><annotation encoding=\"application/x-tex\">0.05</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">0</span><span class=\"mord\">.</span><span class=\"mord\">0</span><span class=\"mord\">5</span></span></span></span></td>\n<td>We don’t act greedily even during validation, so that the agent cannot overfit on the training experience</td>\n</tr>\n<tr>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><msub><mi>ϵ</mi><mtext>steps</mtext></msub></mrow><annotation encoding=\"application/x-tex\">\\epsilon_{\\text{steps}}</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.716668em;vertical-align:-0.286108em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\">ϵ</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.28055599999999997em;\"><span style=\"top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">steps</span></span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.286108em;\"><span></span></span></span></span></span></span></span></span></span></td>\n<td><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>1000</mn></mrow><annotation encoding=\"application/x-tex\">1000</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span><span class=\"mord\">0</span><span class=\"mord\">0</span><span class=\"mord\">0</span></span></span></span></td>\n<td>Within <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>1000</mn></mrow><annotation encoding=\"application/x-tex\">1000</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span><span class=\"mord\">0</span><span class=\"mord\">0</span><span class=\"mord\">0</span></span></span></span> updates, <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ϵ</mi></mrow><annotation encoding=\"application/x-tex\">\\epsilon</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">ϵ</span></span></span></span> is linearly annealed to <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>0.05</mn></mrow><annotation encoding=\"application/x-tex\">0.05</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">0</span><span class=\"mord\">.</span><span class=\"mord\">0</span><span class=\"mord\">5</span></span></span></span>. Hand picked during implementation as it gave fast convergence.</td>\n</tr>\n<tr>\n<td>Optimizer</td>\n<td>ADAM (default parameters)</td>\n<td>State of the art optimizer suitable for most applications.</td>\n</tr>\n<tr>\n<td>NN architecture</td>\n<td>Input dim: number of states, i.e. 4; Hidden dim: 128 followed by ReLU; Output dim: number of actions, i.e. 2</td>\n<td>Hand-picked based on problem and input complexity.</td>\n</tr>\n</tbody>\n</table>\n<h3>The Implementation</h3>\n<p>We have utilized the code obtained during Lab sessions of the course\nReinforcement Learning at University of Amsterdam as the skeleton of our\ncode, and added a flag to enable/disable the experience replay trick,\nand added an integer input to define for how many steps the target\nnetwork would be fixed (1 meaning it is updated each time i.e. it is not\nfixed). All the environments used are from the <a href=\"https://gym.openai.com/\">OpenAI Gym\nlibrary</a>. For the implementation we use several\nopen-source standard machine learning python packages (for details see\n<a href=\"https://github.com/YoniSchirris/rl-lab3\">the source code</a>). The\nexperiments are run on a regular laptop with Intel i5-7200U CPU and 8GB\nof RAM on which a single run of 100 episodes takes about 30 seconds.\nThis means that the results we obtained here can easily be checked on\nyour computer in a few minutes.</p>\n<h3>Results</h3>\n<p>For assessing the performance of our agent, we look both at the return\nand at so-called soft divergence. Note that with return, we always mean\nthe undiscounted return, even if the agent discounts, as in our case,\nwith <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>γ</mi><mo>=</mo><mn>0.8</mn></mrow><annotation encoding=\"application/x-tex\">\\gamma = 0.8</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">0</span><span class=\"mord\">.</span><span class=\"mord\">8</span></span></span></span>. That is, we treat <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>γ</mi></mrow><annotation encoding=\"application/x-tex\">\\gamma</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span></span></span></span> not as part of the\nobjective, but as a hyperparameter. We will explain soft divergence in\nthe corresponding subsection.</p>\n<h4>Return</h4>\n<p>We will present the results of the experiments with plots in which the\nx-axis represents the number of the episodes and the y-axis represents\nthe return obtained during validation. This means that every 5 episodes\nwe take the trained model, fix <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>ϵ</mi></mrow><annotation encoding=\"application/x-tex\">\\epsilon</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">ϵ</span></span></span></span> to 0.05 and evaluate the\nperformance of this model, without training it during this process.</p>\n<p>Each curve belongs to one variant of the algorithm. For each variant, we\ntrain the agent from scratch 10 times with different seeds and create\nthe line by averaging the returns. The shaded area around each curve is\nthe <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>95</mn><mi mathvariant=\"normal\">%</mi></mrow><annotation encoding=\"application/x-tex\">95\\%</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.80556em;vertical-align:-0.05556em;\"></span><span class=\"mord\">9</span><span class=\"mord\">5</span><span class=\"mord\">%</span></span></span></span> confidence interval (i.e. it represents <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>μ</mi><mo>±</mo><mn>2</mn><mi>σ</mi></mrow><annotation encoding=\"application/x-tex\">\\mu \\pm 2\\sigma</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.7777700000000001em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\">μ</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span><span class=\"mbin\">±</span><span class=\"mspace\" style=\"margin-right:0.2222222222222222em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">2</span><span class=\"mord mathdefault\" style=\"margin-right:0.03588em;\">σ</span></span></span></span>).</p>\n<figure name=\"fig:1\">\n<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/6878cefe0f31bb2480b896acee49d14b/e49a9/yoni-fig-1.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAYAAADkmO9VAAAACXBIWXMAAA9hAAAPYQGoP6dpAAACVElEQVQ4y62UaXOiQBCG8///0W7lw+5WJdGo8RbxABSUQ0FOOfVN9ygeW/txqRq6h5l65u1jeAE95/MZ/+NhzksNq05nZFmJPC/IFjebZbnw87y8zXmU5YlGhaKohK05AhiGAQ6HCKvVDqPRHIOBTGOKfn8q/F5PwmSyxHA4I3+CbncCTTNhWQfsdiH2++gZWJYlojCBZdhQ1S2mUxWyrNFYQZIUzGYrAipQlI3wJUlFpytDN3ZwnFBAn4AUPeI4g23zYkQnxnDdWNhHn9c8L8HWDNAZm2J+AT4o5FcQBPD9WJzEG+pTHSf4p6+s9vj5toS+OWCh7WHZwR1YVZVwjsecchKQyuAGZb+e75wLqD3aoDUw8Pqh4Gu8RWtoQNO9e5Vt26Zw9hRy+qSuVst2Y/owLR+DqYlmX8fvlibAbP+0Nahr9w6Mogg8GMg5fISyOq4ghzVdOvgkZQx5766pIAc0CP7rUxXrN2DdlEmSP4VYh+wSUJrbeP9aCzUfBOtLpjhQ0106RH9WyK8oCkVROId1qAzinMmKgyYp+/G2QKOnYyhbVAxP7OPqX9IR/N02VOkwxWTuYEft4VxbpzXa4rWhirBaIxOa4YsWEe3C+0wXDqfFTXA63YAXqVF0xGSsQlcM2LoFdWGg2aKb0R5j0OhgOZ7B0VawZjNs50vYhgVLlrCVJDiWh4uuaw65dfh+JlGMPM0Qk/XcAwq6uwkXzaeQioKsj8Dj75n4FtDI0xRZmt+B9z/NmWSfhMeWr+PlKyi3Nt0OEymBqusetmVVPv5qBPQbEARt7yrKUOkAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;\"></span>\n  <img class=\"gatsby-resp-image-image\" alt=\"yoni fig 1\" title=\"yoni fig 1\" src=\"/static/6878cefe0f31bb2480b896acee49d14b/b9e4f/yoni-fig-1.png\" srcset=\"/static/6878cefe0f31bb2480b896acee49d14b/cf440/yoni-fig-1.png 148w,\n/static/6878cefe0f31bb2480b896acee49d14b/d2d38/yoni-fig-1.png 295w,\n/static/6878cefe0f31bb2480b896acee49d14b/b9e4f/yoni-fig-1.png 590w,\n/static/6878cefe0f31bb2480b896acee49d14b/e49a9/yoni-fig-1.png 640w\" sizes=\"(max-width: 590px) 100vw, 590px\" loading=\"lazy\">\n  </a>\n    </span>\n<figcaption>Fig 1: Comparison between DQN with and without experience\nreplay</figcaption>\n</figure>\n<p><a href=\"#fig:1\">Figure 1</a> presents the comparison between\nthe models with and without experience replay. In both cases, we did not\nuse the fixed target network. The picture clearly displays the\ndifference in quality of the two different methods showing that\nexperience replay is a very important factor of convergence in this\nenvironment. We also ran the method without experience replay for 500\nepisodes, but still did not see any clearly improving values as seen in\n<a href=\"#fig:apx-1\">Figure 6</a> in the appendix.</p>\n<figure name=\"fig:2\">\n<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/249b0315293573094a93345de3254dac/e49a9/no_experience_varying_target.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAYAAADkmO9VAAAACXBIWXMAAA9hAAAPYQGoP6dpAAACdklEQVQ4y62UaVPbMBCG+f+/pNMvdKbQ0iEchoSEkNNJCCYhJ5blI5ft+IidvF0pBwVm+FTPrLVaaR/trlc+2mw2+F+PYB0JZbVaIUlSRHGCKFohDCMa408Sxyu5LvQwjMkvlX5C9oFJ4HLpYzp1MRo5aLdfUKu10Wx20Gp15fj4+AJV1aBpA9TrT9LWaGgwjCksy4XjeG9AoYRhiNlsQcCxdMzlSigUKlLy+TKKxTruizWUSg3c3hSkXdj6fQO27ZG474EiwvnCB+NzMDYF0yfydMYmcr7VScYm9IEOZsyg6w44rX+KULyCIIDrLneneXKD4/j/6GR3aJ3ZsMcMziSAzacwtTYsPqO5j816B4zjGL7vETCQp+1FpHHQxWj7MMcGzMEQFgGFbjSr4CNxwJKA6y2Qc04ROPC8L4BCp4gtcjZ7vS2wP4TRqIB3uxK43kfouq6soeeFXwJNSp8NRjAJIGyc0uXNGvSWCpM72JZw14dRFMmPws0ZpfYRuIBNdk4pP7UoolYd/JWD1YpglQKaD5foDfsfGpvqOFt4GBo2Jpa3BdnbunHdAuu/glPhK9VrMLUErVmGUS2gX60hf/uT2qfzHhiJPiSgqqnQ1DylMMGEUpwSpNd+hpq9R7VRwln2GMr5MS4uvqGjZnGnKPhx/p2Azx+vXiy/cv6hgIxygsZTCx2ScrmIXCGHy6sMzjK/cHV9gRMlgz+3Cq5o7++7LE6VG3QpgwNw35DiXi7cEGG0pnqKvpvA8wPSXcxcD6IrfG/br3G8wXweyCCieI0gXL1vbDERsl6nh5PSND38RUxqLcNg8oqmaSJtYm+SJG9/mp3fX0z5Zh4WcYt/AAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;\"></span>\n  <img class=\"gatsby-resp-image-image\" alt=\"no experience varying target\" title=\"no experience varying target\" src=\"/static/249b0315293573094a93345de3254dac/b9e4f/no_experience_varying_target.png\" srcset=\"/static/249b0315293573094a93345de3254dac/cf440/no_experience_varying_target.png 148w,\n/static/249b0315293573094a93345de3254dac/d2d38/no_experience_varying_target.png 295w,\n/static/249b0315293573094a93345de3254dac/b9e4f/no_experience_varying_target.png 590w,\n/static/249b0315293573094a93345de3254dac/e49a9/no_experience_varying_target.png 640w\" sizes=\"(max-width: 590px) 100vw, 590px\" loading=\"lazy\">\n  </a>\n    </span>\n<figcaption>Fig 2: Comparison between DQN without experience replay for different rates of\nupdate of the target\nnetwork.</figcaption>\n</figure>\n<p>In <a href=\"#fig:2\">Figure 2</a> we display the effect of\nthe target network with different update rates on the obtained returns\nwhile turning experience replay off. We observe that the version with an\nupdate rate of <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>50</mn></mrow><annotation encoding=\"application/x-tex\">50</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span><span class=\"mord\">0</span></span></span></span> is often in the lead, however, we do not see an\nobvious statistical effect of the target network alone. To keep the\ngraphs uncluttered, we do not show the values of 100 and 500 which\nperform roughly equal to the 50 value.</p>\n<figure name=\"fig:3\">\n<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/5cd90ce8574bab9cbce7c0f096459fd0/e49a9/experience_varying_target.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAYAAADkmO9VAAAACXBIWXMAAA9hAAAPYQGoP6dpAAACl0lEQVQ4y61U7VLaUBD1/V+i05m2qLVW6ShaRUVBBQGLRhpBAgbJDYaEfAIhgdO9CQH53zuz2UNYzt09u8sG6Mznc/yPw3k2ErJwNsN4EmAymWI89sn7kU+w708jn9h0Gi4tCMJlUhGhbdswTQeyPEC93ors6ellYU0IQgOi2EatJhJuRu8VZmIwcPDetyK/Ruh5HoZDiwjf6Ed/cX9fR6FQRqlUw81NFdWqgGLxHlf5En2uoFwR8PyiYKDZeH93iNDFbLYg5A/btjCkDFnfhKIYkTFmLLAe4V5Ph8qG6LypuH16wFm+AFXugZH1mYY4wXlM6LoOLMuDpjlL42V8xDwLxgZ4aD7ivHKJ1I8tlEtZXBR+Q5KkVVN83480tO0RpW8vTdM+YLKB7uFOqCF9foD9k5/Y3PuC9FEK+5kUxMbzipAxhn5fheMsCLWYjGeWECsq6dvp4ejqGHsHX7GX3kYmd4qdk2NspVOoN5orQsviHTbhOhN0qSRFNRal2nRR3EGprVEzLrF7+A27RJbNZvHaUXFTk5E5zaHR6KwIOXBIQ4dKbvdUvCr9SK+4ZAuGMcJjXcLOwSb2D3eRu65CqMvQSYLXroGq0ENH1tcHezTy4NoT1KQWGl0K1lwitdHqaHiWDGQLJXze/oQcjZDYos4zK9aVsufzyEiS5RzyByc0hy4ylRL+dBow9TF0Cq4IDGe3bez8+o6LixP0FDO6KGlY4nXdWZ/DIAioyx6u74ooPlQhU9mi9IZcWaQS88jnz9ElzRIpuLZJ41Q1xmsZzmiP+Q577gjj0QS2QyUbQ3o3iUaKTwCP596yHNrfGWGPvnNpx0OKC9ZXLxE0nIUxpgtCyjo5qspoYxT6oxgjDOMY7oMPMQnHP7usXwP43tK3AAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;\"></span>\n  <img class=\"gatsby-resp-image-image\" alt=\"experience varying target\" title=\"experience varying target\" src=\"/static/5cd90ce8574bab9cbce7c0f096459fd0/b9e4f/experience_varying_target.png\" srcset=\"/static/5cd90ce8574bab9cbce7c0f096459fd0/cf440/experience_varying_target.png 148w,\n/static/5cd90ce8574bab9cbce7c0f096459fd0/d2d38/experience_varying_target.png 295w,\n/static/5cd90ce8574bab9cbce7c0f096459fd0/b9e4f/experience_varying_target.png 590w,\n/static/5cd90ce8574bab9cbce7c0f096459fd0/e49a9/experience_varying_target.png 640w\" sizes=\"(max-width: 590px) 100vw, 590px\" loading=\"lazy\">\n  </a>\n    </span>\n<figcaption>Fig 3: Comparison between DQN with experience replay for different rates of\nupdate of the target\nnetwork.\nreplay</figcaption>\n</figure>\n<p>In <a href=\"#fig:3\">Figure 3</a> we illustrate the effect of the\ncombination of both tricks. In the case with experience replay, over\nlarge parts of the training process the version with an update rate of\n<span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>200</mn></mrow><annotation encoding=\"application/x-tex\">200</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">2</span><span class=\"mord\">0</span><span class=\"mord\">0</span></span></span></span> steps is leading, although there can not be any statistical\nsignificance inferred in comparison to the update rate of <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>50</mn></mrow><annotation encoding=\"application/x-tex\">50</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span><span class=\"mord\">0</span></span></span></span> and <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>1</mn></mrow><annotation encoding=\"application/x-tex\">1</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span></span></span></span>\n(no target network).</p>\n<p>Finally, in Figures <a href=\"#fig:1\">1</a> and <a href=\"#fig:3\">[3]</a>, we observe that the average episode\nduration decreases after 75 episodes. To investigate if this is a\nsystematic or purely random effect, we trained the model for 300\nepisodes which can be seen in <a href=\"#fig:apx-2\">Figure 7</a> in the appendix. In both\nsettings the performance seems to decrease slightly after 100 episodes.\nTo be on the safe side, we therefore suggest to periodically take\nsnapshots of the weights and in the end use the model with the highest\naverage episode duration.</p>\n<h3>Soft Divergence</h3>\n<p>Now we look at assessing the so-called soft divergence. Soft divergence\nwas for the first time studied in <a href=\"#references\">4</a> as follows:</p>\n<p>Let <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>γ</mi></mrow><annotation encoding=\"application/x-tex\">\\gamma</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.625em;vertical-align:-0.19444em;\"></span><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span></span></span></span> be our discount rate, in our case <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>0.8</mn></mrow><annotation encoding=\"application/x-tex\">0.8</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">0</span><span class=\"mord\">.</span><span class=\"mord\">8</span></span></span></span>. Since our rewards\nare all in the interval <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mo stretchy=\"false\">[</mo><mo>−</mo><mn>1</mn><mo separator=\"true\">,</mo><mn>1</mn><mo stretchy=\"false\">]</mo></mrow><annotation encoding=\"application/x-tex\">[-1, 1]</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"></span><span class=\"mopen\">[</span><span class=\"mord\">−</span><span class=\"mord\">1</span><span class=\"mpunct\">,</span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\">1</span><span class=\"mclose\">]</span></span></span></span> (actually, it is always <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>1</mn></mrow><annotation encoding=\"application/x-tex\">1</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span></span></span></span> until the\nagent loses or the episode is over), there is a bound on the maximally\nachievable true Q-values. We can see this by bounding the discounted\nreturn:</p>\n<p><span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>G</mi><mo>=</mo><msubsup><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>0</mn></mrow><mi mathvariant=\"normal\">∞</mi></msubsup><msup><mi>γ</mi><mi>n</mi></msup><msub><mi>R</mi><mi>n</mi></msub><mo>≤</mo><msubsup><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>0</mn></mrow><mi mathvariant=\"normal\">∞</mi></msubsup><msup><mi>γ</mi><mi>n</mi></msup><mo>=</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mi>γ</mi></mrow></mfrac><mi mathvariant=\"normal\">.</mi></mrow><annotation encoding=\"application/x-tex\">G = \\sum_{n = 0}^{\\infty} \\gamma^n R_n \\leq \\sum_{n = 0}^{\\infty} \\gamma^n  = \\frac{1}{1 - \\gamma}.</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.68333em;vertical-align:0em;\"></span><span class=\"mord mathdefault\">G</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.104002em;vertical-align:-0.29971000000000003em;\"></span><span class=\"mop\"><span class=\"mop op-symbol small-op\" style=\"position:relative;top:-0.0000050000000000050004em;\">∑</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.804292em;\"><span style=\"top:-2.40029em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">n</span><span class=\"mrel mtight\">=</span><span class=\"mord mtight\">0</span></span></span></span><span style=\"top:-3.2029em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">∞</span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.29971000000000003em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.664392em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathdefault mtight\">n</span></span></span></span></span></span></span></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.00773em;\">R</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.151392em;\"><span style=\"top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathdefault mtight\">n</span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.15em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">≤</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.104002em;vertical-align:-0.29971000000000003em;\"></span><span class=\"mop\"><span class=\"mop op-symbol small-op\" style=\"position:relative;top:-0.0000050000000000050004em;\">∑</span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.804292em;\"><span style=\"top:-2.40029em;margin-left:0em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathdefault mtight\">n</span><span class=\"mrel mtight\">=</span><span class=\"mord mtight\">0</span></span></span></span><span style=\"top:-3.2029em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">∞</span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.29971000000000003em;\"><span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.16666666666666666em;\"></span><span class=\"mord\"><span class=\"mord mathdefault\" style=\"margin-right:0.05556em;\">γ</span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.664392em;\"><span style=\"top:-3.063em;margin-right:0.05em;\"><span class=\"pstrut\" style=\"height:2.7em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathdefault mtight\">n</span></span></span></span></span></span></span></span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:1.326216em;vertical-align:-0.481108em;\"></span><span class=\"mord\"><span class=\"mopen nulldelimiter\"></span><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.845108em;\"><span style=\"top:-2.6550000000000002em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">1</span><span class=\"mbin mtight\">−</span><span class=\"mord mathdefault mtight\" style=\"margin-right:0.05556em;\">γ</span></span></span></span><span style=\"top:-3.23em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"frac-line\" style=\"border-bottom-width:0.04em;\"></span></span><span style=\"top:-3.394em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">1</span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.481108em;\"><span></span></span></span></span></span><span class=\"mclose nulldelimiter\"></span></span><span class=\"mord\">.</span></span></span></span></p>\n<p>In our case, this bound is <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mn>0.8</mn></mrow></mfrac><mo>=</mo><mn>5</mn></mrow><annotation encoding=\"application/x-tex\">\\frac{1}{1 - 0.8} = 5</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:1.2484389999999999em;vertical-align:-0.403331em;\"></span><span class=\"mord\"><span class=\"mopen nulldelimiter\"></span><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.845108em;\"><span style=\"top:-2.655em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">1</span><span class=\"mbin mtight\">−</span><span class=\"mord mtight\">0</span><span class=\"mord mtight\">.</span><span class=\"mord mtight\">8</span></span></span></span><span style=\"top:-3.23em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"frac-line\" style=\"border-bottom-width:0.04em;\"></span></span><span style=\"top:-3.394em;\"><span class=\"pstrut\" style=\"height:3em;\"></span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">1</span></span></span></span></span><span class=\"vlist-s\">​</span></span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height:0.403331em;\"><span></span></span></span></span></span><span class=\"mclose nulldelimiter\"></span></span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span><span class=\"mrel\">=</span><span class=\"mspace\" style=\"margin-right:0.2777777777777778em;\"></span></span><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span></span></span></span>. Now, <em>soft\ndivergence</em> is the phenomenon of learned Q-values that exceed this\nbound. This is connected to the deadly triad: off-policy learning with\nbootstrapping and function approximation can lead to a situation in\nwhich the Q-values chase the targets, which themselves can get bigger\nand bigger until they exceed realistic bounds.</p>\n<figure name=\"fig:4\">\n<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/fe26e7ac01107c6354f621ae80e9dc8c/df67e/q-values-without_exp.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 70.68376068376068%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAAA9hAAAPYQGoP6dpAAACPElEQVQ4y61TyXbaMBTN//9DTzfZtZvu2pMTSJkCZjB2gDIYAgHMIGzAswH79knGBOi29rnW07N0dd+ghziO8b8ezvWQGsfjCUFwIBwRhgnSOYfrBrBtl0YfjpMgsT34fohUmCA8Ho/Y712sVnus1xb6I4a5vhV2gj36/TGKxSokSUGlIhOaqFYV8tXQbmuIojMhJ47jCJbtic1zfYdCY4LBO8NmYwsf4/4pw3A4w/u7foPhcI7pdEWE0afCKDoJhRtmY0DqstIYancBxqyLSkb/NhvnHzDmwDTda4UxTqcTLMsThK3eArKsQVLnWCyTcDl4OpKU3NpLGvlhF4XcsCyL4NJPC52ejlWrCYny1NXm2G49rGjDPSEf2Tm/PDUXwkSdBcf2Rf5a9QYa0k/k8j+QeX6CPl7CmC1gGBTaLqQDArIpGsKasFjaZ4VxqjDG4XCA44Zis1R6hqy1UWuV8fT9K349fkHx2yOa2RfIVNlyqYqi9IZyQcFrtoxMXoH2we77MIJNhKPOAGqvA2Z6mDTa6BVKGDSbaMg11CsltMqvUIt51PO/Iede0M5loBRzmE31T0L+4XCdAP3ukPpNg670MClUsZ6bMChMk4dJECHzuenDoMoaWxqp0sZ1DkUf0rvb2vijdjCsqfiQVOgjKg4lfbncXQqS2Lsb34Js3l43fchD9rxAkDqk1CLsqY0cm2zLp0LYBEe0Fu9XseZs27SGz2+uXmLH17f8ciJ/wjCk++yLNak/vluT+v4CiYQf3LsFMo0AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;\"></span>\n  <img class=\"gatsby-resp-image-image\" alt=\"q values without exp\" title=\"q values without exp\" src=\"/static/fe26e7ac01107c6354f621ae80e9dc8c/b9e4f/q-values-without_exp.png\" srcset=\"/static/fe26e7ac01107c6354f621ae80e9dc8c/cf440/q-values-without_exp.png 148w,\n/static/fe26e7ac01107c6354f621ae80e9dc8c/d2d38/q-values-without_exp.png 295w,\n/static/fe26e7ac01107c6354f621ae80e9dc8c/b9e4f/q-values-without_exp.png 590w,\n/static/fe26e7ac01107c6354f621ae80e9dc8c/f9b6a/q-values-without_exp.png 885w,\n/static/fe26e7ac01107c6354f621ae80e9dc8c/df67e/q-values-without_exp.png 1170w\" sizes=\"(max-width: 590px) 100vw, 590px\" loading=\"lazy\">\n  </a>\n    </span>\n<figcaption>Fig 4: Max Q-Values within evaluation episodes with experience\nreplay\nnetwork.\nreplay</figcaption>\n</figure>\n<figure name=\"fig:4\">\n<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/1631068184d67d379a3cbca5f9f30f7c/df67e/q-values-with_exp.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 70.68376068376068%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAAA9hAAAPYQGoP6dpAAACIklEQVQ4y61Ua2/aMBTt//8j1T5N2ySkThudprHyKFtbBqHlEQjEeZmQh5OQbOTs2oGoAz4u0omvHeX43HuPfVWWJf7XI7muKr4SRfEHu91vQlGN+auYkGU5hEgVkiSjMTuMqfp2FFYThmECxwnhuBFc24fLPBXLNZfG1cpFvz/A3V0fnc6DQrf7iFbrHsPhFPt9TVgFQuzg2AE8noDpBsyXKTxPwD0Q2paPxcLCfG5C11mN2cykzRwi3L9WiIPCiEgiaL0+Hlpt2EQm55JQgnNxBrnpZpOcKiwh4oxSjGEtGbqfmvj+sYnxeE2KhSJT5VBqw7NYblorlK8kEfB9UuCnmA5G6H1tk8Iebr/1sLaoDIe0L0ESch6fEiYqZVmv0f0jjJWHwfAJb27eo/f4Ame8gOPF9HNQqzrWtlJ4QlgUBcKImrH2MB3PMVgSFlPcfmng7fU1Zs0O+JqDB7uqBETgHeBSmf5ReOxykuZgSwfafI4Zs8DNLQyySLt5g88f3uFXo4GlNgEj+9gmh818BYtij1SfN4VsszZsaEudFEQwyGdssoIf5tBNi0qg4fnHE4zJgja2FDEju5gUe25wwYfU5efhmHxmgI10GD812HZICMCVNTJqWqZ8qgxPqcrUbWW1Cymn6Q7+JoCIUoTbmGqaKtVRlCkHbOUaNS4MRLUepmoex5ma10fveKhPL4my3Ndxnud0njNVmqOS6p/92eXwF3P3JHn48pWmAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;\"></span>\n  <img class=\"gatsby-resp-image-image\" alt=\"q values with exp\" title=\"q values with exp\" src=\"/static/1631068184d67d379a3cbca5f9f30f7c/b9e4f/q-values-with_exp.png\" srcset=\"/static/1631068184d67d379a3cbca5f9f30f7c/cf440/q-values-with_exp.png 148w,\n/static/1631068184d67d379a3cbca5f9f30f7c/d2d38/q-values-with_exp.png 295w,\n/static/1631068184d67d379a3cbca5f9f30f7c/b9e4f/q-values-with_exp.png 590w,\n/static/1631068184d67d379a3cbca5f9f30f7c/f9b6a/q-values-with_exp.png 885w,\n/static/1631068184d67d379a3cbca5f9f30f7c/df67e/q-values-with_exp.png 1170w\" sizes=\"(max-width: 590px) 100vw, 590px\" loading=\"lazy\">\n  </a>\n    </span>\n<figcaption>Fig 5: Max Q-Values within evaluation episodes with experience\nreplay.\nreplay</figcaption>\n</figure>\n<p>In order to assess the presence of soft divergence, we measure the\nmaximal Q-values that appear within evaluation episodes. In <a href=\"#fig:4\">Figure 4</a>, you can find this for the case\nwithout experience replay, and in <a href=\"#fig:5\">Figure 5</a> for the case with experience replay.</p>\n<p>First of all, we can see that if we use experience replay, the maximal\nQ-value converges precisely to <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>5</mn></mrow><annotation encoding=\"application/x-tex\">5</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span></span></span></span>. This is expected, since we already\nknow that an agent using experience replay achieves high return, and so\nif the Q-value measures the return correctly, it should output something\nclose to the optimal value, which is <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>5</mn></mrow><annotation encoding=\"application/x-tex\">5</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span></span></span></span>.</p>\n<p>However, the version without experience replay actually shows soft\ndivergence, with Q-values being consistently too large in the end of\ntraining with values ranging between <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>5</mn></mrow><annotation encoding=\"application/x-tex\">5</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span></span></span></span> and <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>8</mn></mrow><annotation encoding=\"application/x-tex\">8</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">8</span></span></span></span>. Additionally, since we\nalready know that the agent performs very bad, we would actually expect\nQ-values lower than <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>5</mn></mrow><annotation encoding=\"application/x-tex\">5</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span></span></span></span>. Now we get an explanation for why the agent is\nnot able to perform properly: the measurements of the Q-values are\npartly too high during the entire training process, and therefore not\naccurate, and cannot guide the agent to the actions with the best\nbehaviour.</p>\n<p>Another interesting phenomenon can be observed by looking at the\ndifferences between the different target network update rates. In both\nplots we see that, in the beginning of the training process, the\nQ-values get considerably larger than <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>5</mn></mrow><annotation encoding=\"application/x-tex\">5</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">5</span></span></span></span>, ranging up to <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>25</mn></mrow><annotation encoding=\"application/x-tex\">25</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">2</span><span class=\"mord\">5</span></span></span></span>. That is,\nall versions of the algorithm show soft divergence in the first 50\nepisodes, with a peak after <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>20</mn></mrow><annotation encoding=\"application/x-tex\">20</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">2</span><span class=\"mord\">0</span></span></span></span> episodes. One explanation might be the\n<em>maximization bias</em> which is present in Q-learning. Maximization bias is\na phenomenon which happens when the algorithm overestimates the Q-values\nas a consequence of the <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mi>max</mi><mo>⁡</mo></mrow><annotation encoding=\"application/x-tex\">\\max</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.43056em;vertical-align:0em;\"></span><span class=\"mop\">max</span></span></span></span> arguments in the update rule. In\npractice, this bias has been studied and can be solved for example with\ndouble Q-learning explained in <a href=\"#references\">[5]</a>. However, since we did not\ninclude experiments using double Q-learning, we cannot assess the\nvalidity of this explanation.</p>\n<p>A second and complementary explanation for this might be the following:\nin the beginning of training, the learned representations of the\nQ-network are very bad and might not be successful in distinguishing\nbetween different state-action pairs. If that happens, then the increase\nof one Q-value might just increase the Q-values of all other states in\nthe same way, which creates the already explained phenomenon that\nQ-values can become unbounded. This would also explain why the problem\nis less severe when using the fixed target network, and even less severe\nif the updates of the parameters happen less often.</p>\n<h2>Conclusion</h2>\n<p>In this blog post we have analyzed what effect the tricks experience\nreplay and a fixed target network have on the convergence of the DQN\nalgorithm. Our experiments have shown that, by itself, the DQN algorithm\nwithout experience replay and fixed target network fails to converge.</p>\n<p>Furthermore, we observe that experience replay has a significant impact\non the objective, i.e. return, whereas the fixed target network showed\nalmost no effect on that. This is different from what we expected, as we\ninitially expected that only the addition of <em>both</em> tricks would allow\nthe agent to perform optimally.</p>\n<p>For the large effect of the experience replay on the return, we believe\nthis is because in a physical system like the cart pole environment\nsubsequent state observations are deterministically dependent. This is\nthe strongest form of correlation and might lead to too large steps in\none direction. This can be specifically crucial, because a fast move in\none direction of the cart can lead to a change of orientation of the\npole in the other direction and to the end of the episode.</p>\n<p>The small effect of the fixed target network on the return is\ncontradicted, however, by analyzing its effect on soft divergence. We\nsaw that the fixed target network actually plays an essential role in\nmitigating the initially very high Q-values seen in all variants of the\nalgorithm. This does not itself change the return of the agent after\nconvergence, since in all cases, the algorithm is able to correct for\nthe mistake of vastly too large initial Q-values. Still, these results\nhint that in more complex environments, we would also see an influence\nof the fixed target network on the return achieved by agents as updating\nthe target network more rarely can help fight the soft divergence.\\</p>\n<p>We conclude that generally DQN contains strong and theoretically sound\nideas, and we saw that both tricks can, in different ways, help avoiding\ndivergence. However, to train it successfully and to use the utility of\nthe tricks fruitfully, still requires practice and effortful tuning.</p>\n<p>Appendix</p>\n<h3>DQN without Experience Replay after 500 Episodes</h3>\n<figure name=\"fig:apx-1\">\n<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/1008484e72fa85266ada94ec04c5e84a/e49a9/fig-for-non-converging-dqn-over-500-episodes.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAYAAADkmO9VAAAACXBIWXMAAA9hAAAPYQGoP6dpAAACUUlEQVQ4y61U127bUAzN/39HH4oCRVAgKTqStE3apgmcwHHlKc/I1rSsZUvWsE9JSnLGcyUckyLvPTy8w0f7/R7/62GuI3bSNEWW59imhG2KOE6QJFtCWtnS5xz7cVzGUhqf54WgFnbEdr2OsPIiTB+XGKgzKIqKbndEGKPTGR7sYFDm2u0hfY+gaTZsJ4LrRtjtDoTs7Il0C9sOCCEcGuQ4YeWX9jV4jGUF0A0f7vIZIf8EgU9YS7WnCaU1zeBFjGESkWGWEEJ3/VxhuaCbTSqTuWpNwP5MWwmBbZU5XffxOF8JODdfeKL2QFgUBcIwKBVWKoS0UjieuTBIRa14NF0KZkSoThxoc0/GHjbFsizasZiQiRKHJi1IBbfCxDKJVBimL/nu0MZg7GBChZS+KeRc6EAYRWt4nkcqNweFc90TEv5WBqa0VhfpEeF9WxdcNzU8dI2XhOWB3MkuW5VCXhdeI17T8z8TUcExLnTbmuPzr5Hg9FLFzYNGcV8I9zVhlqWkMBZCPgIqtdQeWKLo+LyPVs9Ef2SLurefOnh/1sPHq6Hg5IeKHuVe3JQ8z+AHm2rHQjSpDa7Oyt6cKji7HuPr7zGuGjO8+9LFh299HF/0cdmYSv5O0Z8I696DMCEVDqkx0Pi7wMXNVAhuWwt8v2V/ip/3GrU4R7Ojo6Pa0smdssBotnqtMJe7GUa029scnh/RMYrobhdkQ2zWsUzwVj7iTYI8K2hpVnSXM4Hreq+vnvBTsBCPzyava/3HYZgGnUUdUcQHeCfxJEmwr17xK55/optqq02Pg1MAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;\"></span>\n  <img class=\"gatsby-resp-image-image\" alt=\"Results\" title=\"Results\" src=\"/static/1008484e72fa85266ada94ec04c5e84a/b9e4f/fig-for-non-converging-dqn-over-500-episodes.png\" srcset=\"/static/1008484e72fa85266ada94ec04c5e84a/cf440/fig-for-non-converging-dqn-over-500-episodes.png 148w,\n/static/1008484e72fa85266ada94ec04c5e84a/d2d38/fig-for-non-converging-dqn-over-500-episodes.png 295w,\n/static/1008484e72fa85266ada94ec04c5e84a/b9e4f/fig-for-non-converging-dqn-over-500-episodes.png 590w,\n/static/1008484e72fa85266ada94ec04c5e84a/e49a9/fig-for-non-converging-dqn-over-500-episodes.png 640w\" sizes=\"(max-width: 590px) 100vw, 590px\" loading=\"lazy\">\n  </a>\n    </span>\n<figcaption> This result shows that even after $500$ episodes, the version without\nexperience replay fails to converge in the cartpole environment. There\nis some slight upward trend that is nevertheless not very\nstable.</figcaption>  \n</figure>\n<h3>DQN with Experience Replay after 300 Episodes</h3>\n<figure name=\"fig:apx-2\">\n<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;\">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/84d54717d1c5845cdb2ae7ef5caf9bac/e49a9/appendix-fig-converge-to-diverge.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 75%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAYAAADkmO9VAAAACXBIWXMAAA9hAAAPYQGoP6dpAAACyElEQVQ4y61Ua3eaQBD1//+SPk5PkzZqYx6miRrSNBrfgCKiILKACGgU0dzOrjU59nP3nNkZluXO6w4Z0Hp9fcX/WBwnwzcu2+0O6yRFkmywXm/+6kQItw9yONtstkjT7Zs+BJU5GFH0guGQodXqodlU0WgopBXU6108P3fR7Q5Qq3WEzbVl+XDdGI4TwvPiY8B5OEc4j8EsBl2fYDAwocgagffRaGlkD6BplgBVVQOKOoJtBwKITefw/wXc7bZYLNbkLSKvC3guadNGp6mhrdiYWi7caUAfB5iYHkxrTvciMIpwbEzpfohDGzJ8S9MN4nglwndYBMf24WgapEodD48yhqqOQacHo6tCVcboKBO4LMSUxVBlg5z52ONRU3gz4jhGGC7B2N4rsxzY3RYqd1UUrqpQa3X0mjKMVhsylUH63YdtebCcGF2qt2dO9oC8y47jYLV6IdnAoXowSpmNqY53FyieZJHLVVA7z6IlPUK5L+O+cI1iqQu9P8ZgNEPpsgymD98BFwuqmeeKCD2KzqD6mH0dynUO5Wwe999OcfPxAyrfs2gUcrgmJ+eXVdQeW5BqQ5yfFOCosgAUPDyQcrlcw3XmUIYBZElC9eIc1TsJcvkWt19PUcxeQfpRQPnLZ+Q/nYroC2c3uDzJYyj3jonNazgn2vgsgKL7eLopoVKUYI8dTPt9aPUGftV04l8fzVIJD5c/0ag8oCeVUD7LE6Umx4C73Q4RATrjKZ7aNrJnt3iq6/D8paAMm7jQx3OMzACGbmM0cmHaIWZeBEMzYVPt8c7Dfcqc2DoRuFnv4+mxDc3wBWHZYRq4noZiOjgbXC7iXYTZbPlObL5xYvP5jOMl1i9rSj+i55TGcSEcJckWQRAJ7fsB1Xsl7gZBSGcpNXUmsjya5UPqfG02CfaET+mnkVKkHo3jQAxAkiTiHj/nNl+r1eotwj87HVyIIiiA5AAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;\"></span>\n  <img class=\"gatsby-resp-image-image\" alt=\"Results\" title=\"Results\" src=\"/static/84d54717d1c5845cdb2ae7ef5caf9bac/b9e4f/appendix-fig-converge-to-diverge.png\" srcset=\"/static/84d54717d1c5845cdb2ae7ef5caf9bac/cf440/appendix-fig-converge-to-diverge.png 148w,\n/static/84d54717d1c5845cdb2ae7ef5caf9bac/d2d38/appendix-fig-converge-to-diverge.png 295w,\n/static/84d54717d1c5845cdb2ae7ef5caf9bac/b9e4f/appendix-fig-converge-to-diverge.png 590w,\n/static/84d54717d1c5845cdb2ae7ef5caf9bac/e49a9/appendix-fig-converge-to-diverge.png 640w\" sizes=\"(max-width: 590px) 100vw, 590px\" loading=\"lazy\">\n  </a>\n    </span>\n<figcaption> This result shows that DQN with experience replay and with a fixed\ntarget network for 50 steps tends to reduce performance when training\nlonger. Note that the shaded area here is the 95% confidence interval\nover 5 runs from scratch, less than the 10 runs as in the other\nexperiments.</figcaption>  \n</figure>\n<h4>Referenes</h4>\n<p>[1] <a href=\"https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf\">Human-level control through deep reinforcement learning</a> <br>\n[2] <a href=\"https://ui.adsabs.harvard.edu/abs/2013arXiv1312.5602M\">Playing Atari with Deep Reinforcement Learning</a>] <br>\n[3] <a href=\"http://incompleteideas.net/book/RLbook2018.pdf\">Sutton &#x26; Barto: Reinforcement Learning Book</a> <br>\n[4] <a href=\"https://arxiv.org/pdf/1812.02648.pdf\">Deep Reinforcement Learning and the Deadly Triad</a> <br>\n[5] <a href=\"https://papers.nips.cc/paper/3964-double-q-learning.pdf\">Double Q-learning</a> <br></p>\n<div class=\"footnotes\">\n<hr>\n<ol>\n<li id=\"fn-1\">\n<p>It is out of the scope of this blog post to go into the details as to why the deadly triad can lead to divergence. We recommend the interested reader to read chapter <span class=\"katex\"><span class=\"katex-mathml\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><semantics><mrow><mn>11</mn></mrow><annotation encoding=\"application/x-tex\">11</annotation></semantics></math></span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"strut\" style=\"height:0.64444em;vertical-align:0em;\"></span><span class=\"mord\">1</span><span class=\"mord\">1</span></span></span></span> of [@sutton2018reinforcement].</p>\n<a href=\"#fnref-1\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-2\">\n<p><a href=\"https://github.com/YoniSchirris/rl-lab3\">https://github.com/YoniSchirris/rl-lab3</a></p>\n<a href=\"#fnref-2\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-3\">\n<p><a href=\"https://gym.openai.com/\">https://gym.openai.com/</a></p>\n<a href=\"#fnref-3\" class=\"footnote-backref\">↩</a>\n</li>\n</ol>\n</div>","frontmatter":{"title":"Diving into the atari-game playing algorithm - Deep Q-Networks","date":"December 01, 2019","description":"A small collaborative research about the convergence properties of the Deep Q Network. Written by Leon Lang, Igor Pejic, and Simon Passenheim, and Yoni Schirris."}}},"pageContext":{"isCreatedByStatefulCreatePages":false,"slug":"/does-dqn-coverge/","previous":{"fields":{"slug":"/ideas-for-future-posts/"},"frontmatter":{"title":"Ideas for future posts"}},"next":null}}}