[PUBLISHER] Merge #45

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 13.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md
dgcnz · Oct 6, 2024 · 8b44e44 · 8b44e44
1 parent 28a897f
commit 8b44e44
Show file tree

Hide file tree

Showing 3 changed files with 120 additions and 6 deletions.
diff --git a/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 11.md b/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 11.md
@@ -24,7 +24,7 @@ share: true
 TODO:
 - [x] 11.11 mu norm equation ✅ 2024-10-01
 - [x] 11.17 and 11.18 bellman error ✅ 2024-10-01
-- [ ] 11.19 mean square bellman error
+- [x] 11.19 mean square bellman error ✅ 2024-10-01
 
 > [!NOTE] Equation 11.11: $\mu$-norm
 > 
@@ -41,9 +41,11 @@ TODO:
 > \end{align}
 > $$
 
-
-
-
+> [!NOTE] Equation 11.19: Mean-square Bellman error
+> 
+> $$
+> \overline{BE}({\mathbf{w}}) \doteq || \bar{\delta} ||^2_\mu
+> $$
 
 ## 11.5 Gradient Descent in the Bellman Error
 
@@ -67,5 +69,3 @@ TODO:
 > \end{align}
 > $$
 
-
-
diff --git a/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md b/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md
@@ -0,0 +1,112 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+# 13 Policy Gradient Methods
+
+- Value-based methods: learn action-value estimates.
+- Policy Gradient methods: learn **parameterized policies**, $\pi(a \mid s, \boldsymbol{\theta})$. 
+	- (policies that don't consult a value function). 
+- Actor-critic methods: learn both.
+	- Actor $\to$ policy
+	- Critic $\to$ value function
+
+Notation:
+- $\boldsymbol{\theta}$: policy parameters
+- $\mathbf{w}$ : value function parameters
+
+> [!NOTE] Equation 13.1: Gradient ascent update of policy parameters
+> 
+> $$
+> \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha \widehat{\nabla J(\boldsymbol{\theta})} \tag{13.1}
+> $$
+
+
+## 13.1 Policy Approximation and its Advantages
+
+> [!FAQ]- What is in practice enforced to ensure exploration for PGMs?
+> That the policy never becomes deterministic.
+
+> [!NOTE] Equation 13.2: Soft-max in action preferences
+> 
+> $$
+> \pi(a \mid s, \boldsymbol{\theta}) \doteq \frac{e^{h(s, a, \boldsymbol{\theta})}}{\sum_{b} e^{h(s, b, \boldsymbol{\theta})}} \tag{13.2}
+> $$
+
+> [!FAQ]- Which conditions must the problem fulfill in order to reasonably consider the *Soft-max in action preferences* parametrization?
+> The action space must be discrete and small enough.
+
+> [!NOTE] Equation 13.3: Linear parametrization of policy preferences $h$
+> 
+> $$
+> h(s, a, \boldsymbol{\theta}) \doteq \boldsymbol{\theta}^\intercal \mathbf{x}(s, a) \tag{13.3}
+> $$
+> 
+> where $\mathbf{x}(s, a)$ is a feature vector that describes the state-action pair.
+
+> [!FAQ]- What are the advantage of parametrizing policies according to the *soft-max in action preferences* in comparison with $\epsilon-$greedy policies?
+> 1. Enables policies to approach a deterministic policy. $\epsilon$-greedy policies always maintain a minimum non-greedy probability $\epsilon$.
+> 2. It enables truly stochastic policies. $\epsilon$-greedy policies force the policy to be almost greedy, but sometimes the best policy is to do $x$ with probability $p$ and $y$ with probability $1-p$ (e.g. poker bluffing). 
+  
+> [!FAQ]- What is the most important reason for using policy gradient methods instead of value-based methods?
+> They allow you to inject prior knowledge about the desired form of the policy.
+
+## 13.2 The Policy Gradient Theorem
+
+> [!FAQ]- How does continuous policy parametrization help convergence? Compare it to VBMs.
+> With VBMs, a small change in value function could drastically change the policy.
+> With PGMs, a small change in policy parameters will only change the policy slightly, which helps optimization.
+
+> [!NOTE] Equation 13.4: Performance $J(\boldsymbol{\theta})$ for the episodic case
+> 
+> $$
+>  J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4}
+> $$
+
+> [!FAQ]- What problem/question does the Policy Gradient Theorem answer?
+> How can we estimate the performance gradient with respect to the policy parameter when the gradient depends on the unknown effect of policy changes on the state distribution?
+> 
+> **The policy gradient theorem allows for an estimate of the performance gradient irrespective of the derivative of the state distribution.**
+
+> [!NOTE] Equation 13.5: Policy gradient theorem
+> 
+> $$
+> \nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla \pi(a \mid s, \boldsymbol{\theta}) \tag{13.5}
+> $$
+
+## 13.3 REINFORCE: Monte Carlo Policy Gradient
+
+> [!NOTE] Equation 13.6: *All-actions* policy gradient
+> 
+> $$
+> \begin{align}
+>   \nabla J(\boldsymbol{\theta}) &\propto \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla \pi(a \mid s, \boldsymbol{\theta}) \\
+>   &= \mathbb{E}_{\pi} \left[ \sum_a q_{\pi}(S_t, a) \nabla \pi(a \mid S_t, \boldsymbol{\theta}) \right] \\
+> \end{align}
+> $$
+## 13.4 REINFORCE with Baseline
+
+#todo 
+
+## 13.6 Policy Gradient for Continuing Problems
+
+> [!NOTE] Equation 13.15: Average rate of reward per time step
+> 
+> $$
+> \begin{align}
+> 	J(\boldsymbol{\theta}) &\doteq  r(\pi) \doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^h\mathbb{E} \left[ R_t \mid S_t, A_{0:t-1} \sim \pi \right] \tag{13.15} \\
+> 	&= \lim_{t \to \infty} \mathbb{E} \left[R_t \mid S_t, A_{0:t-1} \sim \pi \right] \\\
+> 	&= \sum_s \mu(s) \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) r
+> \end{align}
+> $$
+> 
+> Where:
+> - $\mu(s) \doteq \lim_{t\to \infty} \mathbb{P} \left[S_t = s \mid A_{0:t} \sim \pi \right]$ is the steady-state distribution of states under $\pi$, which is assumed to exist and to be independent of $S_0$ (an ergodicity assumption).
+
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md
@@ -165,6 +165,8 @@ Examples of $U_t$:
 
 ## 9.6 Selecting Step-Size Parameters Manually
 
+“The classical choice t =1/t, which produces sample averages in tabular MC methods, is not appropriate for TD methods, for nonstationary problems, or for any method using function approximation.” (Sutton and Barto, 2020, p. 244)
+
 
 > [!NOTE] Equation 9.19
 >  Suppose you wanted to learn in about $\tau$ experiences with substantially the same feature vector. A good rule of thumb for setting the step-size parameter of *linear SGD methods* is: