diff --git a/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md b/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md index 6582b416..8b5f8592 100644 --- a/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md +++ b/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md @@ -41,7 +41,7 @@ New notation! ($s\to u$ is an update rule for $v(s)$ using new expression $u$) > > $$ > \begin{align} -> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}} \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && (9.1) +> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}} \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && \tag{9.1} > \end{align} > $$ @@ -54,13 +54,13 @@ For on-policy episodic tasks, $\mu(s)$ is called the *on-policy distribution*, w > > $$ > \begin{align} -> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S && (9.2) +> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S && \tag{9.2} > \end{align} > $$ > > $$ > \begin{align} -> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && (9.3) +> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && \tag{9.3} > \end{align} > $$ @@ -73,4 +73,105 @@ Where: - $\overline{VE}$ only guaranties local optimality. -## 9.3 Stochastic-gradient and Semi-gradient Methods \ No newline at end of file +## 9.3 Stochastic-gradient and Semi-gradient Methods + +> [!NOTE] Equations 9.4 and 9.5 +> +> $$ +> \begin{align} +> \mathbf{w}_{t+1} &= \mathbf{w}_t - \frac{1}{2} \alpha \nabla \left[v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t) \right] && \tag{9.4} \\ +> &= \mathbf{w}_t + \alpha \left[v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t) \right] \nabla \hat{v}(S_t, \mathbf{w}_t) && \tag{9.5} +> \end{align} +> $$ + +However, since we don't know the true $v_\pi(s)$, we can replace it with the *target output* $U_t$: + +> [!NOTE] Equation 9.7 +> +> $$ +> \begin{align} +> \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \left[U_t - \hat{v}(S_t, \mathbf{w}_t) \right] \nabla \hat{v}(S_t, \mathbf{w}_t) && \tag{9.7} +> \end{align} +> $$ + +Where: +- $U_t$ *should* be an unbiased estimate of $v_\pi(s)$, that is: + - $\mathbb{E}[U_t \mid S_t=s] = v_\pi(s)$ + - With local optimum convergence guarantees. + +![[Pasted image 20240923171752.png|Pasted image 20240923171752.png]] + +Examples of $U_t$: +- Monte Carlo target: $U_t = G_t$ (that is, the reward achieved until the end of the episode), unbiased. +- Bootstrapping targets are biased because they depend on $\mathbf{w}$ through $\hat{v}(S_t, \mathbf{w})$ . + - To make them unbiased, you can treat the dependent expressions as constants (stop the gradient flow). This yields *semi-gradient methods*. + +*Semi-gradient methods*: +- Do not converge as robustly as gradient methods, aside from the linear case. +- Faster, enable online/continual learning. + +![[Pasted image 20240923172823.png|Pasted image 20240923172823.png]] + +## 9.4 Linear Methods + +> [!NOTE] Equation 9.8 +> +> $$ +> \begin{align} +> \hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^\intercal \mathbf{x}(s) = \sum_{i=1}^d w_i x_i(s) && \tag{9.8} +> \end{align} +> $$ +> +> Where: +> - $\mathbf{x}(s) = \left(x_1(s), \dots, x_d(s)\right)^\intercal$ + +- Chapter also explores the convergence of TD(0) with SGD and linear approximation and finds it converges to the *TD fixed point* (Eqs. 9.11, 9.12), $\mathbf{w}_{TD}$. + + +> [!NOTE] Equation 9.14 +> +> Interpretation: The asymptotic error of the TD method is no more than $\frac{1}{1-\gamma}$ times the *smallest possible error*. +> +> $$ +> \begin{align} +> \overline{VE}(\mathbf{w}_{TD}) & \leq \frac{1}{1-\gamma} \min_{\mathbf{w}} \overline{VE}(\mathbf{w}) \tag{9.14} +> \end{align} +> $$ + + +![[Pasted image 20240923173826.png|Pasted image 20240923173826.png]] + +> [!NOTE] Equation 9.15 +> +> $$ +> \mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \left[ G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1}) \right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T, \tag{9.15} +> $$ + +> [!NOTE] Equation 9.16 +> +> $$ +> G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T - n. \tag{9.16} +> $$ + + +## 9.5 Feature Construction for Linear Methods + +- 9.5.1 Polynomials +- 9.5.2 Fourier Basis +- 9.5.3 Coarse coding +- 9.5.4 Tile Coding +- 9.5.5 Radial Basis Functions + +## 9.6 Selecting Step-Size Parameters Manually + + +> [!NOTE] Equation 9.19 +> A good rule of thumb for setting the step-size parameter of *linear SGD methods* is: +> +> $$ +> \begin{align} +> \alpha \doteq \left(\tau \mathbb{E}\left[\mathbf{x}^\intercal\mathbf{x}\right]\right)^{-1} \tag{9.19} +> \end{align} +> $$ + + diff --git a/docs/images/Pasted image 20240923171752.png b/docs/images/Pasted image 20240923171752.png new file mode 100644 index 00000000..e890ea66 Binary files /dev/null and b/docs/images/Pasted image 20240923171752.png differ diff --git a/docs/images/Pasted image 20240923172823.png b/docs/images/Pasted image 20240923172823.png new file mode 100644 index 00000000..6c3e7b57 Binary files /dev/null and b/docs/images/Pasted image 20240923172823.png differ diff --git a/docs/images/Pasted image 20240923173826.png b/docs/images/Pasted image 20240923173826.png new file mode 100644 index 00000000..cafdce6c Binary files /dev/null and b/docs/images/Pasted image 20240923173826.png differ