generated from jobindjohn/obsidian-publish-mkdocs
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 4.md * PUSH ATTACHMENT : Pasted image 20240917104341.png * PUSH ATTACHMENT : Pasted image 20240917104526.png * PUSH ATTACHMENT : Pasted image 20240917104618.png * PUSH NOTE : Reinforcement Learning - An Introduction.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 6.md * PUSH ATTACHMENT : Pasted image 20240917101226.png * PUSH ATTACHMENT : Pasted image 20240917101536.png
- Loading branch information
Showing
9 changed files
with
224 additions
and
0 deletions.
There are no files selected for viewing
66 changes: 66 additions & 0 deletions
66
...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 4.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
--- | ||
authors: | ||
- "[[Richard S. Sutton|Richard S. Sutton]]" | ||
- "[[Andrew G. Barton|Andrew G. Barton]]" | ||
year: 2018 | ||
tags: | ||
- textbook | ||
url: | ||
share: true | ||
--- | ||
## 4.1 Policy evaluation | ||
|
||
> [!NOTE] Equations 4.3 and 4.4 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> v_{\pi}(s) &\doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \\ | ||
> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] && (\text{from (3.9)})\\ | ||
> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s] && (4.3)\\ | ||
> &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \left[ r + \gamma v_{\pi}(s') \right] && (4.4), | ||
> \end{align} | ||
> $$ | ||
> [!NOTE] Equation 4.5 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> v_{k+1}(s) &\doteq \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t = s ] \\ | ||
& = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_k(s') \right] && (4.5), | ||
> \end{align} | ||
> $$ | ||
![[Pasted image 20240917104341.png|600]] | ||
## 4.2 Policy Improvement | ||
|
||
> [!NOTE] Equation 4.6 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> q_\pi(s, a) &\doteq \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a] && (4.6)\\ | ||
> &= \sum_{s', r}p(s', r \mid s, a)[r + \gamma v_\pi(s')] \\ | ||
> \end{align} | ||
> $$ | ||
## 4.3 Policy Iteration | ||
|
||
![[Pasted image 20240917104526.png|600]] | ||
## 4.4 Value Iteration | ||
|
||
> "This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps." | ||
> [!NOTE] Equation 4.10 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> v_{k+1} &\doteq \max_{a} \mathbb{E} [R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t =s, A = a] \\ | ||
> &= \max_{a} \sum_{s', r}p(s', r \mid s, a)[r + \gamma v_k(s')] && (4.10) \\ | ||
> \end{align} | ||
> $$ | ||
![[Pasted image 20240917104618.png|600]] | ||
|
||
## 4.5 Asynchronous Dynamic Programming | ||
|
||
> "These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. [...] To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation. Asynchronous DP algorithms allow great flexibility in selecting states to update." | ||
68 changes: 68 additions & 0 deletions
68
...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 6.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
--- | ||
authors: | ||
- "[[Richard S. Sutton|Richard S. Sutton]]" | ||
- "[[Andrew G. Barton|Andrew G. Barton]]" | ||
year: 2018 | ||
tags: | ||
- textbook | ||
url: | ||
share: true | ||
--- | ||
## 6.4 Sarsa: On-policy TD Control | ||
|
||
> [!NOTE] Equation 6.7 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right] | ||
> \end{align} | ||
> $$ | ||
![[Pasted image 20240917101226.png|900]] | ||
## 6.5 Q-learning: Off-policy TD Control | ||
|
||
> [!NOTE] Equation 6.8 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \right] | ||
> \end{align} | ||
> $$ | ||
![[Pasted image 20240917101536.png|700]] | ||
|
||
## 6.6 Expected SARSA | ||
|
||
> [!NOTE] Equation 6.9 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> Q(S_t, A_t) &\leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \mathbb{E}_\pi [Q(S_{t+1}, A_{t+1}) \mid S_{t+1}] - Q(S_t, A_t) \right] \\ | ||
> &= Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t) \right] && (6.9) | ||
> \end{align} | ||
> $$ | ||
It's more computationally demanding but it's more stable and fares better than q learning and sarsa. | ||
|
||
## 6.7 Maximization Bias and Double Learning | ||
|
||
|
||
> "All the control algorithms that we have discussed so far involve maximization in the construction of their target policies" | ||
this causes maximization bias: | ||
- think of estimating the mean of N(-0.1, 1) | ||
- this estimate might at some point be 0.1 and the other option might be correctly 0 | ||
- the optimal choice is to pick 0, but because we take the max of an estimate, we positively bias ourselves | ||
|
||
The general way to solve it is to estimate two different value functions, one for getting the value ($Q_2$) and the other for obtaining the best action $Q_1$. | ||
|
||
$$ | ||
\begin{align} | ||
A^* &= \text{argmax}_a Q_1(a) \\ | ||
Q_2(A^*) &= Q_2(\text{argmax}_a Q_1(a)) | ||
\end{align} | ||
$$ | ||
|
||
This effectively debiases the estimate $\mathbb{E}[Q_2(A^*)] = q(A^*)$ | ||
|
||
|
76 changes: 76 additions & 0 deletions
76
...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
--- | ||
authors: | ||
- "[[Richard S. Sutton|Richard S. Sutton]]" | ||
- "[[Andrew G. Barton|Andrew G. Barton]]" | ||
year: 2018 | ||
tags: | ||
- textbook | ||
url: | ||
share: true | ||
--- | ||
# 9. On-policy prediction with approximation | ||
|
||
**Problem setting**: In most real scenarios, the number of states is too large for tabular learning algorithms, so we will approximate the value function by a learned, parametrized function: | ||
$$\hat{v}(s, \mathbf{w}) \approx v_\pi(s)$$ | ||
- Examples of possible modelling choices for this function could be linear functions, non linear functions, neural networks, etc. | ||
- $\mathbf{w} \in R^d$ , $d \ll |\mathcal{S}|$ , which means that updating on state affects multiple: *generalization* | ||
- This formulation allows for partially observable states. | ||
- Side note: not all convergence proofs apply to all function classes (for more info see [UCL x DeepMind 7/13](https://youtu.be/ook46h2Jfb4?si=uEHHuZOyzDgkmLqr)) | ||
## 9.1 Value-function approximation | ||
|
||
New notation! ($s\to u$ is an update rule for $v(s)$ using new expression $u$) | ||
|
||
> [!WARNING]- How does the learning setting differ between neural networks (supervised) and reinforcement learning? | ||
> RL requires modeling to allow: | ||
> - online learning (while interacting with environment), incrementally acquire data | ||
> - Remember that supervised learning suffers from catastrophic forgetting | ||
> - Non-stationary target functions | ||
> | ||
> Supervised Learning assumes iid sampling from a fixed but unknown data distribution | ||
## 9.2 The Prediction Objective ($\overline{VE}$) | ||
|
||
> [!FAQ]- Why do we need a prediction objective now? What has changed? | ||
> In the tabular setting we had two nice properties: | ||
> - the learned value function could actually converge exactly to the true value function | ||
> - the value of a state was decoupled from other states | ||
> | ||
> Without these two, we must say which states are most important to us. | ||
> [!NOTE] Equation 9.1 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}} \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && (9.1) | ||
> \end{align} | ||
> $$ | ||
Where: | ||
- $\mu(s)$ is the *state distribution* (reminder: non-negative, sums to one) | ||
|
||
For on-policy episodic tasks, $\mu(s)$ is called the *on-policy distribution*, which can be defined as follows: | ||
|
||
> [!NOTE] Equations 9.2 and 9.3 | ||
> | ||
> $$ | ||
> \begin{align} | ||
> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S && (9.2) | ||
> \end{align} | ||
> $$ | ||
> | ||
> $$ | ||
> \begin{align} | ||
> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && (9.3) | ||
> \end{align} | ||
> $$ | ||
Where: | ||
- $h(s)$ is the probability that an episode begins in a state $s$. | ||
- $\eta(s)$ is the number of time steps spent on average in a state $s$ for a single episode. | ||
- Interpretation of 2 terms: Time is spent in a $s$ if an episode starts in $s$ or if another state transitions into $s$. | ||
|
||
|
||
- $\overline{VE}$ only guaranties local optimality. | ||
|
||
|
||
## 9.3 Stochastic-gradient and Semi-gradient Methods |
14 changes: 14 additions & 0 deletions
14
docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
--- | ||
authors: | ||
- "[[Richard S. Sutton|Richard S. Sutton]]" | ||
- "[[Andrew G. Barton|Andrew G. Barton]]" | ||
year: 2018 | ||
tags: | ||
- textbook | ||
url: | ||
share: true | ||
--- | ||
|
||
- [[Reinforcement Learning - An Introduction - Chapter 4|Reinforcement Learning - An Introduction - Chapter 4]] | ||
- [[Reinforcement Learning - An Introduction - Chapter 6|Reinforcement Learning - An Introduction - Chapter 6]] | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.