Skip to content

Commit

Permalink
[PUBLISHER] Merge #38
Browse files Browse the repository at this point in the history
* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 4.md

* PUSH ATTACHMENT : Pasted image 20240917104341.png

* PUSH ATTACHMENT : Pasted image 20240917104526.png

* PUSH ATTACHMENT : Pasted image 20240917104618.png

* PUSH NOTE : Reinforcement Learning - An Introduction.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 6.md

* PUSH ATTACHMENT : Pasted image 20240917101226.png

* PUSH ATTACHMENT : Pasted image 20240917101536.png
  • Loading branch information
dgcnz authored Sep 23, 2024
1 parent b3c7602 commit c7ebaf6
Show file tree
Hide file tree
Showing 9 changed files with 224 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
## 4.1 Policy evaluation

> [!NOTE] Equations 4.3 and 4.4
>
> $$
> \begin{align}
> v_{\pi}(s) &\doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \\
> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] && (\text{from (3.9)})\\
> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s] && (4.3)\\
> &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \left[ r + \gamma v_{\pi}(s') \right] && (4.4),
> \end{align}
> $$
> [!NOTE] Equation 4.5
>
> $$
> \begin{align}
> v_{k+1}(s) &\doteq \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t = s ] \\
& = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_k(s') \right] && (4.5),
> \end{align}
> $$
![[Pasted image 20240917104341.png|600]]
## 4.2 Policy Improvement

> [!NOTE] Equation 4.6
>
> $$
> \begin{align}
> q_\pi(s, a) &\doteq \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a] && (4.6)\\
> &= \sum_{s', r}p(s', r \mid s, a)[r + \gamma v_\pi(s')] \\
> \end{align}
> $$
## 4.3 Policy Iteration

![[Pasted image 20240917104526.png|600]]
## 4.4 Value Iteration

> "This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps."
> [!NOTE] Equation 4.10
>
> $$
> \begin{align}
> v_{k+1} &\doteq \max_{a} \mathbb{E} [R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t =s, A = a] \\
> &= \max_{a} \sum_{s', r}p(s', r \mid s, a)[r + \gamma v_k(s')] && (4.10) \\
> \end{align}
> $$
![[Pasted image 20240917104618.png|600]]

## 4.5 Asynchronous Dynamic Programming

> "These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. [...] To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation. Asynchronous DP algorithms allow great flexibility in selecting states to update."
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
## 6.4 Sarsa: On-policy TD Control

> [!NOTE] Equation 6.7
>
> $$
> \begin{align}
> Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]
> \end{align}
> $$
![[Pasted image 20240917101226.png|900]]
## 6.5 Q-learning: Off-policy TD Control

> [!NOTE] Equation 6.8
>
> $$
> \begin{align}
> Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \right]
> \end{align}
> $$
![[Pasted image 20240917101536.png|700]]

## 6.6 Expected SARSA

> [!NOTE] Equation 6.9
>
> $$
> \begin{align}
> Q(S_t, A_t) &\leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \mathbb{E}_\pi [Q(S_{t+1}, A_{t+1}) \mid S_{t+1}] - Q(S_t, A_t) \right] \\
> &= Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t) \right] && (6.9)
> \end{align}
> $$
It's more computationally demanding but it's more stable and fares better than q learning and sarsa.

## 6.7 Maximization Bias and Double Learning


> "All the control algorithms that we have discussed so far involve maximization in the construction of their target policies"
this causes maximization bias:
- think of estimating the mean of N(-0.1, 1)
- this estimate might at some point be 0.1 and the other option might be correctly 0
- the optimal choice is to pick 0, but because we take the max of an estimate, we positively bias ourselves

The general way to solve it is to estimate two different value functions, one for getting the value ($Q_2$) and the other for obtaining the best action $Q_1$.

$$
\begin{align}
A^* &= \text{argmax}_a Q_1(a) \\
Q_2(A^*) &= Q_2(\text{argmax}_a Q_1(a))
\end{align}
$$

This effectively debiases the estimate $\mathbb{E}[Q_2(A^*)] = q(A^*)$


Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
# 9. On-policy prediction with approximation

**Problem setting**: In most real scenarios, the number of states is too large for tabular learning algorithms, so we will approximate the value function by a learned, parametrized function:
$$\hat{v}(s, \mathbf{w}) \approx v_\pi(s)$$
- Examples of possible modelling choices for this function could be linear functions, non linear functions, neural networks, etc.
- $\mathbf{w} \in R^d$ , $d \ll |\mathcal{S}|$ , which means that updating on state affects multiple: *generalization*
- This formulation allows for partially observable states.
- Side note: not all convergence proofs apply to all function classes (for more info see [UCL x DeepMind 7/13](https://youtu.be/ook46h2Jfb4?si=uEHHuZOyzDgkmLqr))
## 9.1 Value-function approximation

New notation! ($s\to u$ is an update rule for $v(s)$ using new expression $u$)

> [!WARNING]- How does the learning setting differ between neural networks (supervised) and reinforcement learning?
> RL requires modeling to allow:
> - online learning (while interacting with environment), incrementally acquire data
> - Remember that supervised learning suffers from catastrophic forgetting
> - Non-stationary target functions
>
> Supervised Learning assumes iid sampling from a fixed but unknown data distribution
## 9.2 The Prediction Objective ($\overline{VE}$)

> [!FAQ]- Why do we need a prediction objective now? What has changed?
> In the tabular setting we had two nice properties:
> - the learned value function could actually converge exactly to the true value function
> - the value of a state was decoupled from other states
>
> Without these two, we must say which states are most important to us.
> [!NOTE] Equation 9.1
>
> $$
> \begin{align}
> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}} \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && (9.1)
> \end{align}
> $$
Where:
- $\mu(s)$ is the *state distribution* (reminder: non-negative, sums to one)

For on-policy episodic tasks, $\mu(s)$ is called the *on-policy distribution*, which can be defined as follows:

> [!NOTE] Equations 9.2 and 9.3
>
> $$
> \begin{align}
> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S && (9.2)
> \end{align}
> $$
>
> $$
> \begin{align}
> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && (9.3)
> \end{align}
> $$
Where:
- $h(s)$ is the probability that an episode begins in a state $s$.
- $\eta(s)$ is the number of time steps spent on average in a state $s$ for a single episode.
- Interpretation of 2 terms: Time is spent in a $s$ if an episode starts in $s$ or if another state transitions into $s$.


- $\overline{VE}$ only guaranties local optimality.


## 9.3 Stochastic-gradient and Semi-gradient Methods
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---

- [[Reinforcement Learning - An Introduction - Chapter 4|Reinforcement Learning - An Introduction - Chapter 4]]
- [[Reinforcement Learning - An Introduction - Chapter 6|Reinforcement Learning - An Introduction - Chapter 6]]

Binary file added docs/images/Pasted image 20240917101226.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240917101536.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240917104341.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240917104526.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240917104618.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c7ebaf6

Please sign in to comment.