Skip to content

Commit

Permalink
[PUBLISHER] Merge #50
Browse files Browse the repository at this point in the history
* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 2.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 16.md

* PUSH ATTACHMENT : Pasted image 20241020203656.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 6.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 5.md

* PUSH ATTACHMENT : Pasted image 20241021121638.png

* PUSH ATTACHMENT : Pasted image 20241021121518.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 4.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 3.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 13.md

* PUSH ATTACHMENT : Pasted image 20241020212513.png

* PUSH ATTACHMENT : Pasted image 20241020213317.png

* PUSH ATTACHMENT : Pasted image 20241020213339.png

* PUSH ATTACHMENT : Pasted image 20241020213450.png

* PUSH ATTACHMENT : Pasted image 20241020213623.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md
  • Loading branch information
dgcnz authored Oct 21, 2024
1 parent 77d3236 commit c869a2f
Show file tree
Hide file tree
Showing 16 changed files with 195 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,29 @@ TLDR: Using the geometry of the value function, we find that $\overline{BE}$ mea
> \overline{BE}({\mathbf{w}}) \doteq || \bar{\delta} ||^2_\mu
> $$
> With linear function approximation there always exists an approximate value function (within the subspace) with zero PBE; this is the TD fixed point, wTD

> [!NOTE] Equation 11.13: Projection matrix for linear function approximation
>
> $$
> \Pi \doteq \mathbf{X} (\mathbf{X}^\top \mathbf{D} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{D}
> $$
>
> Where:
> - $\mathbf{X} \in \mathbb{R}^{|\mathcal{S}| \times d}$ is the matrix of feature vectors
> - $\mathbf{D} \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{S}|}$ is a diagonal matrix with $\mu(s)$ on the diagonal
> [!NOTE] Equation 11.22: Mean square Projected Bellman error
>
> $$
> \overline{PBE} = || \Pi \bar{\delta}_{\mathbf{w}} ||^2_\mu
> $$
>
> Where:
> - $\Pi$ is the projection matrix
> - $\bar{\delta}_{\mathbf{w}}$ is the Bellman error
## 11.5 Gradient Descent in the Bellman Error

TLDR: Semi-gradient SGD might diverge, but true SGD doesn't! Sadly, both TDE and BE yield bad minima.
Expand Down Expand Up @@ -138,7 +161,9 @@ TLDR: $\overline{BE}$ is not learnable but $\overline{TDE}$ and $\overline{PBE}
## 11.7 Gradient-TD Methods

TLDR: To minimize $\overline{PBE}$ using SGD efficiently we use two separate estimates for dependent expectations. This yields two algorithms: GTD2 and TDC.
TLDR: To minimize $\overline{PBE}$ using SGD efficiently we use two separate estimates for dependent expectations. This yields two algorithms: GTD2 and TDC.

**DISCLAIMER: These methods only work with linear function approximation.**

> [!NOTE] Equation 11.27: Gradient of $\overline{PBE}$
>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ Note:
> [!FAQ]- Why does REINFORCE yield slow learning?
> Because as a Monte Carlo method, it has high variance.
**DISCLAIMER: For uva-rl1, this method is called REINFORCE v2. REINFORCE v1 uses the full return at each step**
## 13.4 REINFORCE with Baseline

The policy gradient can be generalized to include any baseline function $b(s)$, as long as it is independent of the action.
Expand Down Expand Up @@ -186,9 +187,23 @@ The policy gradient can be generalized to include any baseline function $b(s)$,

## 13.5 Actor-Critic Methods

TLDR: Expand usage of the baseline/value function with multi-step returns, lambda TD, etc. Helps with variance.
TLDR: Expand usage of the baseline/value function with multi-step returns, lambda TD, etc. Helps with variance. Adds bias but can be controlled with lambda TD, etc.

todo: add notes
Example:
- Use value function as baseline
- Use one-step returns with value bootstrap as target

> [!NOTE] Equation 13.12, 13.13 and 13.14: One-step Actor-critic update rule
>
> $$
> \begin{align}
> \theta_{t+1} &\doteq \theta_t + \alpha \left( G_{t:t+1} - \hat{v}(S_t, \mathbf{w}) \right) \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \quad \tag{13.12} \\
> &= \theta_t + \alpha \left( R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \right) \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \quad \tag{13.13} \\
> &= \theta_t + \alpha \delta_t \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \quad \tag{13.14}
> \end{align}
> $$
![[Pasted image 20241020212513.png|800]]
## 13.6 Policy Gradient for Continuing Problems

> [!NOTE] Equation 13.15: Average rate of reward per time step
Expand All @@ -215,3 +230,28 @@ Note: not part of the course readings, missing remaining notes for this subsecti
*todo: add notes*

$d \tau$

## Extra: Deterministic Policy Gradients

- Use deterministic policy as target policy
- Use stochastic policy as behavior policy (example: target + noise)


![[Pasted image 20241020213317.png|500]]

![[Pasted image 20241020213339.png|500]]

DPG with q-learning update
![[Pasted image 20241020213450.png|500]]

Only works with continous actions
Discrete actions will cause gradient inconsistencies

![[Pasted image 20241020213623.png|500]]


Deep DPG = DPG + modification to use neural nets to generalise
- Use experience replay
- "double-q learning"


Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
- rl1
url:
share: true
---
# 16 Applications and Case Studies
## 16.5 Human-level Video Game Play

DQN's Network architecture: Conv2d + RELU blocks for feature extraction and linear layers for output.
![[Pasted image 20241020203656.png|Pasted image 20241020203656.png]]


> [!NOTE] Equation 16.3: DQN Semi-Gradient update rule
>
> $$
> \mathbf{w}_{t+1} = \mathbf{w}_{t} + \alpha \left[ R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a; \mathbf{w}_{t}) - \hat{q}(S_t, A_t; \mathbf{w}_{t}) \right] \nabla \hat{q}(S_t, A_t; \mathbf{w}_{t})
> $$
> [!FAQ]- What are the three modifications to Q-learning that make DQN?
> 1. Experience Replay: Useful to use data better and remove the dependence of successive experiences on the current weights.
> 2. "Double Q-learning": Keep a copy of the network at the previous step to provide targets to avoid divergence and oscillations.
> 3. Clip the error term $R_{t+1} + \gamma \max_{a} q(S_{t+1}, a; \mathbf{w}_{t}) - q(S_t, A_t; \mathbf{w}_{t})$ to $[-1, 1]$ to improve stability.
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
- rl1
url:
share: true
---
# 2 Multi-armed Bandits

## 2.2 Action-value Methods


> [!NOTE] Equation 2.1: Sample-average Method
>
> $$
> Q_t(a) \doteq \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}} \tag{2.1}
> $$
> [!NOTE] Equation 2.2: Greedy Action Selection
>
> $$
> A_t \doteq \underset{a}{\arg\max} Q_t(a) \tag{2.2}
> $$
## 2.4 Incremental Implementation


> [!NOTE] Equation 2.4: Incremental Sample-average method
>
> $$
> Q_{n+1} = Q_n + \frac{1}{n}[R_n - Q_n] \tag{2.4}
> $$
>
> Where:
> - $Q_1$ is usually initialized to zero.
> - $R_n$ is the reward received after the $n$-th selection of action $a$
> - $Q_n$ denote the estimate of its action value after it has been selected $n-1$ times

## 2.5 Tracking a Nonstationary Problem


> [!NOTE] Equation 2.7: Two learning rate conditions to ensure convergence
>
> $$
> \sum_{n=1}^{\infty} \alpha_n(a) = \infty \quad \text{and} \quad \sum_{n=1}^{\infty} \alpha_n^2(a) < \infty \tag{2.7}
> $$

## 2.6 Optimistic Initial Values

TLDR: Initializing $Q_1(a)$ to a positive non-zero number encourages exploration.

## 2.7 Upper-Confidence-Bound Action Selection


> [!NOTE] Equation 2.10: UCB action selection
>
> $$
> A_t \doteq \underset{a}{\arg\max} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right] \tag{2.10}
> $$
>
> Where:
> - $c > 0$ controls the degree of exploration.
Original file line number Diff line number Diff line change
Expand Up @@ -73,19 +73,28 @@ You can obtain the expected reward also.

A policy $\pi(a \mid s)$ is a probability distribution over actions given states.


> [!NOTE] Equation 3.12: State-value function
>
> $$
> v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \;\; \forall s \in \mathcal{S} \tag{3.12}
>
> $$

> [!NOTE] Equation 3.13: Action-value function
>
> $$
> q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a] \;\; \forall s \in \mathcal{S}, a \in \mathcal{A} \tag{3.13}
> $$

> [!NOTE] Writing $v_{\pi}$ in terms of $q_{\pi}$
>
> $$
> v_{\pi}(s) = \sum_{a} \pi(a \mid s) q_{\pi}(s, a)
> $$

> [!NOTE] Equation 3.14: Bellman equation for $v_{\pi}$
>
> $$
Expand All @@ -99,6 +108,12 @@ A policy $\pi(a \mid s)$ is a probability distribution over actions given states
## 3.6 Optimal Policies and Optimal Value Functions

A policy $\pi^*$ is the optimal policy if:

$$
v_{\pi^*} (s) \geq v_{\pi'}(s) \quad \forall s, \pi' \in \mathcal{S} \times \mathcal{\Pi}
$$

> [!NOTE] Equation 3.15: Optimal state-value function
>
> $$
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ tags:
url:
share: true
---
# 4 Dynamic Programming
## 4.1 Policy evaluation

> [!NOTE] Equations 4.3 and 4.4
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,18 @@ first-visit mc
every-visit mc

- [ ] TODO: finish notes

## 5.3 Monte Carlo Control

![[Pasted image 20241021121638.png|600]]
## 5.4 Monte Carlo Control without Exploring Starts

- $\epsilon-$greedy policy
- All non-greedy actions have minimum probability of $\frac{\epsilon}{|\mathcal{A}|}$
- Greedy action has probability $(1 - \epsilon) + \frac{\epsilon}{|\mathcal{A}|}$

- [ ] TODO: finish notes

![[Pasted image 20241021121518.png|600]]
## 5.5 Off-policy Prediction via Importance Sampling

Given a starting state $S_t$, the probability of the subsequent state-action trajectory, $A_t, S_{t+1}, A_{t+1}, \dots, S_T$, under the policy $\pi$ is given by:
Expand Down Expand Up @@ -63,7 +67,7 @@ $$
> V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1}} \tag{5.6}
> $$
![[Pasted image 20240929183258.png|Pasted image 20240929183258.png]]
![[Pasted image 20240929183258.png|600]]

In practice, weighted importance sampling has much lower error at the beginning.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ tags:
url:
share: true
---
# 6 Temporal-Difference Learning

## 6.1 TD Prediction

> [!NOTE] Equation 6.2: TD(0) update
Expand Down Expand Up @@ -84,6 +86,10 @@ share: true
It's more computationally demanding but it's more stable and fares better than q learning and sarsa.

Can also be used as is for off-policy case.

> [!FAQ]- Why doesn't Expected SARSA off-policy need importance sampling?
> I wasn't convinced by the slides explanation, so I'll have to check a proper explanation later.
## 6.7 Maximization Bias and Double Learning


Expand Down
Binary file added docs/images/Pasted image 20241020203656.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241020212513.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241020213317.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241020213339.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241020213450.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241020213623.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241021121518.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241021121638.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c869a2f

Please sign in to comment.