[PUBLISHER] Merge #50

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 2.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 16.md * PUSH ATTACHMENT : Pasted image 20241020203656.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 6.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 5.md * PUSH ATTACHMENT : Pasted image 20241021121638.png * PUSH ATTACHMENT : Pasted image 20241021121518.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 4.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 3.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 13.md * PUSH ATTACHMENT : Pasted image 20241020212513.png * PUSH ATTACHMENT : Pasted image 20241020213317.png * PUSH ATTACHMENT : Pasted image 20241020213339.png * PUSH ATTACHMENT : Pasted image 20241020213450.png * PUSH ATTACHMENT : Pasted image 20241020213623.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md
dgcnz · Oct 21, 2024 · c869a2f · c869a2f
1 parent 77d3236
commit c869a2f
Show file tree

Hide file tree

Showing 16 changed files with 195 additions and 6 deletions.
diff --git a/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 11.md b/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 11.md
@@ -97,6 +97,29 @@ TLDR: Using the geometry of the value function, we find that $\overline{BE}$ mea
 > \overline{BE}({\mathbf{w}}) \doteq || \bar{\delta} ||^2_\mu
 > $$
 
+> With linear function approximation there always exists an approximate value function (within the subspace) with zero PBE; this is the TD fixed point, wTD
+
+
+> [!NOTE] Equation 11.13: Projection matrix for linear function approximation
+> 
+> $$
+> \Pi \doteq \mathbf{X} (\mathbf{X}^\top \mathbf{D} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{D}
+> $$
+> 
+> Where:
+> - $\mathbf{X} \in \mathbb{R}^{|\mathcal{S}| \times d}$ is the matrix of feature vectors
+> - $\mathbf{D} \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{S}|}$ is a diagonal matrix with $\mu(s)$ on the diagonal
+
+> [!NOTE] Equation 11.22: Mean square Projected Bellman error
+> 
+> $$
+> \overline{PBE} = || \Pi \bar{\delta}_{\mathbf{w}} ||^2_\mu
+> $$
+> 
+> Where:
+> - $\Pi$ is the projection matrix
+> - $\bar{\delta}_{\mathbf{w}}$ is the Bellman error
+
 ## 11.5 Gradient Descent in the Bellman Error
 
 TLDR: Semi-gradient SGD might diverge, but true SGD doesn't! Sadly, both TDE and BE yield bad minima.
@@ -138,7 +161,9 @@ TLDR: $\overline{BE}$ is not learnable but $\overline{TDE}$  and $\overline{PBE}
 
 ## 11.7 Gradient-TD Methods
 
-TLDR: To minimize $\overline{PBE}$ using SGD efficiently we use two separate estimates for dependent expectations. This yields two algorithms: GTD2 and TDC.
+TLDR: To minimize $\overline{PBE}$ using SGD efficiently we use two separate estimates for dependent expectations. This yields two algorithms: GTD2 and TDC. 
+
+**DISCLAIMER: These methods only work with linear function approximation.**
 
 > [!NOTE] Equation 11.27: Gradient of $\overline{PBE}$
 > 

diff --git a/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md b/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md
@@ -153,6 +153,7 @@ Note:
 > [!FAQ]- Why does REINFORCE yield slow learning?
 > Because as a Monte Carlo method, it has high variance.
 
+**DISCLAIMER: For uva-rl1, this method is called REINFORCE v2. REINFORCE v1 uses the full return at each step**
 ## 13.4 REINFORCE with Baseline
 
 The policy gradient can be generalized to include any baseline function $b(s)$, as long as it is independent of the action. 
@@ -186,9 +187,23 @@ The policy gradient can be generalized to include any baseline function $b(s)$,
 
 ## 13.5 Actor-Critic Methods
 
-TLDR: Expand usage of the baseline/value function with multi-step returns, lambda TD, etc. Helps with variance.
+TLDR: Expand usage of the baseline/value function with multi-step returns, lambda TD, etc. Helps with variance. Adds bias but can be controlled with lambda TD, etc.
 
-todo: add notes
+Example: 
+- Use value function as baseline
+- Use one-step returns with value bootstrap as target
+
+> [!NOTE] Equation 13.12, 13.13 and 13.14: One-step Actor-critic update rule
+> 
+> $$
+> \begin{align}
+>     \theta_{t+1} &\doteq \theta_t + \alpha \left( G_{t:t+1} - \hat{v}(S_t, \mathbf{w}) \right) \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \quad \tag{13.12} \\
+>     &= \theta_t + \alpha \left( R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \right) \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \quad \tag{13.13} \\
+>     &= \theta_t + \alpha \delta_t \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \quad \tag{13.14}
+> \end{align}
+> $$
+
+![[Pasted image 20241020212513.png|800]]
 ## 13.6 Policy Gradient for Continuing Problems
 
 > [!NOTE] Equation 13.15: Average rate of reward per time step
@@ -215,3 +230,28 @@ Note: not part of the course readings, missing remaining notes for this subsecti
 *todo: add notes*
 
 $d \tau$
+
+## Extra: Deterministic Policy Gradients
+
+- Use deterministic policy as target policy
+- Use stochastic policy as behavior policy (example: target + noise)
+
+
+![[Pasted image 20241020213317.png|500]]
+
+![[Pasted image 20241020213339.png|500]]
+
+DPG with q-learning update
+![[Pasted image 20241020213450.png|500]]
+
+Only works with continous actions
+Discrete actions will cause gradient inconsistencies
+
+![[Pasted image 20241020213623.png|500]]
+
+
+Deep DPG = DPG + modification to use neural nets to generalise
+- Use experience replay
+- "double-q learning"
+
+
diff --git a/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 16.md b/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 16.md
@@ -0,0 +1,29 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+  - rl1
+url: 
+share: true
+---
+# 16 Applications and Case Studies
+## 16.5 Human-level Video Game Play
+
+DQN's Network architecture: Conv2d + RELU blocks  for feature extraction and linear layers for output.
+![[Pasted image 20241020203656.png|Pasted image 20241020203656.png]]
+
+
+> [!NOTE] Equation 16.3: DQN Semi-Gradient update rule
+> 
+> $$
+> \mathbf{w}_{t+1} = \mathbf{w}_{t} + \alpha \left[ R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a; \mathbf{w}_{t}) - \hat{q}(S_t, A_t; \mathbf{w}_{t}) \right] \nabla \hat{q}(S_t, A_t; \mathbf{w}_{t})
+> $$
+
+> [!FAQ]- What are the three modifications to Q-learning that make DQN?
+> 1. Experience Replay: Useful to use data better and remove the dependence of successive experiences on the current weights.
+> 2. "Double Q-learning": Keep a copy of the network at the previous step to provide targets to avoid divergence and oscillations.
+> 3. Clip the error term $R_{t+1} + \gamma \max_{a} q(S_{t+1}, a; \mathbf{w}_{t}) - q(S_t, A_t; \mathbf{w}_{t})$ to $[-1, 1]$ to improve stability.
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 2.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 2.md
@@ -0,0 +1,69 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+  - rl1
+url: 
+share: true
+---
+# 2 Multi-armed Bandits
+
+## 2.2 Action-value Methods
+
+
+> [!NOTE] Equation 2.1: Sample-average Method
+> 
+> $$
+> Q_t(a) \doteq \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}} \tag{2.1}
+> $$ 
+
+> [!NOTE] Equation 2.2: Greedy Action Selection
+> 
+> $$
+> A_t \doteq \underset{a}{\arg\max} Q_t(a) \tag{2.2}
+> $$
+
+## 2.4 Incremental Implementation
+
+
+> [!NOTE] Equation 2.4: Incremental Sample-average method
+> 
+> $$
+> Q_{n+1} = Q_n + \frac{1}{n}[R_n - Q_n] \tag{2.4}
+> $$
+> 
+> Where:
+> - $Q_1$  is usually initialized to zero.
+> - $R_n$ is the reward received after the $n$-th selection of action $a$
+> - $Q_n$ denote the estimate of its action value after it has been selected $n-1$ times
+
+
+## 2.5 Tracking a Nonstationary Problem
+
+
+> [!NOTE] Equation 2.7: Two learning rate conditions to ensure convergence
+> 
+> $$
+> \sum_{n=1}^{\infty} \alpha_n(a) = \infty \quad \text{and} \quad \sum_{n=1}^{\infty} \alpha_n^2(a) < \infty \tag{2.7}
+> $$
+
+
+## 2.6 Optimistic Initial Values
+
+TLDR: Initializing $Q_1(a)$ to a positive non-zero number encourages exploration.
+
+## 2.7 Upper-Confidence-Bound Action Selection
+
+
+> [!NOTE] Equation 2.10: UCB action selection
+> 
+> $$
+> A_t \doteq \underset{a}{\arg\max} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right] \tag{2.10}
+> $$
+> 
+> Where:
+> - $c > 0$ controls the degree of exploration.
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 3.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 3.md
@@ -73,19 +73,28 @@ You can obtain the expected reward also.
 
 A policy $\pi(a \mid s)$ is a probability distribution over actions given states.
 
+
 > [!NOTE] Equation 3.12: State-value function
 > 
 > $$
 > v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \;\; \forall s \in \mathcal{S} \tag{3.12}
-> 
 > $$
 
+
 > [!NOTE] Equation 3.13: Action-value function
 > 
 > $$
 > q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a] \;\; \forall s \in \mathcal{S}, a \in \mathcal{A} \tag{3.13}
 > $$
 
+
+> [!NOTE] Writing $v_{\pi}$ in terms of $q_{\pi}$
+> 
+> $$
+> v_{\pi}(s) = \sum_{a} \pi(a \mid s) q_{\pi}(s, a)
+> $$
+
+
 > [!NOTE] Equation 3.14: Bellman equation for $v_{\pi}$
 > 
 > $$
@@ -99,6 +108,12 @@ A policy $\pi(a \mid s)$ is a probability distribution over actions given states
 
 ## 3.6 Optimal Policies and Optimal Value Functions
 
+A policy $\pi^*$ is the optimal policy if:
+
+$$
+v_{\pi^*} (s) \geq v_{\pi'}(s) \quad \forall s, \pi' \in \mathcal{S} \times \mathcal{\Pi}
+$$
+
 > [!NOTE] Equation 3.15: Optimal state-value function
 > 
 > $$

diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 4.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 4.md
@@ -9,6 +9,7 @@ tags:
 url: 
 share: true
 ---
+# 4 Dynamic Programming
 ## 4.1 Policy evaluation
 
 > [!NOTE] Equations 4.3 and 4.4

diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 5.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 5.md
@@ -16,14 +16,18 @@ first-visit mc
 every-visit mc
 
 - [ ] TODO: finish notes
+
+## 5.3 Monte Carlo Control
+
+![[Pasted image 20241021121638.png|600]]
 ## 5.4 Monte Carlo Control without Exploring Starts
 
 - $\epsilon-$greedy policy
 	- All non-greedy actions have minimum probability of $\frac{\epsilon}{|\mathcal{A}|}$
 	- Greedy action has probability $(1 - \epsilon) + \frac{\epsilon}{|\mathcal{A}|}$
 
 - [ ] TODO: finish notes
-
+![[Pasted image 20241021121518.png|600]]
 ## 5.5 Off-policy Prediction via Importance Sampling
 
 Given a starting state $S_t$, the probability of the subsequent state-action trajectory, $A_t, S_{t+1}, A_{t+1}, \dots, S_T$, under the policy $\pi$ is given by:
@@ -63,7 +67,7 @@ $$
 > V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1}} \tag{5.6}
 > $$
 
-![[Pasted image 20240929183258.png|Pasted image 20240929183258.png]]
+![[Pasted image 20240929183258.png|600]]
 
 In practice, weighted importance sampling has much lower error at the beginning.
 

diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 6.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 6.md
@@ -9,6 +9,8 @@ tags:
 url: 
 share: true
 ---
+# 6 Temporal-Difference Learning
+
 ## 6.1 TD Prediction
 
 > [!NOTE] Equation 6.2: TD(0) update
@@ -84,6 +86,10 @@ share: true
 
 It's more computationally demanding but it's more stable and fares better than q learning and sarsa.
 
+Can also be used as is for off-policy case.
+
+> [!FAQ]- Why doesn't Expected SARSA off-policy need importance sampling?
+> I wasn't convinced by the slides explanation, so I'll have to check a proper explanation later.
 ## 6.7 Maximization Bias and Double Learning
 
 

diff --git a/docs/images/Pasted image 20241020203656.png b/docs/images/Pasted image 20241020203656.png
diff --git a/docs/images/Pasted image 20241020212513.png b/docs/images/Pasted image 20241020212513.png
diff --git a/docs/images/Pasted image 20241020213317.png b/docs/images/Pasted image 20241020213317.png
diff --git a/docs/images/Pasted image 20241020213339.png b/docs/images/Pasted image 20241020213339.png
diff --git a/docs/images/Pasted image 20241020213450.png b/docs/images/Pasted image 20241020213450.png
diff --git a/docs/images/Pasted image 20241020213623.png b/docs/images/Pasted image 20241020213623.png
diff --git a/docs/images/Pasted image 20241021121518.png b/docs/images/Pasted image 20241021121518.png
diff --git a/docs/images/Pasted image 20241021121638.png b/docs/images/Pasted image 20241021121638.png