[PUBLISHER] Merge #38

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 4.md * PUSH ATTACHMENT : Pasted image 20240917104341.png * PUSH ATTACHMENT : Pasted image 20240917104526.png * PUSH ATTACHMENT : Pasted image 20240917104618.png * PUSH NOTE : Reinforcement Learning - An Introduction.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 6.md * PUSH ATTACHMENT : Pasted image 20240917101226.png * PUSH ATTACHMENT : Pasted image 20240917101536.png
dgcnz · Sep 23, 2024 · c7ebaf6 · c7ebaf6
1 parent b3c7602
commit c7ebaf6
Show file tree

Hide file tree

Showing 9 changed files with 224 additions and 0 deletions.
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 4.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 4.md
@@ -0,0 +1,66 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+## 4.1 Policy evaluation
+
+> [!NOTE] Equations 4.3 and 4.4
+> 
+> $$
+> \begin{align}
+> v_{\pi}(s) &\doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \\
+> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] && (\text{from (3.9)})\\
+> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s] && (4.3)\\
+> &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \left[ r + \gamma v_{\pi}(s') \right] && (4.4),
+> \end{align}
+> $$
+
+> [!NOTE] Equation 4.5
+> 
+> $$
+> \begin{align}
+> v_{k+1}(s) &\doteq \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t = s ] \\
+& = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_k(s') \right] && (4.5),
+> \end{align}
+> $$
+
+![[Pasted image 20240917104341.png|600]]
+## 4.2 Policy Improvement
+
+> [!NOTE] Equation 4.6
+> 
+> $$
+> \begin{align}
+> q_\pi(s, a) &\doteq  \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a] && (4.6)\\
+> 			&= \sum_{s', r}p(s', r \mid s, a)[r + \gamma v_\pi(s')] \\ 
+> \end{align}
+> $$
+
+## 4.3 Policy Iteration
+
+![[Pasted image 20240917104526.png|600]]
+## 4.4 Value Iteration
+
+> "This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps."
+
+> [!NOTE] Equation 4.10
+>
+> $$
+> \begin{align}
+> v_{k+1} &\doteq \max_{a} \mathbb{E} [R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t =s, A = a] \\
+> 		&= \max_{a} \sum_{s', r}p(s', r \mid s, a)[r + \gamma v_k(s')] && (4.10) \\ 
+> \end{align}
+> $$
+
+![[Pasted image 20240917104618.png|600]]
+
+## 4.5 Asynchronous Dynamic Programming 
+
+> "These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. [...] To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation. Asynchronous DP algorithms allow great flexibility in selecting states to update."
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 6.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 6.md
@@ -0,0 +1,68 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+## 6.4 Sarsa: On-policy TD Control
+
+> [!NOTE] Equation 6.7
+> 
+> $$
+> \begin{align}
+> Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]  
+> \end{align}
+> $$
+
+![[Pasted image 20240917101226.png|900]]
+## 6.5 Q-learning: Off-policy TD Control
+
+> [!NOTE] Equation 6.8
+> 
+> $$
+> \begin{align}
+> Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \right]  
+> \end{align}
+> $$
+
+![[Pasted image 20240917101536.png|700]]
+
+## 6.6 Expected SARSA
+
+> [!NOTE] Equation 6.9
+> 
+> $$
+> \begin{align}
+> Q(S_t, A_t) &\leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \mathbb{E}_\pi [Q(S_{t+1}, A_{t+1}) \mid S_{t+1}] - Q(S_t, A_t) \right]   \\
+>  &= Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) Q(S_{t+1}, a)  - Q(S_t, A_t) \right]  && (6.9)
+> \end{align}
+> $$
+
+It's more computationally demanding but it's more stable and fares better than q learning and sarsa.
+
+## 6.7 Maximization Bias and Double Learning
+
+
+> "All the control algorithms that we have discussed so far involve maximization in the construction of their target policies"
+
+this causes maximization bias:
+- think of estimating the mean of N(-0.1, 1)
+- this estimate might at some point be 0.1 and the other option might be correctly 0
+- the optimal choice is to pick 0, but because we take the max of an estimate, we positively bias ourselves
+
+The general way to solve it is to estimate two different value functions, one for getting the value ($Q_2$) and the other for obtaining the best action $Q_1$.
+
+$$
+\begin{align}
+A^* &= \text{argmax}_a Q_1(a) \\
+Q_2(A^*) &= Q_2(\text{argmax}_a Q_1(a))
+\end{align}
+$$
+
+This effectively debiases the estimate $\mathbb{E}[Q_2(A^*)] = q(A^*)$
+
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md
@@ -0,0 +1,76 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+# 9. On-policy  prediction with approximation
+
+**Problem setting**: In most real scenarios, the number of states is too large for tabular learning algorithms, so we will approximate the value function by a learned, parametrized function:
+$$\hat{v}(s, \mathbf{w}) \approx v_\pi(s)$$
+- Examples of possible modelling choices for this function could be linear functions, non linear functions, neural networks, etc.
+- $\mathbf{w} \in R^d$ , $d \ll |\mathcal{S}|$ , which means that updating on state affects multiple: *generalization*
+- This formulation allows for partially observable states.
+- Side note: not all convergence proofs apply to all function classes (for more info see [UCL x DeepMind 7/13](https://youtu.be/ook46h2Jfb4?si=uEHHuZOyzDgkmLqr))
+## 9.1 Value-function approximation
+
+New notation! ($s\to u$ is an update rule for $v(s)$ using new expression $u$)  
+
+> [!WARNING]- How does the learning setting differ between neural networks (supervised) and reinforcement learning?
+> RL requires modeling to allow:
+> - online learning (while interacting with environment), incrementally acquire data
+> 	- Remember that supervised learning suffers from catastrophic forgetting
+> - Non-stationary target functions
+> 
+> Supervised Learning assumes iid sampling from a fixed but unknown data distribution
+
+## 9.2 The Prediction Objective ($\overline{VE}$) 
+
+> [!FAQ]- Why do we need a prediction objective now? What has changed?
+> In the tabular setting we had two nice properties:
+> - the learned value function could actually converge exactly to the true value function
+> - the value of a state was decoupled from other states
+> 
+> Without these two, we must say which states are most important to us.
+
+> [!NOTE] Equation 9.1
+> 
+> $$
+> \begin{align}
+> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}}  \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && (9.1)
+> \end{align}
+> $$
+
+Where:
+- $\mu(s)$ is the *state distribution* (reminder: non-negative, sums to one)
+
+For on-policy episodic tasks, $\mu(s)$ is called the *on-policy distribution*, which can be defined as follows:
+
+> [!NOTE] Equations 9.2 and 9.3
+> 
+> $$
+> \begin{align}
+> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S  && (9.2)
+> \end{align}
+> $$
+> 
+> $$
+> \begin{align}
+> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && (9.3)
+> \end{align}
+> $$
+
+Where:
+-  $h(s)$ is the probability that an episode begins in a state $s$.
+- $\eta(s)$ is the number of time steps spent on average in a state $s$ for a single episode.
+	- Interpretation of 2 terms: Time is spent in a $s$ if an episode starts in $s$ or if another state transitions into $s$.
+
+
+- $\overline{VE}$ only guaranties local optimality.
+
+
+## 9.3 Stochastic-gradient and Semi-gradient Methods
diff --git a/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction.md b/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction.md
@@ -0,0 +1,14 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+
+- [[Reinforcement Learning - An Introduction - Chapter 4|Reinforcement Learning - An Introduction - Chapter 4]]
+- [[Reinforcement Learning - An Introduction - Chapter 6|Reinforcement Learning - An Introduction - Chapter 6]]
+
diff --git a/docs/images/Pasted image 20240917101226.png b/docs/images/Pasted image 20240917101226.png
diff --git a/docs/images/Pasted image 20240917101536.png b/docs/images/Pasted image 20240917101536.png
diff --git a/docs/images/Pasted image 20240917104341.png b/docs/images/Pasted image 20240917104341.png
diff --git a/docs/images/Pasted image 20240917104526.png b/docs/images/Pasted image 20240917104526.png
diff --git a/docs/images/Pasted image 20240917104618.png b/docs/images/Pasted image 20240917104618.png