Skip to content

Commit

Permalink
[PUBLISHER] Merge #45
Browse files Browse the repository at this point in the history
* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 13.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md
  • Loading branch information
dgcnz authored Oct 6, 2024
1 parent 28a897f commit 8b44e44
Show file tree
Hide file tree
Showing 3 changed files with 120 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ share: true
TODO:
- [x] 11.11 mu norm equation ✅ 2024-10-01
- [x] 11.17 and 11.18 bellman error ✅ 2024-10-01
- [ ] 11.19 mean square bellman error
- [x] 11.19 mean square bellman error ✅ 2024-10-01

> [!NOTE] Equation 11.11: $\mu$-norm
>
Expand All @@ -41,9 +41,11 @@ TODO:
> \end{align}
> $$



> [!NOTE] Equation 11.19: Mean-square Bellman error
>
> $$
> \overline{BE}({\mathbf{w}}) \doteq || \bar{\delta} ||^2_\mu
> $$
## 11.5 Gradient Descent in the Bellman Error

Expand All @@ -67,5 +69,3 @@ TODO:
> \end{align}
> $$


Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
# 13 Policy Gradient Methods

- Value-based methods: learn action-value estimates.
- Policy Gradient methods: learn **parameterized policies**, $\pi(a \mid s, \boldsymbol{\theta})$.
- (policies that don't consult a value function).
- Actor-critic methods: learn both.
- Actor $\to$ policy
- Critic $\to$ value function

Notation:
- $\boldsymbol{\theta}$: policy parameters
- $\mathbf{w}$ : value function parameters

> [!NOTE] Equation 13.1: Gradient ascent update of policy parameters
>
> $$
> \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha \widehat{\nabla J(\boldsymbol{\theta})} \tag{13.1}
> $$

## 13.1 Policy Approximation and its Advantages

> [!FAQ]- What is in practice enforced to ensure exploration for PGMs?
> That the policy never becomes deterministic.
> [!NOTE] Equation 13.2: Soft-max in action preferences
>
> $$
> \pi(a \mid s, \boldsymbol{\theta}) \doteq \frac{e^{h(s, a, \boldsymbol{\theta})}}{\sum_{b} e^{h(s, b, \boldsymbol{\theta})}} \tag{13.2}
> $$
> [!FAQ]- Which conditions must the problem fulfill in order to reasonably consider the *Soft-max in action preferences* parametrization?
> The action space must be discrete and small enough.
> [!NOTE] Equation 13.3: Linear parametrization of policy preferences $h$
>
> $$
> h(s, a, \boldsymbol{\theta}) \doteq \boldsymbol{\theta}^\intercal \mathbf{x}(s, a) \tag{13.3}
> $$
>
> where $\mathbf{x}(s, a)$ is a feature vector that describes the state-action pair.
> [!FAQ]- What are the advantage of parametrizing policies according to the *soft-max in action preferences* in comparison with $\epsilon-$greedy policies?
> 1. Enables policies to approach a deterministic policy. $\epsilon$-greedy policies always maintain a minimum non-greedy probability $\epsilon$.
> 2. It enables truly stochastic policies. $\epsilon$-greedy policies force the policy to be almost greedy, but sometimes the best policy is to do $x$ with probability $p$ and $y$ with probability $1-p$ (e.g. poker bluffing).
> [!FAQ]- What is the most important reason for using policy gradient methods instead of value-based methods?
> They allow you to inject prior knowledge about the desired form of the policy.
## 13.2 The Policy Gradient Theorem

> [!FAQ]- How does continuous policy parametrization help convergence? Compare it to VBMs.
> With VBMs, a small change in value function could drastically change the policy.
> With PGMs, a small change in policy parameters will only change the policy slightly, which helps optimization.
> [!NOTE] Equation 13.4: Performance $J(\boldsymbol{\theta})$ for the episodic case
>
> $$
> J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4}
> $$
> [!FAQ]- What problem/question does the Policy Gradient Theorem answer?
> How can we estimate the performance gradient with respect to the policy parameter when the gradient depends on the unknown effect of policy changes on the state distribution?
>
> **The policy gradient theorem allows for an estimate of the performance gradient irrespective of the derivative of the state distribution.**
> [!NOTE] Equation 13.5: Policy gradient theorem
>
> $$
> \nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla \pi(a \mid s, \boldsymbol{\theta}) \tag{13.5}
> $$
## 13.3 REINFORCE: Monte Carlo Policy Gradient

> [!NOTE] Equation 13.6: *All-actions* policy gradient
>
> $$
> \begin{align}
> \nabla J(\boldsymbol{\theta}) &\propto \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla \pi(a \mid s, \boldsymbol{\theta}) \\
> &= \mathbb{E}_{\pi} \left[ \sum_a q_{\pi}(S_t, a) \nabla \pi(a \mid S_t, \boldsymbol{\theta}) \right] \\
> \end{align}
> $$
## 13.4 REINFORCE with Baseline

#todo

## 13.6 Policy Gradient for Continuing Problems

> [!NOTE] Equation 13.15: Average rate of reward per time step
>
> $$
> \begin{align}
> J(\boldsymbol{\theta}) &\doteq r(\pi) \doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^h\mathbb{E} \left[ R_t \mid S_t, A_{0:t-1} \sim \pi \right] \tag{13.15} \\
> &= \lim_{t \to \infty} \mathbb{E} \left[R_t \mid S_t, A_{0:t-1} \sim \pi \right] \\\
> &= \sum_s \mu(s) \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) r
> \end{align}
> $$
>
> Where:
> - $\mu(s) \doteq \lim_{t\to \infty} \mathbb{P} \left[S_t = s \mid A_{0:t} \sim \pi \right]$ is the steady-state distribution of states under $\pi$, which is assumed to exist and to be independent of $S_0$ (an ergodicity assumption).

Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,8 @@ Examples of $U_t$:

## 9.6 Selecting Step-Size Parameters Manually

“The classical choice t =1/t, which produces sample averages in tabular MC methods, is not appropriate for TD methods, for nonstationary problems, or for any method using function approximation.” (Sutton and Barto, 2020, p. 244)


> [!NOTE] Equation 9.19
> Suppose you wanted to learn in about $\tau$ experiences with substantially the same feature vector. A good rule of thumb for setting the step-size parameter of *linear SGD methods* is:
Expand Down

0 comments on commit 8b44e44

Please sign in to comment.