Skip to content

Commit

Permalink
[PUBLISHER] Merge #48
Browse files Browse the repository at this point in the history
* PUSH NOTE : Thomas Kipf.md

* PUSH NOTE : Tim R. Davidson.md

* PUSH NOTE : Simeng Sun.md

* PUSH NOTE : Ilya Loshchilov.md

* PUSH NOTE : Cheng-Ping Hsieh.md

* PUSH NOTE : Boris Ginsburg.md

* PUSH NOTE : Hyperspherical Variational Auto-Encoders.md

* PUSH ATTACHMENT : Pasted image 20241010115957.png

* PUSH ATTACHMENT : Pasted image 20241010115603.png

* PUSH NOTE : nGPT - Normalized Transformer with Representation Learning on the Hypersphere.md

* PUSH ATTACHMENT : Pasted image 20241010085554.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 13.md
  • Loading branch information
dgcnz authored Oct 10, 2024
1 parent 7822a03 commit 17ca25b
Show file tree
Hide file tree
Showing 12 changed files with 91 additions and 1 deletion.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
authors:
- "[[Tim R. Davidson|Tim R. Davidson]]"
- "[[Luca Falorsi|Luca Falorsi]]"
- "[[Nicola de Cao|Nicola de Cao]]"
- "[[Thomas Kipf|Thomas Kipf]]"
- "[[Jakub M. Tomczak|Jakub M. Tomczak]]"
year: 2018
tags:
- paper
- geometric_dl
url: https://arxiv.org/abs/1804.00891
share: true
---
> [!tldr] Abstract
> The Variational Auto-Encoder (VAE) is one of the most used unsupervised machine learning models. But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments we show how such a hyperspherical VAE, or -VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, -VAE, in low dimensions on other data types. Code at [this http URL](http://github.com/nicola-decao/s-vae-tf) and [this https URL](https://github.com/nicola-decao/s-vae-pytorch)

# Notes

- "However, even for m>20 we observe a vanishing surface problem (see Figure [6](https://ar5iv.labs.arxiv.org/html/1804.00891#A5.F6 "Figure 6 ‣ Appendix E COLLAPSE OF THE SURFACE AREA ‣ Hyperspherical Variational Auto-Encoders") in Appendix [E](https://ar5iv.labs.arxiv.org/html/1804.00891#A5 "Appendix E COLLAPSE OF THE SURFACE AREA ‣ Hyperspherical Variational Auto-Encoders")). This could thus lead to unstable behavior of hyperspherical models in high dimensions."
- Basically, the hypesphere's surface area starts collapsing on high dimensions (m>20), which makes it unsuitable choice, as embeddings in this manifold lose discriminative power. This is backed by the paper's results, where s-vae outperforms n-vae up to d=40.
- ![[Pasted image 20241010115957.png|Pasted image 20241010115957.png]]
![[Pasted image 20241010115603.png|Pasted image 20241010115603.png]]
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
authors:
- "[[Ilya Loshchilov|Ilya Loshchilov]]"
- "[[Cheng-Ping Hsieh|Cheng-Ping Hsieh]]"
- "[[Simeng Sun|Simeng Sun]]"
- "[[Boris Ginsburg|Boris Ginsburg]]"
year: 2024
tags:
- paper
- efficient_dl
- transformers
url: https://arxiv.org/abs/2410.01131
share: true
---
> [!tldr] Abstract
> We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.
![[Pasted image 20241010085554.png|Pasted image 20241010085554.png]]

# Notes
- Interesting, since [[Hyperspherical Variational Auto-Encoders|Hyperspherical Variational Auto-Encoders]] claims that high-dimensional hyperspheres are not well suited for embeddings due to the vanishing surface problem. However, the nGPT paper claims that hypersphere embeddings are beneficial for training transformers. There's some discussion at [Twitter](https://x.com/maksym_andr/status/1843923528502129122).


5 changes: 5 additions & 0 deletions docs/100 Reference notes/102 Authors/Boris Ginsburg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
affiliation:
- "[[NVIDIA|NVIDIA]]"
share: true
---
5 changes: 5 additions & 0 deletions docs/100 Reference notes/102 Authors/Cheng-Ping Hsieh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
affiliation:
- "[[NVIDIA|NVIDIA]]"
share: true
---
5 changes: 5 additions & 0 deletions docs/100 Reference notes/102 Authors/Ilya Loshchilov.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
affiliation:
- "[[NVIDIA|NVIDIA]]"
share: true
---
5 changes: 5 additions & 0 deletions docs/100 Reference notes/102 Authors/Simeng Sun.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
affiliation:
- "[[NVIDIA|NVIDIA]]"
share: true
---
6 changes: 6 additions & 0 deletions docs/100 Reference notes/102 Authors/Thomas Kipf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
affiliation:
- "[[University of Amsterdam|University of Amsterdam]]"
- "[[Google DeepMind|Google DeepMind]]"
share: true
---
6 changes: 6 additions & 0 deletions docs/100 Reference notes/102 Authors/Tim R. Davidson.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
affiliation:
- "[[University of Amsterdam|University of Amsterdam]]"
- "[[EPFL|EPFL]]"
share: true
---
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,12 @@ Notation:
- $\boldsymbol{\theta}$: policy parameters
- $\mathbf{w}$ : value function parameters

> [!FAQ]- How do we learn parametrized policies?
> - 0-order: random search, grid search, heuristics
> - 1-order: use first-order derivative (gradient)
> - 2-order: user second-order statistics (hessian, etc)
> [ref:slides]
> [!NOTE] Equation 13.1: Gradient ascent update of policy parameters
>
> $$
Expand Down Expand Up @@ -60,6 +66,9 @@ Notation:
> 3. Allows continuous action spaces. (ref:slides)
> 4. Allows for stochastic policies. (ref:slides)
> [!FAQ]- How do we inject inductive biases into the policy parametrization?
> - Policy form (e.g, gaussian, etc)
> - Initialization
## 13.2 The Policy Gradient Theorem

> [!FAQ]- How does continuous policy parametrization help convergence? Compare it to VBMs.
Expand All @@ -69,7 +78,7 @@ Notation:
> [!NOTE] Equation 13.4: Performance $J(\boldsymbol{\theta})$ for the episodic case
>
> $$
> J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4}
> J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4} = \mathbb{E}_{\pi_{\boldsymbol{\theta}}}[G_0]
> $$
> [!FAQ]- What problem/question does the Policy Gradient Theorem answer?
Expand Down Expand Up @@ -204,3 +213,5 @@ Note: not part of the course readings, missing remaining notes for this subsecti
> TLDR: parametrize policy by a distribution statistics, for example, mean and variance of gaussian.
*todo: add notes*

$d \tau$
Binary file added docs/images/Pasted image 20241010085554.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241010115603.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241010115957.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 17ca25b

Please sign in to comment.