[PUBLISHER] Merge #48

* PUSH NOTE : Thomas Kipf.md * PUSH NOTE : Tim R. Davidson.md * PUSH NOTE : Simeng Sun.md * PUSH NOTE : Ilya Loshchilov.md * PUSH NOTE : Cheng-Ping Hsieh.md * PUSH NOTE : Boris Ginsburg.md * PUSH NOTE : Hyperspherical Variational Auto-Encoders.md * PUSH ATTACHMENT : Pasted image 20241010115957.png * PUSH ATTACHMENT : Pasted image 20241010115603.png * PUSH NOTE : nGPT - Normalized Transformer with Representation Learning on the Hypersphere.md * PUSH ATTACHMENT : Pasted image 20241010085554.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 13.md
dgcnz · Oct 10, 2024 · 17ca25b · 17ca25b
1 parent 7822a03
commit 17ca25b
Show file tree

Hide file tree

Showing 12 changed files with 91 additions and 1 deletion.
diff --git a/.../100 Reference notes/101 Literature/Hyperspherical Variational Auto-Encoders.md b/.../100 Reference notes/101 Literature/Hyperspherical Variational Auto-Encoders.md
@@ -0,0 +1,24 @@
+---
+authors:
+  - "[[Tim R. Davidson|Tim R. Davidson]]"
+  - "[[Luca Falorsi|Luca Falorsi]]"
+  - "[[Nicola de Cao|Nicola de Cao]]"
+  - "[[Thomas Kipf|Thomas Kipf]]"
+  - "[[Jakub M. Tomczak|Jakub M. Tomczak]]"
+year: 2018
+tags:
+  - paper
+  - geometric_dl
+url: https://arxiv.org/abs/1804.00891
+share: true
+---
+> [!tldr] Abstract
+> The Variational Auto-Encoder (VAE) is one of the most used unsupervised machine learning models. But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments we show how such a hyperspherical VAE, or -VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, -VAE, in low dimensions on other data types. Code at [this http URL](http://github.com/nicola-decao/s-vae-tf) and [this https URL](https://github.com/nicola-decao/s-vae-pytorch)
+
+
+# Notes
+
+- "However, even for m>20 we observe a vanishing surface problem (see Figure [6](https://ar5iv.labs.arxiv.org/html/1804.00891#A5.F6 "Figure 6 ‣ Appendix E COLLAPSE OF THE SURFACE AREA ‣ Hyperspherical Variational Auto-Encoders") in Appendix [E](https://ar5iv.labs.arxiv.org/html/1804.00891#A5 "Appendix E COLLAPSE OF THE SURFACE AREA ‣ Hyperspherical Variational Auto-Encoders")). This could thus lead to unstable behavior of hyperspherical models in high dimensions."
+	- Basically, the hypesphere's surface area starts collapsing on high dimensions (m>20), which makes it unsuitable choice, as embeddings in this manifold lose discriminative power. This is backed by the paper's results, where s-vae outperforms n-vae up to d=40.
+	- ![[Pasted image 20241010115957.png|Pasted image 20241010115957.png]]
+![[Pasted image 20241010115603.png|Pasted image 20241010115603.png]]
diff --git a/...GPT - Normalized Transformer with Representation Learning on the Hypersphere.md b/...GPT - Normalized Transformer with Representation Learning on the Hypersphere.md
@@ -0,0 +1,23 @@
+---
+authors:
+  - "[[Ilya Loshchilov|Ilya Loshchilov]]"
+  - "[[Cheng-Ping Hsieh|Cheng-Ping Hsieh]]"
+  - "[[Simeng Sun|Simeng Sun]]"
+  - "[[Boris Ginsburg|Boris Ginsburg]]"
+year: 2024
+tags:
+  - paper
+  - efficient_dl
+  - transformers
+url: https://arxiv.org/abs/2410.01131
+share: true
+---
+> [!tldr] Abstract
+> We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.
+
+![[Pasted image 20241010085554.png|Pasted image 20241010085554.png]]
+
+# Notes
+- Interesting, since [[Hyperspherical Variational Auto-Encoders|Hyperspherical Variational Auto-Encoders]] claims that high-dimensional hyperspheres are not well suited for embeddings due to the vanishing surface problem. However, the nGPT paper claims that hypersphere embeddings are beneficial for training transformers. There's some discussion at [Twitter](https://x.com/maksym_andr/status/1843923528502129122).
+
+
diff --git a/docs/100 Reference notes/102 Authors/Boris Ginsburg.md b/docs/100 Reference notes/102 Authors/Boris Ginsburg.md
@@ -0,0 +1,5 @@
+---
+affiliation:
+  - "[[NVIDIA|NVIDIA]]"
+share: true
+---
diff --git a/docs/100 Reference notes/102 Authors/Cheng-Ping Hsieh.md b/docs/100 Reference notes/102 Authors/Cheng-Ping Hsieh.md
@@ -0,0 +1,5 @@
+---
+affiliation:
+  - "[[NVIDIA|NVIDIA]]"
+share: true
+---
diff --git a/docs/100 Reference notes/102 Authors/Ilya Loshchilov.md b/docs/100 Reference notes/102 Authors/Ilya Loshchilov.md
@@ -0,0 +1,5 @@
+---
+affiliation:
+  - "[[NVIDIA|NVIDIA]]"
+share: true
+---
diff --git a/docs/100 Reference notes/102 Authors/Simeng Sun.md b/docs/100 Reference notes/102 Authors/Simeng Sun.md
@@ -0,0 +1,5 @@
+---
+affiliation:
+  - "[[NVIDIA|NVIDIA]]"
+share: true
+---
diff --git a/docs/100 Reference notes/102 Authors/Thomas Kipf.md b/docs/100 Reference notes/102 Authors/Thomas Kipf.md
@@ -0,0 +1,6 @@
+---
+affiliation:
+  - "[[University of Amsterdam|University of Amsterdam]]"
+  - "[[Google DeepMind|Google DeepMind]]"
+share: true
+---
diff --git a/docs/100 Reference notes/102 Authors/Tim R. Davidson.md b/docs/100 Reference notes/102 Authors/Tim R. Davidson.md
@@ -0,0 +1,6 @@
+---
+affiliation:
+  - "[[University of Amsterdam|University of Amsterdam]]"
+  - "[[EPFL|EPFL]]"
+share: true
+---
diff --git a/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md b/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md
@@ -22,6 +22,12 @@ Notation:
 - $\boldsymbol{\theta}$: policy parameters
 - $\mathbf{w}$ : value function parameters
 
+> [!FAQ]- How do we learn parametrized policies?
+> - 0-order: random search, grid search, heuristics
+> - 1-order: use first-order derivative (gradient)
+> - 2-order: user second-order statistics (hessian, etc) 
+> [ref:slides]
+
 > [!NOTE] Equation 13.1: Gradient ascent update of policy parameters
 > 
 > $$
@@ -60,6 +66,9 @@ Notation:
 > 3. Allows continuous action spaces. (ref:slides)
 > 4. Allows for stochastic policies. (ref:slides)
 
+> [!FAQ]- How do we inject inductive biases into the policy parametrization?
+>  - Policy form (e.g, gaussian, etc)
+>  - Initialization
 ## 13.2 The Policy Gradient Theorem
 
 > [!FAQ]- How does continuous policy parametrization help convergence? Compare it to VBMs.
@@ -69,7 +78,7 @@ Notation:
 > [!NOTE] Equation 13.4: Performance $J(\boldsymbol{\theta})$ for the episodic case
 > 
 > $$
->  J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4}
+>  J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4} = \mathbb{E}_{\pi_{\boldsymbol{\theta}}}[G_0]
 > $$
 
 > [!FAQ]- What problem/question does the Policy Gradient Theorem answer?
@@ -204,3 +213,5 @@ Note: not part of the course readings, missing remaining notes for this subsecti
 > TLDR: parametrize policy by a distribution statistics, for example, mean and variance of gaussian.
 
 *todo: add notes*
+
+$d \tau$
diff --git a/docs/images/Pasted image 20241010085554.png b/docs/images/Pasted image 20241010085554.png
diff --git a/docs/images/Pasted image 20241010115603.png b/docs/images/Pasted image 20241010115603.png
diff --git a/docs/images/Pasted image 20241010115957.png b/docs/images/Pasted image 20241010115957.png