diff --git a/docs/100 Reference notes/101 Literature/Hyperspherical Variational Auto-Encoders.md b/docs/100 Reference notes/101 Literature/Hyperspherical Variational Auto-Encoders.md new file mode 100644 index 00000000..b9a54d39 --- /dev/null +++ b/docs/100 Reference notes/101 Literature/Hyperspherical Variational Auto-Encoders.md @@ -0,0 +1,24 @@ +--- +authors: + - "[[Tim R. Davidson|Tim R. Davidson]]" + - "[[Luca Falorsi|Luca Falorsi]]" + - "[[Nicola de Cao|Nicola de Cao]]" + - "[[Thomas Kipf|Thomas Kipf]]" + - "[[Jakub M. Tomczak|Jakub M. Tomczak]]" +year: 2018 +tags: + - paper + - geometric_dl +url: https://arxiv.org/abs/1804.00891 +share: true +--- +> [!tldr] Abstract +> The Variational Auto-Encoder (VAE) is one of the most used unsupervised machine learning models. But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments we show how such a hyperspherical VAE, or -VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, -VAE, in low dimensions on other data types. Code at [this http URL](http://github.com/nicola-decao/s-vae-tf) and [this https URL](https://github.com/nicola-decao/s-vae-pytorch) + + +# Notes + +- "However, even for m>20 we observe a vanishing surface problem (see Figure [6](https://ar5iv.labs.arxiv.org/html/1804.00891#A5.F6 "Figure 6 ‣ Appendix E COLLAPSE OF THE SURFACE AREA ‣ Hyperspherical Variational Auto-Encoders") in Appendix [E](https://ar5iv.labs.arxiv.org/html/1804.00891#A5 "Appendix E COLLAPSE OF THE SURFACE AREA ‣ Hyperspherical Variational Auto-Encoders")). This could thus lead to unstable behavior of hyperspherical models in high dimensions." + - Basically, the hypesphere's surface area starts collapsing on high dimensions (m>20), which makes it unsuitable choice, as embeddings in this manifold lose discriminative power. This is backed by the paper's results, where s-vae outperforms n-vae up to d=40. + - ![[Pasted image 20241010115957.png|Pasted image 20241010115957.png]] +![[Pasted image 20241010115603.png|Pasted image 20241010115603.png]] \ No newline at end of file diff --git a/docs/100 Reference notes/101 Literature/nGPT - Normalized Transformer with Representation Learning on the Hypersphere.md b/docs/100 Reference notes/101 Literature/nGPT - Normalized Transformer with Representation Learning on the Hypersphere.md new file mode 100644 index 00000000..5d5e124c --- /dev/null +++ b/docs/100 Reference notes/101 Literature/nGPT - Normalized Transformer with Representation Learning on the Hypersphere.md @@ -0,0 +1,23 @@ +--- +authors: + - "[[Ilya Loshchilov|Ilya Loshchilov]]" + - "[[Cheng-Ping Hsieh|Cheng-Ping Hsieh]]" + - "[[Simeng Sun|Simeng Sun]]" + - "[[Boris Ginsburg|Boris Ginsburg]]" +year: 2024 +tags: + - paper + - efficient_dl + - transformers +url: https://arxiv.org/abs/2410.01131 +share: true +--- +> [!tldr] Abstract +> We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length. + +![[Pasted image 20241010085554.png|Pasted image 20241010085554.png]] + +# Notes +- Interesting, since [[Hyperspherical Variational Auto-Encoders|Hyperspherical Variational Auto-Encoders]] claims that high-dimensional hyperspheres are not well suited for embeddings due to the vanishing surface problem. However, the nGPT paper claims that hypersphere embeddings are beneficial for training transformers. There's some discussion at [Twitter](https://x.com/maksym_andr/status/1843923528502129122). + + diff --git a/docs/100 Reference notes/102 Authors/Boris Ginsburg.md b/docs/100 Reference notes/102 Authors/Boris Ginsburg.md new file mode 100644 index 00000000..81589695 --- /dev/null +++ b/docs/100 Reference notes/102 Authors/Boris Ginsburg.md @@ -0,0 +1,5 @@ +--- +affiliation: + - "[[NVIDIA|NVIDIA]]" +share: true +--- diff --git a/docs/100 Reference notes/102 Authors/Cheng-Ping Hsieh.md b/docs/100 Reference notes/102 Authors/Cheng-Ping Hsieh.md new file mode 100644 index 00000000..81589695 --- /dev/null +++ b/docs/100 Reference notes/102 Authors/Cheng-Ping Hsieh.md @@ -0,0 +1,5 @@ +--- +affiliation: + - "[[NVIDIA|NVIDIA]]" +share: true +--- diff --git a/docs/100 Reference notes/102 Authors/Ilya Loshchilov.md b/docs/100 Reference notes/102 Authors/Ilya Loshchilov.md new file mode 100644 index 00000000..81589695 --- /dev/null +++ b/docs/100 Reference notes/102 Authors/Ilya Loshchilov.md @@ -0,0 +1,5 @@ +--- +affiliation: + - "[[NVIDIA|NVIDIA]]" +share: true +--- diff --git a/docs/100 Reference notes/102 Authors/Simeng Sun.md b/docs/100 Reference notes/102 Authors/Simeng Sun.md new file mode 100644 index 00000000..81589695 --- /dev/null +++ b/docs/100 Reference notes/102 Authors/Simeng Sun.md @@ -0,0 +1,5 @@ +--- +affiliation: + - "[[NVIDIA|NVIDIA]]" +share: true +--- diff --git a/docs/100 Reference notes/102 Authors/Thomas Kipf.md b/docs/100 Reference notes/102 Authors/Thomas Kipf.md new file mode 100644 index 00000000..2b84e68f --- /dev/null +++ b/docs/100 Reference notes/102 Authors/Thomas Kipf.md @@ -0,0 +1,6 @@ +--- +affiliation: + - "[[University of Amsterdam|University of Amsterdam]]" + - "[[Google DeepMind|Google DeepMind]]" +share: true +--- diff --git a/docs/100 Reference notes/102 Authors/Tim R. Davidson.md b/docs/100 Reference notes/102 Authors/Tim R. Davidson.md new file mode 100644 index 00000000..96bde7cf --- /dev/null +++ b/docs/100 Reference notes/102 Authors/Tim R. Davidson.md @@ -0,0 +1,6 @@ +--- +affiliation: + - "[[University of Amsterdam|University of Amsterdam]]" + - "[[EPFL|EPFL]]" +share: true +--- diff --git a/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md b/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md index e4bcb5fd..d94e9666 100644 --- a/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md +++ b/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 13.md @@ -22,6 +22,12 @@ Notation: - $\boldsymbol{\theta}$: policy parameters - $\mathbf{w}$ : value function parameters +> [!FAQ]- How do we learn parametrized policies? +> - 0-order: random search, grid search, heuristics +> - 1-order: use first-order derivative (gradient) +> - 2-order: user second-order statistics (hessian, etc) +> [ref:slides] + > [!NOTE] Equation 13.1: Gradient ascent update of policy parameters > > $$ @@ -60,6 +66,9 @@ Notation: > 3. Allows continuous action spaces. (ref:slides) > 4. Allows for stochastic policies. (ref:slides) +> [!FAQ]- How do we inject inductive biases into the policy parametrization? +> - Policy form (e.g, gaussian, etc) +> - Initialization ## 13.2 The Policy Gradient Theorem > [!FAQ]- How does continuous policy parametrization help convergence? Compare it to VBMs. @@ -69,7 +78,7 @@ Notation: > [!NOTE] Equation 13.4: Performance $J(\boldsymbol{\theta})$ for the episodic case > > $$ -> J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4} +> J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0) \tag{13.4} = \mathbb{E}_{\pi_{\boldsymbol{\theta}}}[G_0] > $$ > [!FAQ]- What problem/question does the Policy Gradient Theorem answer? @@ -204,3 +213,5 @@ Note: not part of the course readings, missing remaining notes for this subsecti > TLDR: parametrize policy by a distribution statistics, for example, mean and variance of gaussian. *todo: add notes* + +$d \tau$ diff --git a/docs/images/Pasted image 20241010085554.png b/docs/images/Pasted image 20241010085554.png new file mode 100644 index 00000000..5a010a99 Binary files /dev/null and b/docs/images/Pasted image 20241010085554.png differ diff --git a/docs/images/Pasted image 20241010115603.png b/docs/images/Pasted image 20241010115603.png new file mode 100644 index 00000000..769b7f95 Binary files /dev/null and b/docs/images/Pasted image 20241010115603.png differ diff --git a/docs/images/Pasted image 20241010115957.png b/docs/images/Pasted image 20241010115957.png new file mode 100644 index 00000000..6b383e55 Binary files /dev/null and b/docs/images/Pasted image 20241010115957.png differ