[PUBLISHER] Merge #51

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 8.md * PUSH ATTACHMENT : Pasted image 20241021182053.png * PUSH NOTE : Masked Autoencoders Are Scalable Vision Learners.md * PUSH ATTACHMENT : Pasted image 20250203104532.png * PUSH NOTE : Unsupervised Visual Representation Learning by Context Prediction.md * PUSH ATTACHMENT : Pasted image 20250113171820.png * PUSH ATTACHMENT : Pasted image 20250113145443.png * PUSH NOTE : Unsupervised Learning of Visual Features by Contrasting Cluster Assignments.md * PUSH NOTE : Three things everyone should know about Vision Transformers.md * PUSH ATTACHMENT : Pasted image 20250113160631.png * PUSH NOTE : Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.md * PUSH NOTE : Round and Round We Go! What makes Rotary Positional Encodings useful?.md * PUSH NOTE : Rotary Position Embedding for Vision Transformer.md * PUSH NOTE : Position Prediction as an Effective Pretraining Strategy.md * PUSH NOTE : LoRA vs Full Fine-tuning - An Illusion of Equivalence.md * PUSH NOTE : Location-Aware Self-Supervised Transformers for Semantic Segmentation.md * PUSH NOTE : LieRE - Generalizing Rotary Position Encodings.md * PUSH NOTE : Improving Self Consistency in LLMs through Probabilistic Tokenization.md * PUSH NOTE : Fixing the train-test resolution discrepancy.md * PUSH ATTACHMENT : Pasted image 20250113142231.png * PUSH NOTE : How JEPA Avoids Noisy Features - The Implicit Bias of DeepLinear Self Distillation Networks.md * PUSH NOTE : HoPE - A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation.md * PUSH NOTE : DeiT III - Revenge of the ViT.md * PUSH NOTE : DropPos - Pre-Training Vision Transformers by Reconstructing Dropped Positions.md * PUSH ATTACHMENT : Pasted image 20250114202919.png * PUSH NOTE : Curiosity-driven Exploration by Self-supervised Prediction.md * PUSH NOTE : Piotr Dollár.md * PUSH NOTE : Saining Xie.md * PUSH NOTE : Wei Lu.md * PUSH NOTE : Song Park.md * PUSH NOTE : Petar Veličković.md * PUSH NOTE : Lucas Beyer.md * PUSH NOTE : Federico Barbero.md * PUSH NOTE : Christos Perivolaropoulos.md * PUSH NOTE : Alex Vitvitskyi.md
dgcnz · Feb 8, 2025 · b1ff5aa · b1ff5aa
1 parent c869a2f
commit b1ff5aa
Show file tree

Hide file tree

Showing 35 changed files with 461 additions and 0 deletions.
diff --git a/...es/101 Literature/Curiosity-driven Exploration by Self-supervised Prediction.md b/...es/101 Literature/Curiosity-driven Exploration by Self-supervised Prediction.md
@@ -0,0 +1,40 @@
+---
+authors:
+  - "[[Deepak Pathak|Deepak Pathak]]"
+  - "[[Pulkit Agrawal|Pulkit Agrawal]]"
+  - "[[Alexei A. Efros|Alexei A. Efros]]"
+  - "[[Trevor Darrell|Trevor Darrell]]"
+year: 2017
+tags:
+  - paper
+  - rl
+url: https://arxiv.org/abs/1705.05363
+share: true
+---
+> [!tldr] Abstract
+> 
+> In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch. Demo video and code available at [this https URL](https://pathak22.github.io/noreward-rl/)
+
+
+**(a) Which and how many environment(s) are the techniques tested on?**
+1. VizDoom: 2 maps, 1 for training and 1 for testing. Each map is different both in terms of layout and textures.
+2. Super Mario Bros: 4 levels (1 for pre-training and 3 for testing). Levels might have different lightning (day/night), etc.
+
+**(b) What methods is the proposed technique compared to?**
+- In all experiments, the technique is ablated as follows:
+	- A3C (baseline)
+	- A3C + ICM
+	- A3C + ICM (pixels)
+- In experiment 3.1, ICM is compared to TRPO + VIME.
+
+**(c) How are hyperparameters set? Does this result in a fair comparison?**
+- For the comparison between ICM and TRPO (+ VIME), the hyperparameters for TRPO and TRPO + VIME are taken from a concurrent work that tests on the same environment. Unless we expect the other work to have chosen suboptimal hyperparameters in comparison to the current work, it doesn't strike me as unfair.
+- For the ICM ablation, no mention of the hyperparameters is made.  
+
+**(d) Which quantities are measured?** 
+- For the comparison between ICM and TRPO (+ VIME), mean and median score (at convergence) are used. They state ICM is also better in terms of convergence rate but don't provide any quantitative measure because both algorithms have "different setups".
+- For the ICM ablation, the Extrinsic Rewards per Episode is measured at each training iteration. This metric basically shows the average success rate of the agent (in reaching the goal).
+
+(e) Is it clear which experimental procedure was followed? (e.g. number of runs, ...) 
+(f) Is it clear what the spread is between such runs? How is this spread reported? 
+(g) Are results presented clear and interpretable way?
diff --git a/docs/100 Reference notes/101 Literature/DeiT III - Revenge of the ViT.md b/docs/100 Reference notes/101 Literature/DeiT III - Revenge of the ViT.md
@@ -0,0 +1,27 @@
+---
+authors:
+  - "[[Hugo Touvron|Hugo Touvron]]"
+  - "[[Matthieu Cord|Matthieu Cord]]"
+  - "[[Hervé Jegou|Hervé Jegou]]"
+year: 2022
+tags:
+  - paper
+  - vit
+url: https://arxiv.org/abs/2204.07118
+share: true
+---
+> [!tldr] Abstract
+> A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.
+
+## Notes
+
+"""
+The main ingredients are as follows: 
+- We build upon the work of Wightman et al. [57] introduced for ResNet50. In particular we adopt a binary cross entropy loss for Imagenet1k only training. We adapt this method by including ingredients that significantly improve the training of large ViT [51], namely **stochastic depth** [24] and **LayerScale** [51]. 
+- **3-Augment**: is a simple data augmentation inspired by that employed for self-supervised learning. Surprisingly, with ViT we observe that it works better than the usual automatic/learned data-augmentation employed to train vision transformers like RandAugment [6]. 
+- **Simple Random Cropping** is more effective than Random Resize Cropping when pre-training on a larger set like ImageNet-21k. 
+- **A lower resolution at training time.** This choice reduces the train-test discrepancy [53] but has not been much exploited with ViT. We observe that it also has a regularizing effect for the largest models by preventing overfitting. For instance, for a target resolution of 224 × 224, a ViT-H pre-trained at resolution 126 × 126 (81 tokens) achieves a better performance on ImageNet-1k than when pre-training at resolution 224 × 224 (256 tokens). This is also less demanding at pre-training time, as there are 70% fewer tokens. From this perspective it offers similar scaling properties as mask-autoencoders [19].
+"""
+
+
+
diff --git a/...opPos - Pre-Training Vision Transformers by Reconstructing Dropped Positions.md b/...opPos - Pre-Training Vision Transformers by Reconstructing Dropped Positions.md
@@ -0,0 +1,25 @@
+---
+authors:
+  - "[[Haochen Wang|Haochen Wang]]"
+  - "[[Junsong Fan|Junsong Fan]]"
+  - "[[Yuxi Wang|Yuxi Wang]]"
+  - "[[Kaiyou Song|Kaiyou Song]]"
+  - "[[Tong Wang|Tong Wang]]"
+  - "[[Zhaoxiang Zhang|Zhaoxiang Zhang]]"
+year: 2023
+tags:
+  - paper
+  - foundation_models
+  - vit
+url: https://arxiv.org/abs/2309.03576
+share: true
+---
+> [!tldr] Abstract
+> As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at [this https URL](https://github.com/Haochen-Wang409/DropPos).
+
+
+# Notes
+
+![[Pasted image 20250114202919.png|Pasted image 20250114202919.png]]
+
+Maybe expected, but fixed position embeddings achieve better performance than learnable position embeddings.
diff --git a/... Reference notes/101 Literature/Fixing the train-test resolution discrepancy.md b/... Reference notes/101 Literature/Fixing the train-test resolution discrepancy.md
@@ -0,0 +1,26 @@
+---
+authors:
+  - "[[Hugo Touvron|Hugo Touvron]]"
+  - "[[Andrea Vedaldi|Andrea Vedaldi]]"
+  - "[[Matthijs Douze|Matthijs Douze]]"
+  - "[[Hervé Jegou|Hervé Jegou]]"
+year: 2019
+tags:
+  - paper
+  - vit
+url: https://arxiv.org/abs/1906.06423
+share: true
+---
+> [!tldr] Abstract
+> Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the typical size of the objects seen by the classifier at train and test time. We experimentally validate that, for a target test resolution, using a lower train resolution offers better classification at test time.  
+> We then propose a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ. It involves only a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-50 trained on 128x128 images, and 79.8% with one trained on 224x224 image. In addition, if we use extra training data we get 82.5% with the ResNet-50 train with 224x224 images.  
+> Conversely, when training a ResNeXt-101 32x48d pre-trained in weakly-supervised fashion on 940 million public images at resolution 224x224 and further optimizing for test resolution 320x320, we obtain a test top-1 accuracy of 86.4% (top-5: 98.0%) (single-crop). To the best of our knowledge this is the highest ImageNet single-crop, top-1 and top-5 accuracy to date.
+
+## Notes
+
+train: RandomCrop, Resize
+test: CenterCrop
+
+generates object size discrepancies in train and test images
+
+![[Pasted image 20250113142231.png|Pasted image 20250113142231.png]]
diff --git a/...ing Without Long-Term Decay for Enhanced Context Awareness and Extrapolation.md b/...ing Without Long-Term Decay for Enhanced Context Awareness and Extrapolation.md
@@ -0,0 +1,19 @@
+---
+authors:
+  - "[[Yuhan Chen|Yuhan Chen]]"
+  - "[[Ang Lv|Ang Lv]]"
+  - "[[Jian Luan|Jian Luan]]"
+  - "[[Bin Wang|Bin Wang]]"
+  - "[[Wei Lu|Wei Lu]]"
+year: 2024
+tags:
+  - paper
+  - dl_theory
+  - transformers
+  - posembed
+url: https://arxiv.org/abs/2410.21216
+share: true
+---
+> [!tldr] Abstract
+> Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and [this http URL](http://extrapolation.inspired/) by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.
+
diff --git a/... Noisy Features - The Implicit Bias of DeepLinear Self Distillation Networks.md b/... Noisy Features - The Implicit Bias of DeepLinear Self Distillation Networks.md
@@ -0,0 +1,19 @@
+---
+authors:
+  - "[[Etai Littwin|Etai Littwin]]"
+  - "[[Omid Saremi|Omid Saremi]]"
+  - "[[Madhu Advani|Madhu Advani]]"
+  - "[[Vimal Thilak|Vimal Thilak]]"
+  - "[[Preetum Nakkiran|Preetum Nakkiran]]"
+  - "[[Chen Huang|Chen Huang]]"
+  - "[[Joshua Susskind|Joshua Susskind]]"
+year: 2024
+tags:
+  - paper
+  - dl_theory
+url: https://arxiv.org/abs/2407.03475
+share: true
+---
+> [!tldr] Abstract
+> Two competing paradigms exist for self-supervised learning of data representations. Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other. A recent successful approach that falls under the JEPA framework is self-distillation, where an online encoder is trained to predict the output of the target encoder, sometimes using a lightweight predictor network. This is contrasted with the Masked AutoEncoder (MAE) paradigm, where an encoder and decoder are trained to reconstruct missing parts of the input in the data space rather, than its latent representation. A common motivation for using the JEPA approach over MAE is that the JEPA objective prioritizes abstract features over fine-grained pixel information (which can be unpredictable and uninformative). In this work, we seek to understand the mechanism behind this empirical observation by analyzing the training dynamics of deep linear models. We uncover a surprising mechanism: in a simplified linear setting where both approaches learn similar representations, JEPAs are biased to learn high-influence features, i.e., features characterized by having high regression coefficients. Our results point to a distinct implicit bias of predicting in latent space that may shed light on its success in practice.
+
diff --git a/...rature/Improving Self Consistency in LLMs through Probabilistic Tokenization.md b/...rature/Improving Self Consistency in LLMs through Probabilistic Tokenization.md
@@ -0,0 +1,23 @@
+---
+authors:
+  - "[[Ashutosh Sathe|Ashutosh Sathe]]"
+  - "[[Divyanshu Aggarwal|Divyanshu Aggarwal]]"
+  - "[[Sunayana Sitaram|Sunayana Sitaram]]"
+year: 2024
+tags:
+  - paper
+  - llm
+  - dl_theory
+url: https://arxiv.org/abs/2407.03678v1
+share: true
+---
+> [!tldr] Abstract
+> Prior research has demonstrated noticeable performance gains through the use of probabilistic tokenizations, an approach that involves employing multiple tokenizations of the same input string during the training phase of a language model. Despite these promising findings, modern large language models (LLMs) have yet to be trained using probabilistic tokenizations. Interestingly, while the tokenizers of these contemporary LLMs have the capability to generate multiple tokenizations, this property remains underutilized.  
+> 
+> In this work, we propose a novel method to leverage the multiple tokenization capabilities of modern LLM tokenizers, aiming to enhance the self-consistency of LLMs in reasoning tasks. Our experiments indicate that when utilizing probabilistic tokenizations, LLMs generate logically diverse reasoning paths, moving beyond mere surface-level linguistic [this http URL](http://diversity.we/) carefully study probabilistic tokenization and offer insights to explain the self consistency improvements it brings through extensive experimentation on 5 LLM families and 4 reasoning benchmarks.
+
+## Notes
+
+> "Our analysis shows that the primary reason for success of probabilistic tokenization on reasoning tasks is its ability to generate logically diverse reasoning paths." 
+
+
diff --git a/...eference notes/101 Literature/LieRE - Generalizing Rotary Position Encodings.md b/...eference notes/101 Literature/LieRE - Generalizing Rotary Position Encodings.md
@@ -0,0 +1,17 @@
+---
+authors:
+  - "[[Sophie Ostmeier|Sophie Ostmeier]]"
+  - "[[Brian Axelrod|Brian Axelrod]]"
+  - "[[Michael E. Moseley|Michael E. Moseley]]"
+  - "[[Akshay Chaudhari|Akshay Chaudhari]]"
+  - "[[Curtis Langlotz|Curtis Langlotz]]"
+year: 2024
+tags:
+  - paper
+  - dl_theory
+  - vit
+url: https://arxiv.org/abs/2406.10322
+share: true
+---
+> [!tldr] Abstract
+> While Rotary Position Embeddings (RoPE) for large language models have become widely adopted, their application for other modalities has been slower. Here, we introduce Lie group Relative position Encodings (LieRE) that goes beyond RoPE in supporting n-dimensional inputs. We evaluate the performance of LieRE on 2D and 3D image classification tasks and observe that LieRE leads to marked relative improvements in performance (up to 9.7% for 2D and up to 25.5% for 3D), training efficiency (3.5x reduction), data efficiency (30%) compared to the baselines of DeiT III, RoPE-Mixed and Vision-Llama. [this https URL](https://github.com/Stanford-AIMI/LieRE)
diff --git a/...e notes/101 Literature/LoRA vs Full Fine-tuning - An Illusion of Equivalence.md b/...e notes/101 Literature/LoRA vs Full Fine-tuning - An Illusion of Equivalence.md
@@ -0,0 +1,16 @@
+---
+authors:
+  - "[[Reece Shuttleworth|Reece Shuttleworth]]"
+  - "[[Jacob Andreas|Jacob Andreas]]"
+  - "[[Antonio Torralba|Antonio Torralba]]"
+  - "[[Pratyusha Sharma|Pratyusha Sharma]]"
+year: 2024
+tags:
+  - paper
+  - dl_theory
+url: https://arxiv.org/abs/2410.21228
+share: true
+---
+> [!tldr] Abstract
+> Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to match the performance of fully fine-tuned models on various tasks with an extreme reduction in the number of trainable parameters. Even in settings where both methods learn similarly accurate models, \emph{are their learned solutions really equivalent?} We study how different fine-tuning methods change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure; moreover, the fine-tuned models themselves show distinct generalization behaviors when tested outside the adaptation task's distribution. More specifically, we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}. Intruder dimensions do not appear during full fine-tuning. Second, we show that LoRA models with intruder dimensions, despite achieving similar performance to full fine-tuning on the target task, become worse models of the pre-training distribution and adapt less robustly to multiple tasks sequentially. Higher-rank, rank-stabilized LoRA models closely mirror full fine-tuning, even when performing on par with lower-rank LoRA models on the same tasks. These results suggest that models updated with LoRA and full fine-tuning access different parts of parameter space, even when they perform equally on the fine-tuned distribution. We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.
+
diff --git a/...rature/Location-Aware Self-Supervised Transformers for Semantic Segmentation.md b/...rature/Location-Aware Self-Supervised Transformers for Semantic Segmentation.md
@@ -0,0 +1,15 @@
+---
+authors:
+  - "[[Mathilde Caron|Mathilde Caron]]"
+  - "[[Neil Houlsby|Neil Houlsby]]"
+  - "[[Cordelia Schmid|Cordelia Schmid]]"
+year: 2022
+tags:
+  - paper
+  - vit
+  - thesis
+url: https://arxiv.org/abs/2212.02400
+share: true
+---
+> [!tldr] Abstract
+> Pixel-level labels are particularly expensive to acquire. Hence, pretraining is a critical step to improve models on a task like semantic segmentation. However, prominent algorithms for pretraining neural networks use image-level objectives, e.g. image classification, image-text alignment a la CLIP, or self-supervised contrastive learning. These objectives do not model spatial information, which might be sub-optimal when finetuning on downstream tasks with spatial reasoning. In this work, we pretrain network with a location-aware (LOCA) self-supervised method which fosters the emergence of strong dense features. Specifically, we use both a patch-level clustering scheme to mine dense pseudo-labels and a relative location prediction task to encourage learning about object parts and their spatial arrangements. Our experiments show that LOCA pretraining leads to representations that transfer competitively to challenging and diverse semantic segmentation datasets.