huggingface
diff --git a/Diff for: ‎.circleci/create_circleci_config.py
+1-2 b/Diff for: ‎.circleci/create_circleci_config.py
+1-2
diff --git a/Diff for: ‎.github/workflows/push-important-models.yml
+1-1 b/Diff for: ‎.github/workflows/push-important-models.yml
+1-1
diff --git a/Diff for: ‎.github/workflows/self-push-caller.yml
+2-2 b/Diff for: ‎.github/workflows/self-push-caller.yml
+2-2
diff --git a/Diff for: ‎README.md
+169-189 b/Diff for: ‎README.md
+169-189
diff --git a/Diff for: ‎docs/source/en/_toctree.yml
+2 b/Diff for: ‎docs/source/en/_toctree.yml
+2
diff --git a/Diff for: ‎docs/source/en/installation.md
+1-1 b/Diff for: ‎docs/source/en/installation.md
+1-1
diff --git a/Diff for: ‎docs/source/en/model_doc/hubert.md
+3-2 b/Diff for: ‎docs/source/en/model_doc/hubert.md
+3-2
diff --git a/Diff for: ‎docs/source/en/model_doc/mistral3.md
+234 b/Diff for: ‎docs/source/en/model_doc/mistral3.md
+234
diff --git a/Diff for: ‎docs/source/en/modular_transformers.md
+5-3 b/Diff for: ‎docs/source/en/modular_transformers.md
+5-3
diff --git a/Diff for: ‎docs/source/ko/_toctree.yml
+2-2 b/Diff for: ‎docs/source/ko/_toctree.yml
+2-2
@@ -41,8 +41,7 @@
     "ConnectionError",  # Connection transient error
     "FileNotFoundError",  # Raised by `datasets` on Hub failures
     "PIL.UnidentifiedImageError",  # Raised by `PIL.Image.open` on connection issues
-    "HTTPError.*502",  # Hub-related
-    "HTTPError.*504",  # Hub-related
+    "HTTPError",  # Also catches HfHubHTTPError
     "AssertionError: Tensor-likes are not close!",  # `torch.testing.assert_close`, we might have unlucky random values
     # TODO: error downloading tokenizer's `merged.txt` from hub can cause all the exceptions below. Throw and handle
     # them under a single message.
 
@@ -27,7 +27,7 @@ jobs:
 
       - name: Get changed files
         id: changed-files
-        uses: tj-actions/changed-files@3f54ebb830831fc121d3263c1857cfbdc310cdb9 #v42
+        uses: tj-actions/changed-files@1c8e6069583811afb28f97afeaf8e7da80c6be5c
         with:
           files: src/transformers/models/**
 
 
@@ -25,7 +25,7 @@ jobs:
 
         - name: Get changed files
           id: changed-files
-          uses: tj-actions/changed-files@v41
+          uses: tj-actions/changed-files@1c8e6069583811afb28f97afeaf8e7da80c6be5c
 
         - name: Was setup changed 
           id: was_changed
@@ -51,4 +51,4 @@ jobs:
     needs: build-docker-containers
     steps:
       - name: Trigger push CI via workflow_run
-        run: echo "Trigger push CI via workflow_run"
+        run: echo "Trigger push CI via workflow_run"
@@ -529,6 +529,8 @@
         title: MegatronGPT2
       - local: model_doc/mistral
         title: Mistral
+      - local: model_doc/mistral3
+        title: Mistral3
       - local: model_doc/mixtral
         title: Mixtral
       - local: model_doc/mluke
 
@@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.
 
 # Installation
 
-Transformers works with [PyTorch](https://pytorch.org/get-started/locally/), [TensorFlow 2.0](https://www.tensorflow.org/install/pip), and [Flax](https://flax.readthedocs.io/en/latest/). It has been tested on Python 3.6+, PyTorch 1.1.0+, TensorFlow 2.0+, and Flax.
+Transformers works with [PyTorch](https://pytorch.org/get-started/locally/), [TensorFlow 2.0](https://www.tensorflow.org/install/pip), and [Flax](https://flax.readthedocs.io/en/latest/). It has been tested on Python 3.9+, PyTorch 2.0+, TensorFlow 2.6+, and Flax 0.4.1+.
 
 ## Virtual environment
 
 
@@ -71,9 +71,10 @@ pip install -U flash-attn --no-build-isolation
 Below is an expected speedup diagram comparing the pure inference time between the native implementation in transformers of `facebook/hubert-large-ls960-ft`, the flash-attention-2 and the sdpa (scale-dot-product-attention) version. We show the average speedup obtained on the `librispeech_asr` `clean` validation split: 
 
 ```python
->>> from transformers import Wav2Vec2Model
+>>> from transformers import HubertModel
+>>> import torch
 
-model = Wav2Vec2Model.from_pretrained("facebook/hubert-large-ls960-ft", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
+>>> model = HubertModel.from_pretrained("facebook/hubert-large-ls960-ft", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda")
 ...
 ```
 
 
@@ -0,0 +1,234 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Mistral3
+
+## Overview
+
+Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
+
+It is ideal for:
+- Fast-response conversational agents.
+- Low-latency function calling.
+- Subject matter experts via fine-tuning.
+- Local inference for hobbyists and organizations handling sensitive data.
+- Programming and math reasoning.
+- Long document understanding.
+- Visual understanding.
+
+This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez) and [yonigozlan](https://huggingface.co/yonigozlan).
+
+The original code can be found [here](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/pixtral.py) and [here](https://github.com/mistralai/mistral-common).
+
+## Usage example
+
+### Inference with Pipeline
+
+Here is how you can use the `image-text-to-text` pipeline to perform inference with the `Mistral3` models in just a few lines of code:
+```python
+>>> from transformers import pipeline
+
+>>> messages = [
+...     {
+...         "role": "user",
+...         "content": [
+...             {
+...                 "type": "image",
+...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
+...             },
+...             {"type": "text", "text": "Describe this image."},
+...         ],
+...     },
+... ]
+
+>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
+>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
+>>> outputs[0]["generated_text"]
+'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'
+```
+### Inference on a single image
+
+This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.
+
+```python
+>>> from transformers import AutoProcessor, AutoModelForImageTextToText
+>>> import torch
+
+>>> torch_device = "cuda"
+>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
+>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+
+>>> messages = [
+...     {
+...         "role": "user",
+...         "content": [
+...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
+...             {"type": "text", "text": "Describe this image"},
+...         ],
+...     }
+... ]
+
+>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
+
+>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
+>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
+
+>>> decoded_output
+"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...
+```
+
+### Text-only generation
+This example shows how to generate text using the Mistral3 model without providing any image input.
+
+
+````python
+>>> from transformers import AutoProcessor, AutoModelForImageTextToText
+>>> import torch
+
+>>> torch_device = "cuda"
+>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
+>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+
+>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
+>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."
+
+>>> messages = [
+...    {"role": "system", "content": SYSTEM_PROMPT},
+...    {"role": "user", "content": user_prompt},
+... ]
+
+>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
+>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
+>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
+
+>>> print(decoded_output)
+"1. À plus tard!
+2. Salut, à plus!
+3. À toute!
+4. À la prochaine!
+5. Je me casse, à plus!
+
+```
+ /\_/\
+( o.o )
+ > ^ <
+```"
+````
+
+### Batched image and text inputs
+Mistral3 models also support batched image and text inputs.
+
+```python
+>>> from transformers import AutoProcessor, AutoModelForImageTextToText
+>>> import torch
+
+>>> torch_device = "cuda"
+>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
+>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+
+>>> messages = [
+...     [
+...         {
+...             "role": "user",
+...             "content": [
+...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
+...                 {"type": "text", "text": "Write a haiku for this image"},
+...             ],
+...         },
+...     ],
+...     [
+...         {
+...             "role": "user",
+...             "content": [
+...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
+...                 {"type": "text", "text": "Describe this image"},
+...             ],
+...         },
+...     ],
+... ]
+
+
+>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
+
+>>> output = model.generate(**inputs, max_new_tokens=25)
+
+>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
+>>> decoded_outputs
+["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
+, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]
+```
+
+### Batched multi-image input and quantization with BitsAndBytes
+This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text.
+This example also how to use `BitsAndBytes` to load the model in 4bit quantization.
+
+```python
+>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
+>>> import torch
+
+>>> torch_device = "cuda"
+>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
+>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
+>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
+>>> model = AutoModelForImageTextToText.from_pretrained(
+...     model_checkpoint, quantization_config=quantization_config
+... )
+
+>>> messages = [
+...     [
+...         {
+...             "role": "user",
+...             "content": [
+...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
+...                 {"type": "text", "text": "Write a haiku for this image"},
+...             ],
+...         },
+...     ],
+...     [
+...         {
+...             "role": "user",
+...             "content": [
+...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
+...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
+...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
+...             ],
+...         },
+...     ],
+>>> ]
+
+>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
+
+>>> output = model.generate(**inputs, max_new_tokens=25)
+
+>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
+>>> decoded_outputs
+["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]
+```
+
+
+## Mistral3Config
+
+[[autodoc]] Mistral3Config
+
+
+## Mistral3ForConditionalGeneration
+
+[[autodoc]] Mistral3ForConditionalGeneration
+    - forward
@@ -78,7 +78,7 @@ class RobertaModel(BertModel):
     super().__init__(config)
     self.embeddings = RobertaEmbeddings(config)
 
-      
+
 # The model heads now only need to redefine the model inside to `RobertaModel`
 class RobertaForMaskedLM(BertForMaskedLM):
   def __init__(self, config):
@@ -546,7 +546,7 @@ This makes it very easy to switch decorators and makes it explicit that the only
 
 ## Docstring variables
 
-If an object defined in both the modular and modeling file from which it inherits, the modular definition has precedence unless for assignments containing the pattern `DOCSTRING`. These variables are typically used in `MODEL_START_DOCSTRING` and `MODEL_INPUT_DOCSTRING` in the modeling files. They are big blocks of docstrings and the linter rewrites the names everywhere. For this reason, assignments containing the `DOCSTRING` variable always uses the definition found in the source file instead of the modular file.
+If an object defined in both the modular and modeling file from which it inherits, the modular definition has precedence unless for assignments containing the pattern `DOCSTRING`. These variables are typically used in `MODEL_START_DOCSTRING` and `MODEL_INPUT_DOCSTRING` in the modeling files. They are big blocks of docstrings and the linter rewrites the names everywhere. For this reason, assignments containing the `DOCSTRING` variable can use the definition found in the source file without copying the whole docstring, by simply setting the variable to `None` in the modular file.
 
 This is very useful if you need the variable reference somewhere but you don't want to clutter the modular file with docstrings which are always the same. The example code below allows you to automatically use the same docstrings from [Mistral](./model_doc/mistral) in [Starcoder2](./model_doc/starcoder2).
 
@@ -561,6 +561,8 @@ class Starcoder2Model(MistralModel):
         ...
 ```
 
+Setting the variable to anything other than `None` will override the docstring, so that you can customize the docstrings if needed.
+
 ## Special naming
 
 The linter automatically renames everything when inheriting from a class. For consistency, you should always use the same class name prefix when inheriting from different classes from the same file.
@@ -586,7 +588,7 @@ We detected multiple prefix names when inheriting from transformers.models.llama
 If there are automatic dependencies with a prefix, but you want another one, explicitly rename the classes locally with a `pass` class as shown in the following.
 
 ```py
-class Emu3TextMLP(LlamaMLP):                                 
+class Emu3TextMLP(LlamaMLP):
     pass
 ```
 
 
@@ -332,8 +332,8 @@
         title: (번역중) CamemBERT
       - local: in_translation
         title: (번역중) CANINE
-      - local: in_translation
-        title: (번역중) CodeGen
+      - local: model_doc/codegen
+        title: CodeGen
       - local: model_doc/cohere
         title: Cohere
       - local: model_doc/convbert