AutoTokenizer in Convert_vima notebook #10

ManishGovind · 2025-02-04T18:28:50Z

When i try to generate instruction Tuning BC (step 4). I get the below error :



from transformers import AutoTokenizer
      [2](vscode-notebook-cell:?execution_count=4&line=2) import torch 
----> [3](vscode-notebook-cell:?execution_count=4&line=3) tokenizer = AutoTokenizer.from_pretrained("llava-hf/llava-1.5-7b-hf")
      [4](vscode-notebook-cell:?execution_count=4&line=4) print(tokenizer.vocab_size)
      [6](vscode-notebook-cell:?execution_count=4&line=6) # this is for RT-2
.......

[110](https://vscode-remote+ssh-002dremote-002bhpc-002echarlotte-002eedu.vscode-resource.vscode-cdn.net/users/mgovind/LLaRA/datasets/~/miniconda3/envs/llara/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:110)     # We have a serialization from tokenizers which let us directly build the backend
--> [111](https://vscode-remote+ssh-002dremote-002bhpc-002echarlotte-002eedu.vscode-resource.vscode-cdn.net/users/mgovind/LLaRA/datasets/~/miniconda3/envs/llara/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111)     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)

Exception: data did not match any variant of untagged enum ModelWrapper at line 277156 column 3

May i know what should be the transformers version. I followed the same instructions for setting up LLaRA according to README.md . Thanks for your wonderful work !!

The text was updated successfully, but these errors were encountered:

LostXine · 2025-02-04T18:32:21Z

Hi @ManishGovind ,

Thanks for your interest in our work. We are using "transformers==4.37.2", "tokenizers==0.15.1" as suggested by the original LLaVA project here

LLaRA/train-llava/pyproject.toml

Line 17 in f60f59d

    
           "transformers==4.37.2", "tokenizers==0.15.1", "sentencepiece==0.1.99", "shortuuid",

Could you help me confirm this?

Thanks,

Xiang

ManishGovind · 2025-02-04T18:34:40Z

Yes, I'm also using the same.

Name: transformers
Version: 4.37.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: #N/A
(llara) pip show tokenizers
Name: tokenizers
Version: 0.15.1
Summary:
Home-page: https://github.com/huggingface/tokenizers
Author: Anthony MOI [email protected]
Author-email: Nicolas Patry [email protected], Anthony Moi [email protected]
License:
Requires: huggingface_hub
Required-by: #N/A

LostXine · 2025-02-04T18:36:08Z

I see, let me initialize a new environment and test it again. Thanks for bringing this to my attention.

ManishGovind · 2025-02-04T18:37:27Z

Sure, No problem. Looking forward to your reply.

LostXine · 2025-02-04T19:04:34Z

Hi @ManishGovind ,

I've reproduced the issue and found a temporary fix. You can bypass the current transformers package version requirement and upgrade to the latest version with the following command:

pip install -U transformers

After the upgrade, the tokenizer should work as expected. However, I haven't tested compatibility with other parts of the code, so if any issues arise, you may need to revert to version 4.37.2.

I'll implement this quick fix for now and work on a proper solution.

Thanks for your understanding.

Best,

Xiang

ManishGovind · 2025-02-04T19:13:27Z

I will also try to upgrade the version and see if it works.
Thanks for your help !!!

Best,
Manish

ManishGovind · 2025-02-14T20:27:16Z

Hi @LostXine ,

I Re-trained llara with D-InBC-Aux-D-80K instruction data and wanted to reproduce the results. But i end up with these results.

May i know what could be the issue?

Thanks,
Manish

LostXine · 2025-02-14T21:14:18Z

Hi @ManishGovind ,

I could not find D-InBC-Aux-D-80K in the image you posted. Could you provide more context?

Thanks,

ManishGovind · 2025-02-14T23:03:03Z

so the first two rows are nothing but the D-inBc+Aux-D. I have used D-inBC-text-multi-train-80k-front.json(i.e, D-inBC + Auxiliary tasks) to train . Do you want me to share my inference json ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer in Convert_vima notebook #10

AutoTokenizer in Convert_vima notebook #10

ManishGovind commented Feb 4, 2025 •

edited

Loading

LostXine commented Feb 4, 2025

ManishGovind commented Feb 4, 2025 •

edited

Loading

LostXine commented Feb 4, 2025

ManishGovind commented Feb 4, 2025

LostXine commented Feb 4, 2025

ManishGovind commented Feb 4, 2025

ManishGovind commented Feb 14, 2025

LostXine commented Feb 14, 2025

ManishGovind commented Feb 14, 2025 •

edited

Loading

AutoTokenizer in Convert_vima notebook #10

AutoTokenizer in Convert_vima notebook #10

Comments

ManishGovind commented Feb 4, 2025 • edited Loading

LostXine commented Feb 4, 2025

ManishGovind commented Feb 4, 2025 • edited Loading

LostXine commented Feb 4, 2025

ManishGovind commented Feb 4, 2025

LostXine commented Feb 4, 2025

ManishGovind commented Feb 4, 2025

ManishGovind commented Feb 14, 2025

LostXine commented Feb 14, 2025

ManishGovind commented Feb 14, 2025 • edited Loading

ManishGovind commented Feb 4, 2025 •

edited

Loading

ManishGovind commented Feb 4, 2025 •

edited

Loading

ManishGovind commented Feb 14, 2025 •

edited

Loading