Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer in Convert_vima notebook #10

Open
ManishGovind opened this issue Feb 4, 2025 · 9 comments
Open

AutoTokenizer in Convert_vima notebook #10

ManishGovind opened this issue Feb 4, 2025 · 9 comments

Comments

@ManishGovind
Copy link

ManishGovind commented Feb 4, 2025

Hello @LostXine ,

When i try to generate instruction Tuning BC (step 4). I get the below error :



from transformers import AutoTokenizer
      [2](vscode-notebook-cell:?execution_count=4&line=2) import torch 
----> [3](vscode-notebook-cell:?execution_count=4&line=3) tokenizer = AutoTokenizer.from_pretrained("llava-hf/llava-1.5-7b-hf")
      [4](vscode-notebook-cell:?execution_count=4&line=4) print(tokenizer.vocab_size)
      [6](vscode-notebook-cell:?execution_count=4&line=6) # this is for RT-2
.......

[110](https://vscode-remote+ssh-002dremote-002bhpc-002echarlotte-002eedu.vscode-resource.vscode-cdn.net/users/mgovind/LLaRA/datasets/~/miniconda3/envs/llara/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:110)     # We have a serialization from tokenizers which let us directly build the backend
--> [111](https://vscode-remote+ssh-002dremote-002bhpc-002echarlotte-002eedu.vscode-resource.vscode-cdn.net/users/mgovind/LLaRA/datasets/~/miniconda3/envs/llara/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111)     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)

Exception: data did not match any variant of untagged enum ModelWrapper at line 277156 column 3

May i know what should be the transformers version. I followed the same instructions for setting up LLaRA according to README.md . Thanks for your wonderful work !!

@LostXine
Copy link
Owner

LostXine commented Feb 4, 2025

Hi @ManishGovind ,

Thanks for your interest in our work. We are using "transformers==4.37.2", "tokenizers==0.15.1" as suggested by the original LLaVA project here

"transformers==4.37.2", "tokenizers==0.15.1", "sentencepiece==0.1.99", "shortuuid",

Could you help me confirm this?

Thanks,

Xiang

@ManishGovind
Copy link
Author

ManishGovind commented Feb 4, 2025

Yes, I'm also using the same.

Name: transformers
Version: 4.37.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: #N/A
(llara) pip show tokenizers
Name: tokenizers
Version: 0.15.1
Summary:
Home-page: https://github.com/huggingface/tokenizers
Author: Anthony MOI [email protected]
Author-email: Nicolas Patry [email protected], Anthony Moi [email protected]
License:
Requires: huggingface_hub
Required-by: #N/A

@LostXine
Copy link
Owner

LostXine commented Feb 4, 2025

I see, let me initialize a new environment and test it again. Thanks for bringing this to my attention.

@ManishGovind
Copy link
Author

Sure, No problem. Looking forward to your reply.

@LostXine
Copy link
Owner

LostXine commented Feb 4, 2025

Hi @ManishGovind ,

I've reproduced the issue and found a temporary fix. You can bypass the current transformers package version requirement and upgrade to the latest version with the following command:

pip install -U transformers

After the upgrade, the tokenizer should work as expected. However, I haven't tested compatibility with other parts of the code, so if any issues arise, you may need to revert to version 4.37.2.

I'll implement this quick fix for now and work on a proper solution.

Thanks for your understanding.

Best,

Xiang

@ManishGovind
Copy link
Author

I will also try to upgrade the version and see if it works.
Thanks for your help !!!

Best,
Manish

@ManishGovind
Copy link
Author

Hi @LostXine ,

I Re-trained llara with D-InBC-Aux-D-80K instruction data and wanted to reproduce the results. But i end up with these results.

May i know what could be the issue?

Image

Thanks,
Manish

@LostXine
Copy link
Owner

Hi @ManishGovind ,

I could not find D-InBC-Aux-D-80K in the image you posted. Could you provide more context?

Thanks,

@ManishGovind
Copy link
Author

ManishGovind commented Feb 14, 2025

so the first two rows are nothing but the D-inBc+Aux-D. I have used D-inBC-text-multi-train-80k-front.json(i.e, D-inBC + Auxiliary tasks) to train . Do you want me to share my inference json ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants