Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: offset must be non-negative and no greater than buffer length #5543

Open
LiYixuan727 opened this issue Sep 23, 2024 · 3 comments

Comments

@LiYixuan727
Copy link

Hi,
I'm training the fairseq with the following script and get the error ValueError: offset must be non-negative and no greater than buffer length.

fairseq-train data-bin --arch transformer
--max-epoch 10
--max-tokens 2048
--num-workers 20
--max-sentences 5000
--fp16
--optimizer adam --lr-scheduler inverse_sqrt --lr 0.0007
--criterion label_smoothed_cross_entropy

@LiYixuan727
Copy link
Author

And here is the whole traceback:

2024-09-23 14:53:13 | INFO | fairseq_cli.train | task: TranslationTask
2024-09-23 14:53:13 | INFO | fairseq_cli.train | model: TransformerModel
2024-09-23 14:53:13 | INFO | fairseq_cli.train | criterion: LabelSmoothedCrossEntropyCriterion
2024-09-23 14:53:13 | INFO | fairseq_cli.train | num. shared model params: 22,480,862,208 (num. trained: 22,480,862,208)
2024-09-23 14:53:13 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
2024-09-23 14:53:13 | INFO | fairseq.data.data_utils | loaded 51,352 examples from: data-bin/valid.en-es.en
2024-09-23 14:53:13 | INFO | fairseq.data.data_utils | loaded 51,352 examples from: data-bin/valid.en-es.es
2024-09-23 14:53:13 | INFO | fairseq.tasks.translation | data-bin valid en-es 51352 examples
2024-09-23 14:53:45 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2024-09-23 14:53:45 | INFO | fairseq.utils | rank 0: capabilities = 8.6 ; total memory = 47.431 GB ; name = NVIDIA RTX A6000
2024-09-23 14:53:45 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2024-09-23 14:53:45 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2024-09-23 14:53:45 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = 5000
2024-09-23 14:53:45 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt
2024-09-23 14:53:45 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt
2024-09-23 14:53:45 | INFO | fairseq.trainer | loading train data for epoch 1
2024-09-23 14:53:49 | INFO | fairseq.data.data_utils | loaded 51,249,574 examples from: data-bin/train.en-es.en
2024-09-23 14:53:53 | INFO | fairseq.data.data_utils | loaded 51,249,574 examples from: data-bin/train.en-es.es
2024-09-23 14:53:53 | INFO | fairseq.tasks.translation | data-bin train en-es 51249574 examples
Traceback (most recent call last):
File "/home/ag/.local/bin/fairseq-train", line 8, in
sys.exit(cli_main())
File "/home/ag/.local/lib/python3.10/site-packages/fairseq_cli/train.py", line 557, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/ag/.local/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/home/ag/.local/lib/python3.10/site-packages/fairseq_cli/train.py", line 164, in main
extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
File "/home/ag/.local/lib/python3.10/site-packages/fairseq/checkpoint_utils.py", line 272, in load_checkpoint
epoch_itr = trainer.get_train_iterator(
File "/home/ag/.local/lib/python3.10/site-packages/fairseq/trainer.py", line 719, in get_train_iterator
self.reset_dummy_batch(batch_iterator.first_batch)
File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/iterators.py", line 368, in first_batch
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/iterators.py", line 368, in
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/language_pair_dataset.py", line 305, in getitem
tgt_item = self.tgt[index] if self.tgt is not None else None
File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/indexed_dataset.py", line 523, in getitem
np_array = np.frombuffer(
ValueError: offset must be non-negative and no greater than buffer length (6711936916)

@Herostomo
Copy link

I wanted to offer my assistance regarding the ValueError: offset must be non-negative and no greater than buffer length error you encountered while training with Fairseq.

Summary of the Issue:
The error occurs during the training process, specifically when the code attempts to access an index in the dataset that is out of range. This typically indicates a potential issue with the dataset formatting or indexing.

Approach :
Verify Dataset Integrity
Check Data Loading and Indexing
Consistency Between Datasets
Adjust Worker Count
Check Configuration Parameters
Inspect Data Paths

@dtamayo-nlp
Copy link

Hi!

In my case this problem appeared because of a problem with integer precision when processing long files in the binarization of the corpus. It can be solved by adding here the following line:

sizes = [np.int64(el) for el in sizes]
address = np.int64(0)

And processing again the corpus with fairseq-preprocess.

You could also avoid this problem by splitting your big files in smaller ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants