Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with dimension size mismatch in step 3 #602

Closed
Linjiahua opened this issue Jun 21, 2023 · 2 comments
Closed

Error with dimension size mismatch in step 3 #602

Linjiahua opened this issue Jun 21, 2023 · 2 comments

Comments

@Linjiahua
Copy link

Linjiahua commented Jun 21, 2023

Traceback (most recent call last):
  File "main.py", line 520, in <module>
    main()
  File "main.py", line 428, in main
    out = trainer.generate_experience(batch_prompt['prompt'],
  File "/root/ljh/lab/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 98, in generate_experience
    seq = self._generate_sequence(prompts, mask)
  File "/root/ljh/lab/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 207, in generate
    self._fuse_lora(self.layer_params[layer_id], self.lora_params[layer_id])
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 137, in _fuse_lora
    weight.data += lora_scaling * torch.matmul(lora_left_weight.t(), lora_right_weight.t())
RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0

I have observed that there are people who have the same problem :#337

@Linjiahua
Copy link
Author

Linjiahua commented Jun 21, 2023

deepspeed --master_port 12346 main.py \
   --data_path /xxx/DeepSpeedExamples/data/Dahoas/rm-static/data/ \
   --data_split 2,4,4 \
   --actor_model_name_or_path /xxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/output \
   --critic_model_name_or_path /xxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/output \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 16 \
   --per_device_mini_train_batch_size 16 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 2 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

This is my startup command and it doesn't change much

@yanshanjing
Copy link

remove --enable_hybrid_engine \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants