Error with dimension size mismatch in step 3 #602

Linjiahua · 2023-06-21T06:59:45Z

Traceback (most recent call last):
  File "main.py", line 520, in <module>
    main()
  File "main.py", line 428, in main
    out = trainer.generate_experience(batch_prompt['prompt'],
  File "/root/ljh/lab/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 98, in generate_experience
    seq = self._generate_sequence(prompts, mask)
  File "/root/ljh/lab/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 207, in generate
    self._fuse_lora(self.layer_params[layer_id], self.lora_params[layer_id])
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 137, in _fuse_lora
    weight.data += lora_scaling * torch.matmul(lora_left_weight.t(), lora_right_weight.t())
RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0

I have observed that there are people who have the same problem ：#337

The text was updated successfully, but these errors were encountered:

Linjiahua · 2023-06-21T07:05:39Z

deepspeed --master_port 12346 main.py \
   --data_path /xxx/DeepSpeedExamples/data/Dahoas/rm-static/data/ \
   --data_split 2,4,4 \
   --actor_model_name_or_path /xxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/output \
   --critic_model_name_or_path /xxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/output \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 16 \
   --per_device_mini_train_batch_size 16 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 2 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

This is my startup command and it doesn't change much

yanshanjing · 2023-06-22T06:21:33Z

remove --enable_hybrid_engine \

Linjiahua closed this as completed Jun 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with dimension size mismatch in step 3 #602

Error with dimension size mismatch in step 3 #602

Linjiahua commented Jun 21, 2023 •

edited

Loading

Linjiahua commented Jun 21, 2023 •

edited

Loading

yanshanjing commented Jun 22, 2023

Error with dimension size mismatch in step 3 #602

Error with dimension size mismatch in step 3 #602

Comments

Linjiahua commented Jun 21, 2023 • edited Loading

Linjiahua commented Jun 21, 2023 • edited Loading

yanshanjing commented Jun 22, 2023

Linjiahua commented Jun 21, 2023 •

edited

Loading

Linjiahua commented Jun 21, 2023 •

edited

Loading