-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.0 版本多机序列并行训练第一步卡住,疑似linux 内核版本问题? #3115
Comments
pip版本 Package Version Editable project location absl-py 2.1.0 |
3.0运行参数:
export NCCL_DEBUG=DEBUG
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_P2P_LEVEL=NVL
NNODES=${WORLD_SIZE:-1}
NODE_RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
MASTER_PORT=${MASTER_PORT:-$RANDOM_PORT}
NPROC_PER_NODE=$nproc_per_node
swift sft
--model_type qwen2_5
--model $model_dir
--train_type full
--torch_dtype bfloat16
--output_dir $output_dir
--ddp_backend nccl
--dataset $data_path
--dataloader_num_workers 0
--num_train_epochs 3
--max_length 12000
--gradient_checkpointing true
--per_device_train_batch_size 1
--enable_cache true
--weight_decay 0.1
--learning_rate 1e-5
--gradient_accumulation_steps 4
--max_grad_norm 1.0
--warmup_ratio 0.1
--save_steps 10000
--eval_steps 10000
--save_total_limit 2
--logging_steps 1
--save_only_model true
--deepspeed zero3
--attn_impl flash_attn
--report_to none
--ddp_timeout 1800000000 2>&1 | tee $log_file
报错日志
/trainers/mixin.py:78: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(
Detected kernel version 4.9.151, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO:swift] The logging file will be saved in: /mnt4/ckpt/yintaoye.yty/MAPMixArrange-antfinix-72B-32K-cosine-swift-20250214-long-cot-withmap/v7-20250214-173448/logging.jsonl
Parameter Offload: Total persistent parameters: 2138112 in 401 params
Train: 0%| | 0/177 [00:00<?, ?it/s]
同样的参数 swift2.6 可以正常训练
升级linux内核感觉比较麻烦,还有其他方法解决吗
The text was updated successfully, but these errors were encountered: