Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 版本多机序列并行训练第一步卡住,疑似linux 内核版本问题? #3115

Open
yty3805595 opened this issue Feb 14, 2025 · 1 comment

Comments

@yty3805595
Copy link

yty3805595 commented Feb 14, 2025

3.0运行参数:
export NCCL_DEBUG=DEBUG
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_P2P_LEVEL=NVL

NNODES=${WORLD_SIZE:-1}
NODE_RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
MASTER_PORT=${MASTER_PORT:-$RANDOM_PORT}
NPROC_PER_NODE=$nproc_per_node
swift sft
--model_type qwen2_5
--model $model_dir
--train_type full
--torch_dtype bfloat16
--output_dir $output_dir
--ddp_backend nccl
--dataset $data_path
--dataloader_num_workers 0
--num_train_epochs 3
--max_length 12000
--gradient_checkpointing true
--per_device_train_batch_size 1
--enable_cache true
--weight_decay 0.1
--learning_rate 1e-5
--gradient_accumulation_steps 4
--max_grad_norm 1.0
--warmup_ratio 0.1
--save_steps 10000
--eval_steps 10000
--save_total_limit 2
--logging_steps 1
--save_only_model true
--deepspeed zero3
--attn_impl flash_attn
--report_to none
--ddp_timeout 1800000000 2>&1 | tee $log_file

报错日志
/trainers/mixin.py:78: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Seq2SeqTrainer.__init__. Use processing_class instead.
super().init(
Detected kernel version 4.9.151, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO:swift] The logging file will be saved in: /mnt4/ckpt/yintaoye.yty/MAPMixArrange-antfinix-72B-32K-cosine-swift-20250214-long-cot-withmap/v7-20250214-173448/logging.jsonl
Parameter Offload: Total persistent parameters: 2138112 in 401 params
Train: 0%| | 0/177 [00:00<?, ?it/s]

同样的参数 swift2.6 可以正常训练

升级linux内核感觉比较麻烦,还有其他方法解决吗

@yty3805595
Copy link
Author

pip版本

Package Version Editable project location


absl-py 2.1.0
accelerate 1.3.0
addict 2.4.0
aenum 3.1.15
aii-pypai 0.1.40.73
aiofiles 23.2.1
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiohttp-cors 0.7.0
aiosignal 1.3.1
airportsdata 20241001
aistudio-analyzer 0.0.4.119
aistudio-checkpoint 0.1.250123
aistudio-common 0.0.28.60
aistudio-notebook 2.0.131
aistudio-serving 0.0.0.93
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-core-v3 2.13.33
aliyun-python-sdk-kms 2.16.5
alluxio 0.1.2
altair 5.5.0
anaconda-anon-usage 0.5.0
annotated-types 0.7.0
antlr4-python3-runtime 4.13.2
anyio 4.7.0
archspec 0.2.3
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
arxiv 2.1.3
astor 0.8.1
astroid 3.3.6
asttokens 3.0.0
async-lru 2.0.4
async-timeout 5.0.1
attrdict 2.0.1
attrs 24.2.0
autopep8 2.0.4
babel 2.16.0
beautifulsoup4 4.12.3
binpacking 1.5.2
bitsandbytes 0.45.1
blake3 1.0.4
bleach 6.2.0
blessed 1.20.0
blinker 1.9.0
blis 0.7.11
boltons 23.0.0
Brotli 1.0.9
cachetools 5.5.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2024.8.30
cffi 1.17.1
chardet 5.2.0
charset-normalizer 2.0.4
click 8.1.8
cloudpathlib 0.16.0
cloudpickle 3.1.0
colorama 0.4.6
coloredlogs 15.0.1
colorful 0.5.6
colorlog 6.9.0
comm 0.2.2
compressed-tensors 0.9.0
concurrent-log-handler 0.9.20
conda 24.11.1
conda-content-trust 0.2.0
conda-libmamba-solver 24.1.0
conda-package-handling 2.4.0
conda_package_streaming 0.11.0
confection 0.1.5
configparser 7.1.0
contourpy 1.3.1
coverage 7.6.12
cpm-kernels 1.0.11
crcmod 1.7
cryptography 43.0.3
cycler 0.12.1
cymem 2.0.11
dacite 1.9.2
dataclasses-json 0.6.7
DataProperty 1.1.0
datasets 3.0.1
debugpy 1.8.9
decorator 5.1.1
decord 0.6.0
deepspeed 0.15.4
defusedxml 0.7.1
deprecation 2.1.0
depyf 0.18.0
diffusers 0.32.2
dill 0.3.8
diskcache 5.6.3
distlib 0.3.9
distro 1.9.0
docstring-to-markdown 0.15
duckduckgo_search 5.3.1b1
editdistance 0.8.1
einops 0.8.0
entrypoints 0.4
et_xmlfile 2.0.0
evalscope 0.10.1
evaluate 0.4.3
exceptiongroup 1.2.2
execnet 2.1.1
executing 2.1.0
fastapi 0.115.6
fastjsonschema 2.21.1
fe 0.3.33
feedparser 6.0.11
ffmpy 0.4.0
filelock 3.17.0
fire 0.7.0
flake8 7.1.1
flash_attn 2.7.4.post1
flatbuffers 24.3.25
fonttools 4.55.3
fqdn 1.5.1
frozendict 2.4.2
frozenlist 1.5.0
fsspec 2024.2.0
func_timeout 4.3.5
future 1.0.0
fuzzywuzzy 0.18.0
gguf 0.10.0
gitdb 4.0.12
GitPython 3.1.44
google-api-core 2.24.1
google-auth 2.36.0
googleapis-common-protos 1.66.0
gpustat 1.1.1
gradio 5.16.0
gradio_client 1.7.0
griffe 0.49.0
grpcio 1.68.1
grpcio-channelz 1.34.0
h11 0.14.0
h2 4.2.0
h5py 3.12.1
hf_transfer 0.1.9
hjson 3.1.0
hpack 4.1.0
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.28.1
human-eval 1.0.3
humanfriendly 10.0
hyperframe 6.1.0
idna 3.7
imageio 2.37.0
immutabledict 4.2.1
immutables 0.20
importlib_metadata 8.6.1
iniconfig 2.0.0
inquirerpy 0.3.4
interegular 0.3.3
ipykernel 6.29.5
ipython 8.30.0
ipython-genutils 0.2.0
ipywidgets 8.1.5
isoduration 20.11.0
isort 5.13.2
jedi 0.19.2
jedi-language-server 0.42.0
jieba 0.42.1
Jinja2 3.1.5
jinjasql 0.1.8
jiter 0.8.2
jmespath 0.10.0
joblib 1.4.2
json5 0.10.0
jsonlines 4.0.0
jsonpatch 1.33
jsonpointer 2.1
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter 1.1.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.10.0
jupyter-lsp 2.2.5
jupyter_server 2.14.2
jupyter_server_terminals 0.5.3
jupyterlab 4.3.5
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.13
kiwisolver 1.4.7
kubemaker 0.2.17
kubernetes 9.0.0
lagent 0.2.4
langcodes 3.5.0
language_data 1.3.0
lark 1.2.2
latex2sympy2_extended 1.0.4
lazy_loader 0.4
Levenshtein 0.26.1
libmambapy 1.5.8
libro-core 0.1.6
libro-server 0.1.10
liger_kernel 0.5.2
lighteval 0.7.0
linkify-it-py 2.0.3
llm-blender 0.0.2
lm-format-enforcer 0.10.9
lsprotocol 2023.0.1
lxml 5.3.0
marisa-trie 1.2.1
Markdown 3.7
markdown-it-py 2.2.0
MarkupSafe 2.1.5
marshmallow 3.26.1
math-verify 0.5.1
matplotlib 3.9.3
matplotlib-inline 0.1.7
maya-tools 0.0.4
mbstrdecoder 1.1.4
mccabe 0.7.0
mdit-py-plugins 0.3.3
mdurl 0.1.2
menuinst 2.2.0
mergekit 0.0.6
mistral_common 1.5.2
mistune 0.8.4
mmengine 0.10.6
mmengine-lite 0.10.6
modelscope 1.22.3
mpmath 1.3.0
ms-opencompass 0.1.6
ms-swift 3.2.0.dev0 /mnt4/code/yintaoye.yty/ms-swift-0213
ms-vlmeval 0.0.13
msgpack 1.1.0
msgspec 0.19.0
multidict 6.1.0
multiprocess 0.70.16
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.17.0
nbclient 0.5.13
nbconvert 6.4.4
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 3.2.1
ninja 1.11.1.3
nltk 3.9.1
notebook 6.4.6
notebook_shim 0.2.4
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
oauthlib 3.2.2
omegaconf 2.3.0
open-r1 0.1.0.dev0 /mnt4/code/yintaoye.yty/open-r1-main/src
openai 1.58.1
OpenCC 1.1.9
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python 4.11.0.86
opencv-python-headless 4.11.0.86
openpyxl 3.1.5
orjson 3.10.12
oss2 2.6.0
osscmd 0.4.5
outlines 0.1.11
outlines_core 0.1.26
overrides 7.7.0
packaging 24.2
pandas 2.0.1
pandocfilters 1.5.1
parameterized 0.9.0
parso 0.8.4
partial-json-parser 0.2.1.1.post5
pathvalidate 3.2.3
peft 0.14.0
peppercorn 0.6
pexpect 4.9.0
pfzy 0.3.4
phx-class-registry 4.1.0
pillow 10.4.0
pip 24.2
platformdirs 3.11.0
plotly 6.0.0
pluggy 1.5.0
ply 3.11
portalocker 3.0.0
preshed 3.0.9
prettytable 3.14.0
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.0.2
prompt_toolkit 3.0.48
propcache 0.2.1
proto-plus 1.26.0
protobuf 3.20.3
psutil 6.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
py-spy 0.4.0
pyaml 21.10.1
pyarrow 18.1.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybind11 2.13.6
pycodestyle 2.12.1
pycosat 0.6.6
pycountry 24.6.1
pycparser 2.21
pycryptodome 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydeck 0.9.1
pydocstyle 6.3.0
pydub 0.25.1
pyflakes 3.2.0
pygls 1.3.1
Pygments 2.18.0
pyhocon 0.3.61
pyinotify 0.9.6
pylint 3.3.2
Pympler 1.1
pyodps 0.12.1.1
pyOpenSSL 24.3.0
pyparsing 3.2.0
pypinyin 0.53.0
PySocks 1.7.1
pytablewriter 1.2.1
pytest 8.3.4
pytest-cov 6.0.0
pytest-rerunfailures 15.0
pytest-xdist 3.6.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-json-logger 3.2.0
python-Levenshtein 0.26.1
python-lsp-jsonrpc 1.1.2
python-lsp-server 1.12.0
python-multipart 0.0.19
python-prctl 1.8.1
pytoolconfig 1.3.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
rank-bm25 0.2.2
RapidFuzz 3.12.1
ray 2.9.0
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
requests-toolbelt 1.0.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.9.4
rope 1.13.0
rouge 1.0.1
rouge-chinese 1.0.3
rouge_score 0.1.2
rpds-py 0.22.3
rsa 4.9
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
ruff 0.9.4
ruff-lsp 0.0.59
sacrebleu 2.5.1
safehttpx 0.1.6
safetensors 0.5.2
scikit-image 0.25.1
scikit-learn 1.6.1
scipy 1.15.1
seaborn 0.13.2
semantic-version 2.10.0
Send2Trash 1.8.3
sentence-transformers 3.4.1
sentencepiece 0.2.0
setuptools 69.5.1
setuptools-scm 8.1.0
sgmllib3k 1.0.0
shellingham 1.5.4
simple-ddl-parser 1.7.1
simplejson 3.19.3
six 1.17.0
smart-open 6.4.0
smmap 5.0.2
sniffio 1.3.1
snowballstemmer 2.2.0
socksio 1.0.0
sortedcontainers 2.4.0
soupsieve 2.6
spacy 3.7.2
spacy-legacy 3.0.12
spacy-loggers 1.0.5
sqlparse 0.5.3
srsly 2.5.1
stack-data 0.6.3
starlette 0.41.3
streamlit 1.42.0
sty 1.0.6
sympy 1.13.1
tabledata 1.3.4
tabulate 0.9.0
tcolorpy 0.1.7
tenacity 9.0.0
tensorboard 2.19.0
tensorboard-data-server 0.7.2
termcolor 2.3.0
terminado 0.18.1
testpath 0.6.0
thinc 8.2.5
threadpoolctl 3.5.0
tifffile 2025.1.10
tiktoken 0.7.0
timeout-decorator 0.5.0
tinycss2 1.4.0
tokenizers 0.21.0
toml 0.10.2
tomli 2.2.1
tomlkit 0.13.2
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.48.2
transformers-stream-generator 0.0.5
triton 3.1.0
trl 0.15.0.dev0 /mnt4/code/yintaoye.yty/trl
truststore 0.8.0
typepy 1.3.4
typer 0.15.1
types-python-dateutil 2.9.0.20241206
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2024.2
uc-micro-py 1.0.3
ujson 5.10.0
uri-template 1.3.0
urllib3 1.26.4
uvicorn 0.32.1
uvloop 0.21.0
validators 0.34.0
virtualenv 20.21.0
vllm 0.7.1.post0+cu124.ant
wasabi 1.1.3
watchdog 6.0.0
watchfiles 1.0.4
wcwidth 0.2.13
weasel 0.3.4
webcolors 24.11.1
webencodings 0.5.1
websocket-client 1.8.0
websockets 11.0.3
Werkzeug 3.1.3
wfbuilder 1.0.56.48
wget 3.2
whatthepatch 1.0.7
wheel 0.44.0
widgetsnbextension 4.0.13
wrapt 1.17.2
xformers 0.0.28.post3
xgrammar 0.1.11
XlsxWriter 3.2.2
xtuner 0.1.23
xxhash 3.5.0
yapf 0.43.0
yarl 1.18.3
zdfs-dfs 3.0.3
zipp 3.21.0
zstandard 0.23.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant