Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 版本多机序列并行训练第一步卡住,疑似linux 内核版本问题? #3115

yty3805595 opened this issue Feb 14, 2025 · 1 comment


Copy link

yty3805595 commented Feb 14, 2025

export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64

swift sft
--model_type qwen2_5
--model $model_dir
--train_type full
--torch_dtype bfloat16
--output_dir $output_dir
--ddp_backend nccl
--dataset $data_path
--dataloader_num_workers 0
--num_train_epochs 3
--max_length 12000
--gradient_checkpointing true
--per_device_train_batch_size 1
--enable_cache true
--weight_decay 0.1
--learning_rate 1e-5
--gradient_accumulation_steps 4
--max_grad_norm 1.0
--warmup_ratio 0.1
--save_steps 10000
--eval_steps 10000
--save_total_limit 2
--logging_steps 1
--save_only_model true
--deepspeed zero3
--attn_impl flash_attn
--report_to none
--ddp_timeout 1800000000 2>&1 | tee $log_file

/trainers/ FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Seq2SeqTrainer.__init__. Use processing_class instead.
Detected kernel version 4.9.151, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO:swift] The logging file will be saved in: /mnt4/ckpt/yintaoye.yty/MAPMixArrange-antfinix-72B-32K-cosine-swift-20250214-long-cot-withmap/v7-20250214-173448/logging.jsonl
Parameter Offload: Total persistent parameters: 2138112 in 401 params
Train: 0%| | 0/177 [00:00<?, ?it/s]

同样的参数 swift2.6 可以正常训练


Copy link


Package Version Editable project location

absl-py 2.1.0
accelerate 1.3.0
addict 2.4.0
aenum 3.1.15
aiofiles 23.2.1
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiohttp-cors 0.7.0
aiosignal 1.3.1
airportsdata 20241001
aistudio-checkpoint 0.1.250123
aistudio-notebook 2.0.131
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-core-v3 2.13.33
aliyun-python-sdk-kms 2.16.5
alluxio 0.1.2
altair 5.5.0
anaconda-anon-usage 0.5.0
annotated-types 0.7.0
antlr4-python3-runtime 4.13.2
anyio 4.7.0
archspec 0.2.3
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
arxiv 2.1.3
astor 0.8.1
astroid 3.3.6
asttokens 3.0.0
async-lru 2.0.4
async-timeout 5.0.1
attrdict 2.0.1
attrs 24.2.0
autopep8 2.0.4
babel 2.16.0
beautifulsoup4 4.12.3
binpacking 1.5.2
bitsandbytes 0.45.1
blake3 1.0.4
bleach 6.2.0
blessed 1.20.0
blinker 1.9.0
blis 0.7.11
boltons 23.0.0
Brotli 1.0.9
cachetools 5.5.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2024.8.30
cffi 1.17.1
chardet 5.2.0
charset-normalizer 2.0.4
click 8.1.8
cloudpathlib 0.16.0
cloudpickle 3.1.0
colorama 0.4.6
coloredlogs 15.0.1
colorful 0.5.6
colorlog 6.9.0
comm 0.2.2
compressed-tensors 0.9.0
concurrent-log-handler 0.9.20
conda 24.11.1
conda-content-trust 0.2.0
conda-libmamba-solver 24.1.0
conda-package-handling 2.4.0
conda_package_streaming 0.11.0
confection 0.1.5
configparser 7.1.0
contourpy 1.3.1
coverage 7.6.12
cpm-kernels 1.0.11
crcmod 1.7
cryptography 43.0.3
cycler 0.12.1
cymem 2.0.11
dacite 1.9.2
dataclasses-json 0.6.7
DataProperty 1.1.0
datasets 3.0.1
debugpy 1.8.9
decorator 5.1.1
decord 0.6.0
deepspeed 0.15.4
defusedxml 0.7.1
deprecation 2.1.0
depyf 0.18.0
diffusers 0.32.2
dill 0.3.8
diskcache 5.6.3
distlib 0.3.9
distro 1.9.0
docstring-to-markdown 0.15
duckduckgo_search 5.3.1b1
editdistance 0.8.1
einops 0.8.0
entrypoints 0.4
et_xmlfile 2.0.0
evalscope 0.10.1
evaluate 0.4.3
exceptiongroup 1.2.2
execnet 2.1.1
executing 2.1.0
fastapi 0.115.6
fastjsonschema 2.21.1
fe 0.3.33
feedparser 6.0.11
ffmpy 0.4.0
filelock 3.17.0
fire 0.7.0
flake8 7.1.1
flash_attn 2.7.4.post1
flatbuffers 24.3.25
fonttools 4.55.3
fqdn 1.5.1
frozendict 2.4.2
frozenlist 1.5.0
fsspec 2024.2.0
func_timeout 4.3.5
future 1.0.0
fuzzywuzzy 0.18.0
gguf 0.10.0
gitdb 4.0.12
GitPython 3.1.44
google-api-core 2.24.1
google-auth 2.36.0
googleapis-common-protos 1.66.0
gpustat 1.1.1
gradio 5.16.0
gradio_client 1.7.0
griffe 0.49.0
grpcio 1.68.1
grpcio-channelz 1.34.0
h11 0.14.0
h2 4.2.0
h5py 3.12.1
hf_transfer 0.1.9
hjson 3.1.0
hpack 4.1.0
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.28.1
human-eval 1.0.3
humanfriendly 10.0
hyperframe 6.1.0
idna 3.7
imageio 2.37.0
immutabledict 4.2.1
immutables 0.20
importlib_metadata 8.6.1
iniconfig 2.0.0
inquirerpy 0.3.4
interegular 0.3.3
ipykernel 6.29.5
ipython 8.30.0
ipython-genutils 0.2.0
ipywidgets 8.1.5
isoduration 20.11.0
isort 5.13.2
jedi 0.19.2
jedi-language-server 0.42.0
jieba 0.42.1
Jinja2 3.1.5
jinjasql 0.1.8
jiter 0.8.2
jmespath 0.10.0
joblib 1.4.2
json5 0.10.0
jsonlines 4.0.0
jsonpatch 1.33
jsonpointer 2.1
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter 1.1.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.10.0
jupyter-lsp 2.2.5
jupyter_server 2.14.2
jupyter_server_terminals 0.5.3
jupyterlab 4.3.5
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.13
kiwisolver 1.4.7
kubemaker 0.2.17
kubernetes 9.0.0
lagent 0.2.4
langcodes 3.5.0
language_data 1.3.0
lark 1.2.2
latex2sympy2_extended 1.0.4
lazy_loader 0.4
Levenshtein 0.26.1
libmambapy 1.5.8
libro-core 0.1.6
libro-server 0.1.10
liger_kernel 0.5.2
lighteval 0.7.0
linkify-it-py 2.0.3
llm-blender 0.0.2
lm-format-enforcer 0.10.9
lsprotocol 2023.0.1
lxml 5.3.0
marisa-trie 1.2.1
Markdown 3.7
markdown-it-py 2.2.0
MarkupSafe 2.1.5
marshmallow 3.26.1
math-verify 0.5.1
matplotlib 3.9.3
matplotlib-inline 0.1.7
maya-tools 0.0.4
mbstrdecoder 1.1.4
mccabe 0.7.0
mdit-py-plugins 0.3.3
mdurl 0.1.2
menuinst 2.2.0
mergekit 0.0.6
mistral_common 1.5.2
mistune 0.8.4
mmengine 0.10.6
mmengine-lite 0.10.6
modelscope 1.22.3
mpmath 1.3.0
ms-opencompass 0.1.6
ms-swift 3.2.0.dev0 /mnt4/code/yintaoye.yty/ms-swift-0213
ms-vlmeval 0.0.13
msgpack 1.1.0
msgspec 0.19.0
multidict 6.1.0
multiprocess 0.70.16
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.17.0
nbclient 0.5.13
nbconvert 6.4.4
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 3.2.1
nltk 3.9.1
notebook 6.4.6
notebook_shim 0.2.4
numpy 1.26.4
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
oauthlib 3.2.2
omegaconf 2.3.0
open-r1 0.1.0.dev0 /mnt4/code/yintaoye.yty/open-r1-main/src
openai 1.58.1
OpenCC 1.1.9
opencensus 0.11.4
opencensus-context 0.1.3
openpyxl 3.1.5
orjson 3.10.12
oss2 2.6.0
osscmd 0.4.5
outlines 0.1.11
outlines_core 0.1.26
overrides 7.7.0
packaging 24.2
pandas 2.0.1
pandocfilters 1.5.1
parameterized 0.9.0
parso 0.8.4
pathvalidate 3.2.3
peft 0.14.0
peppercorn 0.6
pexpect 4.9.0
pfzy 0.3.4
phx-class-registry 4.1.0
pillow 10.4.0
pip 24.2
platformdirs 3.11.0
plotly 6.0.0
pluggy 1.5.0
ply 3.11
portalocker 3.0.0
preshed 3.0.9
prettytable 3.14.0
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.0.2
prompt_toolkit 3.0.48
propcache 0.2.1
proto-plus 1.26.0
protobuf 3.20.3
psutil 6.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
py-spy 0.4.0
pyaml 21.10.1
pyarrow 18.1.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybind11 2.13.6
pycodestyle 2.12.1
pycosat 0.6.6
pycountry 24.6.1
pycparser 2.21
pycryptodome 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydeck 0.9.1
pydocstyle 6.3.0
pydub 0.25.1
pyflakes 3.2.0
pygls 1.3.1
Pygments 2.18.0
pyhocon 0.3.61
pyinotify 0.9.6
pylint 3.3.2
Pympler 1.1
pyOpenSSL 24.3.0
pyparsing 3.2.0
pypinyin 0.53.0
PySocks 1.7.1
pytablewriter 1.2.1
pytest 8.3.4
pytest-cov 6.0.0
pytest-rerunfailures 15.0
pytest-xdist 3.6.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-json-logger 3.2.0
python-Levenshtein 0.26.1
python-lsp-jsonrpc 1.1.2
python-lsp-server 1.12.0
python-multipart 0.0.19
python-prctl 1.8.1
pytoolconfig 1.3.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
rank-bm25 0.2.2
RapidFuzz 3.12.1
ray 2.9.0
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
requests-toolbelt 1.0.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.9.4
rope 1.13.0
rouge 1.0.1
rouge-chinese 1.0.3
rouge_score 0.1.2
rpds-py 0.22.3
rsa 4.9
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
ruff 0.9.4
ruff-lsp 0.0.59
sacrebleu 2.5.1
safehttpx 0.1.6
safetensors 0.5.2
scikit-image 0.25.1
scikit-learn 1.6.1
scipy 1.15.1
seaborn 0.13.2
semantic-version 2.10.0
Send2Trash 1.8.3
sentence-transformers 3.4.1
sentencepiece 0.2.0
setuptools 69.5.1
setuptools-scm 8.1.0
sgmllib3k 1.0.0
shellingham 1.5.4
simple-ddl-parser 1.7.1
simplejson 3.19.3
six 1.17.0
smart-open 6.4.0
smmap 5.0.2
sniffio 1.3.1
snowballstemmer 2.2.0
socksio 1.0.0
sortedcontainers 2.4.0
soupsieve 2.6
spacy 3.7.2
spacy-legacy 3.0.12
spacy-loggers 1.0.5
sqlparse 0.5.3
srsly 2.5.1
stack-data 0.6.3
starlette 0.41.3
streamlit 1.42.0
sty 1.0.6
sympy 1.13.1
tabledata 1.3.4
tabulate 0.9.0
tcolorpy 0.1.7
tenacity 9.0.0
tensorboard 2.19.0
tensorboard-data-server 0.7.2
termcolor 2.3.0
terminado 0.18.1
testpath 0.6.0
thinc 8.2.5
threadpoolctl 3.5.0
tifffile 2025.1.10
tiktoken 0.7.0
timeout-decorator 0.5.0
tinycss2 1.4.0
tokenizers 0.21.0
toml 0.10.2
tomli 2.2.1
tomlkit 0.13.2
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.48.2
transformers-stream-generator 0.0.5
triton 3.1.0
trl 0.15.0.dev0 /mnt4/code/yintaoye.yty/trl
truststore 0.8.0
typepy 1.3.4
typer 0.15.1
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2024.2
uc-micro-py 1.0.3
ujson 5.10.0
uri-template 1.3.0
urllib3 1.26.4
uvicorn 0.32.1
uvloop 0.21.0
validators 0.34.0
virtualenv 20.21.0
vllm 0.7.1.post0+cu124.ant
wasabi 1.1.3
watchdog 6.0.0
watchfiles 1.0.4
wcwidth 0.2.13
weasel 0.3.4
webcolors 24.11.1
webencodings 0.5.1
websocket-client 1.8.0
websockets 11.0.3
Werkzeug 3.1.3
wget 3.2
whatthepatch 1.0.7
wheel 0.44.0
widgetsnbextension 4.0.13
wrapt 1.17.2
xformers 0.0.28.post3
xgrammar 0.1.11
XlsxWriter 3.2.2
xtuner 0.1.23
xxhash 3.5.0
yapf 0.43.0
yarl 1.18.3
zdfs-dfs 3.0.3
zipp 3.21.0
zstandard 0.23.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

1 participant