Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 使用internevo训练,转换成hf模型用opencompass测试时候有一定概率会nan #266

Open
Cerberous opened this issue Jul 1, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Cerberous
Copy link

Describe the bug

使用internevo训练,转换成hf模型用opencompass测试ppl的时候有一定概率会nan,opencompass默认是用fp16测试的,是因为这个原因导致的嘛?切换成bf16后这个问题能够解决,但是其他的hf模型并没有这个问题,请问和use_fp32_norm有关嘛,训练用的bf16

Environment

官方镜像

Other information

No response

@Cerberous Cerberous added the bug Something isn't working label Jul 1, 2024
@sunpengsdu
Copy link
Contributor

@SolenoidWGT 看看这个

@Cerberous
Copy link
Author

Cerberous commented Jul 11, 2024

我来重新描述一下我的问题,我在用internevo训练的时候用的bf16,然后转换成hf后用fp16推理遇到了下述报错

Traceback (most recent call last):
  File "/InternLM/hf_test.py", line 15, in <module>
    output = model.generate(**inputs, **gen_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2734, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

这个错误是由于model load进来的时候torch_dtype的,如果我改成torch_type=torch.bfloat16或者torch.float32都是没有问题的,但是torch.float16会存在这个问题,我自己的理解是训练用bf16,推理用fp16本身就存在一定的精度误差,指数位bf16是高于fp16的,最后比如计算attention的matrix multiply时会导致这个错误,但是我看到internlm官方的代码也是用torch.float16,所以想请教下这个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants