Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGPU limits the inference speed of the entire system #12828

Open
dttprofessor opened this issue Feb 14, 2025 · 5 comments
Open

IGPU limits the inference speed of the entire system #12828

dttprofessor opened this issue Feb 14, 2025 · 5 comments

Comments

@dttprofessor
Copy link

dttprofessor commented Feb 14, 2025

System: U265K+48G ddr5 +B580
ENV: Run Ollama Portable Zip on Intel GPU with IPEX-LLM
GPU drive:6559

Question:IGPU limits the inference speed of the entire system

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Graphics 12.70 64 1024 32 26769M 1.6.31441
1 [level_zero:gpu:1] Intel Arc B580 Graphics 20.1 160 1024 32 12450M 1.6.3

1/ When i load deepseek-r1:7b , igpu loaded 4g, B580 loaded 3.2g, IGPU limits the inference speed of the entire system.

2/When i load deepseek-r1:32b , igpu loaded 15.7g, B580 loaded 8.4g, CPU don't work.

3/ When i shut off igpu & load deepseek-r1:32b,B580 loaded 25g, CPU don't work. The large model is stuck and cannot perform inference.

@sgwhat
Copy link
Contributor

sgwhat commented Feb 17, 2025

Set ONEAPI_DEVICE_SELECTOR="level_zero:1" to enable B580 only could be helpful for the inference speed. But I don't think B580 has enough VRAM to load deepseek-r1:32b.

@dttprofessor
Copy link
Author

Set ONEAPI_DEVICE_SELECTOR="level_zero:1" to enable B580 only could be helpful for the inference speed. But I don't think B580 has enough VRAM to load deepseek-r1:32b.将 ONEAPI_DEVICE_SELECTOR="level_zero:1" 设置为仅启用 B580 可能有助于推理速度。但我不认为 B580 有足够的 VRAM 来加载 deepseek-r1:32b

Isn't SYCL the default CPU+GPU hybrid inference? Need to set it up manually?

@sgwhat
Copy link
Contributor

sgwhat commented Feb 18, 2025

Set ONEAPI_DEVICE_SELECTOR="level_zero:1" to enable B580 only could be helpful for the inference speed. But I don't think B580 has enough VRAM to load deepseek-r1:32b.将 ONEAPI_DEVICE_SELECTOR="level_zero:1" 设置为仅启用 B580 可能有助于推理速度。但我不认为 B580 有足够的 VRAM 来加载 deepseek-r1:32b

Isn't SYCL the default CPU+GPU hybrid inference? Need to set it up manually?

No it's not, and CPU+B580 hybrid could be slower than iGPU+B580.

@dttprofessor
Copy link
Author

Set ONEAPI_DEVICE_SELECTOR="level_zero:1" to enable B580 only could be helpful for the inference speed. But I don't think B580 has enough VRAM to load deepseek-r1:32b.将 ONEAPI_DEVICE_SELECTOR="level_zero:1" 设置为仅启用 B580 可能有助于推理速度。但我不认为 B580 有足够的 VRAM 来加载 deepseek-r1:32b 。将 ONEAPI_DEVICE_SELECTOR="level_zero:1" 设置为仅启用 B580 可能有助于推理速度。但我不认为 B580 有足够的 VRAM 来加载 deepseek-r1:32b

Isn't SYCL the default CPU+GPU hybrid inference? Need to set it up manually?SYCL 是不是默认的 CPU+GPU 混合推理?需要手动设置吗?

No it's not, and CPU+B580 hybrid could be slower than iGPU+B580.不是的,CPU+B580 混合可能比 iGPU+B580 慢。

ok!
However, when I turned off IGBU, all the 32B models were loaded on b580, and the CPU was useless at all, and the model became almost unavailable.

@sgwhat
Copy link
Contributor

sgwhat commented Feb 19, 2025

The VRAM of the B580 is insufficient to load a 32B model. You can continue running the model using the iGPU + B580.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants