Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 x A770 with Ollama Linux , inference responses slow down dramatically #12852

Open
RobinJing opened this issue Feb 19, 2025 · 0 comments
Open
Assignees

Comments

@RobinJing
Copy link

Hi,
I use the latest cpp docker and the latest i915 dkms driver on Linux, inference with 2x A770 and Ollama, the model is Deepseek 32B int4. However, after 2 or 3 times we get 15~20 tokens, the rest of time we only get 6-8 tokens/s, also, the speed is very strange, it looks quite fast at the beginning, then drops dramatically after like 50-100 tokens, and sometimes it speeds up after 10-20 seconds..

Also, the gap between ollama and vllm is 5x times on the same machine, which is typically a bug performance.
BR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants