-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The error between LLM-viewer predicted results and TensorRT-LLM real performance is large. #4
Comments
You make a very good point. The question you asked is a great one. The time estimated by the roofline model shows the fastest speed the hardware could possibly go. We wanted to help everyone better understand the most important things that affect how fast these huge language models (LLMs) can run on computers. So I think comparing how different things affect the time is useful. But we have to remember that the exact numbers it predicts will always be faster than what really happens. These are the best possible times, not the real ones. So maybe we should add a note reminding everyone that this report only shows the highest it could go in a perfect world. Nothing is perfect in real life, so the real computer will always be a little slower. The most important thing is that this tool helps us learn about what goes into making LLMs run fast or slow. Even if the exact time is off, seeing how the different parts work together can give us a good idea. And your question helped point that out - it's good to remember this just shows the limit, not what will really happen every time. |
I understand that software often fails to fully utilize HBM bandwidth or max out Tensor Cores. I'm attempting to use this project to compare the performance of an LLM inference task on two types of hardware. However, I often obtain results from LLM-Viewer that contradict the actual measurements. Please forgive me for not providing precise results, as some hardware information is confidential. So, I am currently using a very naive roofline model in my project. Therefore, I'm particularly curious about the purpose of this project:
|
I compare the LLM-view results with TensorRT-LLM A100 performance provided by NVIDIA
We can see the estimated generation throughput is higher than real results.
For the prefill time, you can see the estimated prefill time is lower than the real results.
This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.
I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .
The text was updated successfully, but these errors were encountered: