-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the result of Ultravox-v0.5-LLaMA-3.1-8B #11
Comments
the last point of the sd-qa is |
Hi, I just notice that using |
Could you please check if you observe the same discrepancy? |
I am using qa_metrics==0.2.24. When i install qa_metrics==0.2.17 and run the evaluator i got this error:
|
by the way,when i evaluate ifeval,does the last point cal by |
Yes, it's calculated as |
According to zli12321/qa_metrics#2, the panda score calculation method changed in version 0.2.17 and later, explaining the observed score discrepancy. |
The panda score is tuned to be more correlated with GPT-4 in version 0.2.17 and later (0.2.30 is currently the latest). The jump of score is because pandas gives a more relaxed match than the older version. |
so,will the leadboard update to use the latest |
The correlation is closer to LLM evaluation, but since the score averages out with get, the overall rankings for existing leaderboard will mostly remain the same. |
I have test Ultravox-v0.5-LLaMA-3.1-8B too.but My test results are slightly different from yours, especially on the sdqa dataset.
The text was updated successfully, but these errors were encountered: