Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the result of Ultravox-v0.5-LLaMA-3.1-8B #11

Open
Gpwner opened this issue Mar 4, 2025 · 10 comments
Open

the result of Ultravox-v0.5-LLaMA-3.1-8B #11

Gpwner opened this issue Mar 4, 2025 · 10 comments

Comments

@Gpwner
Copy link

Gpwner commented Mar 4, 2025

I have test Ultravox-v0.5-LLaMA-3.1-8B too.but My test results are slightly different from yours, especially on the sdqa dataset.

  AlpacaEval CommonEval SD-QA OpenBookQA IFEval AdvBench
Open-Ended QA Open-Ended QA Reference-Based QA Multiple-Choice QA Instruction Following Safety
samples 199 200 553 455 345 520
Ultravox0.5 LLama3.1 8B Instruct 4.75 4.08 72.42 69.01 68.05 98.84

Image

@Gpwner
Copy link
Author

Gpwner commented Mar 4, 2025

the last point of the sd-qa is(panda+gpt)/2,right?

@MatthewCYM
Copy link
Owner

Hi, I just notice that using qa_metrics==0.2.17 (as specified in the requirements) results in a Panda score of 47.74, which aligns with the current performance reported on the leaderboard. However, when using the latest version, qa_metrics==0.2.30, the Panda score unexpectedly jumps to 74.50. I'm currently investigating this discrepancy.

@MatthewCYM
Copy link
Owner

Could you please check if you observe the same discrepancy?

@Gpwner
Copy link
Author

Gpwner commented Mar 13, 2025

I am using qa_metrics==0.2.24. When i install qa_metrics==0.2.17 and run the evaluator i got this error:

    panda_results = [self.pedant.evaluate([item['reference'].lower()], item['response'].lower(), item['prompt'].lower()) for item in data]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/extendisk1/usr/bencheval/VoiceBench/src/evaluator/qa.py", line 22, in <listcomp>
    panda_results = [self.pedant.evaluate([item['reference'].lower()], item['response'].lower(), item['prompt'].lower()) for item in data]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/qa_metrics/pedant.py", line 267, in evaluate
    output = self.model.predict(result)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 451, in predict
    scores = self.decision_function(X)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 432, in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/base.py", line 626, in _validate_data
    self._check_n_features(X, reset=reset)
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/base.py", line 415, in _check_n_features
    raise ValueError(
ValueError: X has 82735 features, but SGDClassifier is expecting 82768 features as input.

@Gpwner
Copy link
Author

Gpwner commented Mar 13, 2025

by the way,when i evaluate ifeval,does the last point cal by (strict-prompt+strict-instruction+loose-prompt+loose-instruction)/4 ? thanks

@MatthewCYM
Copy link
Owner

Yes, it's calculated as (strict-prompt+strict-instruction+loose-prompt+loose-instruction)/4.

@MatthewCYM
Copy link
Owner

According to zli12321/qa_metrics#2, the panda score calculation method changed in version 0.2.17 and later, explaining the observed score discrepancy.

@zli12321
Copy link

The panda score is tuned to be more correlated with GPT-4 in version 0.2.17 and later (0.2.30 is currently the latest). The jump of score is because pandas gives a more relaxed match than the older version.
Old panda accuracy 0.47 and GPT-4 is 0.68. New panda accuracy 0.76 and is closer to GPT-4 accuracy.

@Gpwner
Copy link
Author

Gpwner commented Mar 14, 2025

so,will the leadboard update to use the latest qa_metrics == 0.2.30 ? @MatthewCYM

@zli12321
Copy link

The correlation is closer to LLM evaluation, but since the score averages out with get, the overall rankings for existing leaderboard will mostly remain the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants