Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Behavior with max_tokens, Post-Processing, and Cache Options #2702

Open
ntlm1686 opened this issue Feb 15, 2025 · 1 comment
Open

Comments

@ntlm1686
Copy link

I've encountered several issues when running benchmarks on MMLU-Pro that may affect reproducibility and expected behavior. The issues are as follows:

  1. Default max_tokens Handling:

    • Observation: When the max_tokens argument is not provided in model_args or gen_kwargs, the current implementation still sends a max_tokens value to the API.
      max_gen_toks: int = 256,
    • Expected Behavior: If no max_tokens is specified, no batch size should be passed to the API, allowing the API to use its own default value.
    • Impact: The model wouldn't finish the answer within this number, and it's hidden so deep that can confuse the users. Using a default value here makes no sense.
  2. Post-Process Function Inconsistency:

    • Observation: The post-processing function in the MMLU-Pro template does not match the official implementation as found here.
    • Expected Behavior: The template should be consistent with the official implementation to ensure consistent evaluation results.
    • Impact: Divergence in post-processing can lead to discrepancies in benchmark results and potentially incorrect interpretations of model performance.
  3. Confusing Cache Arguments (--use_cache and --cache_requests):

    • Observation: The --cache_requests flag, when enabled, caches the payloads of requests, but it does so in a location indicated by the --use_cache argument.
    • Expected Behavior: The naming and behavior of these arguments should be more intuitive. Ideally, --cache_requests should directly imply where and how the caching is done, or the two flags should be unified/renamed for clarity.
    • Impact: The current setup can confuse users about where cached data is stored and how caching is managed, leading to potential misuse or debugging difficulties.

Suggested Fixes

  • Batch Size: Modify lm_eval to omit sending a batch size parameter if it is not explicitly provided, letting the API handle the default.
  • Post-Process Function: Update the MMLU-Pro template to align the post-process function with the official implementation.
  • Cache Options: Review and refactor the caching arguments for clarity—consider renaming or consolidating the flags to reduce confusion.
@baberabb
Copy link
Contributor

Hi! Thanks for the feedback! Couple of thoughts:

  1. max_gen_toks is usually defined in the task config, and we set 256 as a default fallback throughout the library. This also ensures that users do not waste API credits unnecessarily. I'll see about adding a warning or making this more explicit.
  2. Good catch! Would you be willing to make a PR?
  3. These two caching arguments serve distinct purposes. use_cache stores the model outputs during generation, so if an interruption occurs, we can resume from the last sample rather than regenerating all previous samples. Meanwhile, cache_requests stores the preprocessed inputs, allowing for faster restart of the evaluation. I'll update the interface doc to make this clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants