Inconsistent Behavior with max_tokens, Post-Processing, and Cache Options #2702

ntlm1686 · 2025-02-15T01:26:57Z

I've encountered several issues when running benchmarks on MMLU-Pro that may affect reproducibility and expected behavior. The issues are as follows:

Default max_tokens Handling:
- Observation: When the max_tokens argument is not provided in model_args or gen_kwargs, the current implementation still sends a max_tokens value to the API.
  
  lm-evaluation-harness/lm_eval/models/api_models.py
  
  Line 71 in 41b952f
  
  max_gen_toks: int = 256,
- Expected Behavior: If no max_tokens is specified, no batch size should be passed to the API, allowing the API to use its own default value.
- Impact: The model wouldn't finish the answer within this number, and it's hidden so deep that can confuse the users. Using a default value here makes no sense.
Post-Process Function Inconsistency:
- Observation: The post-processing function in the MMLU-Pro template does not match the official implementation as found here.
- Expected Behavior: The template should be consistent with the official implementation to ensure consistent evaluation results.
- Impact: Divergence in post-processing can lead to discrepancies in benchmark results and potentially incorrect interpretations of model performance.
Confusing Cache Arguments (--use_cache and --cache_requests):
- Observation: The --cache_requests flag, when enabled, caches the payloads of requests, but it does so in a location indicated by the --use_cache argument.
- Expected Behavior: The naming and behavior of these arguments should be more intuitive. Ideally, --cache_requests should directly imply where and how the caching is done, or the two flags should be unified/renamed for clarity.
- Impact: The current setup can confuse users about where cached data is stored and how caching is managed, leading to potential misuse or debugging difficulties.

Suggested Fixes

Batch Size: Modify lm_eval to omit sending a batch size parameter if it is not explicitly provided, letting the API handle the default.
Post-Process Function: Update the MMLU-Pro template to align the post-process function with the official implementation.
Cache Options: Review and refactor the caching arguments for clarity—consider renaming or consolidating the flags to reduce confusion.

The text was updated successfully, but these errors were encountered:

baberabb · 2025-02-17T11:26:45Z

Hi! Thanks for the feedback! Couple of thoughts:

max_gen_toks is usually defined in the task config, and we set 256 as a default fallback throughout the library. This also ensures that users do not waste API credits unnecessarily. I'll see about adding a warning or making this more explicit.
Good catch! Would you be willing to make a PR?
These two caching arguments serve distinct purposes. use_cache stores the model outputs during generation, so if an interruption occurs, we can resume from the last sample rather than regenerating all previous samples. Meanwhile, cache_requests stores the preprocessed inputs, allowing for faster restart of the evaluation. I'll update the interface doc to make this clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Behavior with max_tokens, Post-Processing, and Cache Options #2702

Inconsistent Behavior with max_tokens, Post-Processing, and Cache Options #2702

ntlm1686 commented Feb 15, 2025

baberabb commented Feb 17, 2025

Inconsistent Behavior with max_tokens, Post-Processing, and Cache Options #2702

Inconsistent Behavior with max_tokens, Post-Processing, and Cache Options #2702

Comments

ntlm1686 commented Feb 15, 2025

Suggested Fixes

baberabb commented Feb 17, 2025