You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've encountered several issues when running benchmarks on MMLU-Pro that may affect reproducibility and expected behavior. The issues are as follows:
Default max_tokens Handling:
Observation: When the max_tokens argument is not provided in model_args or gen_kwargs, the current implementation still sends a max_tokens value to the API.
Expected Behavior: If no max_tokens is specified, no batch size should be passed to the API, allowing the API to use its own default value.
Impact: The model wouldn't finish the answer within this number, and it's hidden so deep that can confuse the users. Using a default value here makes no sense.
Post-Process Function Inconsistency:
Observation: The post-processing function in the MMLU-Pro template does not match the official implementation as found here.
Expected Behavior: The template should be consistent with the official implementation to ensure consistent evaluation results.
Impact: Divergence in post-processing can lead to discrepancies in benchmark results and potentially incorrect interpretations of model performance.
Confusing Cache Arguments (--use_cache and --cache_requests):
Observation: The --cache_requests flag, when enabled, caches the payloads of requests, but it does so in a location indicated by the --use_cache argument.
Expected Behavior: The naming and behavior of these arguments should be more intuitive. Ideally, --cache_requests should directly imply where and how the caching is done, or the two flags should be unified/renamed for clarity.
Impact: The current setup can confuse users about where cached data is stored and how caching is managed, leading to potential misuse or debugging difficulties.
Suggested Fixes
Batch Size: Modify lm_eval to omit sending a batch size parameter if it is not explicitly provided, letting the API handle the default.
Post-Process Function: Update the MMLU-Pro template to align the post-process function with the official implementation.
Cache Options: Review and refactor the caching arguments for clarity—consider renaming or consolidating the flags to reduce confusion.
The text was updated successfully, but these errors were encountered:
max_gen_toks is usually defined in the task config, and we set 256 as a default fallback throughout the library. This also ensures that users do not waste API credits unnecessarily. I'll see about adding a warning or making this more explicit.
Good catch! Would you be willing to make a PR?
These two caching arguments serve distinct purposes. use_cache stores the model outputs during generation, so if an interruption occurs, we can resume from the last sample rather than regenerating all previous samples. Meanwhile, cache_requests stores the preprocessed inputs, allowing for faster restart of the evaluation. I'll update the interface doc to make this clear.
I've encountered several issues when running benchmarks on MMLU-Pro that may affect reproducibility and expected behavior. The issues are as follows:
Default
max_tokens
Handling:max_tokens
argument is not provided inmodel_args
orgen_kwargs
, the current implementation still sends a max_tokens value to the API.lm-evaluation-harness/lm_eval/models/api_models.py
Line 71 in 41b952f
max_tokens
is specified, no batch size should be passed to the API, allowing the API to use its own default value.Post-Process Function Inconsistency:
Confusing Cache Arguments (
--use_cache
and--cache_requests
):--cache_requests
flag, when enabled, caches the payloads of requests, but it does so in a location indicated by the--use_cache
argument.--cache_requests
should directly imply where and how the caching is done, or the two flags should be unified/renamed for clarity.Suggested Fixes
lm_eval
to omit sending a batch size parameter if it is not explicitly provided, letting the API handle the default.The text was updated successfully, but these errors were encountered: