Question about Request Count and Generative Sampling #528

rolshoven · 2025-01-31T01:56:30Z

Hi everyone,

First off, thanks for this great library! I've been using it extensively recently and really appreciate the work that has gone into it. While analyzing the code, I came across a couple of questions and would love your insights.

Request Count for Metrics
I'm evaluating a custom community summarization task with the following metrics: BERTScore, METEOR, BLEU, ROUGE-1/2/L, Extractiveness, and a custom LLM-as-Judge metric.

Given a set of samples, I noticed that the number of requests being generated is twice the number of samples. I have two metric categories: MetricCategory.GENERATIVE and MetricCategory.LLM_AS_JUDGE. Does this mean the model gets called twice per sample, effectively doubling costs when using an endpoint? I see that LiteLLM has caching, but even when running the model locally, it seems like each sample is processed twice thus generating an overhead.

Would switching my judge metric to MetricCategory.GENERATIVE prevent this duplication, or is there a better way to ensure only one evaluation pass per sample? And would this change have any other side effect that I am not seeing currently

Generative Sampling Behavior
I want to use sampling (do_sample=True) when generating locally, but the setting gets overwritten unless the metric category is MetricCategory.GENERATIVE_SAMPLING. To work around this, I currently instantiate my metrics and then dynamically change the category, but this feels like a hack.

However, when I do this, I suddenly see three times the number of requests—one for GENERATIVE, one for GENERATIVE_SAMPLING, and one for LLM_AS_JUDGE. I'm guessing something is off in how I'm modifying the category dynamically.

Is there a cleaner way to enable sampling without unintended extra requests?

Thanks in advance for your help! I really appreciate your time and insights.

The text was updated successfully, but these errors were encountered:

clefourrier · 2025-01-31T08:21:41Z

If the parameters for your 2 metrics (in terms of sampling/generation length/stop tokens/... are the same), it shouldn't but it's possible we messed up something. Good question for the category change, cc @NathanHB ?
You need a GENERATIVE_SAMPLING metric to do sampling - GENERATIVE will do greedy, so if you ask for both you'll get both behaviors (one in greedy and one in sampling, possibly with temperature change, etc). You could add your metrics as custom metrics, which would likely be the easiest. Can you give a code sample example?

rolshoven · 2025-01-31T14:20:47Z

Thank you for your time!

If the parameters for your 2 metrics (in terms of sampling/generation length/stop tokens/... are the same), it shouldn't but it's possible we messed up something. Good question for the category change, cc @NathanHB?

You need a GENERATIVE_SAMPLING metric to do sampling - GENERATIVE will do greedy, so if you ask for both you'll get both behaviors (one in greedy and one in sampling, possibly with temperature change, etc). You could add your metrics as custom metrics, which would likely be the easiest. Can you give a code sample example?

Currently I see that there are three different metric categories associated with the requests, the original GENERATIVE category of most of my metrics (even though I dynamically overwrite them), the GENERATIVE_SAMPLING category that I assign to metric.use_case and the LLM_AS_JUDGE category of my judge metric. When running three tasks with a maximum of 10 samples each, I get 90 requests to LiteLLM.

I see why having GENERATIVE and GENERATIVE_SAMPLING requests would result in different calls to the model since once it is greedy and once it should use sampling, however I would like to use the judge on the already generated samples without re-generating the output. I will try to make all metrics including the judge custom metrics of type GENERATIVE_SAMPLING and then report back if that worked, thank you for your advice.

rolshoven · 2025-02-01T15:33:31Z

Update

Overwriting the MetricCategory dynamically does not work well

TL;DR: It is better not to overwrite the MetricCateogry of the built-in metrics as there might be a reset somewhere in the code and you will end up with the original category. Better implement it as a custom metric as it was also suggested also by @clefourrier.

I managed to set all of my metrics to the metric category GENERATIVE_SAMPLING and now the number of requests matches the number of samples in the pipeline. However now my LLM as Judge metric doesn't work anymore since _get_metric_method_from_category [1] returns the apply_generative_metric now instead of the apply_llm_as_judge_metric, which doesn't work with the output of the judge. Therefore, I switched the judge metric back to LLM_AS_JUDGE.

As soon as I did this, I had 90 instead of 30 requests again. The different request category metrics as observed in the in _run_model function in the pipeline [2] are of the following form:

>>> pprint([r.metric_categories for r in requests])

[[<MetricCategory.GENERATIVE_SAMPLING: '5'>],
 [<MetricCategory.GENERATIVE: '3'>],
 [<MetricCategory.LLM_AS_JUDGE: '7'>],
 [<MetricCategory.GENERATIVE_SAMPLING: '5'>],
 [<MetricCategory.GENERATIVE: '3'>],
 [<MetricCategory.LLM_AS_JUDGE: '7'>],
...
 [<MetricCategory.LLM_AS_JUDGE: '7'>]]

Somehow adding a metric with the LLM_AS_JUDGE category resets the metric category of the built in metrics back to what they were originally. Using a custom metric worked for me in the end, thank you!

LLM-as-Judge metric triggers additional requests to LLM

Now that I have only the two metric categories of interest (GENERATIVE_SAMPLING and LLM_AS_JUDGE), I still would like to have only one set of requests and not twice as much. For the LiteLLM experiments I should be fine because of the response caching feature. Is there something similar when using the TransformersModel?

Additionally, in case of the transformer models, I think this might lead to wrong results because if sampling is enabled, then the second request to the LLM that is used for generating the output to be evaluated by the LLM-as-Judge metric might be different than the initial output that was evaluated using the GENERATIVE_SAMPLING metrics.

References:
[1] src/lighteval/tasks/lighteval_task.py#L508
[2] src/lighteval/pipeline.py#L442

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Request Count and Generative Sampling #528

Question about Request Count and Generative Sampling #528

rolshoven commented Jan 31, 2025

clefourrier commented Jan 31, 2025

rolshoven commented Jan 31, 2025

rolshoven commented Feb 1, 2025 •

edited

Loading

Question about Request Count and Generative Sampling #528

Question about Request Count and Generative Sampling #528

Comments

rolshoven commented Jan 31, 2025

clefourrier commented Jan 31, 2025

rolshoven commented Jan 31, 2025

rolshoven commented Feb 1, 2025 • edited Loading

Update

Overwriting the MetricCategory dynamically does not work well

LLM-as-Judge metric triggers additional requests to LLM

rolshoven commented Feb 1, 2025 •

edited

Loading