Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Request Count and Generative Sampling #528

Open
rolshoven opened this issue Jan 31, 2025 · 3 comments
Open

Question about Request Count and Generative Sampling #528

rolshoven opened this issue Jan 31, 2025 · 3 comments

Comments

@rolshoven
Copy link

Hi everyone,

First off, thanks for this great library! I've been using it extensively recently and really appreciate the work that has gone into it. While analyzing the code, I came across a couple of questions and would love your insights.

  1. Request Count for Metrics
    I'm evaluating a custom community summarization task with the following metrics: BERTScore, METEOR, BLEU, ROUGE-1/2/L, Extractiveness, and a custom LLM-as-Judge metric.

Given a set of samples, I noticed that the number of requests being generated is twice the number of samples. I have two metric categories: MetricCategory.GENERATIVE and MetricCategory.LLM_AS_JUDGE. Does this mean the model gets called twice per sample, effectively doubling costs when using an endpoint? I see that LiteLLM has caching, but even when running the model locally, it seems like each sample is processed twice thus generating an overhead.

Would switching my judge metric to MetricCategory.GENERATIVE prevent this duplication, or is there a better way to ensure only one evaluation pass per sample? And would this change have any other side effect that I am not seeing currently

  1. Generative Sampling Behavior
    I want to use sampling (do_sample=True) when generating locally, but the setting gets overwritten unless the metric category is MetricCategory.GENERATIVE_SAMPLING. To work around this, I currently instantiate my metrics and then dynamically change the category, but this feels like a hack.

However, when I do this, I suddenly see three times the number of requests—one for GENERATIVE, one for GENERATIVE_SAMPLING, and one for LLM_AS_JUDGE. I'm guessing something is off in how I'm modifying the category dynamically.

Is there a cleaner way to enable sampling without unintended extra requests?

Thanks in advance for your help! I really appreciate your time and insights.

@clefourrier
Copy link
Member

  1. If the parameters for your 2 metrics (in terms of sampling/generation length/stop tokens/... are the same), it shouldn't but it's possible we messed up something. Good question for the category change, cc @NathanHB ?
  2. You need a GENERATIVE_SAMPLING metric to do sampling - GENERATIVE will do greedy, so if you ask for both you'll get both behaviors (one in greedy and one in sampling, possibly with temperature change, etc). You could add your metrics as custom metrics, which would likely be the easiest. Can you give a code sample example?

@rolshoven
Copy link
Author

Thank you for your time!

  1. If the parameters for your 2 metrics (in terms of sampling/generation length/stop tokens/... are the same), it shouldn't but it's possible we messed up something. Good question for the category change, cc @NathanHB?
  2. You need a GENERATIVE_SAMPLING metric to do sampling - GENERATIVE will do greedy, so if you ask for both you'll get both behaviors (one in greedy and one in sampling, possibly with temperature change, etc). You could add your metrics as custom metrics, which would likely be the easiest. Can you give a code sample example?

Currently I see that there are three different metric categories associated with the requests, the original GENERATIVE category of most of my metrics (even though I dynamically overwrite them), the GENERATIVE_SAMPLING category that I assign to metric.use_case and the LLM_AS_JUDGE category of my judge metric. When running three tasks with a maximum of 10 samples each, I get 90 requests to LiteLLM.

I see why having GENERATIVE and GENERATIVE_SAMPLING requests would result in different calls to the model since once it is greedy and once it should use sampling, however I would like to use the judge on the already generated samples without re-generating the output. I will try to make all metrics including the judge custom metrics of type GENERATIVE_SAMPLING and then report back if that worked, thank you for your advice.

@rolshoven
Copy link
Author

rolshoven commented Feb 1, 2025

Update

Overwriting the MetricCategory dynamically does not work well

TL;DR: It is better not to overwrite the MetricCateogry of the built-in metrics as there might be a reset somewhere in the code and you will end up with the original category. Better implement it as a custom metric as it was also suggested also by @clefourrier.

I managed to set all of my metrics to the metric category GENERATIVE_SAMPLING and now the number of requests matches the number of samples in the pipeline. However now my LLM as Judge metric doesn't work anymore since _get_metric_method_from_category [1] returns the apply_generative_metric now instead of the apply_llm_as_judge_metric, which doesn't work with the output of the judge. Therefore, I switched the judge metric back to LLM_AS_JUDGE.

As soon as I did this, I had 90 instead of 30 requests again. The different request category metrics as observed in the in _run_model function in the pipeline [2] are of the following form:

>>> pprint([r.metric_categories for r in requests])

[[<MetricCategory.GENERATIVE_SAMPLING: '5'>],
 [<MetricCategory.GENERATIVE: '3'>],
 [<MetricCategory.LLM_AS_JUDGE: '7'>],
 [<MetricCategory.GENERATIVE_SAMPLING: '5'>],
 [<MetricCategory.GENERATIVE: '3'>],
 [<MetricCategory.LLM_AS_JUDGE: '7'>],
...
 [<MetricCategory.LLM_AS_JUDGE: '7'>]]

Somehow adding a metric with the LLM_AS_JUDGE category resets the metric category of the built in metrics back to what they were originally. Using a custom metric worked for me in the end, thank you!

LLM-as-Judge metric triggers additional requests to LLM

Now that I have only the two metric categories of interest (GENERATIVE_SAMPLING and LLM_AS_JUDGE), I still would like to have only one set of requests and not twice as much. For the LiteLLM experiments I should be fine because of the response caching feature. Is there something similar when using the TransformersModel?

Additionally, in case of the transformer models, I think this might lead to wrong results because if sampling is enabled, then the second request to the LLM that is used for generating the output to be evaluated by the LLM-as-Judge metric might be different than the initial output that was evaluated using the GENERATIVE_SAMPLING metrics.


References:
[1] src/lighteval/tasks/lighteval_task.py#L508
[2] src/lighteval/pipeline.py#L442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants