-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Request Count and Generative Sampling #528
Comments
|
Thank you for your time!
Currently I see that there are three different metric categories associated with the requests, the original GENERATIVE category of most of my metrics (even though I dynamically overwrite them), the GENERATIVE_SAMPLING category that I assign to metric.use_case and the LLM_AS_JUDGE category of my judge metric. When running three tasks with a maximum of 10 samples each, I get 90 requests to LiteLLM. I see why having GENERATIVE and GENERATIVE_SAMPLING requests would result in different calls to the model since once it is greedy and once it should use sampling, however I would like to use the judge on the already generated samples without re-generating the output. I will try to make all metrics including the judge custom metrics of type GENERATIVE_SAMPLING and then report back if that worked, thank you for your advice. |
UpdateOverwriting the MetricCategory dynamically does not work wellTL;DR: It is better not to overwrite the I managed to set all of my metrics to the metric category GENERATIVE_SAMPLING and now the number of requests matches the number of samples in the pipeline. However now my LLM as Judge metric doesn't work anymore since As soon as I did this, I had 90 instead of 30 requests again. The different request category metrics as observed in the in >>> pprint([r.metric_categories for r in requests])
[[<MetricCategory.GENERATIVE_SAMPLING: '5'>],
[<MetricCategory.GENERATIVE: '3'>],
[<MetricCategory.LLM_AS_JUDGE: '7'>],
[<MetricCategory.GENERATIVE_SAMPLING: '5'>],
[<MetricCategory.GENERATIVE: '3'>],
[<MetricCategory.LLM_AS_JUDGE: '7'>],
...
[<MetricCategory.LLM_AS_JUDGE: '7'>]] Somehow adding a metric with the LLM-as-Judge metric triggers additional requests to LLMNow that I have only the two metric categories of interest ( Additionally, in case of the transformer models, I think this might lead to wrong results because if sampling is enabled, then the second request to the LLM that is used for generating the output to be evaluated by the LLM-as-Judge metric might be different than the initial output that was evaluated using the References: |
Hi everyone,
First off, thanks for this great library! I've been using it extensively recently and really appreciate the work that has gone into it. While analyzing the code, I came across a couple of questions and would love your insights.
I'm evaluating a custom community summarization task with the following metrics: BERTScore, METEOR, BLEU, ROUGE-1/2/L, Extractiveness, and a custom LLM-as-Judge metric.
Given a set of samples, I noticed that the number of requests being generated is twice the number of samples. I have two metric categories:
MetricCategory.GENERATIVE
andMetricCategory.LLM_AS_JUDGE
. Does this mean the model gets called twice per sample, effectively doubling costs when using an endpoint? I see that LiteLLM has caching, but even when running the model locally, it seems like each sample is processed twice thus generating an overhead.Would switching my judge metric to MetricCategory.GENERATIVE prevent this duplication, or is there a better way to ensure only one evaluation pass per sample? And would this change have any other side effect that I am not seeing currently
I want to use sampling (
do_sample=True
) when generating locally, but the setting gets overwritten unless the metric category isMetricCategory.GENERATIVE_SAMPLING
. To work around this, I currently instantiate my metrics and then dynamically change the category, but this feels like a hack.However, when I do this, I suddenly see three times the number of requests—one for
GENERATIVE
, one forGENERATIVE_SAMPLING
, and one forLLM_AS_JUDGE
. I'm guessing something is off in how I'm modifying the category dynamically.Is there a cleaner way to enable sampling without unintended extra requests?
Thanks in advance for your help! I really appreciate your time and insights.
The text was updated successfully, but these errors were encountered: