Create evals #11

homanp · 2025-02-07T07:05:43Z

Would be nice to have an eval pipeline to continuously measure performance/cost gains over a set of LLMs and documents.

g-hano · 2025-02-07T07:28:03Z

are you planning to include Ollama models in the evaluation? Since their performance metrics would depend heavily on local hardware, we might need a different approach compared to API-based models

homanp · 2025-02-07T07:41:54Z

are you planning to include Ollama models in the evaluation? Since their performance metrics would depend heavily on local hardware, we might need a different approach compared to API-based models

I was thinking starting with the hosted models, since as you correctly point out local inference would need it's own evals.

The tricky part here is to find a good dataset that is accepted as a benchmark. Any ideas?

g-hano · 2025-02-07T07:55:46Z

After a quick search, I found Frames and RewardBench

homanp · 2025-02-07T09:11:12Z

After a quick search, I found Frames and RewardBench

I would go with frames

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create evals #11

Create evals #11

homanp commented Feb 7, 2025

g-hano commented Feb 7, 2025

homanp commented Feb 7, 2025

g-hano commented Feb 7, 2025

homanp commented Feb 7, 2025

Create evals #11

Create evals #11

Comments

homanp commented Feb 7, 2025

g-hano commented Feb 7, 2025

homanp commented Feb 7, 2025

g-hano commented Feb 7, 2025

homanp commented Feb 7, 2025