Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SGLang as Potential Backend for Evaluation #2703

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,14 +241,14 @@ vLLM occasionally differs in output from Huggingface. We treat Huggingface as th
### Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`
We support SGLang with its efficient offline batch inference. Its **[Fast Backend Runtime](https://docs.sglang.ai/index.html)** thanks to efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, overhead-free CPU scheduler, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (FP8/INT4/AWQ/GPTQ).

To use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https://docs.sglang.ai/start/install.html#install-sglang).
To use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https://docs.sglang.ai/start/install.html#install-sglang).
> [!Tip]
> Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.

SGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.ai/backend/server_arguments.html) for more information. We provide an example of the usage here:
```bash
lm_eval --model sglang \
--model_args pretrained={model_name},tp_size={data_parallel_size},dp_size={tensor_parallel_size},dtype=auto,mem-fraction-static=0.9, \
--model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto,mem-fraction-static=0.9, \
--tasks gsm8k_cot \
--batch_size auto
```
Expand Down
2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,6 @@ zeno = ["pandas", "zeno-client"]
wandb = ["wandb>=0.16.3", "pandas", "numpy"]
gptqmodel = ["gptqmodel>=1.0.9"]
japanese_leaderboard = ["emoji==2.14.0", "neologdn==0.5.3", "fugashi[unidic-lite]", "rouge_score>=0.1.2"]
sglang =["sglang>=0.4.2.post2"]
all = [
"lm_eval[anthropic]",
"lm_eval[dev]",
Expand All @@ -99,7 +98,6 @@ all = [
"lm_eval[zeno]",
"lm_eval[wandb]",
"lm_eval[japanese_leaderboard]",
"lm_eval[sglang]",
]

[tool.ruff.lint]
Expand Down
54 changes: 53 additions & 1 deletion tests/models/test_sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
import pytest
import torch

from lm_eval import tasks
from lm_eval import evaluate, simple_evaluate, tasks
from lm_eval.api.instance import Instance
from lm_eval.tasks import get_task_dict


task_manager = tasks.TaskManager()
Expand Down Expand Up @@ -60,3 +61,54 @@ def test_logliklihood_rolling(self) -> None:
res = self.LM.loglikelihood_rolling(self.ROLLING)
for x in res:
assert isinstance(x, float)

# def test_simple_evaluate(self)-> None:
# results = simple_evaluate(
# model =self.LM,
# tasks=["gsm8k"],
# # num_fewshot=0,
# task_manager=task_manager,
# limit= 1,
# )
# print(results)

# def test_evaluate(self)-> None:
# tasks=["gsm8k"]
# task_dict = get_task_dict(tasks, task_manager)
# results = evaluate(
# lm=self.LM,
# task_dict=task_dict,
# limit= 1,
# )
# print(results)

# TODO(jinwei): find out the outpt differences for "gsm_8k" with simple_evalute() and evaluate(). There are some errors in parser as well.
def test_evaluator(self) -> None:
simple_results = simple_evaluate(
model=self.LM,
tasks=["arc_easy"],
task_manager=task_manager,
limit=1,
)
assert simple_results is not None, "simple_evaluate returned None"

task_dict = get_task_dict(["arc_easy"], task_manager)
evaluate_results = evaluate(
lm=self.LM,
task_dict=task_dict,
limit=1,
)
assert evaluate_results is not None, "evaluate returned None"

assert set(simple_results["results"].keys()) == set(
evaluate_results["results"].keys()
), "Mismatch in task keys between simple_evaluate and evaluate"

for task in simple_results["results"]:
assert (
simple_results["results"][task] == evaluate_results["results"][task]
Copy link
Contributor

@Qubitium Qubitium Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Monstertail I see that the tests compare the two api return the same results. Nice! But there is a situation where both failed to generate proper score like arc <= 0.10 as example where some very low (bad) score is returned that is not normal for the model. It would be best to add an actual fixed score range compare for this fixed 1.5B model? LIke arc returned should be >= 0.5 (based on actual result, within a maybe 10% range to offset for different gpus/kernels).

), f"Mismatch in results for {task}"

print(
"✅ test_evaluator passed: simple_evaluate and evaluate results are identical."
)
Loading