UGMathBench

🔥News

2025-02-19: We have released our evaluation codes based on the rule-based method and the hybrid method.
2025-02-07: We have released the UGMathBench dataset on Huggingface.
2025-01-23: Our paper is accepted to ICLR 2025.

Notice

The original paper employs a rule-based method to evaluate the answers generated by LLMs. However, as highlighted in several recent studies (e.g., OmniMath), model-based methods offer greater flexibility and adaptability in evaluation. Therefore, we also provide scripts that implement a hybrid evaluation approach (MARJ, see UGPhysics). We recommend utilizing this hybrid approach for assessment.

UGMathBench is a diverse and dynamic benchmark, specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions.

🏆 Leaderboard

coming soon!

🫵🏻 Test your own model

Download Data

git clone https://huggingface.co/datasets/UGMathBench/ugmathbench
mv ugmathbench/data/* ./data/

Export Your Key

export OPENAI_BASE_URL=your_base_url
export OPENAI_API_KEY=your_key

Inference

For closed-source LLMs, please replace ${MODEL} and ${SUBJECT} by the LLM and subject you want to test.

python generate_close.py --model $model ${MODEL} --subject ${SUBJECT} --prompt raw --nproc 16

For open-source LLMs, we will use vllm to accelerate inference. Please add the corresponding prompt to prompt.json.

python generate_open.py --model_path ${MODEL} --subject ${SUBJECT} --prompt llama3math --tensor_parallel_size 4

Evaluate

Remember to export your keys as mentioned above. You can either use the rule-based method (used in our original paper):

python eval_rule.py --model_path ${MODEL} --subject ${SUBJECT}

or the hybrid approach (please refer to UGPhysics):

python eval_marj.py --model_path ${MODEL} --subject ${SUBJECT}

💬 Citation

Thanks for the open-source code of OlympiadBench and OlympicArena. Our README.md is adapted from OmniMath.

If you find our work interesting and meaningful, welcome to give a 🌟 to our repo and cite our paper.

@article{xu2025ugmathbench,
  title={UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models},
  author={Xu, Xin and Zhang, Jiaxin and Chen, Tianhao and Chao, Zitong and Hu, Jishan and Yang, Can},
  journal={arXiv preprint arXiv:2501.13766},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UGMathBench

🔥News

Notice

🏆 Leaderboard

🫵🏻 Test your own model

Download Data

Export Your Key

Inference

Evaluate

💬 Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
imgs		imgs
scripts		scripts
LICENSE		LICENSE
README.md		README.md
eval_marj.py		eval_marj.py
eval_rule.py		eval_rule.py
generate_close.py		generate_close.py
generate_open.py		generate_open.py
judge_marj.py		judge_marj.py
judge_rule.py		judge_rule.py
prompt.json		prompt.json
utils.py		utils.py

License

YangLabHKUST/UGMathBench

Folders and files

Latest commit

History

Repository files navigation

UGMathBench

🔥News

Notice

🏆 Leaderboard

🫵🏻 Test your own model

Download Data

Export Your Key

Inference

Evaluate

💬 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages