- 2025-02-19: We have released our evaluation codes based on the rule-based method and the hybrid method.
- 2025-02-07: We have released the UGMathBench dataset on Huggingface.
- 2025-01-23: Our paper is accepted to ICLR 2025.
The original paper employs a rule-based method to evaluate the answers generated by LLMs. However, as highlighted in several recent studies (e.g., OmniMath), model-based methods offer greater flexibility and adaptability in evaluation. Therefore, we also provide scripts that implement a hybrid evaluation approach (MARJ, see UGPhysics). We recommend utilizing this hybrid approach for assessment.
UGMathBench is a diverse and dynamic benchmark, specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions.
coming soon!
git clone https://huggingface.co/datasets/UGMathBench/ugmathbench
mv ugmathbench/data/* ./data/
export OPENAI_BASE_URL=your_base_url
export OPENAI_API_KEY=your_key
For closed-source LLMs, please replace ${MODEL}
and ${SUBJECT}
by the LLM and subject you want to test.
python generate_close.py --model $model ${MODEL} --subject ${SUBJECT} --prompt raw --nproc 16
For open-source LLMs, we will use vllm
to accelerate inference. Please add the corresponding prompt to prompt.json
.
python generate_open.py --model_path ${MODEL} --subject ${SUBJECT} --prompt llama3math --tensor_parallel_size 4
Remember to export your keys as mentioned above. You can either use the rule-based method (used in our original paper):
python eval_rule.py --model_path ${MODEL} --subject ${SUBJECT}
or the hybrid approach (please refer to UGPhysics):
python eval_marj.py --model_path ${MODEL} --subject ${SUBJECT}
Thanks for the open-source code of OlympiadBench and OlympicArena.
Our README.md
is adapted from OmniMath.
If you find our work interesting and meaningful, welcome to give a 🌟 to our repo and cite our paper.
@article{xu2025ugmathbench,
title={UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models},
author={Xu, Xin and Zhang, Jiaxin and Chen, Tianhao and Chao, Zitong and Hu, Jishan and Yang, Can},
journal={arXiv preprint arXiv:2501.13766},
year={2025}
}