Skip to content

Official Repo of UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models [ICLR 2025]

License

Notifications You must be signed in to change notification settings

YangLabHKUST/UGMathBench

Repository files navigation

Logo
UGMathBench

Static Badge Static Badge

🔥News

  • 2025-02-19: We have released our evaluation codes based on the rule-based method and the hybrid method.
  • 2025-02-07: We have released the UGMathBench dataset on Huggingface.
  • 2025-01-23: Our paper is accepted to ICLR 2025.

Notice

The original paper employs a rule-based method to evaluate the answers generated by LLMs. However, as highlighted in several recent studies (e.g., OmniMath), model-based methods offer greater flexibility and adaptability in evaluation. Therefore, we also provide scripts that implement a hybrid evaluation approach (MARJ, see UGPhysics). We recommend utilizing this hybrid approach for assessment.

UGMathBench is a diverse and dynamic benchmark, specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions.

HeadPicture

🏆 Leaderboard

coming soon!

🫵🏻 Test your own model

Download Data

git clone https://huggingface.co/datasets/UGMathBench/ugmathbench
mv ugmathbench/data/* ./data/

Export Your Key

export OPENAI_BASE_URL=your_base_url
export OPENAI_API_KEY=your_key

Inference

For closed-source LLMs, please replace ${MODEL} and ${SUBJECT} by the LLM and subject you want to test.

python generate_close.py --model $model ${MODEL} --subject ${SUBJECT} --prompt raw --nproc 16

For open-source LLMs, we will use vllm to accelerate inference. Please add the corresponding prompt to prompt.json.

python generate_open.py --model_path ${MODEL} --subject ${SUBJECT} --prompt llama3math --tensor_parallel_size 4

Evaluate

Remember to export your keys as mentioned above. You can either use the rule-based method (used in our original paper):

python eval_rule.py --model_path ${MODEL} --subject ${SUBJECT} 

or the hybrid approach (please refer to UGPhysics):

python eval_marj.py --model_path ${MODEL} --subject ${SUBJECT} 

💬 Citation

Thanks for the open-source code of OlympiadBench and OlympicArena. Our README.md is adapted from OmniMath.

If you find our work interesting and meaningful, welcome to give a 🌟 to our repo and cite our paper.

@article{xu2025ugmathbench,
  title={UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models},
  author={Xu, Xin and Zhang, Jiaxin and Chen, Tianhao and Chao, Zitong and Hu, Jishan and Yang, Can},
  journal={arXiv preprint arXiv:2501.13766},
  year={2025}
}

About

Official Repo of UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models [ICLR 2025]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published