Data • Model • Performance • Training • Evaluation • Paper • Citation
Welcome to the official repo for Schema Reinforcement Learning, containing the dataset, training scripts, and evaluation code in our paper.
- [2025/02/27] Our paper is now released on arXiv! Check it here!
- [2025/02/26] SchemaBench is now released!
SchemaBench is intended solely for research and educational purposes and should not be construed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. Below is the statistics of the schemas used in SchemaBench:
We crawled 40K+ real-world schema files from JSON Schema Store and GitHub, and constructed SchemaBench. Below we present our data cleaning and construction pipeline with common cases.
Please download our dataset using the following link: Google Drive or Tsinghua Cloud. Simply copy those data files into the same folders in the repo. The file structure is as follows:
├── /schemabench/
│ └── /data/
│ ├── /custom/ // Custom Formats
│ ├── /schema/ // Complex Schema
│ ├── custom_append.jsonl
│ └── translation_test.jsonl // Escape Translation
├── /train/
│ └── /data/
│ ├── mix_train_no_collected_json.json // SFT - w/o collected json
│ ├── mix_train.json // SFT - w/ collected json
│ ├── train_with_tool_ToS.parquet // SRL - training set
│ └── val_with_tool_ToS.parquet // SRL - validation set
Please make sure you have downloaded all data files and put them into the right directory.
We release the LLaMA-3.2 3B SRL for anyone who wants to use it.
We evaluate the performance of several models on the SchemaBench. The results are shown below:
Model | Schema-only Generation | Schema-constrained Reasoning | ||||||
---|---|---|---|---|---|---|---|---|
Complex | Custom | Escape | Overall | GSM8K | MATH500 | MMLU | ARC-C | |
GPT-4o | 84.47 | 61.56 | 37.14 | 61.06 | 97.80 | 41.40 | 86.16 | 97.01 |
GPT-4o-mini | 68.86 | 46.17 | 16.89 | 43.98 | 86.13 | 31.80 | 49.41 | 77.65 |
Qwen-2.5 7B | 72.42 | 43.60 | 11.11 | 42.38 | 94.54 | 38.60 | 74.43 | 91.21 |
MiniCPM-3 4B | 53.88 | 20.29 | 9.13 | 27.77 | 69.22 | 33.40 | 66.58 | 88.31 |
LLaMA-3.1 8B | 64.26 | 33.07 | 12.02 | 36.45 | 95.91 | 85.60 | 71.83 | 84.98 |
LLaMA-3.1 8B SFT | 74.56 | 46.64 | 60.58 | 60.59 | 89.46 | 63.80 | 66.97 | 84.56 |
- w/o Collected JSON | 70.84 | 42.06 | 60.35 | 57.75 | 78.39 | 46.00 | 58.87 | 75.68 |
LLaMA-3.1 8B SRL | 90.48 | 78.67 | 69.86 | 79.67 | 90.90 | 88.00 | 70.74 | 84.81 |
LLaMA-3.2 3B | 49.84 | 27.31 | 8.37 | 28.51 | 80.97 | 35.40 | 62.38 | 79.27 |
LLaMA-3.2 3B SFT | 71.71 | 45.52 | 52.21 | 56.48 | 82.94 | 44.40 | 61.50 | 78.41 |
- w/o Collected JSON | 72.42 | 42.83 | 54.82 | 56.69 | 78.85 | 36.20 | 59.11 | 75.68 |
LLaMA-3.2 3B SRL | 82.25 | 66.13 | 69.10 | 72.50 | 84.23 | 43.20 | 57.99 | 78.24 |
✨Here is an overview of our training pipeline.
Clone this repository and navigate to the SchemaBench folder.
git clone [email protected]:thunlp/SchemaReinforcementLearning.git
cd SchemaReinforcementLearning
Initialize the environment (python==3.11)
bash scripts/init_env.sh
Download the data files from the Data Release section and put them into the right directory.
bash scripts/train_sft.sh
If you want to use the SFT data without collected JSON, please run the following command:
bash scripts/train_sft_no_collected_json.sh
We use a modified version of PRIME for SRL, which is already included in this repo as a submodule. To train the SRL model, please run the following command:
bash scripts/train_srl.sh
You can find your trained models in the train/results
directory by default.
Before evaluating performance on the SchemaBench, you should initialize the config file for local models' inference. We use CodeLinker for inference, which currently support any OpenAI compatible server for the evaluation. To initialize the config file, first:
cp private_example.toml private.toml
Then fill in the private.toml
with your api key and base url if needed. After that, you can run the following evaluation script:
bash scripts/test_schemabench.sh
If you need to run the evaluation on a subset of the SchemaBench, you can modify the test_category
in test_schemabench.sh
script. Currently you can choose from ['all', 'schema', 'reasoning']
and all single sub-tasks.
If you find our work helpful, please consider citing our paper:
@misc{lu2025learninggeneratestructuredoutput,
title={Learning to Generate Structured Output with Schema Reinforcement Learning},
author={Yaxi Lu and Haolun Li and Xin Cong and Zhong Zhang and Yesai Wu and Yankai Lin and Zhiyuan Liu and Fangming Liu and Maosong Sun},
year={2025},
eprint={2502.18878},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18878},
}