Adding grpo training #1233

Goekdeniz-Guelmez · 2025-01-31T16:00:25Z

No description provided.

mark-lord · 2025-02-02T15:29:38Z

Absolute HERO! Been trying to figure this out myself the past week but made pretty much no progress whatsoever, other than to make a script that fills up all the RAM on my Mac 🤣

Is there any way to run this yet? I assume no since at the mo it's still marked as in draft + there isn't a lora_config.yaml like in the DPO example yet (not sure if it's needed)?

Goekdeniz-Guelmez · 2025-02-03T08:12:28Z

No, not yet I still have to implement the Dataset Wrapper and some other stuff, I'll tell you when it's done.

…uelmez/mlx-examples into adding-GRPO-training

Guo-astro

Possible need to use expanded_prompts, expanded_answers in both reward and loss

llms/mlx_lm/tuner/grpo_trainer.py

Goekdeniz-Guelmez · 2025-02-03T18:45:42Z

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-0.5B \
    --train \
    --data /Users/gokdenizgulmez/Desktop/test_grpo \
    --iters 5 \
    --batch-size 1 \
    --num-layers 4 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/gokdenizgulmez/Desktop/test-grpo-full \
    --max-seq-length 128 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2

Output

Loading pretrained model
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 124936.71it/s]
Loading datasets
Training
Trainable parameters: 0.109% (0.541M/494.033M)
Starting GRPO training with 5 reward functions..., iters: 5
[WARNING] Some prompts are longer than 128 tokens. Long prompts will be truncated.
Iter 1: Val loss 0.00000140, Val total_rewards_mean -0.359, Val total_rewards_std 0.010, Val grouped_rewards_mean -0.359, Val grouped_rewards_std 0.010, Val kl 0.000, Val reward_func_0_mean 0.000, Val reward_func_0_std 0.000, Val reward_func_1_mean 0.000, Val reward_func_1_std 0.000, Val reward_func_2_mean 0.000, Val reward_func_2_std 0.000, Val reward_func_3_mean 0.000, Val reward_func_3_std 0.000, Val reward_func_4_mean -1.794, Val reward_func_4_std 0.051, Val took 8.385s

But after that my 32 GB of ram get fully used. I tried to add some memory optimisations but the memory usage is still too much.

Goekdeniz-Guelmez · 2025-02-03T18:58:02Z

Iter 1: Val loss -0.00000057, Val total_rewards_mean -0.387, Val total_rewards_std 0.026, Val grouped_rewards_mean -0.387, Val grouped_rewards_std 0.026, Val kl 0.000, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean -1.937, Val r1_count_xml_std 0.128, Val took 8.314s

Still uses too much memory.

Goekdeniz-Guelmez · 2025-02-03T20:47:37Z

So I tried using trl and the same amount of ram has been used, so no error on my side

mark-lord · 2025-02-03T21:16:59Z

🚀

Would you be able to share the datasets you used for the training? Will give it a go on my machine as soon as I can 🙌

Goekdeniz-Guelmez · 2025-02-03T22:34:03Z

Will do that tomorrow 🤝

Guo-astro · 2025-02-04T04:30:02Z

🚀

Would you be able to share the datasets you used for the training? Will give it a go on my machine as soon as I can 🙌

I created a quick one only for testing the code

https://huggingface.co/datasets/Goastro/mlx-grpo-dataset

Goekdeniz-Guelmez · 2025-02-04T08:19:36Z

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-0.5B \
    --train \
    --data /Users/gokdenizgulmez/Desktop/test_grpo \
    --iters 5 \
    --batch-size 1 \
    --num-layers 8 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/gokdenizgulmez/Desktop/test-grpo-full \
    --max-seq-length 255 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2 \
    --max-completion-length 6

Output:

Loading pretrained model
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 72853.92it/s]
Loading datasets
Training
Trainable parameters: 0.109% (0.541M/494.033M)
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10955.27it/s]
Starting GRPO training with 5 reward functions..., iters: 5
Iter 1: Val loss 0.00000000, Val total_rewards_mean -0.354, Val total_rewards_std 0.012, Val grouped_rewards_mean -0.354, Val grouped_rewards_std 0.012, Val kl 0.000, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean -1.769, Val r1_count_xml_std 0.060, Val took 26.298s
Iter 1: Train loss -0.00001353, Total rewards mean -0.306, Total rewards std 0.001, Grouped rewards mean -0.306, Grouped rewards std 0.001, KL 0.000, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -1.532, r1_count_xml std 0.005, Learning Rate 1.000e-05, It/sec 0.079, Tokens/sec 25.072, Peak mem 7.254 GB
Iter 2: Train loss 0.00055540, Total rewards mean -0.572, Total rewards std 0.001, Grouped rewards mean -0.572, Grouped rewards std 0.001, KL 0.006, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -2.861, r1_count_xml std 0.005, Learning Rate 1.000e-05, It/sec 0.121, Tokens/sec 36.164, Peak mem 7.254 GB
Iter 3: Train loss 0.00070858, Total rewards mean -0.842, Total rewards std 0.003, Grouped rewards mean -0.842, Grouped rewards std 0.003, KL 0.013, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -4.210, r1_count_xml std 0.013, Learning Rate 1.000e-05, It/sec 0.110, Tokens/sec 31.790, Peak mem 7.254 GB
Iter 4: Train loss 0.00070563, Total rewards mean -1.161, Total rewards std 0.005, Grouped rewards mean -1.161, Grouped rewards std 0.005, KL 0.020, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -5.806, r1_count_xml std 0.024, Learning Rate 1.000e-05, It/sec 0.105, Tokens/sec 36.961, Peak mem 7.899 GB
Iter 5: Val loss 0.00057772, Val total_rewards_mean -0.345, Val total_rewards_std 0.005, Val grouped_rewards_mean -0.345, Val grouped_rewards_std 0.005, Val kl 0.006, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean -1.726, Val r1_count_xml_std 0.025, Val took 22.624s
Iter 5: Train loss 0.00059050, Total rewards mean -1.399, Total rewards std 0.006, Grouped rewards mean -1.399, Grouped rewards std 0.006, KL 0.026, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -6.994, r1_count_xml std 0.029, Learning Rate 1.000e-05, It/sec 0.156, Tokens/sec 39.539, Peak mem 7.899 GB
Saved final weights to /Users/gokdenizgulmez/Desktop/test-grpo-full/adapters.safetensors.

mark-lord · 2025-02-04T08:51:44Z

🥳🥳🥳

Working on my machine too! Not to mention it's plug-and-play with QLoRA as well, which I don't think TRL even has 😁 And already used it to get an 'aha' moment out of Phi-14b and do some knowledge injection 🚀 [Edit: I did not get it to work properly - see later in conversation]

wangcheng0825 · 2025-02-23T13:54:05Z

Why Train loss always 0.

python -m mlx_lm.lora \ --model Qwen/Qwen2.5-3B-Instruct \ --train \ --data /Users/stone/Documents/repository/ml/mlx_learn/data \ --iters 500 \ --batch-size 1 \ --num-layers 8 \ --val-batches 1 \ --steps-per-report 1 \ --adapter-path /Users/stone/Documents/repository/ml/mlx_learn/adapter \ --max-seq-length 512 \ --grad-checkpoint \ --training-mode grpo \ --fine-tune-type lora \ --beta 0.1 \ --temperature 0.8 \ --steps-per-eval 500 \ --group-size 4 \ --max-completion-length 512 \ --use-chat-template \ --save-every 100

Goekdeniz-Guelmez · 2025-02-24T13:12:05Z

@wangcheng0825 It's normal to see zero loss during RL fine-tuning of LLMs as long as rewards are improving. Here is Unsloth:

Goekdeniz-Guelmez · 2025-02-24T13:16:04Z

@kiratp that's a great idea! I'll push the update later today.

…s (generates now faster, while same RAM usage), fix for the identical generatrions, seperated the reward functions into a seperate file.

Goekdeniz-Guelmez · 2025-02-24T21:19:58Z

Huge changes! the generating are different now (it was because I used argmax instead of mx.random.categorical), now this has system message support in the dataset loader too.

Loading pretrained model
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 263977.17it/s]
Loading datasets
Training
Trainable parameters: 0.013% (0.410M/3085.939M)
Fetching 9 files: 100%|█████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 60397.98it/s]
Starting GRPO training with 5 reward functions..., iters: 100

=== Validation Sample Details ===

📝 Generation:
Let's begin by calculating the total number of flowers in the garden. We'll sum up the roses, tulips, and daisies.

<think> Roses: 25, Tulips: 40, Daisies: 35 </think>
<answer> Total flowers = Roses + Tulips + Daisies = 25 + 40 + 35 = 100 </answer>

Next, we need to find the number of flowers that are not roses. This can be determined by subtracting the number of roses from the total number of flowers.

<think> Total flowers: 100, Roses: 25 </think>
<answer> Flowers not roses = Total flowers - Roses = 100 - 25 = 75 </answer>

To find what percentage of the flowers are not roses, we need to use the formula for percentage:

<think> Flowers not roses: 75, Total flowers: 100 </think>
<answer> Percentage of flowers not roses = (Flowers not roses / Total flowers) * 100 = (75 / 100) * 100 = 75% </answer>

Therefore, the percentage of flowers that are not roses is 75%.

==========


✅ Answer:
75

==========


🔍 Extracted Answer:
Percentage of flowers not roses = (Flowers not roses / Total flowers) * 100 = (75 / 100) * 100 = 75%

==============================

Iter 1: Val loss -0.00000003, Val total_rewards_mean 0.688, Val total_rewards_std 0.188, Val grouped_rewards_mean 0.688, Val grouped_rewards_std 0.188, Val kl 0.000, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.500, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean 0.188, Val r1_count_xml_std 0.188, Val took 32.792s
Iter 1: Train loss -0.00000003, Total rewards mean 2.125, Total rewards std 1.250, Grouped rewards mean 2.125, Grouped rewards std 1.250, KL 0.000, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 1.000, r1_int_reward_func mean 0.250, r1_int_reward_func std 0.250, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.375, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.032, Tokens/sec 10.947, Peak mem 13.803 GB
Iter 20: Train loss -0.00197959, Total rewards mean 23.274, Total rewards std 7.669, Grouped rewards mean 23.274, Grouped rewards std 7.669, KL 0.014, r1_accuracy_reward_func mean 6.000, r1_accuracy_reward_func std 6.000, r1_int_reward_func mean 2.250, r1_int_reward_func std 1.250, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 8.750, r1_soft_format_reward_func std 0.750, r1_count_xml mean 6.273, r1_count_xml std 0.885, Learning Rate 1.000e-05, It/sec 0.012, Tokens/sec 8.923, Peak mem 15.801 GB
Iter 40: Train loss -0.00107503, Total rewards mean 45.060, Total rewards std 15.009, Grouped rewards mean 45.060, Grouped rewards std 15.009, KL 0.043, r1_accuracy_reward_func mean 11.000, r1_accuracy_reward_func std 11.000, r1_int_reward_func mean 3.750, r1_int_reward_func std 2.250, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 17.500, r1_soft_format_reward_func std 2.000, r1_count_xml mean 12.809, r1_count_xml std 1.599, Learning Rate 1.000e-05, It/sec 0.020, Tokens/sec 9.655, Peak mem 16.137 GB

=== Validation Sample Details ===

📝 Generation:
To determine what percentage of the flowers in the garden are not roses, we'll follow these steps:

<think> First, we need to calculate the total number of flowers in the garden. This can be done by summing the number of roses, tulips, and daisies. Next, we need to find out how many flowers are not roses, which means subtracting the number of roses from the total number of flowers. Then, to calculate the percentage of flowers that are not roses, we divide the number of flowers that are not roses by the total number of flowers, and multiply by 100. </think>

<answer> First, we sum the total number of flowers: 25 roses + 40 tulips + 35 daisies = 100 flowers. Next, we find out how many flowers are not roses by subtracting the number of roses: 100 total flowers - 25 roses = 75 flowers that are not roses. Finally, to calculate the percentage of flowers that are not roses, we divide the number of flowers that are not roses by the total number of flowers and multiply by 100: (75 / 100) * 100 = 75%. </answer>

==========


✅ Answer:
75

==========


🔍 Extracted Answer:
First, we sum the total number of flowers: 25 roses + 40 tulips + 35 daisies = 100 flowers. Next, we find out how many flowers are not roses by subtracting the number of roses: 100 total flowers - 25 roses = 75 flowers that are not roses. Finally, to calculate the percentage of flowers that are not roses, we divide the number of flowers that are not roses by the total number of flowers and multiply by 100: (75 / 100) * 100 = 75%.

==============================

Iter 50: Val loss 0.00013034, Val total_rewards_mean 0.875, Val total_rewards_std 0.000, Val grouped_rewards_mean 0.875, Val grouped_rewards_std 0.000, Val kl 0.001, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.500, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean 0.375, Val r1_count_xml_std 0.000, Val took 34.302s
Iter 60: Train loss -0.36879355, Total rewards mean 73.282, Total rewards std 25.806, Grouped rewards mean 73.282, Grouped rewards std 25.806, KL 0.115, r1_accuracy_reward_func mean 22.000, r1_accuracy_reward_func std 18.000, r1_int_reward_func mean 7.500, r1_int_reward_func std 4.500, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 25.500, r1_soft_format_reward_func std 4.000, r1_count_xml mean 18.281, r1_count_xml std 2.895, Learning Rate 1.000e-05, It/sec 0.021, Tokens/sec 9.869, Peak mem 16.137 GB
Iter 80: Train loss 0.00018776, Total rewards mean 97.672, Total rewards std 34.840, Grouped rewards mean 97.672, Grouped rewards std 34.840, KL 0.199, r1_accuracy_reward_func mean 29.000, r1_accuracy_reward_func std 23.000, r1_int_reward_func mean 10.000, r1_int_reward_func std 6.500, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 34.750, r1_soft_format_reward_func std 4.750, r1_count_xml mean 23.922, r1_count_xml std 4.203, Learning Rate 1.000e-05, It/sec 0.015, Tokens/sec 9.120, Peak mem 16.137 GB

=== Validation Sample Details ===

📝 Generation:
To solve this problem, we need to follow these steps:

1. Calculate the total number of flowers in the garden.
2. Determine the number of flowers that are not roses.
3. Compute the percentage of flowers that are not roses.

Let's proceed step-by-step:

<think> 
1. Total number of flowers = Number of roses + Number of tulips + Number of daisies
2. Flowers not roses = Number of tulips + Number of daisies
3. Percentage of flowers not roses = (Flowers not roses / Total number of flowers) * 100
</think>

Now let's do the calculations:

<answer>
1. Total number of flowers = 25 roses + 40 tulips + 35 daisies = 100 flowers
2. Flowers not roses = 40 tulips + 35 daisies = 75 flowers
3. Percentage of flowers not roses = (75 / 100) * 100 = 75%

Therefore, 75% of the flowers are not roses.
</answer>

==========


✅ Answer:
75

==========


🔍 Extracted Answer:
1. Total number of flowers = 25 roses + 40 tulips + 35 daisies = 100 flowers
2. Flowers not roses = 40 tulips + 35 daisies = 75 flowers
3. Percentage of flowers not roses = (75 / 100) * 100 = 75%

Therefore, 75% of the flowers are not roses.

==============================

Iter 100: Val loss 0.00016276, Val total_rewards_mean 0.875, Val total_rewards_std 0.000, Val grouped_rewards_mean 0.875, Val grouped_rewards_std 0.000, Val kl 0.002, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.500, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean 0.375, Val r1_count_xml_std 0.000, Val took 32.239s
Iter 100: Train loss -0.00121820, Total rewards mean 125.407, Total rewards std 40.104, Grouped rewards mean 125.407, Grouped rewards std 40.104, KL 0.368, r1_accuracy_reward_func mean 38.000, r1_accuracy_reward_func std 26.000, r1_int_reward_func mean 12.500, r1_int_reward_func std 7.500, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 44.500, r1_soft_format_reward_func std 5.000, r1_count_xml mean 30.407, r1_count_xml std 5.218, Learning Rate 1.000e-05, It/sec 0.023, Tokens/sec 10.279, Peak mem 16.158 GB
Iter 100: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000100_adapters.safetensors.
Saved final weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors.
Testing
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 21959.71it/s]

=== Validation Sample Details ===

📝 Generation:
Let's start by figuring out the total number of pencils Arnel had. We know that after keeping 10 pencils, the rest were shared equally among his five friends, and each friend received 8 pencils. So, the total number of pencils given to friends is 5 friends * 8 pencils/friend = 40 pencils. Since Arnel kept 10 pencils for himself, the total number of pencils he originally had is 10 (kept) + 40 (given to friends) = 50 pencils. If these 50 pencils are in 10 boxes with an equal number of pencils in each, we simply divide the total number of pencils by the number of boxes: 50 pencils / 10 boxes = 5 pencils per box. <think> 50 / 10 = 5 pencils were in each box </think><answer>5</answer>

==========


✅ Answer:
5

==========


🔍 Extracted Answer:
5

==============================

Test loss 0.000, Test ppl 1.000, Rewards: total_rewards_mean: 3.375, total_rewards_std: 0.000, grouped_rewards_mean: 3.375, grouped_rewards_std: 0.000, kl: 0.002, r1_accuracy_reward_func_mean: 2.000, r1_accuracy_reward_func_std: 0.000, r1_int_reward_func_mean: 0.500, r1_int_reward_func_std: 0.000, r1_strict_format_reward_func_mean: 0.000, r1_strict_format_reward_func_std: 0.000, r1_soft_format_reward_func_mean: 0.500, r1_soft_format_reward_func_std: 0.000, r1_count_xml_mean: 0.375, r1_count_xml_std: 0.000

Goekdeniz-Guelmez · 2025-02-24T21:21:11Z

Args:

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-3B-Instruct \
    --train \
    --data /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/test_grpo \
    --iters 100 \   
    --batch-size 1 \                                                                                                  
    --num-layers 8 \       
    --val-batches 1 \  
    --steps-per-report 1 \
    --adapter-path /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/test-grpo-full \
    --max-seq-length 1024 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \                 
    --steps-per-eval 50 \
    --test \
    --test-batches 1 \
    --group-size 2 \
    --max-completion-length 512 \
    --use-chat-template

lin72h · 2025-02-24T21:31:43Z

Huge changes! the generating are different now (it was because I used argmax instead of mx.random.categorical), now this has system message support in the dataset loader too.

This looks like a solid improvement! I’ve heard that the mx.random.categorical method produces more diverse results. Adding system message support is a great enhancement—really useful. Thanks for all your effort in improving this feature!

Goekdeniz-Guelmez · 2025-02-25T08:10:26Z

Thanks for your kind words @lin72h! This really motivates me to know that my efforts are appreciated and that there's a clear desire within the community for these enhancements. Special thanks to @Guo-astro @deathcoder @kiratp and everyone else here!!!!! Your support means a lot.

mark-lord · 2025-02-25T09:05:47Z

This really motivates me to know that my efforts are appreciated and that there's a clear desire within the community for these enhancements

Just wanted to pop up again and express my support again 😁 The efforts are very much appreciated!!! (Been dealing with some personal issues lately so haven't been near as active in community as I'd like, but have been keeping an eye on this repo every day regardless pahahahaha) Thanks for the awesome work @Goekdeniz-Guelmez 😁

Goekdeniz-Guelmez · 2025-02-26T08:11:46Z

You only ran 250 iterations with batch size 1, which is likely insufficient for meaningful changes in model behavior, especially for a 3B parameter model. Can you also show me the logs from the training? If the rewards go up that means the mdoel is learning. Is the adapter path correct? Also is the system promtp you used correct? The default system prompt is A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. or try training again but the pretrained version if you do so, then use --use-prompt instead of --use-chat-template. I'll try it out too with your settings when im Home.

wangcheng0825 · 2025-02-26T08:18:05Z

You only ran 250 iterations with batch size 1, which is likely insufficient for meaningful changes in model behavior, especially for a 3B parameter model. Can you also show me the logs from the training? If the rewards go up that means the mdoel is learning. Is the adapter path correct? Also is the system promtp you used correct? The default system prompt is A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. or try training again but the pretrained version if you do so, then use --use-prompt instead of --use-chat-template. I'll try it out too with your settings when im Home.

thx @Goekdeniz-Guelmez , It's a problem with my system prompt, I try to use A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. this prompt, the answer looks correct, so i delete my question。However, thank you very much for your reply.

==========
<think> John feeds each horse 20 pounds of food twice a day, so each horse consumes 20 * 2 = 40 pounds of food per day. With 25 horses, the total daily food consumption is 25 * 40 = 1000 pounds. Over 60 days, the total food consumption is 1000 * 60 = 60000 pounds. Since John buys half-ton bags of food, each bag contains 1000 pounds of food. Therefore, the number of bags needed is 60000 / 1000 = 60 bags. </think>
<answer> 60 </answer>
==========
Prompt: 146 tokens, 990.388 tokens-per-sec
Generation: 153 tokens, 49.867 tokens-per-sec
Peak memory: 12.345 GB

SfcFromSx · 2025-02-26T14:00:58Z

maybe this place need a indentation？

Goekdeniz-Guelmez · 2025-02-26T14:26:53Z

Thanks!! It should NOT be indented because it should execute regardless of whether the weights were provided or defaulted.

deathcoder · 2025-02-26T19:07:20Z

@Goekdeniz-Guelmez i am testing on the latest commit (first of all, again, amazing improvements) i was running training for
mlx-community/Qwen2.5-7B-Instruct-8bit model and after 80 steps, just after it saved adapters it got stuck, gpu usage dropped to 0, and couldnt ctrl-c out of it, in the end i had to kill the process and i got this message after i did that:

Iter 80: Train loss -0.002, Total rewards mean 6.854, Total rewards std 0.954, Grouped rewards mean 6.854, Grouped rewards std 0.954, KL 0.217, r1_chess_reward_func mean 6.854, r1_chess_reward_func std 0.954, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, Learning Rate 1.000e-05, It/sec 0.016, Tokens/sec 14.979, Peak mem 31.438 GB
Iter 80: Saved adapter weights to adapters/chess_small/adapters.safetensors and adapters/chess_small/0000080_adapters.safetensors.

/Users/admin/devtools/miniconda3/envs/mlx-grpo/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

i'm saving adapters every 10 steps, so this wasnt the first time they were being saved... not really sure what else i can add about this unfortunately it didnt print a stacktrace

Goekdeniz-Guelmez · 2025-02-27T10:30:30Z

@deathcoder Probably has something to do with the memory handling and clearing, I'll look into it when im home.

Vi-cs · 2025-02-27T16:31:17Z

Hello,

Thanks a lot for all the work, it is a pleasure to be able to play with GRPO locally !!

After a few tests, it seems to work perfectly fine for very short prompts, but I struggle with prompts of 1000 tokens, even with small models like mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-8bit

My args :

python -m mlx_lm.lora \
    --model mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-8bit \
    --train \
    --data /dataset/ # prompt of 1000 tokens 
    --iters 100 \
    --batch-size 1 \
    --num-layers -1 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/test4/ \
    --max-seq-length 2000 \
    --max-completion-length 1000 \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2 \
    --use-prompt

With the model Qwen/Qwen2.5-0.5B, I get 5 tokens/sec, so with a group-size of 2 generating 500 tokens = 1000 tokens generated => 200 sec per iteration
With mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-8bit, iteration 1 is still not completed after 45 min

I have a M4 max 128gb : my gpu usage peak at 10% sometimes, but stay very low. The memory used is close to 40 BG with the 1.5B

@Goekdeniz-Guelmez, do you have any idea how to improve the performance ?

deathcoder · 2025-02-27T17:13:25Z

@Vi-cs have you tried reducing the num-layers? you are tuning all layers with -1, also make sure you are on the latest commit i never tried with -1, but in my tests with 8 layers i get much higher speeds than that,

on 32B-8bit i get 7/8 toks/s
on 7B-8bit is closer to 20 toks/s

Vi-cs · 2025-02-27T20:31:55Z

My commit was a few days old but I pulled the last commit just to be sure. And I changed to --num-layer 4.
Still not able to complete the first iteration :/
The gpu usage remain very low.

Edit :
I relaunched the same command and get first iteration in a few minutes then nothing for 20 min.

Iter 1: Val loss 0.000, Val total_rewards_mean 0.062, Val total_rewards_std 0.062, Val grouped_rewards_mean 0.062, Val grouped_rewards_std 0.062, Val kl 0.000, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean 0.062, Val r1_count_xml_std 0.062, Val took 7.940s
Iter 1: Train loss 0.000, Total rewards mean 0.125, Total rewards std 0.125, Grouped rewards mean 0.125, Grouped rewards std 0.125, KL 0.000, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.125, r1_count_xml std 0.125, Learning Rate 1.000e-05, It/sec 0.023, Tokens/sec 11.888, Peak mem 40.836 GB

python -m mlx_lm.lora \
    --model mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-8bit \
    --train \
    --data vi-c/test\
    --iters 100 \
    --batch-size 1 \
    --num-layers 4 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/viviencuisinier/Github/mlx-examples/test4/ \
    --max-seq-length 2000 \
    --max-completion-length 1000 \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2 \
    --use-prompt

deathcoder · 2025-02-27T20:58:30Z

not sure what it is on your side that is slowing you down, i just ran the exact same command you just sent, only difference is the dataset:

Iter 1: Val loss 0.000, Val total_rewards_mean 0.050, Val total_rewards_std 0.050, Val grouped_rewards_mean 0.050, Val grouped_rewards_std 0.050, Val kl 0.000, Val r1_chess_reward_func_mean 0.050, Val r1_chess_reward_func_std 0.050, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val took 7.870s
Iter 1: Train loss 0.000, Total rewards mean 0.000, Total rewards std 0.000, Grouped rewards mean 0.000, Grouped rewards std 0.000, KL 0.000, r1_chess_reward_func mean 0.000, r1_chess_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, Learning Rate 1.000e-05, It/sec 0.244, Tokens/sec 56.339, Peak mem 5.457 GB
Iter 2: Train loss 0.000, Total rewards mean 0.000, Total rewards std 0.000, Grouped rewards mean 0.000, Grouped rewards std 0.000, KL 0.000, r1_chess_reward_func mean 0.000, r1_chess_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, Learning Rate 1.000e-05, It/sec 0.251, Tokens/sec 55.909, Peak mem 5.457 GB

and training is still going i'm now on iteration 14 and i just launched it before i start writing this message

edit: just to confirm you are actually on latest commit, do you have the validation sample details in your logs?
for me it looks like this at the very start of the process:

Starting GRPO training with 3 reward functions..., iters: 100

=== Validation Sample Details ===

📝 Generation:
 ```<answer>2</answer>``.<|im_end|>
</think>

The analyzing program started by considering UCI notation moves.

Moving from Player 1's captures, it moves Option 2 from Player 2 including the Queen back into check, to return to Player 1's line.

The成熟 male pawn component doubles up, gave Player 2 a pawn supported by a undefended bishop.

But the other bishop when under a sometimes prompted positional control.

But ultimately, Player 2 counter-pumpes and controls the game with deep development of promote capturing a way for Player 1 to accept a successful Minimal One Move```

<answer>2</answer>

==========


✅ Answer:
['c8b7']

==========


🔍 Extracted Answer:
2

==============================

wxjiao · 2025-02-28T02:38:05Z

Args:

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-3B-Instruct \
    --train \
    --data /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/test_grpo \
    --iters 100 \   
    --batch-size 1 \                                                                                                  
    --num-layers 8 \       
    --val-batches 1 \  
    --steps-per-report 1 \
    --adapter-path /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/test-grpo-full \
    --max-seq-length 1024 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \                 
    --steps-per-eval 50 \
    --test \
    --test-batches 1 \
    --group-size 2 \
    --max-completion-length 512 \
    --use-chat-template

@Goekdeniz-Guelmez Thanks for the nice job! I wonder if the current codes support GRPO training with --batch-size > 1?

Vi-cs · 2025-02-28T07:41:50Z

Thanks @deathcoder.

The dataset is this one : https://huggingface.co/datasets/vi-c/test/.

python -m mlx_lm.lora \
    --model mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-8bit \
    --train \
    --data vi-c/test\
    --iters 100 \
    --batch-size 1 \
    --num-layers 4 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/test4/ \
    --max-seq-length 2000 \
    --max-completion-length 1000 \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2 \
    --use-prompt

Also, I think I am on the last commit based on le log :

Starting GRPO training with 5 reward functions..., iters: 100

=== Validation Sample Details ===

📝 Generation:
获得感：.eml中的copied to CV中的相关信息，具体包括Alexandra和Bill的收入情况，专业技能和经验和公司经历等。

符合这个JSON schema的contact对象信息如下：

{
"domicile": "CH",
"birth_country": "CH",
"short_description": "Alexandra",
"email": "[email protected]",
"phone": "555-123-4567",
"birth_date": "1985-03-20",
"zip_code": "80512",
"occupation": "工商而乘",
"职业学位": null,
"长期兼职职位": null,
"短期兼职职位": "行政 Assistant",
"first_name": "Bill",
"last_name": "",
"nationalities": ["CH", "US", "IT"],
"type": "person",
"asset": {
"cash": { "value": "CHF 6.55M", "currency": "CHF", "description": "CHF cash portfolio" },
"division投资": { "value": "CHF 2.35M", "currency": "CHF", "description": "投资组合成果" },
"real estate": { "value": "CHF 2.5M", "currency": "CHF", "description": "房地产购买" },
"其他": { "value": "CHF 2.2M", "currency": "CHF", "description": "其他资产" }
},
"source_of_wealth": "通过分配收益，依据题目中的公司系统开发`
{
  "domicile": "CH",
  "birth_country": "CH",
  "short_description": "Alexandra",
  "email": "[email protected]",
  "phone": "555-123-4567",
  "birth_date": "1985-03-20",
  "zip_code": "80512",
  "occupation": "工商而乘",
  "职业学位": null,
  "长期兼职职位": null,
  "短期兼职职位": "行政 Assistant",
  "first_name": "Bill",
  "last_name": "",
  "nationalities": ["CH", "US", "IT"],
  "type": "person",
  "asset": {
    "cash": { "value": "CHF 6.55M", "currency": "CHF", "description": "CHF cash portfolio" },
    "division投资": { "value": "CHF 2.35M", "currency": "CHF", "description": "投资组合成果" },
    "real estate": { "value": "CHF 2.5M", "currency": "CHF", "description": "房地产购买" },
    "其他": { "value": "CHF 2.2M", "currency": "CHF", "description": "其他资产" }
  },
  "source_of_wealth": "通过分配收益，依据题目中的公司系统开发"
}
==========

✅ Answer:
{
"contacts": [
{
"id": null,
"domicile": "CHE",

Goekdeniz-Guelmez · 2025-02-28T07:56:00Z

@Vi-cs with --use-prompt your not applying the chat template, so try using --use-chat-template instead, you can take a look at the LORA.md documentation file. @wxjiao thanks, and yes this should also work with batching.

Vi-cs · 2025-02-28T08:32:59Z

Hi @Goekdeniz-Guelmez , I am using a R1 distill model which is not an instruct model. It doesn't behave correctly with --use-chat-template. I set the --use-prompt on purpose and it works fine (I tested this on Unsloth GRPO).

Goekdeniz-Guelmez · 2025-02-28T10:56:17Z

@Vi-cs You're using the wrong Dataset! Your dataset doesn't match the normal reward functions. The reward functions are looking for specific XML tags and formatted answers, but your dataset contains JSON with Chinese text instead. If you really need to, then you have create new reward functions and prompt that works with the JSON data, and train your model via code. The Dataset is not suited for GRPO training since GRPO needs structured data with clear evaluation criteria to optimize against, which your mixed-language JSON data doesn't provide. Look into the LORA.md documentation to understand the dataset that should be used, your dataset is usually more suited for basic SFT training and not for GRPO.

Bad Dataset

Good Dataset:

Goastro/mlx-grpo-dataset

Vi-cs · 2025-02-28T12:45:33Z

Totally agree!

The reward functions provided only reward the model if the completion contains a strict xml structure, an Int inside the answer tag, and the correct int.
These functions did not match my dataset so I wrote new functions.

The training doesn't work with my reward functions.

Just to make sure the issue is not related to my custom reward functions, I use the ones provided (which don't make sense with my dataset, but should not break the training)

Still, the training doesn't work.

@Goekdeniz-Guelmez do you see a reason why the training would not succeed in computing iterations with my dataset?

Any chance you try to execute my args (with my dataset), to see if on you side you process at least a few itérations?

Thanks!

Goekdeniz-Guelmez and others added 7 commits January 29, 2025 00:19

initial commit, gn

5e0ae83

Merge branch 'ml-explore:main' into adding-GRPO-training

b1e573d

updates ans fixing the KL div lines

93370ff

updates

6c58aa9

grpo_trainer shoudl be done

80bcf68

update

a57d553

update lora.py

243c962

Goekdeniz-Guelmez and others added 2 commits February 3, 2025 08:26

adding function for R1

d034ca3

Merge branch 'ml-explore:main' into adding-GRPO-training

734d6f4

Goekdeniz-Guelmez added 4 commits February 3, 2025 09:13

dataset wrapper done

a3ed632

Merge branch 'adding-GRPO-training' of https://github.com/Goekdeniz-G…

41ff536

…uelmez/mlx-examples into adding-GRPO-training

starting fist training test run

23d75cd

first working prototype, will try training out at home

1d9e480

Guo-astro reviewed Feb 3, 2025

View reviewed changes

llms/mlx_lm/tuner/grpo_trainer.py Outdated Show resolved Hide resolved

Goekdeniz-Guelmez added 2 commits February 3, 2025 19:37

optims

05d921b

fixes

40bca77

Goekdeniz-Guelmez added 2 commits February 3, 2025 19:47

print func name

06f9c29

fix name funcs

54e295e

updates

ca32424

first succesfull training run

7173840

adding custom system message integration in dataset, more opimization…

e4eac9c

…s (generates now faster, while same RAM usage), fix for the identical generatrions, seperated the reward functions into a seperate file.

last update, gn

53185c7

Merge branch 'ml-explore:main' into adding-GRPO-training

ef6ff92

smoll fix

fab2dc2

Merge branch 'ml-explore:main' into adding-GRPO-training

f27ed26

Merge branch 'ml-explore:main' into adding-GRPO-training

a04eb02

Adding grpo training #1233

Are you sure you want to change the base?

Adding grpo training #1233

Conversation

Goekdeniz-Guelmez commented Jan 31, 2025

mark-lord commented Feb 2, 2025 • edited Loading

Goekdeniz-Guelmez commented Feb 3, 2025

Guo-astro left a comment

Choose a reason for hiding this comment

Goekdeniz-Guelmez commented Feb 3, 2025

Output

Goekdeniz-Guelmez commented Feb 3, 2025

Goekdeniz-Guelmez commented Feb 3, 2025 • edited Loading

mark-lord commented Feb 3, 2025

Goekdeniz-Guelmez commented Feb 3, 2025

Guo-astro commented Feb 4, 2025

Goekdeniz-Guelmez commented Feb 4, 2025

mark-lord commented Feb 4, 2025 • edited Loading

wangcheng0825 commented Feb 23, 2025

Goekdeniz-Guelmez commented Feb 24, 2025 • edited Loading

Goekdeniz-Guelmez commented Feb 24, 2025

Goekdeniz-Guelmez commented Feb 24, 2025 • edited Loading

Goekdeniz-Guelmez commented Feb 24, 2025

lin72h commented Feb 24, 2025

Goekdeniz-Guelmez commented Feb 25, 2025 • edited Loading

mark-lord commented Feb 25, 2025

Goekdeniz-Guelmez commented Feb 26, 2025 • edited Loading

wangcheng0825 commented Feb 26, 2025

SfcFromSx commented Feb 26, 2025

Goekdeniz-Guelmez commented Feb 26, 2025

deathcoder commented Feb 26, 2025

Goekdeniz-Guelmez commented Feb 27, 2025

Vi-cs commented Feb 27, 2025

deathcoder commented Feb 27, 2025

Vi-cs commented Feb 27, 2025 • edited Loading

deathcoder commented Feb 27, 2025 • edited Loading

wxjiao commented Feb 28, 2025

Vi-cs commented Feb 28, 2025

Goekdeniz-Guelmez commented Feb 28, 2025

Vi-cs commented Feb 28, 2025 • edited Loading

Goekdeniz-Guelmez commented Feb 28, 2025 • edited Loading

Bad Dataset

Good Dataset:

Vi-cs commented Feb 28, 2025

mark-lord commented Feb 2, 2025 •

edited

Loading

Goekdeniz-Guelmez commented Feb 3, 2025 •

edited

Loading

mark-lord commented Feb 4, 2025 •

edited

Loading

Goekdeniz-Guelmez commented Feb 24, 2025 •

edited

Loading

Goekdeniz-Guelmez commented Feb 24, 2025 •

edited

Loading

Goekdeniz-Guelmez commented Feb 25, 2025 •

edited

Loading

Goekdeniz-Guelmez commented Feb 26, 2025 •

edited

Loading

Vi-cs commented Feb 27, 2025 •

edited

Loading

deathcoder commented Feb 27, 2025 •

edited

Loading

Vi-cs commented Feb 28, 2025 •

edited

Loading

Goekdeniz-Guelmez commented Feb 28, 2025 •

edited

Loading