R1-Zero（GRPO）的复现实验记录，踩坑问题（小模型）-EW帮帮网

实现声明，本次记录是在1.5B/3B模型上的实验踩坑记录，比如出现各种奇奇怪怪的情况。。。。。

总体分为两个大节，踩坑和效果对比

实验配置：

LLMRL框架：OpenRLHF

模型：Qwen2.5-1.5/3B-Base/Instruct

训练数据：MATH（7.5k）

指令的格式是参考orz的think\answer格式：

"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here "

以及加上，方便answer提取答案

"Please put your final answer within \boxed{}."

OpenRLHF里面修改是prompts_dataset.py文件

奖励函数:

accuracy_reward: 使用 math_verify 做答案抽取和匹配
format_reward: 借鉴openr1的，pattern =r"^<think>.*?</think><answer>.*?</answer>$"

踩坑

format格式奖励问题

模型：Qwen2.5-1.5B-base

数据：MATH数据集取十分之一（较为简单，先看一下什么情况）

基础超参：

train_batch_size=16,32,64

rollout_size=16

n_generate=6

lr=1e-6

kl_coef=1e-3

accuracy_rewar和response在增长，形式还是很不错的。但引出了我们的第一个问题，我的format去哪里了？ 太低了，怎么还降低了？这是怎么回事。

先调取一下模型训练过程中的response，其中部分是没有think和answer的，但也有部分是拥有和的过程，

To find the number of zeros in the expansion of $999,999,999,998^2$, we first need to understand the structure of $999,999,999,998^2$.   
Let's denote $n \= 999,999,999,998$. Then, we need to find $n^2$.  
By squaring $n \= 999,999,999,998$, we arrive at the expression $n^2 \= (10^{11} \- 2\)^2$, which expands to $10^{22} \- 20 \\cdot 10^{11} \+ 4$.   
We need to determine how many zeros appear in this expansion. Notice that the term $20 \\cdot 10^{11}$ contributes to the formation of 11 zeros in the resulting number, because each '20' term creates a '0' followed by a '0'. The remaining term '4' contributes a single zero.   
Therefore, the total number of zeros is $(\\text{number of zeros in } 10^{11}) \+ 1 \= 11 \+ 1$.   
The number of zeros in the expansion of $999,999,999,998^2$ is 12\.   
To determine how many cubic feet are in one cubic yard, we need to consider the conversion factors between yards and feet.\\n\\n 1 yard is equal to 3 feet.\\n\\n To create a cubic yard, we need dimensions that are each 3 feet long, wide, and high.\\n\\n The volume of one cubic yard is calculated by multiplying the dimensions:\\n\\\[ \\text{Volume of one cubic yard} \= \\text{length} \\times \\text{width} \\times \\text{height} \\]\\n\\\[ \\text{Volume of one cubic yard} \= 3 \\, \\text{feet} \\times 3 \\, \\text{feet} \\times 3 \\, \\text{feet} \\]\\n\\n Therefore, the volume of one cubic yard is \\( 27 \\times 1 \\) cubic feet. Thus, there are 27 cubic feet in one cubic yard.

可以看出，确实可以生成出和的过程，但是由于我们的reward_func中pattern = r"^<think>.*?</think><answer>.*?</answer>$"的严格限制：要求response必须开始（^）结尾是，和之间没有其他的内容。很难在探索或者采样的时候获取反馈。接下来我们进行修改一下pattern，不断格式的限制。

（1）逐步,或者说tag_count。正确放置前x个tag 得到 0.25 *x 的奖励（注：正确放置指多个tag顺序正确，且每个tag 只出现一次）

（2）宽松。pattern =r"<think>.*?</think><answer>.*?</answer>"，允许response不是开始（^）结尾是，可以多次思考