苹果开源 DiffuCoder :用于代码生成的掩码扩散模型

发布于:2025-07-09 ⋅ 阅读:(10) ⋅ 点赞:(0)

该软件项目伴随研究论文,DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation.

在这里插入图片描述

研究动机

基于掩码去噪模型(MDMs)的扩展,扩散大语言模型(dLLMs)如LLaDA和Dream已在多项基准测试中达到与同规模自回归(AR)大语言模型相当的性能。近期商业级dLLMs如Mercury和Gemini进一步证明,基于扩散的代码生成器在编程任务上可与顶尖AR代码模型匹敌,同时提供更快的文本生成速度。

然而,dLLMs的生成模式和训练后策略仍有待深入探索。本研究探讨以下问题:

  • dLLMs的生成模式与AR模型有何本质差异?
  • 在建模不同数据类型(如代码与数学)时存在哪些差异?
  • dLLMs的多样性边界如何界定?训练后流程应如何设计?

在这里插入图片描述
我们采用DiffuLLaMA中的自适应方法来训练DiffuCoder,并引入了一个新指标——自回归评分,以量化dLLM生成过程中的因果模式。主要发现如下。

调查结果

  • 由于文本的性质,dLLM 仍然表现出从左到右的偏差,但它们也可以在 AR 模型中打破这种严格的顺序。

  • 经过预训练后,我们发现代码任务比数学任务引起的全局 AR-ness 更少。

  • 在 dLLM 中,改变采样温度不仅会影响采样标记(如 AR 模型),还会改变生成顺序本身。

更多有趣的发现,请参考我们的原始论文!

我们提出耦合梯度奖励策略优化(Coupled-GRPO)——一种提升DiffuCoder性能的后训练方法。

快速上手

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "apple/DiffuCoder-7B-cpGRPO"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()

query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template function

TOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokens

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")

output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=256,
    output_history=True,
    return_dict_in_generate=True,
    steps=256//TOKEN_PER_STEP,
    temperature=0.4,
    top_p=0.95,
    alg="entropy",
    alg_temp=0.,
)
generations = [
    tokenizer.decode(g[len(p) :].tolist())
    for p, g in zip(input_ids, output.sequences)
]

print(generations[0].split('<|dlm_pad|>')[0])
Here is the code to solve this problem: 
```python
def shared_elements(list1, list2):
  return [value for value in list1 if value in list2]
```<|im_end|>
import torch
from transformers import AutoModel, AutoTokenizer

model_path = "apple/DiffuCoder-7B-Instruct"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()

query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template function

TOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokens

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")

output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=256,
    output_history=True,
    return_dict_in_generate=True,
    steps=256//TOKEN_PER_STEP,
    temperature=0.3,
    top_p=0.95,
    alg="entropy",
    alg_temp=0.,
)
generations = [
    tokenizer.decode(g[len(p) :].tolist())
    for p, g in zip(input_ids, output.sequences)
]

print(generations[0].split('<|dlm_pad|>')[0])
Here is the code to solve this problem: 
```python
def shared_elements(list1, list2):
    result = []
    for i in list1:
        if i in list2:
            result.append(i)
    return result
```<|im_end|>

代码

https://github.com/apple/ml-diffucoder


网站公告

今日签到

点亮在社区的每一天
去签到