苹果开源 DiffuCoder ：用于代码生成的掩码扩散模型-EW帮帮网

该软件项目伴随研究论文，DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation.

在这里插入图片描述

研究动机

基于掩码去噪模型（MDMs）的扩展，扩散大语言模型（dLLMs）如LLaDA和Dream已在多项基准测试中达到与同规模自回归（AR）大语言模型相当的性能。近期商业级dLLMs如Mercury和Gemini进一步证明，基于扩散的代码生成器在编程任务上可与顶尖AR代码模型匹敌，同时提供更快的文本生成速度。

然而，dLLMs的生成模式和训练后策略仍有待深入探索。本研究探讨以下问题：

dLLMs的生成模式与AR模型有何本质差异？
在建模不同数据类型（如代码与数学）时存在哪些差异？
dLLMs的多样性边界如何界定？训练后流程应如何设计？

在这里插入图片描述
我们采用DiffuLLaMA中的自适应方法来训练DiffuCoder，并引入了一个新指标——自回归评分，以量化dLLM生成过程中的因果模式。主要发现如下。

调查结果

由于文本的性质，dLLM 仍然表现出从左到右的偏差，但它们也可以在 AR 模型中打破这种严格的顺序。
经过预训练后，我们发现代码任务比数学任务引起的全局 AR-ness 更少。
在 dLLM 中，改变采样温度不仅会影响采样标记（如 AR 模型），还会改变生成顺序本身。

更多有趣的发现，请参考我们的原始论文！

我们提出耦合梯度奖励策略优化（Coupled-GRPO）——一种提升DiffuCoder性能的后训练方法。

快速上手

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "apple/DiffuCoder-7B-cpGRPO"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()

query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template function

TOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokens

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")

output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=256,
    output_history=True,
    return_dict_in_generate=True,
    steps=256//TOKEN_PER_STEP,
    temperature=0.4,
    top_p=0.95,
    alg="entropy",
    alg_temp=0.,
)
generations = [
    tokenizer.decode(g[len(p) :].tolist())
    for p, g in zip(input_ids, output.sequences)
]

print(generations[0].split('<|dlm_pad|>')[0])

Here is the code to solve this problem: 
```python
def shared_elements(list1, list2):
  return [value for value in list1 if value in list2]
```<|im_end|>

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "apple/DiffuCoder-7B-Instruct"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()

query = "Write a function to find the shared elements from the given two lists."
prompt = f"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{query.strip()}
<|im_end|>
<|im_start|>assistant
""" ## following the template of qwen; you can also use apply_chat_template function

TOKEN_PER_STEP = 1 # diffusion timesteps * TOKEN_PER_STEP = total new tokens

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(device="cuda")
attention_mask = inputs.attention_mask.to(device="cuda")

output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=256,
    output_history=True,
    return_dict_in_generate=True,
    steps=256//TOKEN_PER_STEP,
    temperature=0.3,
    top_p=0.95,
    alg="entropy",
    alg_temp=0.,
)
generations = [
    tokenizer.decode(g[len(p) :].tolist())
    for p, g in zip(input_ids, output.sequences)
]

print(generations[0].split('<|dlm_pad|>')[0])

Here is the code to solve this problem: 
```python
def shared_elements(list1, list2):
    result = []
    for i in list1:
        if i in list2:
            result.append(i)
    return result
```<|im_end|>

代码

https://github.com/apple/ml-diffucoder

苹果开源 DiffuCoder ：用于代码生成的掩码扩散模型

研究动机

调查结果

快速上手

代码

网站公告

今日签到

热门文章

最新发布