Transformers框架微调Qwen和DeepSeek-EW帮帮网

Transformers框架微调Qwen和DeepSeek

transformers是一个Python库，用于对大模型进行训练和微调

transformers可能会涉及从huggingface中下载模型文件，因此需要设置镜像，使其在国内能访问。只需要创建一个名为HF_ENDPOINT的环境变量，将其值设为https://hf-mirror.com即可。

在使用transformers前需要安装：

pip install transformers datasets

还需要安装pytorch：访问pytorch官网PyTorch，下滑，看到Install PyTorch部分，根据具体设备情况下载

在使用时可能会出现找不到模块，只需要下载即可

简单使用

可以写一个简单的使用大模型的程序，如下：

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread

model_name"Qwen/Qwen2.5-7B-Instruct"

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载模型
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

print("模型加载完毕")

message = [
    # {"role": "system", "content": "你是一个人工智能助手},
]

while(True):
    prompt = input(">")
    message.append({"role": "user", "content": prompt})

    text = tokenizer.apply_chat_template(
        message,
        tokenize=False,
        add_generation_prompt=True
    )

    print(text)

    model_input = tokenizer([text], return_tensors="pt").to(model.device)

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
    gkw = dict(model_input, streamer=streamer, max_new_tokens=2048, repetition_penalty=1.3)
    Thread(target=model.generate, kwargs=gkw).start()

    all_message = ""
    for i in streamer:
        if i:
            print(i, end='', flush=True)
            all_message += i
    print()
    message.append({"role": "assistant", "content": all_message})

这段代码允许循环提问，并流式输出代码。代码很简单，不做解释。

微调过程

微调过程可分为如下步骤：

准备数据集

可以准备csv或json格式的数据集，如下：

prompt,answer
1+1等于几？,1+1等于2
2+2等于几？,2+2等于4

[
    {
        "prompt": "1+1等于几？",
        "answer": "1+1等于2"
    },
    {
        "prompt": "2+2等于几？",
        "answer": "2+2等于4"
    }
]

当然，这只是一个数据集示例，实际会比这复杂。

数据预处理

数据预处理的目的是为了让大模型可以识别。大模型本质上是通过之前所有词汇预测后面的词汇，因此需要对提示词进行特殊处理。如处理后的提示词可能是这样（以DeepSeek为例）：

<｜begin▁of▁sentence｜><｜User｜>1+1等于几？<｜Assistant｜><think>

LoRA处理

LoRA允许只微调少量参数，降低显存利用

训练

可以采用transformers的Trainer类训练

微调Qwen

代码如下：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
import datasets
from peft import LoraConfig, get_peft_model


dataset = datasets.load_dataset("json", data_files="dataset.json")['train']

# 加载模型和分词器
model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    # torch_dtype=torch.float16,
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    use_cache=False,
)

model.enable_input_require_grads()

model = get_peft_model(model,
                       LoraConfig(
                           r=8,
                           lora_alpha=16,
                           lora_dropout=0.05,
                           bias="none",
                           task_type="CAUSAL_LM",
                       )
                       )
model.gradient_checkpointing_enable()

# 数据预处理
def process_func(example):
    MAX_LENGTH = 2400
    # 构建对话消息列表
    messages = [
        {"role": "user", "content": example['prompt']},
        {"role": "assistant", "content": example['answer'].replace('\\n', '\n')}
    ]
    # 使用 apply_chat_template 生成输入序列
    input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)
    # 生成 attention_mask
    attention_mask = [1] * len(input_ids)
    # 构建 labels
    instruction_length = len(tokenizer.apply_chat_template(
        [{"role": "user", "content": example['prompt']}],
        tokenize=True
    ))
    labels = [-100] * instruction_length + input_ids[instruction_length:]

    # 做截断处理
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]

    return {
        "input_ids": torch.tensor(input_ids, dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
        "labels": torch.tensor(labels, dtype=torch.long)
    }

tokenized_dataset = dataset.map(process_func)

# 配置训练参数
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=20,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    save_steps=1000,
    save_total_limit=2,
    eval_strategy="no",
    logging_steps=100,
    fp16=True,
    remove_unused_columns=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# 开始训练
trainer.train()

model.save_pretrained("./fine_tuned_model_qwen")

这里面有几点注意：

加载模型时采用4bit加载，减少显存占用。
model.enable_input_require_grads()、model.gradient_checkpointing_enable()开启梯度检查点，减少显存占用

其他代码都很简单，可以自行理解。

但是，这样保存的模型不是完全体，只是LoRA的模型，要将它保存为完全体，还需要保存分词器和其他模型参数：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

tokenize = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.float16)
lora_config = PeftConfig.from_pretrained("fine_tuned_model_qwen")
model = PeftModel.from_pretrained(base_model, "fine_tuned_model_qwen", config=lora_config)

model = model.merge_and_unload()

to = "yl_qwen7b"
tokenize.save_pretrained(to)
model.save_pretrained(to)

微调DeepSeek（全网首发）

微调DeepSeek最大的问题就是think部分，因此DeepSeek官方大量采用强化学习的方式训练，这样可以不用在乎模型的think部分，只关心模型的真正的输出部分。

在DeepSeek的apply_chat_template中，默认会不处理think部分，也就是说DeepSeek不会知道它上次想了什么。相应的，即使在answer中设置think部分，采用上述预处理办法也会导致think部分被忽略。

而在微调DeepSeek时，也可以采用传统方法。像我们在数据预处理时将prompt部分的标签设置为-100一样，我们也可以将think部分的标签设置为-100，但这需要在损失函数中设置，因此需要重写Trainer类，自定义损失函数。

完整代码如下：

from typing import Union, Optional, Dict, Callable, List, Tuple, Type, Any

import torch
from torch import nn
from torch.utils.data import Dataset, IterableDataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, PreTrainedModel, DataCollator, \
    PreTrainedTokenizerBase, BaseImageProcessor, FeatureExtractionMixin, ProcessorMixin, EvalPrediction, TrainerCallback
import datasets
from peft import LoraConfig, get_peft_model


dataset = datasets.load_dataset("json", data_files="dataset.json")['train']

# 加载模型和分词器
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    # torch_dtype=torch.float16,
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    use_cache=False,
)

model.enable_input_require_grads()

model = get_peft_model(model,
                       LoraConfig(
                           r=8,
                           lora_alpha=16,
                           lora_dropout=0.05,
                           bias="none",
                           task_type="CAUSAL_LM",
                       )
                       )
model.gradient_checkpointing_enable()

# 数据预处理
def process_func(example):
    MAX_LENGTH = 2400
    # 构建对话消息列表
    messages = [
        {"role": "user", "content": example['prompt']},
        {"role": "assistant", "content": example['answer'].replace('\\n', '\n')}
    ]
    # 使用 apply_chat_template 生成输入序列
    input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)
    # 生成 attention_mask
    attention_mask = [1] * len(input_ids)
    # 构建 labels
    instruction_length = len(tokenizer.apply_chat_template(
        [{"role": "user", "content": example['prompt']}],
        tokenize=True
    ))
    labels = [-100] * instruction_length + input_ids[instruction_length:]

    # 做截断处理
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]

    return {
        "input_ids": torch.tensor(input_ids, dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
        "labels": torch.tensor(labels, dtype=torch.long)
    }

tokenized_dataset = dataset.map(process_func)

# 配置训练参数
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=10,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    save_steps=1000,
    save_total_limit=2,
    eval_strategy="no",
    logging_steps=100,
    fp16=True,
    remove_unused_columns=False
)


# 自定义 Trainer 类
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch: Any = None):
        # 前向传播
        outputs = model(**inputs)
        logits = outputs.logits

        # 获取生成的文本
        generated_ids = logits.argmax(dim=-1)
        for i in range(generated_ids.shape[0]):
            generated_text = tokenizer.decode(generated_ids[i], skip_special_tokens=False)

            # 识别 <think> 部分
            think_start = generated_text.find("<think>")
            think_end = generated_text.find("</think>")

            if think_start != -1 and think_end != -1:
                think_tokens_start = len(tokenizer.encode(generated_text[:think_start]))
                think_tokens_end = len(tokenizer.encode(generated_text[:think_end + len("</think>")]))

                # 调整标签
                inputs["labels"][i, think_tokens_start:think_tokens_end] = -100

        # 重新计算损失
        loss = model(**inputs).loss
        return (loss, outputs) if return_outputs else loss


trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# 开始训练
trainer.train()

model.save_pretrained("./fine_tuned_model_ds")

这段代码最关键的就是写了一个自定义的Trainer类，识别think部分并设置label为-100，这样模型就会不关心这一部分。

接下来需要向微调Qwen一样将其完整保存，代码同上，只不过要将模型名称进行修改。

但是，也可以通过Qwen的传统方式微调，这取决于模型微调的效果。

转为GGUF

有时，我们需要用ollama或lm studio部署模型，这就需要将模型转换为GGUF。

需要注意的是，DeepSeek的GGUF版本默认不会将第一个<think>归入prompt，而transformers会，因此采用ollama或lm studio部署转换完的DeepSeek模型时，会无法识别哪部分是think，因此会体现在UI上，这是正常现象，不是模型Bug。

废话不多说，接下来教大家如何转换transformers模型为GGUF

下载llama.cpp的源码包，或运行命令

git clone https://github.com/ggerganov/llama.cpp.git

在该目录下运行：pip install -r llama.cpp/requirements.txt

运行命令转换：

python llama.cpp/convert_hf_to_gguf.py <模型路径（是个文件夹）>\
    --outtype f16 \
    --outfile <输出GGUF文件名>

其中，outtype可以设置为q8_0等量化类型

Transformers框架微调Qwen和DeepSeek