大模型笔记11:微调Qwen3模型提高其数学推理能力

发布于:2025-07-07 ⋅ 阅读:(17) ⋅ 点赞:(0)

内容介绍

使用的微调库是unsloth,此库只支持单卡微调,所以更适合个人或小团队微调使用。

基座模型是unsloth/Qwen3-8B-unsloth-bnb-4bit,是4bit动态量化版。从魔搭社区下载。

此次微调使用两个数据集:OpenMathReasoning-mini和FinTome-100k。

unsloth框架调用模型代码演示

倒入库

from unsloth import FastLanguageModel
import torch

 加载模型

max_seq_length = 8192
dtype = None
load_in_4bit = True

# 加载当前路径下的Qwen3模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./Qwen3-8B-unsloth-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=load_in_4bit,
)

model

 输出结果:

==((====))==  Unsloth 2025.6.8: Fast Qwen3 patching. Transformers: 4.53.0.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.57 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.13it/s]

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 4096, padding_idx=151654)
    (layers): ModuleList(
      (0-2): 3 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (3): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (4): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (5): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (6): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (7-33): 27 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (34): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (35): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
    )
    (norm): Qwen3RMSNorm((4096,), eps=1e-06)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)

开启模型对话功能

messages = [
    {"role": "user", "content": "你好,好久不见!"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,     # 不返回生成ID,也就是直接返回字符串
    add_generation_prompt=True,     # 添加生成提示,也就是assistant
    enable_thinking=False, # 设置不思考
)

text

 输出结果,这里展示了输入模型前的数据格式,包含了完整的特殊标识符的文本

'<|im_start|>user\n你好,好久不见!<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'

 进行分词

inputs = tokenizer(text, return_tensors="pt").to("cuda")
inputs
{'input_ids': tensor([[151644,    872,    198, 108386,   3837, 111920, 101571,   6313, 151645,
            198, 151644,  77091,    198, 151667,    271, 151668,    271]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

进行模型推理

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
outputs
tensor([[151644,    872,    198, 108386,   3837, 111920, 101571,   6313, 151645,
            198, 151644,  77091,    198, 151667,    271, 151668,    271, 108386,
           6313, 111920, 101571,   6313, 104284, 113097,  56568, 101036,   1773,
         104044, 108178, 104472,  11319, 104139, 104838,  29826,  99172,  33108,
          35946,  93149, 101037,  11319, 144236, 151645]], device='cuda:0')

根据ID转换为自然语言

response = tokenizer.batch_decode(outputs)
response
['<|im_start|>user\n你好,好久不见!<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n你好!好久不见!我也很想你呢。最近过得怎么样?有什么新鲜事想和我分享吗?😊<|im_end|>']

怎么思考过程的代码,只需要把enable_thinking改为True

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,     # 不返回生成ID,也就是直接返回字符串
    add_generation_prompt=True,     # 添加生成提示,也就是assistant
    enable_thinking=True, # 设置思考
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]
'<|im_start|>user\n你好,好久不见!<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户发来“你好,好久不见!”,看起来像是在打招呼,但可能有更深层的意图。首先,我需要确认用户是否在测试我的反应,或者是否有其他需求。用户可能希望继续之前的对话,但之前没有记录,所以可能需要进一步询问。另外,用户可能想表达某种情感,比如怀念或者测试我的回应能力。我应该保持友好,同时引导用户说明具体需求,以便提供更有针对性的帮助。需要避免假设太多,保持开放和灵活的态度。接下来,我应该用轻松的语气回应,同时邀请用户分享更多信息,这样可以促进更深入的交流。\n</think>\n\n啊,好久不见!最近过得怎么样呀?😊 有什么新鲜事想和我分享吗?或者需要我帮忙解决什么问题?<|im_end|>'

 携带系统提示词的模板
 

messages = [
    {"role": "system", "content": "你是一名助人为乐的助手,名字叫小胡。"},
    {"role": "user", "content":"你好,请问你叫什么名字?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,     # 不返回生成ID,也就是直接返回字符串
    add_generation_prompt=True,     # 添加生成提示,也就是assistant
    enable_thinking=True, # 设置思考
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]
'<|im_start|>system\n你是一名助人为乐的助手,名字叫小胡。<|im_end|>\n<|im_start|>user\n你好,请问你叫什么名字?<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户问我的名字,我应该回答小胡。我需要保持友好和乐于助人的态度。可以加点表情符号让对话更生动。比如用✨或者😊这样的符号。然后可以主动询问用户需要什么帮助,这样能促进进一步的交流。保持口语化,不用太正式。比如用"有什么我可以帮你的吗?"这样的问句。这样既亲切又自然。同时要注意不要用复杂的句子结构,保持简单明了。这样用户会觉得容易理解和亲近。最后,确保回答符合角色设定,体现出助人为乐的特点。比如用"随时为你服务!"这样的结尾。这样整个回答就既符合要求,又显得亲切友好。\n</think>\n\n你好呀!我叫小胡,是你的AI助手,有什么我可以帮你的吗?✨😊<|im_end|>'

 LLM使用function_call,以天气查询为例

编写天气查询函数

def get_weather(loc):
    url="http://api.openweathermap.org/data/2.5/weather"
    params = {
        "q": loc,
        "appid": "你的api_key",
        "units": "metric",  # 使用摄氏度
        "lang": "zh_cn"
    }
    response = requests.get(url, params=params)

    data = response.json()
    return  json.dumps(data)

 定义工具列表 JSON Schema方式描述

tools = [
    {
        "type": "function",
        "function": {
            'name': 'get_weather',
            'description': '查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。',
            'parameters': {
                'type': 'object',
                'properties': {
                    'loc': {
                        'description': "城市名称,注意,中国的城市需要用对应城市的英文名称代替。",
                        'type': 'string'
                    }
                },
                'required': ['loc']
            }  
        }
    }
]
messages = [
    {"role": "system", "content": "你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。"},
    {"role": "user", "content":"你好,请帮我查询一下无锡的今天天气如何?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tools = tools,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = True,
)
text

 可以看到,tools中的内容也会被加入到上下文中。

'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡的今天天气如何?<|im_end|>\n<|im_start|>assistant\n'

输出模型响应结果

inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]

 可以看到它成功创建了一个function call messages

\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call>
'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡的今天天气如何?<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户让我查询无锡的天气。首先,我需要确认用户的需求是今天的天气情况。然后,根据提供的工具,我需要调用get_weather函数。注意,工具里提到中国的城市要用英文名称,所以无锡对应的英文是Wuxi。接下来,构造参数,参数是loc,值应该是"Wuxi"。然后,确保只调用一个城市,没有其他参数。最后,生成正确的tool_call结构,把函数名和参数放进去。检查一遍有没有错误,比如拼写或者参数名是否正确。确认无误后,就可以返回结果了。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call><|im_end|>'

 那么当一次输出就要求输出多次工具使用,或者同时使用不同的工具时,模型会输出什么。这里我们让模型同时查看两个城市的天气。

messages = [
    {"role": "system", "content": "你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。"},
    {"role": "user", "content":"你好,请帮我查询一下无锡和南京的今天天气如何?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tools = tools,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = True,
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]

 可以看到,模型非常聪明,知道分两次调用函数

'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡和南京的今天天气如何?<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户让我查询无锡和南京今天的天气。首先,我需要确认用户提到的城市是否正确,以及是否需要转换为英文名称。根据工具说明,中国的城市需要用英文名称代替。无锡的英文是Wuxi,南京是Nanjing。然后,我需要调用get_weather函数,但每次只能查询一个城市。所以可能需要分两次调用。不过用户同时提到了两个城市,可能需要分别处理。先检查是否有函数支持多个城市,但根据工具描述,一次只能查询一个城市。因此,我应该分别调用两次get_weather函数,一次 for 无锡,一次 for 南京。确保参数正确,然后返回结果。现在先处理第一个城市无锡。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call>\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Nanjing"}}\n</tool_call><|im_end|>'

 但显然,这里只是一个中间回复,用户需要的时最终的天气结果的字符串,所以还需要进一步的处理。

假设下面的内容是模型输出被开发者提取出的tool_call部分的内容的json格式,需要将其加入到messages中
 

messages.append(
    {
        "role": "assistant",
        "content": "<think>\n我将调用 get_weather 函数来检查天气。\n</think>\n",
        "tool_calls": [
            {"name": "get_weather", 
             "arguments": {
                 "location": "无锡"
             }
            },
             {
                 "name": "get_weather",
                 "arguments": {
                     "location": "南京"
                 }
             }
        ]
    }
)

调用函数后的输出内容也将以json格式加入到messages中。解释一下,这些内容都需要外部代码去实现,这里只是假装直接拿过来用。

messages.append(
    {
        "role": "tool",
        "content": json.dumps({
            "location": "无锡",
            "weather": "晴,最高气温26摄氏度"
        })
    }
)

messages.append(
    {
        "role": "tool",
        "content": json.dumps({
            "location": "南京",
            "weather": "多云转小雨,最高气温23摄氏度"
        })
    }
)

 再次输入模型后的输出便是
 

'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡和南京的今天天气如何?<|im_end|>\n<|im_start|>assistant\n<think>\n我将调用 get_weather 函数来检查天气。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"location": "无锡"}}\n</tool_call>\n<tool_call>\n{"name": "get_weather", "arguments": {"location": "南京"}}\n</tool_call><|im_end|>\n<|im_start|>user\n<tool_response>\n{"location": "\\u65e0\\u9521", "weather": "\\u6674\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2926\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u5357\\u4eac", "weather": "\\u591a\\u4e91\\u8f6c\\u5c0f\\u96e8\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2923\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u65e0\\u9521", "weather": "\\u6674\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2926\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u5357\\u4eac", "weather": "\\u591a\\u4e91\\u8f6c\\u5c0f\\u96e8\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2923\\u6444\\u6c0f\\u5ea6"}\n</tool_response><|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户之前请求查询无锡和南京的天气,我分别调用了两次get_weather函数。现在需要将结果整理成自然的中文回复。首先,确认无锡的天气是晴天,最高温度26摄氏度,南京是多云转小雨,最高温度23摄氏度。需要注意的是,用户可能希望知道两地的天气差异,是否需要带伞,或者穿衣建议。要确保信息清晰,用口语化的表达,避免使用专业术语。同时,检查是否有错别字,确保城市名称正确。最后,以友好的语气结束,询问是否需要进一步帮助。\n</think>\n\n无锡今天是晴天,最高气温26摄氏度,天气晴朗宜人。南京今天多云转小雨,最高气温23摄氏度,建议带伞出门。需要我帮您查询其他城市的天气吗?<|im_end|>'

 到这里做一个重要总结:其实这些模型的输出内容,就是训练模型时所需要训练样本的格式。

微调数据集准备

此次微调使用两个数据集:OpenMathReasoning-mini和FinTome-100k。

前者负责推理和解答数学问题,后者进行普通对话。

可能会有疑问,既然是微调数学方向,这个普通对话数据集作用是什么?
答案:1.防止微调后模型发生灾难性的遗忘。2.保证Qwen3模型的混合推理能力,也就是保证模型不仅有推理能力也能有非推理能力。
 

使用datasets库下载数据集
 

from datasets import load_dataset
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")  # cot是Chain-of-thought思维链部分
reasoning_dataset
Dataset({
    features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
    num_rows: 19252
})
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
non_reasoning_dataset
Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

 可以看到Math数据集数据量是将近两万,而对话数据集数据量达到了10万,所以两者分配不均,接下来要对数据集进行清洗。

数据集清洗

 新增对话字段
 

def generate_conversation(examples):
    problems = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role": "user", "content": problem},
            {"role": "assistant", "content": solution},
        ])
    return {"conversations": conversations,}
reasoning_data = reasoning_dataset.map(generate_conversation, batched=True)  # 新增一个conversation字段
reasoning_data["conversations"][0]
[{'content': 'Given $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.',
  'role': 'user'},
 {'content': "<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>To solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n   \\[\n   \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n   \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n   \\[\n   \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n   \\]\n\n3. Square both sides to eliminate the square root on the left:\n   \\[\n   (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n   \\]\n   Simplifying both sides, we get:\n   \\[\n   x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n   \\]\n\n4. Combine like terms on the right side:\n   \\[\n   x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n   \\]\n   Simplifying further:\n   \\[\n   x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n5. Subtract \\(x^2\\) from both sides:\n   \\[\n   165 = -3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n   \\[\n   168 = 14\\sqrt{x^2 - 52}\n   \\]\n\n7. Divide both sides by 14:\n   \\[\n   12 = \\sqrt{x^2 - 52}\n   \\]\n\n8. Square both sides again to eliminate the square root:\n   \\[\n   12^2 = x^2 - 52\n   \\]\n   Simplifying:\n   \\[\n   144 = x^2 - 52\n   \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n   \\[\n   196 = x^2\n   \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n    \\[\n    x = \\sqrt{196} = 14\n    \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n    \\[\n    \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n    \\]\n    The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]",
  'role': 'assistant'}]

 送入分词器中
 

reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_data["conversations"],
    tokenize = False
)
reasoning_conversations[0]
"<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>\n\nTo solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n   \\[\n   \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n   \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n   \\[\n   \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n   \\]\n\n3. Square both sides to eliminate the square root on the left:\n   \\[\n   (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n   \\]\n   Simplifying both sides, we get:\n   \\[\n   x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n   \\]\n\n4. Combine like terms on the right side:\n   \\[\n   x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n   \\]\n   Simplifying further:\n   \\[\n   x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n5. Subtract \\(x^2\\) from both sides:\n   \\[\n   165 = -3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n   \\[\n   168 = 14\\sqrt{x^2 - 52}\n   \\]\n\n7. Divide both sides by 14:\n   \\[\n   12 = \\sqrt{x^2 - 52}\n   \\]\n\n8. Square both sides again to eliminate the square root:\n   \\[\n   12^2 = x^2 - 52\n   \\]\n   Simplifying:\n   \\[\n   144 = x^2 - 52\n   \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n   \\[\n   196 = x^2\n   \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n    \\[\n    x = \\sqrt{196} = 14\n    \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n    \\[\n    \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n    \\]\n    The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]<|im_end|>\n"

 同样对另一个数据集也进行格式转换
 

from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)
non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize=False,
)

 对数据进行抽样, 非推理样本只占推理样本总量的百分之25

chat_percentage = 0.75 
import pandas as pd
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
    int(len(reasoning_conversations) * (1.0 - chat_percentage)),
    random_state = 2407,
)

接下来将两个数据集拼接并打散顺序
 

data = pd.concat([
    pd.Series(reasoning_conversations),
    pd.Series(non_reasoning_subset)
])
data.name = "text"

from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed=3407)

 LORA微调

lora微调配置,将lora适配器应用于Qwen3模型,也就指定需要微调的模块加入低秩矩阵适配器。

model = FastLanguageModel.get_peft_model(
    model,
    r=32,   # LOra的秩
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # 缩放系数
    lora_dropout = 0,
    bias = "none",  # 偏置训练方式
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None, 
)

配置微调训练器,设置训练的超参数
 

from trl import SFTTrainer, SFTConfig   # SFTTrainer 监督微调训练器 SFTConfig 监督微调的超参数配置
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset= None,
    args = SFTConfig(
        dataset_text_field = "text",  # 指定训练数据所在的字段
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,   # 累计梯度更新的次数,也就是4次minibatch梯度计算完累计后后 再更新参数
        warmup_steps = 5,       # 学习率预热的步数
        # num_train_epochs = 1, # 这个参数的优先级比max_steps低,所以这两个参数一般只选择一个去使用, 一般先用少量的max_steps确定训练过程是否正常,损失值是否在降低
        max_steps = 30,    # 最大更新次数
        learning_rate = 2e-4,
        logging_steps = 1,    # 每参数更新一次,记录日志
        optim = "adamw_8bit",   # 8bit量化的优化器
        weight_decay = 0.01,   # 权重衰减率
        lr_scheduler_type = "linear",   # 学习率采用线性衰减
        seed = 3407,
        report_to = "wandb",   # 设置wandb这个库作为模型训练记录的工具
    ),
)

接下来开始训练:目前设置的是三十次参数更新停止,目的观察损失函数是否正常降低,只要是波动中降低也没问题,因为是小批量梯度下降算法。如果无异常可以再改用epoch正式训练
 

trainer_stats = trainer.train()

 训练过程中的loss会被打印出来,如果你设置了wandb则在网页会有更直观的图表展示。

Step Training Loss
1 0.518500
2 0.621800
3 0.605500
4 0.587000
5 0.520100
6 0.498500
7 0.486000
8 0.465000
9 0.420200

微调后的模型

微调后的model就会被微调后的模型覆盖,可以打印model的结构:
 

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen3ForCausalLM(
      (model): Qwen3Model(
        (embed_tokens): Embedding(151936, 4096, padding_idx=151654)
        (layers): ModuleList(
          (0-2): 3 x Qwen3DecoderLayer(
            (self_attn): Qwen3Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )

 上面是结构的一小部分展示,分析一下q_proj微调后的结构:

(2) lora_dropout: ModuleDict((default): Identity())

(3) lora_A: ModuleDict((default): Linear(in_features=4096, out_features=32, bias=False))

(4) lora_B: ModuleDict((default): Linear(in_features=32, out_features=4096, bias=False))

(5) lora_embedding_A: ParameterDict()

(6) lora_embedding_B: ParameterDict()

(7) lora_magnitude_vector: ModuleDict()

  • (1) base_layer: Linear4bit(in_features=4096, out_features=4096, bias=False)

  • 这是原始的线性层(未微调的部分),使用 4 位量化 来降低内存占用。
  • 参数说明:
    • in_features=4096:输入特征维度为 4096。
    • out_features=4096:输出特征维度为 4096。
    • bias=False:该线性层不包含偏置项(bias)。
  • 作用:这是模型中某个 Transformer 层的子模块(例如 q_proj,即查询投影层,用于多头注意力机制中的查询向量计算)。它表示原始预训练权重,通常在微调时保持冻结(不更新)。
  • 4 位量化:通过将权重压缩到 4 位,减少内存占用和计算量,适合在资源受限的环境中运行大模型。
  • LoRA Dropout:这是 LoRA 微调中用于正则化的 dropout 层。
  • Identity():表示当前没有应用 dropout(等价于 dropout 概率为 0)。Identity 是一个占位操作,输入直接输出,不做任何变换。
  • 作用:LoRA 允许为不同层设置 dropout,但这里默认配置是无 dropout,可能为了简化或避免正则化。
  • LoRA A 矩阵:这是 LoRA 微调中的第一个低秩矩阵(低秩分解的 A 部分)。
  • 参数说明:
    • in_features=4096:与 base_layer 的输入维度一致。
    • out_features=32:输出维度为 32,表示 LoRA 的秩(rank),这是一个超参数,用于控制 LoRA 的参数量。秩越小,参数量越少,微调越高效。
    • bias=False:没有偏置项。
  • 作用:lora_A 是一个可训练的矩阵,用于将输入映射到一个低维空间(秩为 32),这是 LoRA 的核心思想:通过低秩矩阵近似权重更新,减少需要训练的参数量。
  • LoRA B 矩阵:这是 LoRA 微调中的第二个低秩矩阵(低秩分解的 B 部分)。
  • 参数说明:
    • in_features=32:与 lora_A 的输出维度一致(秩为 32)。
    • out_features=4096:与 base_layer 的输出维度一致。
    • bias=False:没有偏置项。
  • 作用:lora_B 将低维空间的表示映射回原始输出维度,与 lora_A 一起构成 LoRA 的低秩更新矩阵。
  • LoRA Embedding A:这是一个参数字典,通常用于存储 LoRA 针对嵌入层(embedding layer)的 A 矩阵(如果适用)。
  • 当前为空(ParameterDict()),说明这个 q_proj 层没有为嵌入层定义 LoRA 参数(因为 q_proj 是注意力机制中的线性层,而不是嵌入层)。
  • LoRA Embedding B:类似 lora_embedding_A,用于存储嵌入层的 B 矩阵。
  • 当前为空,原因同上。
  • LoRA Magnitude Vector:这是 LoRA 中的一个可选组件,用于存储额外的向量(如缩放因子或幅度向量),以调整 LoRA 的输出。
  • 当前为空(ModuleDict()),说明没有使用这种额外的向量调整机制

样本特征在经过微调后的q_proj层会被如何处理

推理阶段的具体计算步骤如下:这段内容是用GPT生成的公式部分懒得改了

  1. 输入处理:
    • 输入 ( x ) 是一个形状为

      [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]

      的张量,通常表示 Transformer 模型中某一层的隐藏状态(例如,来自上一层的输出)。
  2. 原始线性层计算:
    • 首先,输入 ( x ) 通过 base_layer(即原始权重 ( W ))进行线性变换:

      ybase=W⋅xy_{\text{base}} = W \cdot xy_{\text{base}} = W \cdot x

    • 由于 base_layer 是 4 位量化的,计算时会先将权重反量化(从 4 位恢复到浮点数,通常由底层库如 bitsandbytes 自动处理),然后执行矩阵乘法。
    • 输出

      ybasey_{\text{base}}y_{\text{base}}

      的形状为

      [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]

  3. LoRA 低秩更新计算:
    • LoRA 的更新部分通过

      ΔW=A⋅B\Delta W = A \cdot B\Delta W = A \cdot B

      计算:
      • 首先,输入 ( x ) 通过 lora_A 映射到低秩空间:

        z=x⋅Az = x \cdot Az = x \cdot A

        其中 ( A ) 形状为 ([4096, 32]),输出 ( z ) 的形状为

        [batchsize,seqlength,32][batch_size, seq_length, 32][batch_size, seq_length, 32]

      • 然后,( z ) 通过 lora_B 映射回原始输出维度:

        ylora=z⋅B=(x⋅A)⋅By_{\text{lora}} = z \cdot B = (x \cdot A) \cdot By_{\text{lora}} = z \cdot B = (x \cdot A) \cdot B

        其中 ( B ) 形状为 ([32, 4096]),输出

        yloray_{\text{lora}}y_{\text{lora}}

        的形状为

        [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]

    • 数学上,

      ylora=(A⋅B)⋅xy_{\text{lora}} = (A \cdot B) \cdot xy_{\text{lora}} = (A \cdot B) \cdot x

      ,其中

      A⋅BA \cdot BA \cdot B

      等价于一个 ([4096, 4096]) 的矩阵,但通过分步计算(先

      x⋅Ax \cdot Ax \cdot A

      ,再

      z⋅Bz \cdot Bz \cdot B

      )减少了计算量。
  4. 合并输出:
    • 最终输出是原始线性层输出和 LoRA 更新的加和:

      y=ybase+ylora=W⋅x+(A⋅B)⋅xy = y_{\text{base}} + y_{\text{lora}} = W \cdot x + (A \cdot B) \cdot xy = y_{\text{base}} + y_{\text{lora}} = W \cdot x + (A \cdot B) \cdot x

    • 输出 ( y ) 的形状仍为

      [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]

      ,可以直接作为 Transformer 注意力机制中查询向量(query)的输入。
  5. Dropout(无影响):
    • 由于 lora_dropout 是 Identity(),推理阶段不会应用 dropout,输入直接通过。

 测试模型

messages = [
    {"role": "user", "content": "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = False, 
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),   # **表示将tokenizer返回的字典自动解包为关键字参数传入
    max_new_tokens = 256,
    temperature = 0.7,
    top_p = 0.8,
    top_k = 20,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
    use_cache=False,
)
To solve the equation \((x + 2)^2 = 0\), we start by taking the square root of both sides. 

\[
(x + 2)^2 = 0
\]

Taking the square root of both sides, we get:

\[
x + 2 = 0
\]

Next, we solve for \(x\) by subtracting 2 from both sides:

\[
x = -2
\]

Thus, the solution to the equation is \(x = -2\).<|im_end|>

 大规模微调

将epoch设置为1即将整个数据集训练一遍,并保存微调后的模型权重