大模型笔记11：微调Qwen3模型提高其数学推理能力-EW帮帮网

内容介绍

使用的微调库是unsloth，此库只支持单卡微调，所以更适合个人或小团队微调使用。

基座模型是unsloth/Qwen3-8B-unsloth-bnb-4bit，是4bit动态量化版。从魔搭社区下载。

此次微调使用两个数据集：OpenMathReasoning-mini和FinTome-100k。

unsloth框架调用模型代码演示

倒入库

from unsloth import FastLanguageModel
import torch

加载模型

max_seq_length = 8192
dtype = None
load_in_4bit = True

# 加载当前路径下的Qwen3模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./Qwen3-8B-unsloth-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=load_in_4bit,
)

model

输出结果：

==((====))==  Unsloth 2025.6.8: Fast Qwen3 patching. Transformers: 4.53.0.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.57 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.13it/s]

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 4096, padding_idx=151654)
    (layers): ModuleList(
      (0-2): 3 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (3): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (4): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (5): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (6): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (7-33): 27 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (34): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
      (35): Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
    )
    (norm): Qwen3RMSNorm((4096,), eps=1e-06)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)

开启模型对话功能

messages = [
    {"role": "user", "content": "你好，好久不见！"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,     # 不返回生成ID，也就是直接返回字符串
    add_generation_prompt=True,     # 添加生成提示，也就是assistant
    enable_thinking=False, # 设置不思考
)

text

输出结果，这里展示了输入模型前的数据格式，包含了完整的特殊标识符的文本

'<|im_start|>user\n你好，好久不见！<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'

进行分词

inputs = tokenizer(text, return_tensors="pt").to("cuda")
inputs

{'input_ids': tensor([[151644,    872,    198, 108386,   3837, 111920, 101571,   6313, 151645,
            198, 151644,  77091,    198, 151667,    271, 151668,    271]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

进行模型推理

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
outputs

tensor([[151644,    872,    198, 108386,   3837, 111920, 101571,   6313, 151645,
            198, 151644,  77091,    198, 151667,    271, 151668,    271, 108386,
           6313, 111920, 101571,   6313, 104284, 113097,  56568, 101036,   1773,
         104044, 108178, 104472,  11319, 104139, 104838,  29826,  99172,  33108,
          35946,  93149, 101037,  11319, 144236, 151645]], device='cuda:0')

根据ID转换为自然语言

response = tokenizer.batch_decode(outputs)
response

['<|im_start|>user\n你好，好久不见！<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n你好！好久不见！我也很想你呢。最近过得怎么样？有什么新鲜事想和我分享吗？😊<|im_end|>']

怎么思考过程的代码，只需要把enable_thinking改为True

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,     # 不返回生成ID，也就是直接返回字符串
    add_generation_prompt=True,     # 添加生成提示，也就是assistant
    enable_thinking=True, # 设置思考
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]

'<|im_start|>user\n你好，好久不见！<|im_end|>\n<|im_start|>assistant\n<think>\n好的，用户发来“你好，好久不见！”，看起来像是在打招呼，但可能有更深层的意图。首先，我需要确认用户是否在测试我的反应，或者是否有其他需求。用户可能希望继续之前的对话，但之前没有记录，所以可能需要进一步询问。另外，用户可能想表达某种情感，比如怀念或者测试我的回应能力。我应该保持友好，同时引导用户说明具体需求，以便提供更有针对性的帮助。需要避免假设太多，保持开放和灵活的态度。接下来，我应该用轻松的语气回应，同时邀请用户分享更多信息，这样可以促进更深入的交流。\n</think>\n\n啊，好久不见！最近过得怎么样呀？😊 有什么新鲜事想和我分享吗？或者需要我帮忙解决什么问题？<|im_end|>'

携带系统提示词的模板

messages = [
    {"role": "system", "content": "你是一名助人为乐的助手，名字叫小胡。"},
    {"role": "user", "content":"你好，请问你叫什么名字？"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,     # 不返回生成ID，也就是直接返回字符串
    add_generation_prompt=True,     # 添加生成提示，也就是assistant
    enable_thinking=True, # 设置思考
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]

'<|im_start|>system\n你是一名助人为乐的助手，名字叫小胡。<|im_end|>\n<|im_start|>user\n你好，请问你叫什么名字？<|im_end|>\n<|im_start|>assistant\n<think>\n好的，用户问我的名字，我应该回答小胡。我需要保持友好和乐于助人的态度。可以加点表情符号让对话更生动。比如用✨或者😊这样的符号。然后可以主动询问用户需要什么帮助，这样能促进进一步的交流。保持口语化，不用太正式。比如用"有什么我可以帮你的吗？"这样的问句。这样既亲切又自然。同时要注意不要用复杂的句子结构，保持简单明了。这样用户会觉得容易理解和亲近。最后，确保回答符合角色设定，体现出助人为乐的特点。比如用"随时为你服务！"这样的结尾。这样整个回答就既符合要求，又显得亲切友好。\n</think>\n\n你好呀！我叫小胡，是你的AI助手，有什么我可以帮你的吗？✨😊<|im_end|>'

LLM使用function_call，以天气查询为例

编写天气查询函数

def get_weather(loc):
    url="http://api.openweathermap.org/data/2.5/weather"
    params = {
        "q": loc,
        "appid": "你的api_key",
        "units": "metric",  # 使用摄氏度
        "lang": "zh_cn"
    }
    response = requests.get(url, params=params)

    data = response.json()
    return  json.dumps(data)

定义工具列表 JSON Schema方式描述

tools = [
    {
        "type": "function",
        "function": {
            'name': 'get_weather',
            'description': '查询即时天气函数，根据输入的城市名称，查询对应城市的实时天气 ，一次只能查询一个城市。',
            'parameters': {
                'type': 'object',
                'properties': {
                    'loc': {
                        'description': "城市名称，注意，中国的城市需要用对应城市的英文名称代替。",
                        'type': 'string'
                    }
                },
                'required': ['loc']
            }  
        }
    }
]

messages = [
    {"role": "system", "content": "你是一名助人为乐天气查询助手，当用户询问天气时，请调用get_weather函数进行天气查询。"},
    {"role": "user", "content":"你好，请帮我查询一下无锡的今天天气如何？"}
]

text = tokenizer.apply_chat_template(
    messages,
    tools = tools,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = True,
)
text

可以看到，tools中的内容也会被加入到上下文中。

'<|im_start|>system\n你是一名助人为乐天气查询助手，当用户询问天气时，请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数，根据输入的城市名称，查询对应城市的实时天气 ，一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称，注意，中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好，请帮我查询一下无锡的今天天气如何？<|im_end|>\n<|im_start|>assistant\n'

输出模型响应结果

inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]

可以看到它成功创建了一个function call messages

\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call>

'<|im_start|>system\n你是一名助人为乐天气查询助手，当用户询问天气时，请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数，根据输入的城市名称，查询对应城市的实时天气 ，一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称，注意，中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好，请帮我查询一下无锡的今天天气如何？<|im_end|>\n<|im_start|>assistant\n<think>\n好的，用户让我查询无锡的天气。首先，我需要确认用户的需求是今天的天气情况。然后，根据提供的工具，我需要调用get_weather函数。注意，工具里提到中国的城市要用英文名称，所以无锡对应的英文是Wuxi。接下来，构造参数，参数是loc，值应该是"Wuxi"。然后，确保只调用一个城市，没有其他参数。最后，生成正确的tool_call结构，把函数名和参数放进去。检查一遍有没有错误，比如拼写或者参数名是否正确。确认无误后，就可以返回结果了。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call><|im_end|>'

那么当一次输出就要求输出多次工具使用，或者同时使用不同的工具时，模型会输出什么。这里我们让模型同时查看两个城市的天气。

messages = [
    {"role": "system", "content": "你是一名助人为乐天气查询助手，当用户询问天气时，请调用get_weather函数进行天气查询。"},
    {"role": "user", "content":"你好，请帮我查询一下无锡和南京的今天天气如何？"}
]
text = tokenizer.apply_chat_template(
    messages,
    tools = tools,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = True,
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=max_seq_length,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]

可以看到，模型非常聪明，知道分两次调用函数

'<|im_start|>system\n你是一名助人为乐天气查询助手，当用户询问天气时，请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数，根据输入的城市名称，查询对应城市的实时天气 ，一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称，注意，中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好，请帮我查询一下无锡和南京的今天天气如何？<|im_end|>\n<|im_start|>assistant\n<think>\n好的，用户让我查询无锡和南京今天的天气。首先，我需要确认用户提到的城市是否正确，以及是否需要转换为英文名称。根据工具说明，中国的城市需要用英文名称代替。无锡的英文是Wuxi，南京是Nanjing。然后，我需要调用get_weather函数，但每次只能查询一个城市。所以可能需要分两次调用。不过用户同时提到了两个城市，可能需要分别处理。先检查是否有函数支持多个城市，但根据工具描述，一次只能查询一个城市。因此，我应该分别调用两次get_weather函数，一次 for 无锡，一次 for 南京。确保参数正确，然后返回结果。现在先处理第一个城市无锡。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call>\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Nanjing"}}\n</tool_call><|im_end|>'

但显然，这里只是一个中间回复，用户需要的时最终的天气结果的字符串，所以还需要进一步的处理。

假设下面的内容是模型输出被开发者提取出的tool_call部分的内容的json格式，需要将其加入到messages中

messages.append(
    {
        "role": "assistant",
        "content": "<think>\n我将调用 get_weather 函数来检查天气。\n</think>\n",
        "tool_calls": [
            {"name": "get_weather", 
             "arguments": {
                 "location": "无锡"
             }
            },
             {
                 "name": "get_weather",
                 "arguments": {
                     "location": "南京"
                 }
             }
        ]
    }
)

调用函数后的输出内容也将以json格式加入到messages中。解释一下，这些内容都需要外部代码去实现，这里只是假装直接拿过来用。

messages.append(
    {
        "role": "tool",
        "content": json.dumps({
            "location": "无锡",
            "weather": "晴，最高气温26摄氏度"
        })
    }
)

messages.append(
    {
        "role": "tool",
        "content": json.dumps({
            "location": "南京",
            "weather": "多云转小雨，最高气温23摄氏度"
        })
    }
)

再次输入模型后的输出便是

'<|im_start|>system\n你是一名助人为乐天气查询助手，当用户询问天气时，请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数，根据输入的城市名称，查询对应城市的实时天气 ，一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称，注意，中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好，请帮我查询一下无锡和南京的今天天气如何？<|im_end|>\n<|im_start|>assistant\n<think>\n我将调用 get_weather 函数来检查天气。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"location": "无锡"}}\n</tool_call>\n<tool_call>\n{"name": "get_weather", "arguments": {"location": "南京"}}\n</tool_call><|im_end|>\n<|im_start|>user\n<tool_response>\n{"location": "\\u65e0\\u9521", "weather": "\\u6674\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2926\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u5357\\u4eac", "weather": "\\u591a\\u4e91\\u8f6c\\u5c0f\\u96e8\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2923\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u65e0\\u9521", "weather": "\\u6674\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2926\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u5357\\u4eac", "weather": "\\u591a\\u4e91\\u8f6c\\u5c0f\\u96e8\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2923\\u6444\\u6c0f\\u5ea6"}\n</tool_response><|im_end|>\n<|im_start|>assistant\n<think>\n好的，用户之前请求查询无锡和南京的天气，我分别调用了两次get_weather函数。现在需要将结果整理成自然的中文回复。首先，确认无锡的天气是晴天，最高温度26摄氏度，南京是多云转小雨，最高温度23摄氏度。需要注意的是，用户可能希望知道两地的天气差异，是否需要带伞，或者穿衣建议。要确保信息清晰，用口语化的表达，避免使用专业术语。同时，检查是否有错别字，确保城市名称正确。最后，以友好的语气结束，询问是否需要进一步帮助。\n</think>\n\n无锡今天是晴天，最高气温26摄氏度，天气晴朗宜人。南京今天多云转小雨，最高气温23摄氏度，建议带伞出门。需要我帮您查询其他城市的天气吗？<|im_end|>'

到这里做一个重要总结：其实这些模型的输出内容，就是训练模型时所需要训练样本的格式。

微调数据集准备

此次微调使用两个数据集：OpenMathReasoning-mini和FinTome-100k。

前者负责推理和解答数学问题，后者进行普通对话。

可能会有疑问，既然是微调数学方向，这个普通对话数据集作用是什么？
答案：1.防止微调后模型发生灾难性的遗忘。2.保证Qwen3模型的混合推理能力，也就是保证模型不仅有推理能力也能有非推理能力。

使用datasets库下载数据集

from datasets import load_dataset
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")  # cot是Chain-of-thought思维链部分
reasoning_dataset

Dataset({
    features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
    num_rows: 19252
})

non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
non_reasoning_dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

可以看到Math数据集数据量是将近两万，而对话数据集数据量达到了10万，所以两者分配不均，接下来要对数据集进行清洗。

数据集清洗

新增对话字段

def generate_conversation(examples):
    problems = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role": "user", "content": problem},
            {"role": "assistant", "content": solution},
        ])
    return {"conversations": conversations,}
reasoning_data = reasoning_dataset.map(generate_conversation, batched=True)  # 新增一个conversation字段
reasoning_data["conversations"][0]

[{'content': 'Given $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.',
  'role': 'user'},
 {'content': "<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>To solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n   \\[\n   \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n   \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n   \\[\n   \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n   \\]\n\n3. Square both sides to eliminate the square root on the left:\n   \\[\n   (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n   \\]\n   Simplifying both sides, we get:\n   \\[\n   x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n   \\]\n\n4. Combine like terms on the right side:\n   \\[\n   x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n   \\]\n   Simplifying further:\n   \\[\n   x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n5. Subtract \\(x^2\\) from both sides:\n   \\[\n   165 = -3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n   \\[\n   168 = 14\\sqrt{x^2 - 52}\n   \\]\n\n7. Divide both sides by 14:\n   \\[\n   12 = \\sqrt{x^2 - 52}\n   \\]\n\n8. Square both sides again to eliminate the square root:\n   \\[\n   12^2 = x^2 - 52\n   \\]\n   Simplifying:\n   \\[\n   144 = x^2 - 52\n   \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n   \\[\n   196 = x^2\n   \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n    \\[\n    x = \\sqrt{196} = 14\n    \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n    \\[\n    \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n    \\]\n    The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]",
  'role': 'assistant'}]

送入分词器中

reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_data["conversations"],
    tokenize = False
)
reasoning_conversations[0]

"<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>\n\nTo solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n   \\[\n   \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n   \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n   \\[\n   \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n   \\]\n\n3. Square both sides to eliminate the square root on the left:\n   \\[\n   (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n   \\]\n   Simplifying both sides, we get:\n   \\[\n   x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n   \\]\n\n4. Combine like terms on the right side:\n   \\[\n   x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n   \\]\n   Simplifying further:\n   \\[\n   x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n5. Subtract \\(x^2\\) from both sides:\n   \\[\n   165 = -3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n   \\[\n   168 = 14\\sqrt{x^2 - 52}\n   \\]\n\n7. Divide both sides by 14:\n   \\[\n   12 = \\sqrt{x^2 - 52}\n   \\]\n\n8. Square both sides again to eliminate the square root:\n   \\[\n   12^2 = x^2 - 52\n   \\]\n   Simplifying:\n   \\[\n   144 = x^2 - 52\n   \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n   \\[\n   196 = x^2\n   \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n    \\[\n    x = \\sqrt{196} = 14\n    \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n    \\[\n    \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n    \\]\n    The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]<|im_end|>\n"

同样对另一个数据集也进行格式转换

from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)
non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize=False,
)

对数据进行抽样, 非推理样本只占推理样本总量的百分之25

chat_percentage = 0.75 
import pandas as pd
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
    int(len(reasoning_conversations) * (1.0 - chat_percentage)),
    random_state = 2407,
)

接下来将两个数据集拼接并打散顺序

data = pd.concat([
    pd.Series(reasoning_conversations),
    pd.Series(non_reasoning_subset)
])
data.name = "text"

from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed=3407)

LORA微调

lora微调配置，将lora适配器应用于Qwen3模型，也就指定需要微调的模块加入低秩矩阵适配器。

model = FastLanguageModel.get_peft_model(
    model,
    r=32,   # LOra的秩
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # 缩放系数
    lora_dropout = 0,
    bias = "none",  # 偏置训练方式
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None, 
)

配置微调训练器，设置训练的超参数

from trl import SFTTrainer, SFTConfig   # SFTTrainer 监督微调训练器 SFTConfig 监督微调的超参数配置
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset= None,
    args = SFTConfig(
        dataset_text_field = "text",  # 指定训练数据所在的字段
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,   # 累计梯度更新的次数，也就是4次minibatch梯度计算完累计后后 再更新参数
        warmup_steps = 5,       # 学习率预热的步数
        # num_train_epochs = 1, # 这个参数的优先级比max_steps低，所以这两个参数一般只选择一个去使用， 一般先用少量的max_steps确定训练过程是否正常，损失值是否在降低
        max_steps = 30,    # 最大更新次数
        learning_rate = 2e-4,
        logging_steps = 1,    # 每参数更新一次，记录日志
        optim = "adamw_8bit",   # 8bit量化的优化器
        weight_decay = 0.01,   # 权重衰减率
        lr_scheduler_type = "linear",   # 学习率采用线性衰减
        seed = 3407,
        report_to = "wandb",   # 设置wandb这个库作为模型训练记录的工具
    ),
)

接下来开始训练：目前设置的是三十次参数更新停止，目的观察损失函数是否正常降低，只要是波动中降低也没问题，因为是小批量梯度下降算法。如果无异常可以再改用epoch正式训练

trainer_stats = trainer.train()

训练过程中的loss会被打印出来，如果你设置了wandb则在网页会有更直观的图表展示。

Step	Training Loss
1	0.518500
2	0.621800
3	0.605500
4	0.587000
5	0.520100
6	0.498500
7	0.486000
8	0.465000
9	0.420200

微调后的模型

微调后的model就会被微调后的模型覆盖，可以打印model的结构：

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen3ForCausalLM(
      (model): Qwen3Model(
        (embed_tokens): Embedding(151936, 4096, padding_idx=151654)
        (layers): ModuleList(
          (0-2): 3 x Qwen3DecoderLayer(
            (self_attn): Qwen3Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )

上面是结构的一小部分展示，分析一下q_proj微调后的结构：

(2) lora_dropout: ModuleDict((default): Identity())

(3) lora_A: ModuleDict((default): Linear(in_features=4096, out_features=32, bias=False))

(4) lora_B: ModuleDict((default): Linear(in_features=32, out_features=4096, bias=False))

(5) lora_embedding_A: ParameterDict()

(6) lora_embedding_B: ParameterDict()

(7) lora_magnitude_vector: ModuleDict()

(1) base_layer: Linear4bit(in_features=4096, out_features=4096, bias=False)
这是原始的线性层（未微调的部分），使用 4 位量化来降低内存占用。
参数说明：
- in_features=4096：输入特征维度为 4096。
- out_features=4096：输出特征维度为 4096。
- bias=False：该线性层不包含偏置项（bias）。
作用：这是模型中某个 Transformer 层的子模块（例如 q_proj，即查询投影层，用于多头注意力机制中的查询向量计算）。它表示原始预训练权重，通常在微调时保持冻结（不更新）。
4 位量化：通过将权重压缩到 4 位，减少内存占用和计算量，适合在资源受限的环境中运行大模型。
LoRA Dropout：这是 LoRA 微调中用于正则化的 dropout 层。
Identity()：表示当前没有应用 dropout（等价于 dropout 概率为 0）。Identity 是一个占位操作，输入直接输出，不做任何变换。
作用：LoRA 允许为不同层设置 dropout，但这里默认配置是无 dropout，可能为了简化或避免正则化。
LoRA A 矩阵：这是 LoRA 微调中的第一个低秩矩阵（低秩分解的 A 部分）。
参数说明：
- in_features=4096：与 base_layer 的输入维度一致。
- out_features=32：输出维度为 32，表示 LoRA 的秩（rank），这是一个超参数，用于控制 LoRA 的参数量。秩越小，参数量越少，微调越高效。
- bias=False：没有偏置项。
作用：lora_A 是一个可训练的矩阵，用于将输入映射到一个低维空间（秩为 32），这是 LoRA 的核心思想：通过低秩矩阵近似权重更新，减少需要训练的参数量。
LoRA B 矩阵：这是 LoRA 微调中的第二个低秩矩阵（低秩分解的 B 部分）。
参数说明：
- in_features=32：与 lora_A 的输出维度一致（秩为 32）。
- out_features=4096：与 base_layer 的输出维度一致。
- bias=False：没有偏置项。
作用：lora_B 将低维空间的表示映射回原始输出维度，与 lora_A 一起构成 LoRA 的低秩更新矩阵。
LoRA Embedding A：这是一个参数字典，通常用于存储 LoRA 针对嵌入层（embedding layer）的 A 矩阵（如果适用）。
当前为空（ParameterDict()），说明这个 q_proj 层没有为嵌入层定义 LoRA 参数（因为 q_proj 是注意力机制中的线性层，而不是嵌入层）。
LoRA Embedding B：类似 lora_embedding_A，用于存储嵌入层的 B 矩阵。
当前为空，原因同上。
LoRA Magnitude Vector：这是 LoRA 中的一个可选组件，用于存储额外的向量（如缩放因子或幅度向量），以调整 LoRA 的输出。
当前为空（ModuleDict()），说明没有使用这种额外的向量调整机制

样本特征在经过微调后的q_proj层会被如何处理

推理阶段的具体计算步骤如下：这段内容是用GPT生成的公式部分懒得改了

输入处理：
- 输入 ( x ) 是一个形状为
  [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]
  的张量，通常表示 Transformer 模型中某一层的隐藏状态（例如，来自上一层的输出）。
原始线性层计算：
- 首先，输入 ( x ) 通过 base_layer（即原始权重 ( W )）进行线性变换：
  ybase=W⋅xy_{\text{base}} = W \cdot xy_{\text{base}} = W \cdot x
- 由于 base_layer 是 4 位量化的，计算时会先将权重反量化（从 4 位恢复到浮点数，通常由底层库如 bitsandbytes 自动处理），然后执行矩阵乘法。
- 输出
  ybasey_{\text{base}}y_{\text{base}}
  的形状为
  [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]
  。
LoRA 低秩更新计算：
- LoRA 的更新部分通过
  ΔW=A⋅B\Delta W = A \cdot B\Delta W = A \cdot B
  计算：
  - 首先，输入 ( x ) 通过 lora_A 映射到低秩空间：
    z=x⋅Az = x \cdot Az = x \cdot A
    其中 ( A ) 形状为 ([4096, 32])，输出 ( z ) 的形状为
    [batchsize,seqlength,32][batch_size, seq_length, 32][batch_size, seq_length, 32]
    。
  - 然后，( z ) 通过 lora_B 映射回原始输出维度：
    ylora=z⋅B=(x⋅A)⋅By_{\text{lora}} = z \cdot B = (x \cdot A) \cdot By_{\text{lora}} = z \cdot B = (x \cdot A) \cdot B
    其中 ( B ) 形状为 ([32, 4096])，输出
    yloray_{\text{lora}}y_{\text{lora}}
    的形状为
    [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]
    。
- 数学上，
  ylora=(A⋅B)⋅xy_{\text{lora}} = (A \cdot B) \cdot xy_{\text{lora}} = (A \cdot B) \cdot x
  ，其中
  A⋅BA \cdot BA \cdot B
  等价于一个 ([4096, 4096]) 的矩阵，但通过分步计算（先
  x⋅Ax \cdot Ax \cdot A
  ，再
  z⋅Bz \cdot Bz \cdot B
  ）减少了计算量。
合并输出：
- 最终输出是原始线性层输出和 LoRA 更新的加和：
  y=ybase+ylora=W⋅x+(A⋅B)⋅xy = y_{\text{base}} + y_{\text{lora}} = W \cdot x + (A \cdot B) \cdot xy = y_{\text{base}} + y_{\text{lora}} = W \cdot x + (A \cdot B) \cdot x
- 输出 ( y ) 的形状仍为
  [batchsize,seqlength,4096][batch_size, seq_length, 4096][batch_size, seq_length, 4096]
  ，可以直接作为 Transformer 注意力机制中查询向量（query）的输入。
Dropout（无影响）：
- 由于 lora_dropout 是 Identity()，推理阶段不会应用 dropout，输入直接通过。

测试模型

messages = [
    {"role": "user", "content": "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = False, 
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),   # **表示将tokenizer返回的字典自动解包为关键字参数传入
    max_new_tokens = 256,
    temperature = 0.7,
    top_p = 0.8,
    top_k = 20,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
    use_cache=False,
)

To solve the equation \((x + 2)^2 = 0\), we start by taking the square root of both sides. 

\[
(x + 2)^2 = 0
\]

Taking the square root of both sides, we get:

\[
x + 2 = 0
\]

Next, we solve for \(x\) by subtracting 2 from both sides:

\[
x = -2
\]

Thus, the solution to the equation is \(x = -2\).<|im_end|>

大规模微调

将epoch设置为1即将整个数据集训练一遍，并保存微调后的模型权重

大模型笔记11：微调Qwen3模型提高其数学推理能力

内容介绍

unsloth框架调用模型代码演示

微调数据集准备

数据集清洗

LORA微调

微调后的模型

样本特征在经过微调后的q_proj层会被如何处理

测试模型

大规模微调

网站公告

今日签到

热门文章

最新发布