内容介绍
使用的微调库是unsloth,此库只支持单卡微调,所以更适合个人或小团队微调使用。
基座模型是unsloth/Qwen3-8B-unsloth-bnb-4bit,是4bit动态量化版。从魔搭社区下载。
此次微调使用两个数据集:OpenMathReasoning-mini和FinTome-100k。
unsloth框架调用模型代码演示
倒入库
from unsloth import FastLanguageModel
import torch
加载模型
max_seq_length = 8192
dtype = None
load_in_4bit = True
# 加载当前路径下的Qwen3模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./Qwen3-8B-unsloth-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=load_in_4bit,
)
model
输出结果:
==((====))== Unsloth 2025.6.8: Fast Qwen3 patching. Transformers: 4.53.0.
\\ /| NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.57 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.13it/s]
Qwen3ForCausalLM(
(model): Qwen3Model(
(embed_tokens): Embedding(151936, 4096, padding_idx=151654)
(layers): ModuleList(
(0-2): 3 x Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
(3): Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
(4): Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
(5): Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
(6): Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
(7-33): 27 x Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
(34): Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
(35): Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): Qwen3MLP(
(gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
(down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
)
)
(norm): Qwen3RMSNorm((4096,), eps=1e-06)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)
开启模型对话功能
messages = [
{"role": "user", "content": "你好,好久不见!"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False, # 不返回生成ID,也就是直接返回字符串
add_generation_prompt=True, # 添加生成提示,也就是assistant
enable_thinking=False, # 设置不思考
)
text
输出结果,这里展示了输入模型前的数据格式,包含了完整的特殊标识符的文本
'<|im_start|>user\n你好,好久不见!<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'
进行分词
inputs = tokenizer(text, return_tensors="pt").to("cuda")
inputs
{'input_ids': tensor([[151644, 872, 198, 108386, 3837, 111920, 101571, 6313, 151645,
198, 151644, 77091, 198, 151667, 271, 151668, 271]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
进行模型推理
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=max_seq_length,
use_cache=True,
)
outputs
tensor([[151644, 872, 198, 108386, 3837, 111920, 101571, 6313, 151645,
198, 151644, 77091, 198, 151667, 271, 151668, 271, 108386,
6313, 111920, 101571, 6313, 104284, 113097, 56568, 101036, 1773,
104044, 108178, 104472, 11319, 104139, 104838, 29826, 99172, 33108,
35946, 93149, 101037, 11319, 144236, 151645]], device='cuda:0')
根据ID转换为自然语言
response = tokenizer.batch_decode(outputs)
response
['<|im_start|>user\n你好,好久不见!<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n你好!好久不见!我也很想你呢。最近过得怎么样?有什么新鲜事想和我分享吗?😊<|im_end|>']
怎么思考过程的代码,只需要把enable_thinking改为True
text = tokenizer.apply_chat_template(
messages,
tokenize=False, # 不返回生成ID,也就是直接返回字符串
add_generation_prompt=True, # 添加生成提示,也就是assistant
enable_thinking=True, # 设置思考
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=max_seq_length,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]
'<|im_start|>user\n你好,好久不见!<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户发来“你好,好久不见!”,看起来像是在打招呼,但可能有更深层的意图。首先,我需要确认用户是否在测试我的反应,或者是否有其他需求。用户可能希望继续之前的对话,但之前没有记录,所以可能需要进一步询问。另外,用户可能想表达某种情感,比如怀念或者测试我的回应能力。我应该保持友好,同时引导用户说明具体需求,以便提供更有针对性的帮助。需要避免假设太多,保持开放和灵活的态度。接下来,我应该用轻松的语气回应,同时邀请用户分享更多信息,这样可以促进更深入的交流。\n</think>\n\n啊,好久不见!最近过得怎么样呀?😊 有什么新鲜事想和我分享吗?或者需要我帮忙解决什么问题?<|im_end|>'
携带系统提示词的模板
messages = [
{"role": "system", "content": "你是一名助人为乐的助手,名字叫小胡。"},
{"role": "user", "content":"你好,请问你叫什么名字?"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False, # 不返回生成ID,也就是直接返回字符串
add_generation_prompt=True, # 添加生成提示,也就是assistant
enable_thinking=True, # 设置思考
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=max_seq_length,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]
'<|im_start|>system\n你是一名助人为乐的助手,名字叫小胡。<|im_end|>\n<|im_start|>user\n你好,请问你叫什么名字?<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户问我的名字,我应该回答小胡。我需要保持友好和乐于助人的态度。可以加点表情符号让对话更生动。比如用✨或者😊这样的符号。然后可以主动询问用户需要什么帮助,这样能促进进一步的交流。保持口语化,不用太正式。比如用"有什么我可以帮你的吗?"这样的问句。这样既亲切又自然。同时要注意不要用复杂的句子结构,保持简单明了。这样用户会觉得容易理解和亲近。最后,确保回答符合角色设定,体现出助人为乐的特点。比如用"随时为你服务!"这样的结尾。这样整个回答就既符合要求,又显得亲切友好。\n</think>\n\n你好呀!我叫小胡,是你的AI助手,有什么我可以帮你的吗?✨😊<|im_end|>'
LLM使用function_call,以天气查询为例
编写天气查询函数
def get_weather(loc):
url="http://api.openweathermap.org/data/2.5/weather"
params = {
"q": loc,
"appid": "你的api_key",
"units": "metric", # 使用摄氏度
"lang": "zh_cn"
}
response = requests.get(url, params=params)
data = response.json()
return json.dumps(data)
定义工具列表 JSON Schema方式描述
tools = [
{
"type": "function",
"function": {
'name': 'get_weather',
'description': '查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。',
'parameters': {
'type': 'object',
'properties': {
'loc': {
'description': "城市名称,注意,中国的城市需要用对应城市的英文名称代替。",
'type': 'string'
}
},
'required': ['loc']
}
}
}
]
messages = [
{"role": "system", "content": "你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。"},
{"role": "user", "content":"你好,请帮我查询一下无锡的今天天气如何?"}
]
text = tokenizer.apply_chat_template(
messages,
tools = tools,
tokenize = False,
add_generation_prompt = True,
enable_thinking = True,
)
text
可以看到,tools中的内容也会被加入到上下文中。
'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡的今天天气如何?<|im_end|>\n<|im_start|>assistant\n'
输出模型响应结果
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=max_seq_length,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]
可以看到它成功创建了一个function call messages
\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call>
'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡的今天天气如何?<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户让我查询无锡的天气。首先,我需要确认用户的需求是今天的天气情况。然后,根据提供的工具,我需要调用get_weather函数。注意,工具里提到中国的城市要用英文名称,所以无锡对应的英文是Wuxi。接下来,构造参数,参数是loc,值应该是"Wuxi"。然后,确保只调用一个城市,没有其他参数。最后,生成正确的tool_call结构,把函数名和参数放进去。检查一遍有没有错误,比如拼写或者参数名是否正确。确认无误后,就可以返回结果了。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call><|im_end|>'
那么当一次输出就要求输出多次工具使用,或者同时使用不同的工具时,模型会输出什么。这里我们让模型同时查看两个城市的天气。
messages = [
{"role": "system", "content": "你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。"},
{"role": "user", "content":"你好,请帮我查询一下无锡和南京的今天天气如何?"}
]
text = tokenizer.apply_chat_template(
messages,
tools = tools,
tokenize = False,
add_generation_prompt = True,
enable_thinking = True,
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=max_seq_length,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
response[0]
可以看到,模型非常聪明,知道分两次调用函数
'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡和南京的今天天气如何?<|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户让我查询无锡和南京今天的天气。首先,我需要确认用户提到的城市是否正确,以及是否需要转换为英文名称。根据工具说明,中国的城市需要用英文名称代替。无锡的英文是Wuxi,南京是Nanjing。然后,我需要调用get_weather函数,但每次只能查询一个城市。所以可能需要分两次调用。不过用户同时提到了两个城市,可能需要分别处理。先检查是否有函数支持多个城市,但根据工具描述,一次只能查询一个城市。因此,我应该分别调用两次get_weather函数,一次 for 无锡,一次 for 南京。确保参数正确,然后返回结果。现在先处理第一个城市无锡。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Wuxi"}}\n</tool_call>\n<tool_call>\n{"name": "get_weather", "arguments": {"loc": "Nanjing"}}\n</tool_call><|im_end|>'
但显然,这里只是一个中间回复,用户需要的时最终的天气结果的字符串,所以还需要进一步的处理。
假设下面的内容是模型输出被开发者提取出的tool_call部分的内容的json格式,需要将其加入到messages中
messages.append(
{
"role": "assistant",
"content": "<think>\n我将调用 get_weather 函数来检查天气。\n</think>\n",
"tool_calls": [
{"name": "get_weather",
"arguments": {
"location": "无锡"
}
},
{
"name": "get_weather",
"arguments": {
"location": "南京"
}
}
]
}
)
调用函数后的输出内容也将以json格式加入到messages中。解释一下,这些内容都需要外部代码去实现,这里只是假装直接拿过来用。
messages.append(
{
"role": "tool",
"content": json.dumps({
"location": "无锡",
"weather": "晴,最高气温26摄氏度"
})
}
)
messages.append(
{
"role": "tool",
"content": json.dumps({
"location": "南京",
"weather": "多云转小雨,最高气温23摄氏度"
})
}
)
再次输入模型后的输出便是
'<|im_start|>system\n你是一名助人为乐天气查询助手,当用户询问天气时,请调用get_weather函数进行天气查询。\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "get_weather", "description": "查询即时天气函数,根据输入的城市名称,查询对应城市的实时天气 ,一次只能查询一个城市。", "parameters": {"type": "object", "properties": {"loc": {"description": "城市名称,注意,中国的城市需要用对应城市的英文名称代替。", "type": "string"}}, "required": ["loc"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n你好,请帮我查询一下无锡和南京的今天天气如何?<|im_end|>\n<|im_start|>assistant\n<think>\n我将调用 get_weather 函数来检查天气。\n</think>\n\n<tool_call>\n{"name": "get_weather", "arguments": {"location": "无锡"}}\n</tool_call>\n<tool_call>\n{"name": "get_weather", "arguments": {"location": "南京"}}\n</tool_call><|im_end|>\n<|im_start|>user\n<tool_response>\n{"location": "\\u65e0\\u9521", "weather": "\\u6674\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2926\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u5357\\u4eac", "weather": "\\u591a\\u4e91\\u8f6c\\u5c0f\\u96e8\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2923\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u65e0\\u9521", "weather": "\\u6674\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2926\\u6444\\u6c0f\\u5ea6"}\n</tool_response>\n<tool_response>\n{"location": "\\u5357\\u4eac", "weather": "\\u591a\\u4e91\\u8f6c\\u5c0f\\u96e8\\uff0c\\u6700\\u9ad8\\u6c14\\u6e2923\\u6444\\u6c0f\\u5ea6"}\n</tool_response><|im_end|>\n<|im_start|>assistant\n<think>\n好的,用户之前请求查询无锡和南京的天气,我分别调用了两次get_weather函数。现在需要将结果整理成自然的中文回复。首先,确认无锡的天气是晴天,最高温度26摄氏度,南京是多云转小雨,最高温度23摄氏度。需要注意的是,用户可能希望知道两地的天气差异,是否需要带伞,或者穿衣建议。要确保信息清晰,用口语化的表达,避免使用专业术语。同时,检查是否有错别字,确保城市名称正确。最后,以友好的语气结束,询问是否需要进一步帮助。\n</think>\n\n无锡今天是晴天,最高气温26摄氏度,天气晴朗宜人。南京今天多云转小雨,最高气温23摄氏度,建议带伞出门。需要我帮您查询其他城市的天气吗?<|im_end|>'
到这里做一个重要总结:其实这些模型的输出内容,就是训练模型时所需要训练样本的格式。
微调数据集准备
此次微调使用两个数据集:OpenMathReasoning-mini和FinTome-100k。
前者负责推理和解答数学问题,后者进行普通对话。
可能会有疑问,既然是微调数学方向,这个普通对话数据集作用是什么?
答案:1.防止微调后模型发生灾难性的遗忘。2.保证Qwen3模型的混合推理能力,也就是保证模型不仅有推理能力也能有非推理能力。
使用datasets库下载数据集
from datasets import load_dataset
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot") # cot是Chain-of-thought思维链部分
reasoning_dataset
Dataset({
features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
num_rows: 19252
})
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
non_reasoning_dataset
Dataset({
features: ['conversations', 'source', 'score'],
num_rows: 100000
})
可以看到Math数据集数据量是将近两万,而对话数据集数据量达到了10万,所以两者分配不均,接下来要对数据集进行清洗。
数据集清洗
新增对话字段
def generate_conversation(examples):
problems = examples["problem"]
solutions = examples["generated_solution"]
conversations = []
for problem, solution in zip(problems, solutions):
conversations.append([
{"role": "user", "content": problem},
{"role": "assistant", "content": solution},
])
return {"conversations": conversations,}
reasoning_data = reasoning_dataset.map(generate_conversation, batched=True) # 新增一个conversation字段
reasoning_data["conversations"][0]
[{'content': 'Given $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.',
'role': 'user'},
{'content': "<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>To solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n \\[\n \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n \\[\n \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n \\]\n\n3. Square both sides to eliminate the square root on the left:\n \\[\n (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n \\]\n Simplifying both sides, we get:\n \\[\n x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n \\]\n\n4. Combine like terms on the right side:\n \\[\n x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n \\]\n Simplifying further:\n \\[\n x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n \\]\n\n5. Subtract \\(x^2\\) from both sides:\n \\[\n 165 = -3 + 14\\sqrt{x^2 - 52}\n \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n \\[\n 168 = 14\\sqrt{x^2 - 52}\n \\]\n\n7. Divide both sides by 14:\n \\[\n 12 = \\sqrt{x^2 - 52}\n \\]\n\n8. Square both sides again to eliminate the square root:\n \\[\n 12^2 = x^2 - 52\n \\]\n Simplifying:\n \\[\n 144 = x^2 - 52\n \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n \\[\n 196 = x^2\n \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n \\[\n x = \\sqrt{196} = 14\n \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n \\[\n \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n \\]\n The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]",
'role': 'assistant'}]
送入分词器中
reasoning_conversations = tokenizer.apply_chat_template(
reasoning_data["conversations"],
tokenize = False
)
reasoning_conversations[0]
"<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>\n\nTo solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n \\[\n \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n \\[\n \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n \\]\n\n3. Square both sides to eliminate the square root on the left:\n \\[\n (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n \\]\n Simplifying both sides, we get:\n \\[\n x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n \\]\n\n4. Combine like terms on the right side:\n \\[\n x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n \\]\n Simplifying further:\n \\[\n x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n \\]\n\n5. Subtract \\(x^2\\) from both sides:\n \\[\n 165 = -3 + 14\\sqrt{x^2 - 52}\n \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n \\[\n 168 = 14\\sqrt{x^2 - 52}\n \\]\n\n7. Divide both sides by 14:\n \\[\n 12 = \\sqrt{x^2 - 52}\n \\]\n\n8. Square both sides again to eliminate the square root:\n \\[\n 12^2 = x^2 - 52\n \\]\n Simplifying:\n \\[\n 144 = x^2 - 52\n \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n \\[\n 196 = x^2\n \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n \\[\n x = \\sqrt{196} = 14\n \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n \\[\n \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n \\]\n The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]<|im_end|>\n"
同样对另一个数据集也进行格式转换
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)
non_reasoning_conversations = tokenizer.apply_chat_template(
dataset["conversations"],
tokenize=False,
)
对数据进行抽样, 非推理样本只占推理样本总量的百分之25
chat_percentage = 0.75
import pandas as pd
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
int(len(reasoning_conversations) * (1.0 - chat_percentage)),
random_state = 2407,
)
接下来将两个数据集拼接并打散顺序
data = pd.concat([
pd.Series(reasoning_conversations),
pd.Series(non_reasoning_subset)
])
data.name = "text"
from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed=3407)
LORA微调
lora微调配置,将lora适配器应用于Qwen3模型,也就指定需要微调的模块加入低秩矩阵适配器。
model = FastLanguageModel.get_peft_model(
model,
r=32, # LOra的秩
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 32, # 缩放系数
lora_dropout = 0,
bias = "none", # 偏置训练方式
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)
配置微调训练器,设置训练的超参数
from trl import SFTTrainer, SFTConfig # SFTTrainer 监督微调训练器 SFTConfig 监督微调的超参数配置
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = combined_dataset,
eval_dataset= None,
args = SFTConfig(
dataset_text_field = "text", # 指定训练数据所在的字段
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4, # 累计梯度更新的次数,也就是4次minibatch梯度计算完累计后后 再更新参数
warmup_steps = 5, # 学习率预热的步数
# num_train_epochs = 1, # 这个参数的优先级比max_steps低,所以这两个参数一般只选择一个去使用, 一般先用少量的max_steps确定训练过程是否正常,损失值是否在降低
max_steps = 30, # 最大更新次数
learning_rate = 2e-4,
logging_steps = 1, # 每参数更新一次,记录日志
optim = "adamw_8bit", # 8bit量化的优化器
weight_decay = 0.01, # 权重衰减率
lr_scheduler_type = "linear", # 学习率采用线性衰减
seed = 3407,
report_to = "wandb", # 设置wandb这个库作为模型训练记录的工具
),
)
接下来开始训练:目前设置的是三十次参数更新停止,目的观察损失函数是否正常降低,只要是波动中降低也没问题,因为是小批量梯度下降算法。如果无异常可以再改用epoch正式训练
trainer_stats = trainer.train()
训练过程中的loss会被打印出来,如果你设置了wandb则在网页会有更直观的图表展示。
Step | Training Loss |
---|---|
1 | 0.518500 |
2 | 0.621800 |
3 | 0.605500 |
4 | 0.587000 |
5 | 0.520100 |
6 | 0.498500 |
7 | 0.486000 |
8 | 0.465000 |
9 | 0.420200 |
微调后的模型
微调后的model就会被微调后的模型覆盖,可以打印model的结构:
PeftModelForCausalLM(
(base_model): LoraModel(
(model): Qwen3ForCausalLM(
(model): Qwen3Model(
(embed_tokens): Embedding(151936, 4096, padding_idx=151654)
(layers): ModuleList(
(0-2): 3 x Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
上面是结构的一小部分展示,分析一下q_proj微调后的结构:
(2) lora_dropout: ModuleDict((default): Identity())
(3) lora_A: ModuleDict((default): Linear(in_features=4096, out_features=32, bias=False))
(4) lora_B: ModuleDict((default): Linear(in_features=32, out_features=4096, bias=False))
(5) lora_embedding_A: ParameterDict()
(6) lora_embedding_B: ParameterDict()
(7) lora_magnitude_vector: ModuleDict()
(1) base_layer: Linear4bit(in_features=4096, out_features=4096, bias=False)
- 这是原始的线性层(未微调的部分),使用 4 位量化 来降低内存占用。
- 参数说明:
- in_features=4096:输入特征维度为 4096。
- out_features=4096:输出特征维度为 4096。
- bias=False:该线性层不包含偏置项(bias)。
- 作用:这是模型中某个 Transformer 层的子模块(例如 q_proj,即查询投影层,用于多头注意力机制中的查询向量计算)。它表示原始预训练权重,通常在微调时保持冻结(不更新)。
- 4 位量化:通过将权重压缩到 4 位,减少内存占用和计算量,适合在资源受限的环境中运行大模型。
- LoRA Dropout:这是 LoRA 微调中用于正则化的 dropout 层。
- Identity():表示当前没有应用 dropout(等价于 dropout 概率为 0)。Identity 是一个占位操作,输入直接输出,不做任何变换。
- 作用:LoRA 允许为不同层设置 dropout,但这里默认配置是无 dropout,可能为了简化或避免正则化。
- LoRA A 矩阵:这是 LoRA 微调中的第一个低秩矩阵(低秩分解的 A 部分)。
- 参数说明:
- in_features=4096:与 base_layer 的输入维度一致。
- out_features=32:输出维度为 32,表示 LoRA 的秩(rank),这是一个超参数,用于控制 LoRA 的参数量。秩越小,参数量越少,微调越高效。
- bias=False:没有偏置项。
- 作用:lora_A 是一个可训练的矩阵,用于将输入映射到一个低维空间(秩为 32),这是 LoRA 的核心思想:通过低秩矩阵近似权重更新,减少需要训练的参数量。
- LoRA B 矩阵:这是 LoRA 微调中的第二个低秩矩阵(低秩分解的 B 部分)。
- 参数说明:
- in_features=32:与 lora_A 的输出维度一致(秩为 32)。
- out_features=4096:与 base_layer 的输出维度一致。
- bias=False:没有偏置项。
- 作用:lora_B 将低维空间的表示映射回原始输出维度,与 lora_A 一起构成 LoRA 的低秩更新矩阵。
- LoRA Embedding A:这是一个参数字典,通常用于存储 LoRA 针对嵌入层(embedding layer)的 A 矩阵(如果适用)。
- 当前为空(ParameterDict()),说明这个 q_proj 层没有为嵌入层定义 LoRA 参数(因为 q_proj 是注意力机制中的线性层,而不是嵌入层)。
- LoRA Embedding B:类似 lora_embedding_A,用于存储嵌入层的 B 矩阵。
- 当前为空,原因同上。
- LoRA Magnitude Vector:这是 LoRA 中的一个可选组件,用于存储额外的向量(如缩放因子或幅度向量),以调整 LoRA 的输出。
- 当前为空(ModuleDict()),说明没有使用这种额外的向量调整机制
样本特征在经过微调后的q_proj层会被如何处理
推理阶段的具体计算步骤如下:这段内容是用GPT生成的公式部分懒得改了
- 输入处理:
- 输入 ( x ) 是一个形状为
[batchsize,seqlength,4096][batch_size, seq_length, 4096]
的张量,通常表示 Transformer 模型中某一层的隐藏状态(例如,来自上一层的输出)。[batch_size, seq_length, 4096]
- 输入 ( x ) 是一个形状为
- 原始线性层计算:
- 首先,输入 ( x ) 通过 base_layer(即原始权重 ( W ))进行线性变换:
ybase=W⋅xy_{\text{base}} = W \cdot x
y_{\text{base}} = W \cdot x
- 由于 base_layer 是 4 位量化的,计算时会先将权重反量化(从 4 位恢复到浮点数,通常由底层库如 bitsandbytes 自动处理),然后执行矩阵乘法。
- 输出
ybasey_{\text{base}}
的形状为y_{\text{base}}
[batchsize,seqlength,4096][batch_size, seq_length, 4096]
。[batch_size, seq_length, 4096]
- 首先,输入 ( x ) 通过 base_layer(即原始权重 ( W ))进行线性变换:
- LoRA 低秩更新计算:
- LoRA 的更新部分通过
ΔW=A⋅B\Delta W = A \cdot B
计算:\Delta W = A \cdot B
- 首先,输入 ( x ) 通过 lora_A 映射到低秩空间:
z=x⋅Az = x \cdot A
其中 ( A ) 形状为 ([4096, 32]),输出 ( z ) 的形状为z = x \cdot A
[batchsize,seqlength,32][batch_size, seq_length, 32]
。[batch_size, seq_length, 32]
- 然后,( z ) 通过 lora_B 映射回原始输出维度:
ylora=z⋅B=(x⋅A)⋅By_{\text{lora}} = z \cdot B = (x \cdot A) \cdot B
其中 ( B ) 形状为 ([32, 4096]),输出y_{\text{lora}} = z \cdot B = (x \cdot A) \cdot B
yloray_{\text{lora}}
的形状为y_{\text{lora}}
[batchsize,seqlength,4096][batch_size, seq_length, 4096]
。[batch_size, seq_length, 4096]
- 首先,输入 ( x ) 通过 lora_A 映射到低秩空间:
- 数学上,
ylora=(A⋅B)⋅xy_{\text{lora}} = (A \cdot B) \cdot x
,其中y_{\text{lora}} = (A \cdot B) \cdot x
A⋅BA \cdot B
等价于一个 ([4096, 4096]) 的矩阵,但通过分步计算(先A \cdot B
x⋅Ax \cdot A
,再x \cdot A
z⋅Bz \cdot B
)减少了计算量。z \cdot B
- LoRA 的更新部分通过
- 合并输出:
- 最终输出是原始线性层输出和 LoRA 更新的加和:
y=ybase+ylora=W⋅x+(A⋅B)⋅xy = y_{\text{base}} + y_{\text{lora}} = W \cdot x + (A \cdot B) \cdot x
y = y_{\text{base}} + y_{\text{lora}} = W \cdot x + (A \cdot B) \cdot x
- 输出 ( y ) 的形状仍为
[batchsize,seqlength,4096][batch_size, seq_length, 4096]
,可以直接作为 Transformer 注意力机制中查询向量(query)的输入。[batch_size, seq_length, 4096]
- 最终输出是原始线性层输出和 LoRA 更新的加和:
- Dropout(无影响):
- 由于 lora_dropout 是 Identity(),推理阶段不会应用 dropout,输入直接通过。
测试模型
messages = [
{"role": "user", "content": "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = True,
enable_thinking = False,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(text, return_tensors = "pt").to("cuda"), # **表示将tokenizer返回的字典自动解包为关键字参数传入
max_new_tokens = 256,
temperature = 0.7,
top_p = 0.8,
top_k = 20,
streamer = TextStreamer(tokenizer, skip_prompt = True),
use_cache=False,
)
To solve the equation \((x + 2)^2 = 0\), we start by taking the square root of both sides.
\[
(x + 2)^2 = 0
\]
Taking the square root of both sides, we get:
\[
x + 2 = 0
\]
Next, we solve for \(x\) by subtracting 2 from both sides:
\[
x = -2
\]
Thus, the solution to the equation is \(x = -2\).<|im_end|>
大规模微调
将epoch设置为1即将整个数据集训练一遍,并保存微调后的模型权重