LangChain 框架 Parser 讲解-EW帮帮网

文章目录

LangChain 框架: Parser 讲解

LangChain 框架: Parser 讲解

Parser 在 LangChain 中负责将语言模型的原始输出转换为结构化数据，使得后续处理和分析变得更加简单高效。本文将详细讲解 LangChain 中的 Parser 相关概念、常用类和方法，并通过实例帮助快速上手。

什么是 Parser？

在 LangChain 中，Parser 是处理语言模型输出的组件，它能够：

将非结构化的文本转换为结构化数据（如字典、列表、特定对象等）
提取关键信息并进行格式化
验证输出格式的正确性
处理可能的解析错误

当我们使用语言模型时，得到的通常是一段自然语言文本。而在实际应用中，我们往往需要特定格式的数据（如 JSON、特定类的实例等）来进行后续操作。这就是 Parser 发挥作用的地方。

LangChain 中常用的 Parser 类

LangChain 提供了多种 Parser 类以适应不同的场景，下面介绍几种最常用的：

1. `BaseOutputParser`

BaseOutputParser 是所有解析器的基类，定义了解析器的基本接口。所有其他解析器类都继承自此类，并实现其抽象方法。

类原型：

class BaseOutputParser(ABC):
    """Base class for output parsers."""

    @abstractmethod
    def parse(self, text: str) -> Any:
        """Parse the output of an LLM call.
        
        Args:
            text: Text to parse.
            
        Returns:
            Parsed output.
        """

    def get_format_instructions(self) -> str:
        """Instructions on how the LLM output should be formatted."""
        return ""
        
    async def aparse(self, text: str) -> Any:
        """Async parse the output of an LLM call.
        
        Args:
            text: Text to parse.
            
        Returns:
            Parsed output.
        """
        return self.parse(text)

核心方法详解：

parse(text: str) -> Any
- 参数：text (str) - 需要解析的文本，通常是语言模型的输出
- 功能：将输入文本解析为指定类型的结构化数据
- 返回值：解析后的结构化数据（类型取决于具体实现）
- 说明：抽象方法，必须在子类中实现
get_format_instructions() -> str
- 参数：无
- 功能：返回格式化指令，告诉语言模型应该如何输出才能被正确解析
- 返回值：格式化指令字符串
- 说明：可以被子类重写以提供特定的格式说明
aparse(text: str) -> Any
- 参数：text (str) - 需要解析的文本
- 功能：parse 方法的异步版本
- 返回值：解析后的结构化数据
- 说明：默认实现直接调用 parse 方法，可在子类中重写以提供真正的异步实现

2. `SimpleJsonOutputParser`

SimpleJsonOutputParser 用于将语言模型输出的 JSON 格式字符串解析为 Python 字典或其他 JSON 可序列化对象。

类原型：

class SimpleJsonOutputParser(BaseOutputParser[Any]):
    """Parse the output of an LLM call to a JSON object."""

    def parse(self, text: str) -> Any:
        """Parse the output of an LLM call.
        
        Args:
            text: Text to parse.
            
        Returns:
            Parsed JSON object.
        """
        try:
            return json.loads(text)
        except json.JSONDecodeError as e:
            raise OutputParserException(f"Could not parse output: {text} as JSON. Error: {e}") from e

    def get_format_instructions(self) -> str:
        """Return a string describing the format of the output."""
        return "Return a JSON object."

方法详解：

parse(text: str) -> Any
- 参数：text (str) - 包含 JSON 格式的字符串
- 功能：将 JSON 格式的字符串解析为 Python 对象（通常是字典或列表）
- 返回值：解析后的 Python 对象
- 异常：如果输入不是有效的 JSON 格式，会抛出 OutputParserException
get_format_instructions() -> str
- 参数：无
- 功能：返回提示语言模型输出 JSON 格式的指令
- 返回值：字符串 “Return a JSON object.”

示例：

from langchain.output_parsers import SimpleJsonOutputParser
from langchain.schema import OutputParserException

# 创建解析器实例
parser = SimpleJsonOutputParser()

# 测试1：有效的JSON输出
llm_output1 = '{"name": "Alice", "age": 30, "hobbies": ["reading", "hiking"]}'
try:
    result1 = parser.parse(llm_output1)
    print("测试1结果:")
    print(f"类型: {type(result1)}")
    print(f"内容: {result1}")
    print(f"访问字段: Name: {result1['name']}, Age: {result1['age']}")
except OutputParserException as e:
    print(f"测试1错误: {e}")

# 测试2：无效的JSON输出
llm_output2 = '{"name": "Bob", "age": 25'  # 缺少闭合的}'
try:
    result2 = parser.parse(llm_output2)
    print("\n测试2结果:", result2)
except OutputParserException as e:
    print(f"\n测试2错误: {e}")

运行结果：

测试1结果:
类型: <class 'dict'>
内容: {'name': 'Alice', 'age': 30, 'hobbies': ['reading', 'hiking']}
访问字段: Name: Alice, Age: 30

测试2错误: Could not parse output: {"name": "Bob", "age": 25 as JSON. Error: Expecting property name enclosed in double quotes: line 1 column 25 (char 24)

结果分析：

对于有效的 JSON 字符串，SimpleJsonOutputParser 成功将其转换为 Python 字典
转换后可以方便地通过键名访问各个字段
对于无效的 JSON 格式，解析器会抛出 OutputParserException 并包含详细的错误信息

3. `ResponseSchema` 与 `StructuredOutputParser`

当需要更复杂的结构化输出时，可以使用 StructuredOutputParser，它需要配合 ResponseSchema 来明确定义输出结构。

ResponseSchema 用于定义每个字段的名称、描述和类型，而 StructuredOutputParser 则根据这些定义来解析输出。

ResponseSchema 类原型：

class ResponseSchema(BaseModel):
    """Schema for a response from a structured output parser."""

    name: str
    """The name of the field."""
    description: str
    """The description of the field."""
    type: Optional[str] = None
    """The type of the field."""

参数说明：

name (str)：字段名称，用于标识输出中的特定字段
description (str)：字段描述，解释该字段的含义和应包含的内容
type (Optional[str])：字段类型，可选参数，指定该字段的数据类型

StructuredOutputParser 类原型：

class StructuredOutputParser(BaseOutputParser[Dict[str, Any]]):
    """Parse the output of an LLM call into a structured format."""

    response_schemas: List[ResponseSchema]
    """The schemas for the response."""
    json_parser: SimpleJsonOutputParser = Field(default_factory=SimpleJsonOutputParser)
    """The parser to use for parsing JSON."""

    @classmethod
    def from_response_schemas(cls, response_schemas: List[ResponseSchema]) -> "StructuredOutputParser":
        """Create a StructuredOutputParser from a list of ResponseSchemas."""
        return cls(response_schemas=response_schemas)

    def get_format_instructions(self) -> str:
        """Return a string describing the format of the output."""
        # 生成详细的格式说明
        ...

    def parse(self, text: str) -> Dict[str, Any]:
        """Parse the output of an LLM call.
        
        Args:
            text: Text to parse.
            
        Returns:
            Parsed output as a dictionary.
        """
        # 解析文本并返回字典
        ...

核心方法详解：

from_response_schemas(response_schemas: List[ResponseSchema]) -> "StructuredOutputParser"
- 参数：response_schemas - ResponseSchema 对象的列表，定义了输出结构
- 功能：根据响应模式列表创建 StructuredOutputParser 实例
- 返回值：StructuredOutputParser 实例
get_format_instructions() -> str
- 参数：无
- 功能：生成详细的格式说明，指导语言模型按照指定结构输出
- 返回值：格式化指令字符串，包含 JSON schema 定义
parse(self, text: str) -> Dict[str, Any]
- 参数：text (str) - 语言模型的输出文本
- 功能：解析文本，提取符合预定义结构的数据
- 返回值：包含解析后数据的字典
- 异常：如果解析失败，会抛出 OutputParserException

示例：

from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.schema import OutputParserException

# 1. 定义响应模式
response_schemas = [
    ResponseSchema(
        name="answer", 
        description="对问题的回答，用简洁明了的语言",
        type="string"
    ),
    ResponseSchema(
        name="source", 
        description="回答的来源或依据，可以是书籍、文章或个人知识",
        type="string"
    ),
    ResponseSchema(
        name="confidence", 
        description="对回答的信心程度，范围0-100的整数",
        type="integer"
    )
]

# 2. 创建结构化输出解析器
parser = StructuredOutputParser.from_response_schemas(response_schemas)

# 3. 获取并打印格式化指令
format_instructions = parser.get_format_instructions()
print("格式化指令:")
print(format_instructions)

# 4. 测试解析有效输出
valid_output = '''```json
{
    "answer": "LangChain是一个用于构建基于语言模型的应用程序的框架。",
    "source": "公开文档和官方网站",
    "confidence": 95
}
```'''

try:
    result_valid = parser.parse(valid_output)
    print("\n有效输出解析结果:")
    print(f"类型: {type(result_valid)}")
    print(f"内容: {result_valid}")
    print(f"访问字段: 回答: {result_valid['answer']}, 信心度: {result_valid['confidence']}%")
except OutputParserException as e:
    print(f"有效输出解析错误: {e}")

# 5. 测试解析无效输出（缺少必要字段）
invalid_output = '''```json
{
    "answer": "LangChain是一个Python框架。"
    "confidence": "高"
}
```'''

try:
    result_invalid = parser.parse(invalid_output)
    print("\n无效输出解析结果:", result_invalid)
except OutputParserException as e:
    print(f"\n无效输出解析错误: {e}")

运行结果：

--- 格式化指令 ---
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"answer": string  // 对问题的回答，用简洁明了的语言
	"source": string  // 回答的来源或依据，可以是书籍、文章或个人知识
	"confidence": integer  // 对回答的信心程度，范围0-100的整数
}
```

--- 有效输出解析结果 ---
类型: <class 'dict'>
内容: {'answer': 'LangChain是一个用于构建基于语言模型的应用程序的框架。', 'source': '公开文档和官方网站', 'confidence': 95}
访问字段: 回答: LangChain是一个用于构建基于语言模型的应用程序的框架。, 信心度: 95%

--- 测试内容不完整的输出 ---
解析错误: Got invalid return object. Expected key `source` to be present, but got {'answer': 'LangChain是一个Python框架。', 'confidence': 90}
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE

结果分析：

ResponseSchema 清晰定义了每个输出字段的名称、描述和类型
get_format_instructions() 生成了详细的格式说明，包括 JSON schema 和示例格式
StructuredOutputParser 能够解析符合指定格式的输出，即使包含 markdown 代码块标记
对于格式错误或不完整的输出，会抛出详细的错误信息

4. `CommaSeparatedListOutputParser`

CommaSeparatedListOutputParser 用于将逗号分隔的字符串解析为 Python 列表。这在需要提取列表形式信息的场景中非常有用。

类原型：

class CommaSeparatedListOutputParser(BaseOutputParser[List[str]]):
    """Parse the output of an LLM call to a comma-separated list."""

    def parse(self, text: str) -> List[str]:
        """Parse the output of an LLM call.
        
        Args:
            text: Text to parse.
            
        Returns:
            List of strings.
        """
        return [item.strip() for item in text.split(",")]

    def get_format_instructions(self) -> str:
        """Return a string describing the format of the output."""
        return "Your response should be a list of comma separated values, eg: `foo, bar, baz`"

方法详解：

parse(text: str) -> List[str]
- 参数：text (str) - 逗号分隔的字符串
- 功能：将逗号分隔的字符串拆分为字符串列表，并去除每个元素前后的空格
- 返回值：字符串列表
- 说明：即使输入中包含空元素（如连续逗号），也会被处理为列表中的空字符串
get_format_instructions() -> str
- 参数：无
- 功能：返回提示语言模型输出逗号分隔列表的指令
- 返回值：格式化指令字符串

示例：

from langchain.output_parsers import CommaSeparatedListOutputParser

# 创建解析器
parser = CommaSeparatedListOutputParser()

# 获取并打印格式化指令
print("格式化指令:", parser.get_format_instructions())

# 测试不同的输入
test_cases = [
    "苹果, 香蕉, 橙子, 葡萄, 西瓜",
    "北京,上海,广州,深圳",
    "编程, 数学 , 物理, 化学 ",
    "单个元素"  # 没有逗号的情况
]

for i, text in enumerate(test_cases, 1):
    result = parser.parse(text)
    print(f"\n测试用例 {i}:")
    print(f"输入: {text}")
    print(f"解析结果类型: {type(result)}")
    print(f"解析结果: {result}")
    print(f"列表长度: {len(result)}")

运行结果：

格式化指令: Your response should be a list of comma separated values, eg: `foo, bar, baz`

测试用例 1:
输入: 苹果, 香蕉, 橙子, 葡萄, 西瓜
解析结果类型: <class 'list'>
解析结果: ['苹果', '香蕉', '橙子', '葡萄', '西瓜']
列表长度: 5

测试用例 2:
输入: 北京,上海,广州,深圳
解析结果类型: <class 'list'>
解析结果: ['北京', '上海', '广州', '深圳']
列表长度: 4

测试用例 3:
输入: 编程, 数学 , 物理, 化学 
解析结果类型: <class 'list'>
解析结果: ['编程', '数学', '物理', '化学']
列表长度: 4

测试用例 4:
输入: 单个元素
解析结果类型: <class 'list'>
解析结果: ['单个元素']
列表长度: 1

结果分析：

该解析器能够处理各种格式的逗号分隔列表，包括带有空格和不带空格的情况
自动去除每个元素前后的空格，确保结果的整洁性
对于没有逗号的输入，会将整个字符串作为列表的唯一元素
解析结果是标准的 Python 列表，便于后续的循环、索引等操作

5. `DatetimeOutputParser`

DatetimeOutputParser 用于解析日期时间格式的输出，将字符串转换为 Python 的 datetime.datetime 对象，便于进行日期时间相关的计算和操作。

类原型：

class DatetimeOutputParser(BaseOutputParser[datetime.datetime]):
    """Parse the output of an LLM call to a datetime."""

    format: str = "%Y-%m-%dT%H:%M:%S"
    """The format to use for parsing datetime."""

    def parse(self, text: str) -> datetime.datetime:
        """Parse the output of an LLM call.
        
        Args:
            text: Text to parse.
            
        Returns:
            Parsed datetime.
        """
        try:
            return datetime.datetime.strptime(text.strip(), self.format)
        except ValueError as e:
            raise OutputParserException(
                f"Could not parse datetime: {text} with format: {self.format}. Error: {e}"
            ) from e

    def get_format_instructions(self) -> str:
        """Return a string describing the format of the output."""
        return f"Write a datetime string in the following format: {self.format}. For example: {datetime.datetime.now().strftime(self.format)}"

参数与方法详解：

format (str)：类属性，指定日期时间的格式字符串，默认为 “%Y-%m-%dT%H:%M:%S”
格式说明：
- %Y: 四位数年份
- %m: 两位数月份
- %d: 两位数日期
- %H: 24小时制小时
- %M: 分钟
- %S: 秒

parse(text: str) -> datetime.datetime
- 参数：text (str) - 包含日期时间的字符串
- 功能：将字符串按照指定格式解析为 datetime.datetime 对象
- 返回值：datetime.datetime 对象
- 异常：如果输入不符合指定格式，会抛出 OutputParserException
get_format_instructions() -> str
- 参数：无
- 功能：返回提示语言模型输出指定格式日期时间的指令，包含格式说明和示例
- 返回值：格式化指令字符串

示例：

from langchain.output_parsers import DatetimeOutputParser
from langchain.schema import OutputParserException
import datetime

# 创建解析器，使用不同格式
parser1 = DatetimeOutputParser()  # 使用默认格式 "%Y-%m-%dT%H:%M:%S"
parser2 = DatetimeOutputParser(format="%Y-%m-%d %H:%M:%S")  # 自定义格式

# 打印格式化指令
print("解析器1格式化指令:", parser1.get_format_instructions())
print("解析器2格式化指令:", parser2.get_format_instructions())

# 测试解析器1
print("\n测试解析器1:")
valid_input1 = "2023-11-05T14:30:00"
invalid_input1 = "2023/11/05 14:30"

try:
    result_valid1 = parser1.parse(valid_input1)
    print(f"有效输入解析结果: {result_valid1}")
    print(f"类型: {type(result_valid1)}")
    print(f"年份: {result_valid1.year}, 月份: {result_valid1.month}, 日期: {result_valid1.day}")
except OutputParserException as e:
    print(f"有效输入解析错误: {e}")

try:
    result_invalid1 = parser1.parse(invalid_input1)
    print(f"无效输入解析结果: {result_invalid1}")
except OutputParserException as e:
    print(f"无效输入解析错误: {e}")

# 测试解析器2
print("\n测试解析器2:")
valid_input2 = "2023-11-05 14:30:00"

try:
    result_valid2 = parser2.parse(valid_input2)
    print(f"有效输入解析结果: {result_valid2}")
    print(f"小时: {result_valid2.hour}, 分钟: {result_valid2.minute}")
except OutputParserException as e:
    print(f"解析错误: {e}")

运行结果：

解析器1格式化指令: Write a datetime string in the following format: %Y-%m-%dT%H:%M:%S. For example: 2023-11-05T10:15:30
解析器2格式化指令: Write a datetime string in the following format: %Y-%m-%d %H:%M:%S. For example: 2023-11-05 10:15:30

测试解析器1:
有效输入解析结果: 2023-11-05 14:30:00
类型: <class 'datetime.datetime'>
年份: 2023, 月份: 11, 日期: 5
无效输入解析错误: Could not parse datetime: 2023/11/05 14:30 with format: %Y-%m-%dT%H:%M:%S. Error: time data '2023/11/05 14:30' does not match format '%Y-%m-%dT%H:%M:%S'

测试解析器2:
有效输入解析结果: 2023-11-05 14:30:00
小时: 14, 分钟: 30

结果分析：

解析后得到的是 datetime.datetime 对象，可以方便地访问年、月、日、时、分、秒等属性
可以通过 format 参数指定所需的日期时间格式，适应不同场景
解析器对格式要求严格，如果输入不符合指定格式，会抛出详细的错误信息
get_format_instructions() 方法会自动生成包含当前时间的示例，帮助语言模型理解所需格式

Parser 的实际应用流程

在实际应用中，使用 Parser 的完整流程通常包括以下步骤：

定义输出格式（使用 ResponseSchema 等）
创建相应的 Parser
获取格式化指令并整合到提示词中
调用语言模型获取输出
使用 Parser 解析输出
处理解析结果

下面是一个完整的示例，展示如何将 Parser 与提示词和语言模型结合使用：

from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import PromptTemplate
import os

# 注意：在实际使用中，你需要设置自己的API密钥
# os.environ["OPENAI_API_KEY"] = "your_api_key"

# 1. 定义响应模式
response_schemas = [
    ResponseSchema(name="movie_title", description="电影标题", type="string"),
    ResponseSchema(name="director", description="导演名字", type="string"),
    ResponseSchema(name="year", description="上映年份", type="integer"),
    ResponseSchema(name="genre", description="电影类型，多个类型用逗号分隔", type="string"),
    ResponseSchema(name="rating", description="评分，满分10分，保留一位小数", type="float")
]

# 2. 创建结构化输出解析器
parser = StructuredOutputParser.from_response_schemas(response_schemas)

# 3. 获取格式化指令
format_instructions = parser.get_format_instructions()
print("格式化指令预览:")
print(format_instructions[:500] + "...")  # 只显示前500个字符

# 4. 创建提示模板
prompt = PromptTemplate(
    input_variables=["movie_description"],
    template="请分析以下电影描述，并提取相关信息。\n{format_instructions}\n电影描述: {movie_description}"
)

# 5. 格式化提示
formatted_prompt = prompt.format_prompt(
    movie_description="这部1994年上映的经典电影由罗伯特·泽米吉斯执导，讲述了一个名叫福雷斯·甘的简单人的传奇一生。"
                      "影片融合了剧情、爱情和历史元素，在全球获得了广泛好评， IMDb评分为8.8分。",
    format_instructions=format_instructions
)

# 6. 调用语言模型（这里使用模拟输出）
# 实际应用中应该使用真实的语言模型
# from langchain.llms import OpenAI
# llm = OpenAI(temperature=0)
# llm_output = llm(formatted_prompt.to_string())

llm_output = '''```json
{
    "movie_title": "阿甘正传",
    "director": "罗伯特·泽米吉斯",
    "year": 1994,
    "genre": "剧情、爱情、历史",
    "rating": 8.8
}
```'''

print("\n语言模型输出:")
print(llm_output)

# 7. 解析输出
result = parser.parse(llm_output)

# 8. 处理解析结果
print("\n解析结果:")
print(f"电影标题: {result['movie_title']}")
print(f"导演: {result['director']}")
print(f"上映年份: {result['year']}")
print(f"类型: {result['genre']}")
print(f"评分: {result['rating']}/10")
print(f"评分类型: {type(result['rating'])}")

运行结果：

格式化指令预览:
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"movie_title": string  // 电影标题
	"director": string  // 导演名字
	"year": integer  // 上映年份
	"genre": string  // 电影类型，多个类型用逗号分隔
	"rating": float  // 评分，满分10分，保留一位小数
}

…

语言模型输出:

{
    "movie_title": "阿甘正传",
    "director": "罗伯特·泽米吉斯",
    "year": 1994,
    "genre": "剧情、爱情、历史",
    "rating": 8.8
}

解析结果:
电影标题: 阿甘正传
导演: 罗伯特·泽米吉斯
上映年份: 1994
类型: 剧情、爱情、历史
评分: 8.8/10
评分类型: <class ‘float’>


## 错误处理

在使用 Parser 时，经常会遇到解析错误的情况（如模型输出不符合预期格式）。LangChain 提供了 `OutputParserException` 来统一处理这些错误。

**OutputParserException 类原型**：
```python
class OutputParserException(ValueError):
    """Exception that output parsers should raise when they fail to parse output."""

    def __init__(self, message: str, llm_output: Optional[str] = None) -> None:
        super().__init__(message)
        self.llm_output = llm_output

参数说明：

message (str)：错误消息描述
llm_output (Optional[str])：导致错误的原始语言模型输出

错误处理示例：

from langchain.output_parsers import SimpleJsonOutputParser, OutputParserException

parser = SimpleJsonOutputParser()

# 测试不同类型的错误输入
test_cases = [
    '{"name": "Alice", "age": 30',  # 缺少闭合的}'
    'Not a JSON string',             # 非JSON格式
    '{"name": "Bob", "age": "thirty"}' # 类型错误，但仍是有效的JSON
]

for i, invalid_output in enumerate(test_cases, 1):
    try:
        result = parser.parse(invalid_output)
        print(f"测试用例 {i} 解析成功: {result}")
    except OutputParserException as e:
        print(f"测试用例 {i} 解析错误:")
        print(f"  错误消息: {e}")
        print(f"  错误类型: {type(e)}")
        print(f"  原始输出: {invalid_output}\n")

运行结果：

测试用例 1 解析错误:
  错误消息: Could not parse output: {"name": "Alice", "age": 30 as JSON. Error: Expecting property name enclosed in double quotes: line 1 column 25 (char 24)
  错误类型: <class 'langchain.schema.OutputParserException'>
  原始输出: {"name": "Alice", "age": 30

测试用例 2 解析错误:
  错误消息: Could not parse output: Not a JSON string as JSON. Error: Expecting value: line 1 column 1 (char 0)
  错误类型: <class 'langchain.schema.OutputParserException'>
  原始输出: Not a JSON string

测试用例 3 解析成功: {'name': 'Bob', 'age': 'thirty'}

结果分析：

OutputParserException 会捕获并包装解析过程中的错误，提供清晰的错误消息
第三个测试用例虽然存在逻辑错误（age应为数字却为字符串），但仍是有效的JSON，因此不会抛出异常
在实际应用中，应始终使用 try-except 块来捕获解析错误，以确保程序的健壮性

自定义 Parser

当内置 Parser 无法满足需求时，你可以通过继承 BaseOutputParser 来创建自定义 Parser。

自定义 Parser 步骤：

继承 BaseOutputParser 类
指定泛型类型（解析后的数据类型）
实现 parse 方法
（可选）重写 get_format_instructions 方法
（可选）重写 aparse 方法以提供异步支持

自定义 Parser 示例：

from langchain.output_parsers import BaseOutputParser
from typing import List, Tuple
import re

class KeyValuePairParser(BaseOutputParser[List[Tuple[str, str]]]):
    """
    自定义解析器，将"key: value"格式的文本解析为键值对元组列表
    """
    
    def parse(self, text: str) -> List[Tuple[str, str]]:
        """
        解析文本，提取键值对
        
        Args:
            text: 包含键值对的文本，每行一个键值对，格式为"key: value"
            
        Returns:
            键值对元组列表，每个元组为(key, value)
        """
        # 按行分割文本
        lines = text.strip().split('\n')
        result = []
        
        for line in lines:
            # 使用正则表达式匹配"key: value"格式
            match = re.match(r'^\s*([^:]+?)\s*:\s*(.*?)\s*$', line)
            if match:
                key = match.group(1)
                value = match.group(2)
                result.append((key, value))
        
        return result
    
    def get_format_instructions(self) -> str:
        """返回格式化指令"""
        return (
            "请按照每行一个键值对的格式输出，键和值之间用冒号分隔，例如:\n"
            "name: 张三\n"
            "age: 30\n"
            "occupation: 工程师"
        )

# 使用自定义解析器
parser = KeyValuePairParser()

# 打印格式化指令
print("格式化指令:")
print(parser.get_format_instructions())

# 测试解析
text = """
name: 李四
age: 28
occupation: 设计师
hobby: 绘画, 旅行
"""

result = parser.parse(text)

print("\n解析结果:")
print(f"类型: {type(result)}")
print(f"内容: {result}")

# 处理解析结果
print("\n处理结果:")
for key, value in result:
    print(f"{key} -> {value}")

运行结果：

格式化指令:
请按照每行一个键值对的格式输出，键和值之间用冒号分隔，例如:
name: 张三
age: 30
occupation: 工程师

解析结果:
类型: <class 'list'>
内容: [('name', '李四'), ('age', '28'), ('occupation', '设计师'), ('hobby', '绘画, 旅行')]

处理结果:
name -> 李四
age -> 28
occupation -> 设计师
hobby -> 绘画, 旅行

结果分析：

自定义 Parser 可以处理特定格式的输出，满足内置 Parser 无法覆盖的需求
KeyValuePairParser 成功解析了"key: value"格式的文本，并将其转换为元组列表
通过实现 get_format_instructions 方法，提供了清晰的格式说明，帮助语言模型生成符合要求的输出

总结

Parser 是 LangChain 中非常重要的组件，它架起了非结构化文本输出与结构化数据之间的桥梁。本文详细介绍了 LangChain 中常用的 Parser 类及其使用方法，包括：

BaseOutputParser：所有解析器的基类，定义了基本接口
SimpleJsonOutputParser：解析 JSON 格式输出，转换为 Python 字典
StructuredOutputParser：配合 ResponseSchema 实现复杂结构化输出
CommaSeparatedListOutputParser：解析逗号分隔的列表，转换为 Python 列表
DatetimeOutputParser：解析日期时间格式，转换为 datetime.datetime 对象

lue}")


**运行结果**：

格式化指令:
请按照每行一个键值对的格式输出，键和值之间用冒号分隔，例如:
name: 张三
age: 30
occupation: 工程师

解析结果:
类型: <class ‘list’>
内容: [(‘name’, ‘李四’), (‘age’, ‘28’), (‘occupation’, ‘设计师’), (‘hobby’, ‘绘画, 旅行’)]

处理结果:
name -> 李四
age -> 28
occupation -> 设计师
hobby -> 绘画, 旅行


**结果分析**：
- 自定义 Parser 可以处理特定格式的输出，满足内置 Parser 无法覆盖的需求
- `KeyValuePairParser` 成功解析了"key: value"格式的文本，并将其转换为元组列表
- 通过实现 `get_format_instructions` 方法，提供了清晰的格式说明，帮助语言模型生成符合要求的输出

## 总结

Parser 是 LangChain 中非常重要的组件，它架起了非结构化文本输出与结构化数据之间的桥梁。本文详细介绍了 LangChain 中常用的 Parser 类及其使用方法，包括：

- `BaseOutputParser`：所有解析器的基类，定义了基本接口
- `SimpleJsonOutputParser`：解析 JSON 格式输出，转换为 Python 字典
- `StructuredOutputParser`：配合 `ResponseSchema` 实现复杂结构化输出
- `CommaSeparatedListOutputParser`：解析逗号分隔的列表，转换为 Python 列表
- `DatetimeOutputParser`：解析日期时间格式，转换为 `datetime.datetime` 对象

LangChain 框架 Parser 讲解

文章目录