本文是《LangChain实战课》系列的第十六篇,将手把手教你如何使用SQLDatabaseToolkit和Agent构建一个智能的数据分析助手。这个助手能够理解自然语言问题,自动编写和执行SQL查询,并对结果进行智能分析和解释,让数据查询变得像对话一样简单。
前言
在当今数据驱动的世界中,能够快速从数据库中提取和分析信息是至关重要的技能。然而,编写复杂的SQL查询对于非技术用户来说往往是一个挑战。通过结合LangChain的SQLDatabaseToolkit和Agent技术,我们可以创建一个智能的数据分析助手,让用户用自然语言提问,自动生成和执行SQL查询,并获得有意义的分析结果。
为什么需要SQL数据分析Agent?
传统数据查询的痛点
技术门槛高:需要熟练掌握SQL语法
效率低下:手动编写查询耗时且容易出错
理解困难:非技术人员难以理解查询结果
灵活性差:固定的查询难以应对多变的分析需求
SQL Agent的优势
自然语言交互:用户可以用日常语言提问
自动查询生成:AI自动编写优化的SQL查询
智能结果解释:不仅返回数据,还提供分析和洞察
安全可控:可以限制查询范围和操作权限
环境准备和安装
在开始之前,我们需要安装必要的依赖包:
# 安装LangChain核心库
pip install langchain
# 安装SQL相关依赖
pip install sqlalchemy
# 安装数据库驱动(根据使用的数据库选择)
pip install pymysql # MySQL
pip install psycopg2-binary # PostgreSQL
pip install sqlite3 # SQLite(通常已内置)
# 安装其他工具包
pip install pandas
pip install numpy
pip install python-dotenv
# 安装可选的可视化库
pip install matplotlib
pip install seaborn
创建示例数据库
首先,让我们创建一个丰富的示例数据库来演示SQL Agent的功能:
import sqlite3
import pandas as pd
from datetime import datetime, timedelta
import random
def create_sample_database():
"""创建包含多个表的示例数据库"""
conn = sqlite3.connect('company_data.db')
cursor = conn.cursor()
# 创建员工表
cursor.execute('''
CREATE TABLE IF NOT EXISTS employees (
employee_id INTEGER PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
phone TEXT,
hire_date DATE NOT NULL,
job_title TEXT NOT NULL,
department TEXT NOT NULL,
salary REAL NOT NULL,
manager_id INTEGER,
FOREIGN KEY (manager_id) REFERENCES employees (employee_id)
)
''')
# 创建部门表
cursor.execute('''
CREATE TABLE IF NOT EXISTS departments (
department_id INTEGER PRIMARY KEY,
department_name TEXT UNIQUE NOT NULL,
location TEXT NOT NULL,
budget REAL NOT NULL
)
''')
# 创建销售表
cursor.execute('''
CREATE TABLE IF NOT EXISTS sales (
sale_id INTEGER PRIMARY KEY,
employee_id INTEGER NOT NULL,
sale_date DATE NOT NULL,
product_category TEXT NOT NULL,
amount REAL NOT NULL,
region TEXT NOT NULL,
FOREIGN KEY (employee_id) REFERENCES employees (employee_id)
)
''')
# 创建项目表
cursor.execute('''
CREATE TABLE IF NOT EXISTS projects (
project_id INTEGER PRIMARY KEY,
project_name TEXT NOT NULL,
start_date DATE NOT NULL,
end_date DATE,
budget REAL NOT NULL,
status TEXT NOT NULL CHECK(status IN ('Active', 'Completed', 'On Hold'))
)
''')
# 创建员工项目关联表
cursor.execute('''
CREATE TABLE IF NOT EXISTS employee_projects (
employee_id INTEGER NOT NULL,
project_id INTEGER NOT NULL,
role TEXT NOT NULL,
hours_worked REAL DEFAULT 0,
PRIMARY KEY (employee_id, project_id),
FOREIGN KEY (employee_id) REFERENCES employees (employee_id),
FOREIGN KEY (project_id) REFERENCES projects (project_id)
)
''')
# 插入部门数据
departments = [
(1, 'Engineering', 'San Francisco', 1500000),
(2, 'Sales', 'New York', 1200000),
(3, 'Marketing', 'Chicago', 800000),
(4, 'HR', 'Austin', 500000),
(5, 'Finance', 'Boston', 900000)
]
cursor.executemany('INSERT INTO departments VALUES (?, ?, ?, ?)', departments)
# 插入员工数据
employees = [
(1, 'John', 'Doe', 'john.doe@company.com', '555-0101', '2020-01-15', 'Senior Engineer', 'Engineering', 120000, None),
(2, 'Jane', 'Smith', 'jane.smith@company.com', '555-0102', '2019-03-10', 'Engineering Manager', 'Engineering', 150000, 1),
(3, 'Bob', 'Johnson', 'bob.johnson@company.com', '555-0103', '2021-06-20', 'Software Engineer', 'Engineering', 95000, 2),
(4, 'Alice', 'Brown', 'alice.brown@company.com', '555-0104', '2018-11-05', 'Sales Director', 'Sales', 130000, None),
(5, 'Charlie', 'Wilson', 'charlie.wilson@company.com', '555-0105', '2022-02-14', 'Sales Representative', 'Sales', 75000, 4),
(6, 'Diana', 'Lee', 'diana.lee@company.com', '555-0106', '2020-07-30', 'Marketing Manager', 'Marketing', 110000, None),
(7, 'Mike', 'Chen', 'mike.chen@company.com', '555-0107', '2021-09-12', 'Marketing Specialist', 'Marketing', 85000, 6),
(8, 'Sarah', 'Davis', 'sarah.davis@company.com', '555-0108', '2019-12-01', 'HR Manager', 'HR', 100000, None),
(9, 'Tom', 'Anderson', 'tom.anderson@company.com', '555-0109', '2022-04-25', 'HR Coordinator', 'HR', 60000, 8),
(10, 'Lisa', 'Garcia', 'lisa.garcia@company.com', '555-0110', '2020-08-18', 'Finance Manager', 'Finance', 125000, None)
]
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', employees)
# 插入项目数据
projects = [
(1, 'Website Redesign', '2023-01-10', '2023-06-15', 50000, 'Completed'),
(2, 'Mobile App Development', '2023-03-01', None, 75000, 'Active'),
(3, 'Market Research', '2023-02-15', '2023-04-30', 30000, 'Completed'),
(4, 'ERP Implementation', '2023-05-01', None, 100000, 'Active'),
(5, 'Product Launch', '2023-06-01', None, 60000, 'On Hold')
]
cursor.executemany('INSERT INTO projects VALUES (?, ?, ?, ?, ?, ?)', projects)
# 插入员工项目关联数据
employee_projects = [
(1, 1, 'Lead Developer', 120),
(2, 1, 'Project Manager', 40),
(3, 1, 'Developer', 80),
(1, 2,<