Python Day21 re模块正则表达式简单小说爬取及例题分析-EW帮帮网

一、正则表达式基础

正则表达式（Regular Expression，简写 regex、regexp、re）是由特殊字符组成的模式匹配串，用于快速对字符串进行查找、提取、替换、校验等操作。

1. 基本匹配规则

普通字符匹配：xyz 匹配字符串 "xyz"（写什么匹配什么，不含特殊符号）。

2. 单字符匹配规则

表达式	说明
`[xyz]`	匹配 x、y、z 中任意一个字符
`[0-9]`	匹配任意一个数字
`[a-z]`	匹配任意一个小写字母
`[A-Za-z]`	匹配任意一个字母（大小写）
`[A-Za-z0-9]`	匹配任意一个数字或字母
`[A-Za-z0-9_]`	匹配任意一个单词字符（字母、数字、下划线）
`[-0-9]`	匹配任意一个数字或中划线
`[\u4e00-\u9fa5]`	匹配一个汉字
`[^xyz]`	匹配除 x、y、z 外的任意字符（取反）

3. 元字符匹配规则

元字符是正则中预定义的特殊字符，用于简化匹配逻辑：

元字符	说明	等价表达式
`\d`	匹配任意一个数字	`[0-9]`
`\D`	匹配任意一个非数字字符	`[^0-9]`
`\w`	匹配任意一个单词字符（Python3 含中文）	`[0-9a-zA-Z_]`
`\W`	匹配任意一个非单词字符	`[^0-9a-zA-Z_]`
`\s`	匹配任意一个空白字符（空格、制表符、换行符）	-
`\S`	匹配任意一个非空白字符	-
`.`	匹配除换行符外的任意一个字符	-
`\.`	匹配一个小数点（转义后）	-
`[\d\D]`	匹配任意一个字符（包括换行符）	-

4. 多字符匹配规则（数量限定）

用于指定匹配规则的重复次数，x 代表任意单字符 / 元字符匹配规则：

表达式	说明	等价形式
`x{m}`	连续匹配 m 个 x（m 为整数）	-
`x{m,}`	连续匹配至少 m 个 x	-
`x{m,n}`	连续匹配至少 m 个、至多 n 个 x	-

5. 贪婪与非贪婪匹配

贪婪匹配：尽可能多的匹配内容（默认规则）。
- x*：x 至少出现 0 次（等价 x{0,}）
- x+：x 至少出现 1 次（等价 x{1,}）
- x?：x 最多出现 1 次（等价 x{0,1}）
非贪婪匹配：尽可能少的匹配内容（贪婪表达式后加 ?）。
- x*?：x 至少出现 0 次（非贪婪版 x*）
- x+?：x 至少出现 1 次（非贪婪版 x+）
- x??：x 最多出现 1 次（非贪婪版 x?）

6. 分组匹配规则

用于对正则进行分组，便于提取指定部分内容：

普通分组：(regex)，通过组索引提取内容（从 1 开始）。
命名捕获分组：(?P<name>regex)，通过组名 name 提取内容（Python 语法）。
非命名捕获分组：(?:regex)，仅分组匹配，不单独提取。
引用分组：\n（n 为数字），引用第 n 组匹配的内容（如 (\d)\1 匹配连续相同数字）。

7. 选择匹配规则

regex1 | regex2：匹配 regex1 或 regex2 匹配的内容（逻辑 “或”）。

8. 限定符（用于数据校验）

^：放在正则最前面，代表 “以…… 开头”（如 ^123 匹配以 “123” 开头的字符串）。
$：放在正则最后面，代表 “以…… 结尾”（如 123$ 匹配以 “123” 结尾的字符串）。

9. 断言（条件匹配，不包含在结果中）

正向确定断言：regex(?=regex2)，匹配 regex1 且后面紧跟 regex2。
反向确定断言：(?<=regex1)regex2，匹配 regex2 且前面紧跟 regex1。
正向否定断言：regex1(?!regex2)，匹配 regex1 且后面不紧跟 regex2。
反向否定断言：(?<!regex1)regex2，匹配 regex2 且前面不紧跟 regex1。

10. 常用正则实例

匹配不超过 255 的数：（需覆盖 0-255 所有情况，如 (?:[01]\d\d|2[0-4]\d|25[0-5]|\d\d|\d)）
匹配国内手机号：1[3-9]\d{9}（11 位，以 1 开头，第二位 3-9，后 9 位任意数字）。
匹配国内座机号（区号 - 号码）：0\d{3,4}-\d{8,9}（区号以 0 开头，3-4 位；号码 8-9 位）。

二、Python re 模块（正则操作工具）

re 模块提供多个正则处理字符串的方法，核心功能如下：

1. 查找 / 提取方法

方法	说明
`re.findall(regex, string, flags=None)`	提取所有匹配内容，返回列表。 - 无分组：列表元素为完整匹配内容； - 1 个分组：列表元素为分组内容； - 多个分组：列表元素为元组（含各组内容）。
`re.finditer(regex, string, flags=None)`	提取所有匹配内容，返回迭代器（元素为 Match 对象）。
`re.search(regex, string, flags=None)`	查找第一个匹配内容，返回 Match 对象（无匹配则返回 None）。
`re.match(regex, string, flags=None)`	从字符串开头匹配第一个内容（等价于 `^regex`），返回 Match 对象（开头不匹配则返回 None）。
`re.fullmatch(regex, string, flags=None)`	校验字符串完全匹配正则（等价于 `^regex$`），返回 Match 对象（不完整匹配则返回 None）。

2. 拆分与替换方法

方法	说明
`re.split(regex, string, maxsplit=0, flags=None)`	按正则匹配内容拆分字符串，返回拆分后的列表（`maxsplit` 为最大拆分次数，0 表示全拆分）。
`re.sub(regex, repl, string, count=0, flags=None)`	将匹配内容替换为 `repl`，返回替换后的字符串（`count` 为替换次数，0 表示全替换）。 - `repl` 可为字符串或函数（函数接收 Match 对象，返回替换文本）。
`re.subn(regex, repl, string, count=0, flags=None)`	同 `sub`，但返回元组 `(替换后字符串, 替换次数)`。

3. Match 对象（匹配结果对象）

上述方法返回的 Match 对象包含匹配详情，常用方法：

方法	说明
`group(n=0)`	获取指定组匹配内容（n=0 为完整匹配内容；支持组名 `group('name')`）。
`start(n=0)`	获取指定组匹配内容的起始索引。
`end(n=0)`	获取指定组匹配内容的结束索引（`end - start = 匹配长度`）。
`span(n=0)`	返回指定组匹配内容的 `(起始索引, 结束索引)` 元组。
`groups()`	返回所有分组匹配内容的元组。
`groupdict()`	返回命名捕获分组的字典（键为组名，值为匹配内容）。

三、实际应用示例（爬虫提取小说内容）

使用 requests 抓取网页，结合 re 提取小说章节及内容：

import requests
import re

# 目标网址与请求头
url = "https://www.577ff.cfd/book/58732/"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    "referer": "https://www.bqgui.cc/"
}

# 1. 获取小说目录页内容
response = requests.get(url, headers=headers)
assert response.status_code == 200  # 确保请求成功
content = response.text  # 网页源码

# 2. 提取章节链接与标题（正则匹配章节标签）
regex = r'<dd><a\s+href\s*="(.*?)">(.*?)</a></dd>'
caption_list = re.findall(regex, content)  # 列表元素为 (链接, 标题) 元组

# 3. 遍历章节，提取内容并写入文件
with open("book.txt", "wt", encoding="utf-8") as f:
    for href, title in caption_list:
        # 构建章节详情页绝对地址
        abs_url = f"https://www.577ff.cfd{href}"
        print(f"正在抓取：{title}（{abs_url}）")
        
        # 获取章节详情页内容
        resp = requests.get(abs_url, headers=headers)
        assert resp.status_code == 200
        chapter_content = resp.text
        
        # 提取章节正文（正则匹配正文区域）
        text_regex = r'<div\s+id="chaptercontent".*?>(.*?)请收藏本站'
        match = re.search(text_regex, chapter_content)
        if not match:
            continue  # 跳过无内容章节
        
        # 清洗正文（去除标签和多余空格）
        text = match.group(1)
        text = re.sub(r"<br\s*/?>|\s+", "", text)  # 替换<br>和空白字符
        
        # 写入文件
        f.write(title + "\n")
        f.write(text + "\n")
        print("写入成功")

一、正则表达式与爬虫结合题

1. 抓取笔趣阁小说所有章节

题目要求：使用正则表达式配合requests模块，抓取笔趣阁小说的所有章节（示例网址：http://www.b5200.org/8_8187/），提取章节链接和标题。

import requests
import re

def get_chapter(url):
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
        "cookie": "ckAC = 1;Hm_lvt_15cfcc3e15bd556d32e5aedcadd5a38b = 1754045545;Hm_lpvt_15cfcc3e15bd556d32e5aedcadd5a38b = 1754045545;HMACCOUNT = FDAFB0724C56B8F8",
        "referer": "https://www.bqgui.cc/"
    }
    response = requests.get(url, headers=headers)
    assert response.status_code == 200
    text = response.text
    regex = r'<dd><a\s+href\s*="(.*?)">(.*?)</a></dd>'
    content = re.findall(regex, text)
    return content

# 测试代码
# if __name__ == '__main__':
#     url = "http://www.b5200.org/8_8187/"
#     get_chapter(url)

2. 抓取笔趣阁单章小说正文

题目要求：使用正则表达式配合requests模块，抓取笔趣阁单章小说的正文内容（示例网址：https://www.qu05.cc/html/42900/1.html），提取并清洗内容后写入文件。

import requests
import re

def get_content(url: str):
    last = url.split("/")[-1]
    url = url.removesuffix(f"/{last}")
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0",
        "cookie": "ckAC = 1;Hm_lvt_15cfcc3e15bd556d32e5aedcadd5a38b = 1754045545;Hm_lpvt_15cfcc3e15bd556d32e5aedcadd5a38b = 1754045545;HMACCOUNT = FDAFB0724C56B8F8"
    }
    content = get_chapter(url)
    with open("book.txt", "wt", encoding="utf-8") as f:
        for href, title in content:
            abs_url = f"https://www.31e216f6f.cfd{href}"
            print(f"正在抓取{title}网址是{abs_url}")
            r = requests.get(abs_url, headers=headers)
            assert r.status_code == 200
            content = r.text

            regex = r'<div\s+id="chaptercontent".*?>(.*?)请收藏本站'
            match = re.search(regex, content)

            if match is None:
                print(f"章节 {title} 内容未能匹配")
                continue
            text = match.group(1)
            # 数据清洗
            regex = r"<br\s*/?>|\s+"
            text = re.sub(regex, "", text)
            # 写入文件
            f.write(title)
            f.write("\n")
            f.write(text)
            f.write("\n")
            print(f"写入成功")

# 测试代码
# if __name__ == '__main__':
#     url = "https://www.31e216f6f.cfd/book/66389/1.html"
#     get_content(url)

二、正则表达式匹配与处理题

3. 匹配所有标点符号

题目要求：编写正则表达式，匹配中文逗号、英文逗号、中文句号、英文句号、冒号、中文顿号、中文分号、中英文感叹号等所有标点符号。

import re

text = "Hello! 这是一个测试，包含中文逗号、句号，和英文标点符号。 @#&*"
regex = r"[^\w\s]"
matches = re.findall(regex, text)
print(matches)

4. SQL 语句占位符替换

题目要求：将字符串 insert into tb_user(name, sex, age) values ( #{name} , #{sex} , #{age} ) 替换为 insert into tb_user(name, sex, age) values ( %(name)s , %(sex)s , %(age)s )。

import re

string = r"insert into tb_user(name, sex, age) values ( #{name} , #{sex} , #{age} )"
reg = r"#{(.*?)}"
new_string = re.sub(reg, r"%(\1)s", string)
print(new_string)

5. 邮箱格式校验

题目要求：编写正则表达式校验邮箱格式，规则：

包含一个 @符号
@前面为账号（字母、数字、下划线组成，长度 4~15 位）
@后面为域名（格式如 xyz.abc 或 xyz.abc.mn，组件为字母、数字）

import re

regex = r"^[a-zA-Z0-9_]{4,15}@[a-zA-Z0-9]+\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)?$"

6. 叠词人名匹配

题目要求：编写正则表达式，匹配三个字的人名且为叠词（如 “曹莉莉、王丹丹”）。

import re

regex = r"[\u4e00-\u9fa5]([\u4e00-\u9fa5])(\1)"

7. 邮箱校验函数

题目要求：编写函数check_email(email)验证邮箱格式，规则：

账号由字母、数字、下划线组成
包含一个 @符号
@后面为 xy.ab 或 xy.ab.zz 格式（组件为字母、数字、下划线）

import re

def check_email(email):
    regex = r"^[a-zA-Z0-9_]{4,15}@[a-zA-Z0-9]+\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)?$"
    match = re.match(regex, email)
    return bool(match)

8. 邮箱脱敏函数

题目要求：编写函数email_secure(email)，对邮箱账号脱敏（如huokundian@qikux.com→h********@qikux.com），需先校验邮箱格式（可调用check_email）。

import re

def email_secure(email):
    if check_email(email):
        regex = r'^(\w)([a-zA-Z0-9_]{3,14})@'
        new_str = re.sub(regex, r'\1********@', email)
        return new_str
    return False

# 测试代码
# if __name__ == '__main__':
#     emails = [
#         "huokundian@qikux.com",
#         "test_email123@xyz.abc",
#         "my_email_1@abc.xyz.zz",
#         "invalid_email@xyz"
#     ]
#     for email in emails:
#         print(email_secure(email))

9. 解析章节标签

题目要求：编写函数parse_dd(string)，使用正则表达式解析<dd><a href ="/id/36662/5941.html">第5941章诡异的微笑</a></dd>格式字符串，提取超链接地址和章节名。

import re

# 示例代码（可封装为函数）
text = r'<dd><a href ="/id/36662/5941.html">第5941章 诡异的微笑</a></dd>'
regex = r'<dd><a\s*href\s*="(.*?)">(.*?)</a></dd>'
matches = re.findall(regex, text)
print(matches)

10. 移除字符串叠词

题目要求：编写函数unique_str(string)，移除字符串中所有叠词，返回不含叠词的字符串（如 “hello hello world”→“hello world”）。

import re

def unique_str(string):
    regex = r'(\w+)( \1)+'
    result = re.sub(regex, r'\1', string)
    return result

# 测试代码
# test_strings = [
#     "hello hello world world",
#     "hello world",
#     "this is is a test test",
#     "no no no more more repeats"
# ]
# for s in test_strings:
#     print(f"原始：{s} -> 去重后：{unique_str(s)}")

11. 提取手机号及分段

题目要求：编写函数，从字符串中提取所有符合格式的手机号，及手机号的前 3 位、中间 4 位（如asb13345672234fgfg156345245778→[('13345672234', '133', '4567'), ('15634524577', '156', '3452')]）。

import re

def get_phone_number(number):
    regex = r'(1[3-9][\d])([\d]{4})([\d]{4})'
    ls = re.findall(regex, number)
    result = []
    for phone_part in ls:
        full_phone = phone_part[0] + phone_part[1] + phone_part[2]
        first_three = phone_part[0]
        middle_four = phone_part[1]
        result.append((full_phone, first_three, middle_four))
    return result

# 测试代码
# test_string = "asb13345672234fgfg156345245778dshh"
# print(get_phone_number(test_string))

12. 提取所有手机号

题目要求：编写函数，从字符串中提取所有符合格式的手机号（国内手机号规则：11 位，以 1 开头，第二位 3-9，后 9 位为数字）。

import re

def is_phone_number(string):
    regex = r'1[3-9][\d]{9}'
    return re.findall(regex, string)

# 测试代码
# test_string = "asb13345672234fgfg156345245778dshh"
# print(is_phone_number(test_string))

13. 姓张的人名匹配

题目要求：编写正则表达式，匹配姓张且姓名为 2 字或 3 字的人名（不包含姓本身）。

import re

regex = r'张([\u4e00-\u9fa5]{1,2})'

14. 座机号码匹配

题目要求：编写正则表达式，匹配座机号码，规则：区号以 0 开头（3-4 位），号码为 7-8 位数字，区号和号码之间用中划线分割。

import re

regex = r'0[\d]{2,3}-[\d]{7,8}'

15. 身份证号码分组提取

题目要求：编写正则表达式，匹配身份证号码，对省市县（前 6 位）分组命名为location，出生日期（8 位）分组命名为birth，性别（第 17 位）分组命名为sex，并使用 re 模块提取各部分。

import re

regex = r'(?P<location>[\d]{6})(?P<birth>[\d]{8})[\d]{2}(?P<sex>[\d])[\dX]'

16. 用户名校验

题目要求：编写正则表达式，校验用户名（长度 6-20 位，由字母、数字、下划线组成）。

import re

regex = r'[\d\w_]{6,20}'

17. IP 地址匹配

题目要求：编写正则表达式，匹配 IP 地址（由 4 个 0-255 的数字组成，以小数点分割）。

import re

regex = r'((1\d{2}|2[0-4]\d|25[0-5]|[1-9]?\d)\.){3}(1\d{2}|2[0-4]\d|25[0-5]|[1-9]?\d)'

三、类定义题

18. 文件写入相关类

题目要求：

定义抽象类OutFileDescriptor，包含抽象方法write、flush、close。
定义FileWriter类继承OutFileDescriptor，负责字符文件写入，实现write、newline、flush、close方法，包含私有属性path、encoding、append。
定义FileOutputStream类继承OutFileDescriptor，负责字节文件写入，实现write、flush、close方法，包含私有属性path、append。

from abc import ABC, abstractmethod

class OutFileDescriptor(ABC):
    @abstractmethod
    def write(self, data: str, start=0, length=None): pass

    @abstractmethod
    def flush(self): pass

    @abstractmethod
    def close(self): pass

class FileWriter(OutFileDescriptor):
    def __init__(self, path, *, encoding="UTF-8", append=False):
        self.__path = path
        self.__encoding = encoding
        self.__append = append
        mode = 'a' if append else 'wt'
        self.__file = open(self.__path, mode)

    def write(self, data: str, start=0, length=None):
        if length is None:
            self.__file.write(data[start:])
        else:
            self.__file.write(data[start:start + length])

    def newline(self):
        self.__file.write('\n')

    def flush(self):
        self.__file.flush()

    def close(self):
        self.__file.close()

class FileOutputStream(OutFileDescriptor):
    def __init__(self, path, *, encoding="UTF-8", append=False):
        self.__path = path
        self.__encoding = encoding
        self.__append = append
        mode = 'a' if append else 'wb'
        self.__file = open(self.__path, mode, encoding=self.__encoding)

    def write(self, data: str, start=0, length=None):
        if length is None:
            self.__file.write(data[start:])
        else:
            self.__file.write(data[start:start + length])

    def flush(self):
        self.__file.flush()

    def close(self):
        self.__file.close()

四、日期和时间题

19. 计算复活节日期

题目要求：编写函数get_easter(year)，获取指定年份的复活节日期（格式yyyy-MM-dd），规则：

春分计算：[Y×D+C]-L（21 世纪 C=20.646，D=0.2422，L 为闰年数）。
复活节为春分月圆（农历十五）后第一个星期日，若月圆为周六或周日，则取下一个周日。

import math
from datetime import datetime, timedelta

def get_easter(year):
    # 计算春分日期
    y = year % 100
    C = 20.646
    D = 0.2422
    leap_years = y // 4
    spring_day = math.floor(y * D + C) - leap_years
    if spring_day < 19:
        spring_day = 19
    elif spring_day > 21:
        spring_day = 21
    spring_equinox = datetime(year, 3, spring_day)

    # 计算月圆日（农历十五）
    diff = year - 1977
    Q = diff // 4
    R = diff % 4
    moon_full_date = None
    current_date = spring_equinox
    for _ in range(30):
        day_of_year = current_date.timetuple().tm_yday
        n = (14 * Q + 10.6 * (R + 1) + day_of_year) % 29.5306
        if 14.5 <= n <= 15.5:
            moon_full_date = current_date
            break
        current_date += timedelta(days=1)
    if moon_full_date is None:
        raise ValueError(f"无法计算{year}年的月圆日期")

    # 计算复活节（月圆后第一个星期日）
    weekday = moon_full_date.weekday()
    days_to_sunday = 7 if weekday == 6 else 6 - weekday
    easter_date = moon_full_date + timedelta(days=days_to_sunday)
    return easter_date.strftime("%Y-%m-%d")

# 测试代码
# if __name__ == "__main__":
#     test_years = [2020, 2021, 2022, 2023, 2024, 2025, 2092]
#     for year in test_years:
#         print(f"{year}年复活节日期: {get_easter(year)}")

Python Day21 re模块正则表达式 简单小说爬取 及例题分析