Parser 模块：HTML 解析与数据清洗-EW帮帮网

前言：

在构建 boost 库站内搜索引擎时，Parser 模块是数据处理的核心环节，它的核心功能包括：

筛选出所有 .html 文件，为后续解析提供数据源；

从 .html 文件中提取 title、content ，并构造对应官网的 url 地址；

将清洗后的数据按自定义格式存储为文本文件，便于后续建立索引。

本文主要讲解编写 Parser 模块过程中涉及的部分知识。

1. EnumFile() 方法

EnumFile() 方法的功能是，递归遍历指定目录及其子目录，收集所有后缀为 .html 的普通文件路径，并将结果存储到传入的 std::vector<std::string> 中。

1.1 boost::filesystem 简介

boost::filesystem 是 Boost 库中的一个组件，用于跨平台处理文件系统路径、文件操作和目录管理。它提供了面向对象的接口，简化了文件和目录的创建、删除、遍历等操作。

使用：

包含头文件：#include <boost/filesystem>
链接 Boost 库：-lboost_system -lboost_filesystem

1.2 核心类及其使用

1.2.1 path

boost::filesystem::path 用于表示文件系统路径（如文件或目录的路径），提供了跨平台的路径操作能力。

在 Windows 平台，D:\Develop\笔记\Boost库搜索引擎开发 可以被正确解析。

在 Linux 平台，/home/ZhengTongren/boost-search/data/input 可以被正确解析。

path 可以通过 std::string、const char* 或其它 path 对象进行构造。

const std::string src_path = "data/input/";
boost::filesystem::path root_path(src_path);

相关方法：

filesystem::exists(path) ：检查路径是否存在
is_regular_file(path) ：判断是否为普通文件
path.extension() ：获取拓展名，带 . ，如 .html

1.2.2 recursive_directory_iterator

recursive adj. 递归的

boost::filesystem 中，directory_iterator 和 recursive_directory_iterator 的作用类似于一个 “指针”，用于遍历文件系统中的目录内容。

directory_iterator 提供了一种方式来迭代一个指定目录的内容，但它不会递归进入子目录。

recursive_directory_iterator 则提供了递归遍历目录的功能，它不仅会列出指定目录下的所有条目，还会进入每个子目录继续遍历，直到遍历完整个目录树。

boost::filesystem::recursive_directory_iterator it(root_path); 
// root_path 是一个 filesystem::path 类对象；
// 创建一个递归迭代器，指向 root_path 目录中的第一个条目（文件或子目录）；
// 如果 root_path 不存在或无法访问，则抛出 filesystem_error 异常

boost::filesystem::recursive_directory_iterator end;
// 默认构造的迭代器表示结束

for (; it != end; ++it) { }
// 通过循环逐个访问条目，直到等于 end

1.3 递归遍历、搜集所有 .html 文件路径

#include <iostream>
#include <string>
#include <vector>
#include <boost/filesystem>

const std::string src_path = "data/input/";

bool EnumFile(const std::string& src_path, std::vector<std::string>* files_list)
{
    namespace fs = boost::filesystem;
    // 1. 初始化 path
    fs::path root_path(src_path); 
    // 2. 检查路径是否存在 -- 不存在则 return false
    if (!fs::exists(root_path))
    {
        std::cerr << src_path << " not exits!" << std::endl;
        return false;
    }
    // 3. 递归遍历目录及其子目录文件
    fs::recursive_directory_iterator end;
    for (fs::recursive_directory_iterator it(root_path); it != end; ++it)
    {
        // 3.1 跳过目录文件
        if (!is_regular_file(*it)) continue;
        // 3.2 跳过非 .html 普通文件
        if (it->path().extension() != ".html") continue;
        // 3.3 插入
        files_list->push_back(it->path().c_str());
    }
    return true;
}

2. 文件流操作与 getline

2.1 C++ 文件流类

C++ 中的文件流操作通过 <fstream> 头文件提供的类（如 ifstream、ofstream、fstream）实现，支持对文件的读写操作。

ifstream 用于从文件中读取数据；ofstream 用于将数据写入文件；fstream 支持同时读写文件。

打开模式：

std::ios::in ：以输入模式打开文件
std::ios::out ：以输出模式打开文件（覆盖模式）
std::ios::app ：以追加模式打开文件
std::ios::binary ：以二进制模式打开文件

文件的读写模式必须保持一致，以默认模式写入则以默认模式读取，以二进制模式写入则以二进制模式读取。

对于二进制文件，数据直接按其原始字节序列被读取或写入，没有任何解释或转换，这对于处理图像、音频、视频等非文本数据尤为重要。

// 通用写入函数（二进制）
bool WriteFile(const std::string& path, const std::string& content)
{
    std::ofstream out(path, std::ios::out | std::ios::binary);
    if (!out.is_open()) return false;
    
    out << content;
    return true;
}

// 通用读取函数（二进制）
bool ReadFile(const std::string& path, std::string* out)
{
    std::ifstream in(path, std::ios::in | std::ios::binary);
    if (!in.is_open()) return false;
    
    std::string line;
    while (getline(in, line))
    {
        // ...
    }
    return true;
}

C++ 的文件流类（如 ifstream、ofstream）遵循 RAII 原则：

资源与对象生命周期绑定：文件流对象在构造时打开文件，在析构时自动调用 close() ；
无需手动干预：当流对象离开作用域（即生命周期结束），析构函数会确保文件正确关闭。

因此，常规情况下可以不显式调用 close() 方法。

在长时间运行的程序中，若需要提前关闭不再使用的文件、立即释放资源，可以显式调用 close() 。

2.2 getline 函数

getline 函数用于从输入流中按行读取字符串，它会读取直到遇到换行符 \n 为止，并将此换行符从输入流中提取掉（但 \n 不会包含在返回的字符串中）。

常见用法：

从 std::cin 中读取一行

std::string line;
getline(std::cin, line);

从文件流中逐行读取

std::ifstream in(path, std::ios::in);
if (!in.is_open()) return ;

std::string line;
while (getline(in, line))
{
    // ...
}
in.close();

getline() 支持自定义分隔符

istream& getline (istream& is, string& str);
istream& getline (istream& is, string& str, char delim); // 通过 getline 的第三个参数 delim

char delimiter = ',';
std::string line;
getline(std::cin, line, delimiter);

2.3 使用 getline vs stringstream

getline 逐行读取，适合处理大文件，或需要对每一行进行特定处理的情况。

stringstream 一次性读取整个文件到内存中，适合小文件处理场景；对于较大的文件，这种方法可能导致内存溢出。

Parser 模块：HTML 解析与数据清洗

前言：