BeautifulSoup 入门-EW帮帮网

发现宝藏

前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家。【宝藏入口】。

BeautifulSoup 是 Python 中一个强大的库，主要用于解析 HTML 和 XML 文档，提取数据。它能帮助你轻松实现网页爬取中的数据解析任务。

本博客将从安装到 核心用法，通过代码示例详细解析 BeautifulSoup 的功能和用法，帮助初学者迅速入门。

1. 安装 BeautifulSoup

安装 BeautifulSoup 主要有两个依赖：

beautifulsoup4：提供数据解析的工具。
lxml 或 html.parser：用于解析 HTML 文档。

安装命令：

pip install beautifulsoup4 lxml

2. 导入库并解析 HTML 内容

导入 BeautifulSoup 后，可以通过本地文件或网页内容字符串解析 HTML。

示例 HTML 文本：

<html>
    <head>
        <title>示例页面</title>
    </head>
    <body>
        <h1>欢迎来到我的网站</h1>
        <p class="description">这是一个简单的介绍段落。</p>
        <a href="https://example.com">点击这里</a>
    </body>
</html>

代码解析：

from bs4 import BeautifulSoup

html_content = """
<html>
    <head>
        <title>示例页面</title>
    </head>
    <body>
        <h1>欢迎来到我的网站</h1>
        <p class="description">这是一个简单的介绍段落。</p>
        <a href="https://example.com">点击这里</a>
    </body>
</html>
"""

# 创建 BeautifulSoup 对象
soup = BeautifulSoup(html_content, 'lxml')  # 使用 lxml 解析器

# 打印格式化的 HTML 内容
print(soup.prettify())

输出：

<html>
 <head>
  <title>
   示例页面
  </title>
 </head>
 <body>
  <h1>
   欢迎来到我的网站
  </h1>
  <p class="description">
   这是一个简单的介绍段落。
  </p>
  <a href="https://example.com">
   点击这里
  </a>
 </body>
</html>

3. 常用对象与基本操作

3.1 获取标签内容

soup.title：获取 <title> 标签。
soup.title.string：获取标签内部的文本。

print(soup.title)          # <title>示例页面</title>
print(soup.title.string)   # 示例页面

3.2 查找单个标签：`find()`

find() 方法用于查找第一个符合条件的标签。

h1_tag = soup.find('h1')
print(h1_tag)              # <h1>欢迎来到我的网站</h1>
print(h1_tag.string)       # 欢迎来到我的网站

3.3 查找所有标签：`find_all()`

find_all() 方法返回一个包含所有匹配标签的列表。

# 查找所有 <p> 标签
p_tags = soup.find_all('p')
for p in p_tags:
    print(p.string)

输出：

这是一个简单的介绍段落。

3.4 根据属性查找标签

你可以通过标签的属性（如 class、id）查找特定标签。

# 查找 class 为 description 的 <p> 标签
description = soup.find('p', class_='description')
print(description.string)  # 这是一个简单的介绍段落。

4. 获取标签的属性

通过 .attrs 可以获取标签的所有属性，也可以获取特定属性的值。

# 查找 <a> 标签并获取属性
a_tag = soup.find('a')
print(a_tag.attrs)         # {'href': 'https://example.com'}
print(a_tag['href'])       # https://example.com

5. 使用 CSS 选择器：`select()`

BeautifulSoup 提供了类似 CSS 的语法，通过 select() 方法查找标签。

示例代码：

# 查找所有 class 为 description 的标签
description = soup.select('.description')
print(description[0].string)  # 这是一个简单的介绍段落。

# 查找 <a> 标签
link = soup.select('a[href]')
print(link[0]['href'])        # https://example.com

6. 遍历文档树

6.1 子节点：`contents` 和 `children`

contents：以列表形式返回子节点。
children：返回一个迭代器。

body_tag = soup.body
print(body_tag.contents)  # 包含所有子节点的列表

# 遍历子节点
for child in body_tag.children:
    print(child)

6.2 父节点与兄弟节点

parent：返回父节点。
next_sibling：返回下一个兄弟节点。
previous_sibling：返回上一个兄弟节点。

p_tag = soup.find('p')

# 父节点
print(p_tag.parent.name)  # body

# 兄弟节点
print(p_tag.next_sibling)  # 输出 <a> 标签

7. 修改 HTML 内容

你可以动态修改标签的内容或属性。

# 修改标签内容
soup.title.string = "新的标题"
print(soup.title)  # <title>新的标题</title>

# 修改属性
a_tag['href'] = 'https://newlink.com'
print(a_tag)  # <a href="https://newlink.com">点击这里</a>

8. 保存解析后的 HTML

将修改后的 HTML 保存到文件。

# 保存到文件
with open('output.html', 'w', encoding='utf-8') as file:
    file.write(soup.prettify())

示例代码合集

完整代码示例：BeautifulSoup 入门示例集合。

from bs4 import BeautifulSoup

html_content = "<html><body><h1>Test</h1><a href='https://example.com'>Link</a></body></html>"

soup = BeautifulSoup(html_content, 'lxml')

# 查找标签内容
print(soup.h1.string)

# 查找 <a> 标签及其属性
a_tag = soup.find('a')
print(a_tag['href'])

输出：

Test
https://example.com

希望这篇文章帮助你快速掌握 BeautifulSoup 的基础用法，灵活地在 Python 中解析 HTML 文档！ 🚀

BeautifulSoup 入门

发现宝藏

1. 安装 BeautifulSoup

安装命令：

2. 导入库并解析 HTML 内容

示例 HTML 文本：

代码解析：

输出：

3. 常用对象与基本操作

3.1 获取标签内容

3.2 查找单个标签：`find()`

3.3 查找所有标签：`find_all()`

输出：

3.4 根据属性查找标签

4. 获取标签的属性

5. 使用 CSS 选择器：`select()`

示例代码：

6. 遍历文档树

6.1 子节点：`contents` 和 `children`

6.2 父节点与兄弟节点

7. 修改 HTML 内容

8. 保存解析后的 HTML

附加资源：

示例代码合集

网站公告

今日签到

热门文章

最新发布

BeautifulSoup 入门

发现宝藏

1. 安装 BeautifulSoup

安装命令：

2. 导入库并解析 HTML 内容

示例 HTML 文本：

代码解析：

输出：

3. 常用对象与基本操作

3.1 获取标签内容

3.2 查找单个标签：find()

3.3 查找所有标签：find_all()

输出：

3.4 根据属性查找标签

4. 获取标签的属性

5. 使用 CSS 选择器：select()

示例代码：

6. 遍历文档树

6.1 子节点：contents 和 children

6.2 父节点与兄弟节点

7. 修改 HTML 内容

8. 保存解析后的 HTML

附加资源：

示例代码合集

网站公告

今日签到

热门文章

最新发布

3.2 查找单个标签：`find()`

3.3 查找所有标签：`find_all()`

5. 使用 CSS 选择器：`select()`

6.1 子节点：`contents` 和 `children`