今晚上一篇小说网站给我干难受了,先是五秒盾,还有页面page参数的不规则
直接请求
首先肯定是直接请求
直接请求的代码
import requests
url="https://beqege.cc/2/21.html"
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'
}
response=requests.get(url,headers=headers)
print(response.text)
我们看一下返回值
这里我打印了请求,正常是会返回空,none或者403相应的
可以看到返回值是一串我们看不懂的文字,然后还带一个just moment— 小说反爬还有五秒盾?
没办法,普通请求肯定是没办法了,对于五秒盾来说,换headers或者cookie都没有用,在这里使用一个过五秒盾的库就ok了
具体如下
过5s盾
代码如下
from curl_cffi import requests as cffi_requests
res = cffi_requests.get("https://www.beqege.cc/2/21.html", impersonate='chrome110', timeout=10,verify=False)
print("============cffi_requests的方式", res.status_code, res.cookies, res.text)
提示以下这个库在使用的时候会查询整数,由于只是个人使用,直接禁用证书即可!
可以看到可以正常返回数据了,现在我们只需要对数据进行处理就行了
不对,不对,破案了,这本数的章节不是随着数字一直增加的,偶尔会有一个大跳!
需要先找一下它大跳的规律
我说我刚才按照顺序请求的时候为什么不对!
我扒了一下,发现小说章节链接是这样一个规律:
#21-29 #210-299 #2100-2999 #21000-22455
加到一块就是2456!再加上其余的彩蛋或者番外,就刚好!
代码如下(全本小说)
from bs4 import BeautifulSoup
import lxml
from curl_cffi import requests as cffi_requests
import os
#21-29
#210-299
#2100-2999
#21000-22455
data = [
(21, 29),
(210, 250),
(2100, 2999),
(21000, 22455)
]
for start, end in data:
for i in range(start, end + 1):
res = cffi_requests.get(f"https://www.beqege.cc/2/{i}.html", impersonate='chrome110', timeout=10, verify=False)
res = res.text
soup = BeautifulSoup(res, "lxml")
title = soup.find("div", id="content").text
head = soup.find("div", class_="bookname").text
head = head.replace("\n", "")
folder_path = f"D:/小说/凡人修仙转"
if not os.path.exists(folder_path):
os.makedirs(folder_path)
with open(f"{folder_path}/{head}.txt", "w", encoding="utf-8") as f:
f.write(title)
可以正常处理了!~
这个代码处理了一下空格,以至于正常创建文件夹
如果想爬取的更快一些,可以使用异步或者多线程,我这里使用的是异步
多线程(全本小说)
from bs4 import BeautifulSoup
from curl_cffi import requests as cffi_requests
import os
import random
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
# 定义章节范围列表 (包含起始和结束页码)
CHAPTER_RANGES = [
(21, 29),
(210, 299),
# (2100, 2999),
# (21000, 22455)
]
def download_page(i, retries=3):
"""下载单个页面并保存"""
folder_path = "D:/小说/凡人修仙转"
url = f"https://www.beqege.cc/2/{i}.html"
# 随机延迟(0.5-3秒)
time.sleep(random.uniform(0.5, 3))
for attempt in range(retries):
try:
# 随机选择浏览器指纹
browsers = ['chrome110', 'chrome107', 'edge101']
res = cffi_requests.get(
url,
impersonate=random.choice(browsers),
timeout=10,
verify=False
)
# 检测Cloudflare验证
if "Checking your browser before accessing" in res.text:
raise Exception("触发Cloudflare验证")
if res.status_code != 200:
raise Exception(f"HTTP状态码异常: {res.status_code}")
soup = BeautifulSoup(res.text, "lxml")
content_div = soup.find("div", id="content")
bookname_div = soup.find("div", class_="bookname")
if not content_div or not bookname_div:
raise Exception("关键元素未找到")
title = content_div.get_text(strip=True)
head = bookname_div.get_text(strip=True).replace("\n", "")
# 保存文件
os.makedirs(folder_path, exist_ok=True) # 自动创建目录
with open(f"{folder_path}/{head}.txt", "w", encoding="utf-8") as f:
f.write(title)
print(f"页码 {i} 下载成功")
return True
except Exception as e:
print(f"页码 {i} 第 {attempt + 1} 次尝试失败: {str(e)}")
if attempt < retries - 1:
# 指数退避+随机抖动
wait_time = 2 ** attempt + random.uniform(0, 1)
time.sleep(wait_time)
print(f"页码 {i} 下载失败,已重试 {retries} 次")
return False
def generate_page_numbers():
"""生成所有需要爬取的页码"""
for (start, end) in CHAPTER_RANGES:
yield from range(start, end + 1) # 包含结束页码
def main():
# 配置线程池 (建议4-8个线程)
max_workers = 6
total_pages = sum(end - start + 1 for (start, end) in CHAPTER_RANGES)
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# 提交所有页码的下载任务
futures = {
executor.submit(download_page, page): page
for page in generate_page_numbers()
}
# 进度跟踪
completed = 0
for future in as_completed(futures):
completed += 1
page = futures[future]
try:
future.result()
status = "成功"
except Exception as e:
status = f"失败: {str(e)[:30]}"
print(f"进度: {completed}/{total_pages} | 页码 {page} {status}")
if __name__ == "__main__":
# 随机初始化延迟(1-5秒)
time.sleep(random.uniform(1, 5))
main()
- 增加了多浏览器指纹随即切换
- 随机请求延迟
- 自动重试失败请求
tr(e)[:30]}"
print(f"进度: {completed}/{total_pages} | 页码 {page} {status}")
if name == “main”:
# 随机初始化延迟(1-5秒)
time.sleep(random.uniform(1, 5))
main()
>1. 增加了多浏览器指纹随即切换
>2. 随机请求延迟
>3. 自动重试失败请求
>4. 显示成功和错误日志