使用pytesseract和Cookie登录古诗文网~（python爬虫）-EW帮帮网

以下的代码是一个使用Python实现的古诗文网(gushiwen.cn)自动化登录脚本，主要功能是通过模拟浏览器行为完成登录过程，并验证登录是否成功。

主要实现的功能

模拟浏览器登录功能
- 使用urllib模拟浏览器发送HTTP请求
- 自动处理ASP.NET的VIEWSTATE机制
验证码自动识别功能
- 自动下载网页验证码图片
- 使用Pillow库进行图像预处理（灰度化）
- 调用pytesseract OCR引擎识别验证码文本
- 对识别结果进行清洗和格式化
会话保持功能
- 通过CookieJar自动管理会话Cookie
- 使用HTTPCookieProcessor维持登录状态
- 构建opener实现请求的连贯性
表单自动提交功能
- 自动填充登录表单（邮箱、密码、验证码）
- 自动提交POST请求完成登录
登录状态验证功能
- 登录后自动访问测试页面
- 通过检测"退出登录"文本验证登录状态
- 输出当前会话的Cookie信息
页面内容获取功能
- 使用lxml解析HTML页面
- 提取页面标题等关键信息
异常处理功能
- 对网络请求和验证码识别进行异常捕获
- 提供错误信息输出
调试辅助功能
- 打印关键步骤信息（验证码识别结果、登录状态等）
- 输出当前获取的Cookie列表
自动化流程控制
- 添加适当的延迟确保操作完成
- 根据验证码识别结果决定是否继续登录流程
安全功能
- 使用标准User-Agent模拟浏览器
- 遵循标准HTTP协议流程

Cookie处理机制初始化：

cookiejar = CookieJar()
handler = HTTPCookieProcessor(cookiejar)
opener = build_opener(handler)

创建CookieJar对象存储Cookie
创建HTTPCookieProcessor处理Cookie
构建opener用于发送请求并自动处理Cookie

获取登录页面和提取必要参数

req = Request(login_url, headers=header)
resp = opener.open(req)
html = etree.HTML(resp.read().decode('utf-8'))

viewstate = html.xpath("//input[@id='__VIEWSTATE']/@value")[0]
viewstategenerator = html.xpath("//input[@id='__VIEWSTATEGENERATOR']/@value")[0]

发送GET请求获取登录页面
解析HTML并提取ASP.NET特有的__VIEWSTATE和__VIEWSTATEGENERATOR参数
这些参数是ASP.NET表单必需的隐藏字段

验证码处理流程

code_url = "https://www.gushiwen.cn" + html.xpath("//img[@id='imgCode']/@src")[0]
req_code = Request(code_url, headers=header)
resp_code = opener.open(req_code)

image_data = resp_code.read()
image = Image.open(io.BytesIO(image_data))

# 图像预处理
image = image.convert('L')  # 灰度化

# 识别验证码
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
textcode = pytesseract.image_to_string(image, config=custom_config)
textcode = textcode.strip().replace(' ', '')[:4]

获取验证码图片URL并下载
直接在内存中处理验证码图片
对图片进行灰度化处理提高识别率
使用pytesseract识别验证码，配置参数优化识别效果
清理识别结果，只保留前4位字符

构造并提交登录请求

data = {
    '__VIEWSTATE': viewstate,
    '__VIEWSTATEGENERATOR': viewstategenerator,
    'email': '2833622025@qq.com',
    'pwd': 'ckn12138',
    'code': textcode,
    'denglu': '登录'
}

login_req = Request(
    login_url,
    data=urlencode(data).encode('utf-8'),
    headers=header
)
login_resp = opener.open(login_req)

构造包含所有必需参数的POST数据
使用urlencode编码数据
发送登录请求

具体代码展示：

import requests
from lxml import etree
from PIL import Image
import pytesseract
from http.cookiejar import CookieJar
from urllib.request import HTTPCookieProcessor, build_opener, Request
from urllib.parse import urlencode
import io
import time

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
}
login_url = "https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx"
test_url = "https://www.gushiwen.cn/user/collect.aspx"

# 1. 创建Cookie处理器
cookiejar = CookieJar()
handler = HTTPCookieProcessor(cookiejar)
opener = build_opener(handler)

# 2. 获取登录页面和必要参数
req = Request(login_url, headers=header)
resp = opener.open(req)
html = etree.HTML(resp.read().decode('utf-8'))

viewstate = html.xpath("//input[@id='__VIEWSTATE']/@value")[0]
viewstategenerator = html.xpath("//input[@id='__VIEWSTATEGENERATOR']/@value")[0]

# 3. 下载验证码并用pytesseract识别
code_url = "https://www.gushiwen.cn" + html.xpath("//img[@id='imgCode']/@src")[0]
req_code = Request(code_url, headers=header)
resp_code = opener.open(req_code)

# 将验证码图片直接读入内存进行处理
image_data = resp_code.read()
image = Image.open(io.BytesIO(image_data))

# 图像预处理（根据实际验证码调整）
image = image.convert('L')  # 灰度化
# image = image.point(lambda x: 0 if x < 128 else 255, '1')  # 二值化

# 识别验证码
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
textcode = pytesseract.image_to_string(image, config=custom_config)
textcode = textcode.strip().replace(' ', '')[:4]  # 清理结果并取前4位
print("pytesseract识别的验证码:", textcode)

# 4. 构造登录数据并提交
if len(textcode) == 4:
    data = {
        '__VIEWSTATE': viewstate,
        '__VIEWSTATEGENERATOR': viewstategenerator,
        'email': '2833622025@qq.com',  # 替换为你的邮箱
        'pwd': 'ckn12138',             # 替换为你的密码
        'code': textcode,
        'denglu': '登录'
    }

    # 使用urllib的方式提交登录请求
    login_req = Request(
        login_url,
        data=urlencode(data).encode('utf-8'),
        headers=header
    )
    login_resp = opener.open(login_req)
    login_html = login_resp.read().decode('utf-8')
    
    # 添加延迟，确保登录完成
    time.sleep(2)

    # 测试是否登录成功
    test_req = Request(test_url, headers=header)
    try:
        test_resp = opener.open(test_req)
        test_html = test_resp.read().decode('utf-8')

        if "退出登录" in test_html:
            print("登录成功！")
            # 打印当前保存的Cookie
            print("当前Cookie:", [cookie.name for cookie in cookiejar])
            
            # 获取登录后的页面内容示例
            print("获取收藏页面标题:")
            doc = etree.HTML(test_html)
            title = doc.xpath('//title/text()')[0]
            print(title)
        else:
            print("登录失败，请检查账号或验证码！")
    except Exception as e:
        print("访问测试页面出错:", str(e))
else:
    print("验证码识别失败")

运行结果：

一般是识别不出来的偶尔会成功最好是用超级鹰来识别

当然也有成功地时候多试几次就好~

使用pytesseract和Cookie登录古诗文网~（python爬虫）

主要实现的功能

Cookie处理机制初始化：

获取登录页面和提取必要参数

验证码处理流程

构造并提交登录请求

具体代码展示：

网站公告

今日签到

热门文章

最新发布