京东商品爬虫技术解析：基于Selenium的自动化数据采集实战-EW帮帮网

一、代码概述

本代码实现了一个京东商品数据自动化爬虫系统，核心功能包括 Cookie免密登录、页面动态加载处理、多页数据采集 和 Excel数据存储。代码基于Python生态，主要依赖以下技术栈：

技术组件	作用
Selenium	浏览器自动化操作
lxml	HTML页面解析
pandas	数据存储与Excel导出
Edge WebDriver	浏览器驱动

二、核心功能模块解析

1. Cookie管理机制

def is_exists_cookies():
    cookie_file = './data/jd_cookies.txt'
    if os.path.exists(cookie_file):
        # 加载本地Cookie
        web.get(jd_domain)
        with open(cookie_file, 'r') as file:
            cookies = json.load(file)
            for cookie in cookies:
                web.add_cookie(cookie)
    else:
        # 首次登录保存Cookie
        web.get(jd_login_url)
        time.sleep(30)  # 手动登录时间窗口
        dictcookies = web.get_cookies()
        jsoncookies = json.dumps(dictcookies)
        with open(cookie_file, 'w') as f:
            f.write(jsoncookies)

技术亮点：

通过os.path.exists检测本地Cookie文件，实现免重复登录
add_cookie()方法将Cookie注入浏览器会话
JSON格式持久化存储登录凭证

2. 页面动态加载控制

def slide(web):
    height = 0
    new_height = web.execute_script("return document.body.scrollHeight")
    while height < new_height:
        for i in range(height, new_height, 400):
            web.execute_script(f'window.scrollTo(0, {i})')
            time.sleep(0.5)
        height = new_height
        new_height = web.execute_script(...)

实现原理：

通过JavaScript脚本获取页面总高度
分步滚动（每次400像素）模拟人工浏览
循环检测直至滚动到底部

3. 商品数据解析

def get_product(web):
    et = etree.HTML(web.page_source)
    obj_list = et.xpath('//div[@class="gl-i-wrap"]')
    
    for item in obj_list:
        title = ''.join(item.xpath('./div[@class="p-name"]//text()')).strip()
        price = item.xpath('./div[@class="p-price"]//i/text()')[0]
        shop = item.xpath('./div[@class="p-shop"]//a/text()')[0]
        sales = item.xpath('./div[@class="p-commit"]//text()')[0]
        img = item.xpath('./div[@class="p-img"]//img/@src')[0]

XPath定位策略：

商品列表容器：//div[@class="gl-i-wrap"]
价格字段：./div[@class="p-price"]//i/text()
销量数据：./div[@class="p-commit"]//text()

4. 多页爬取逻辑

def get_more(web, page):
    for i in range(page):
        button = web.find_element(By.XPATH, '//*[@id="J_bottomPage"]//a[9]')
        web.execute_script("arguments[0].click();", button)
        time.sleep(5)
        get_product(web)

翻页机制：

定位页码按钮（第9个a标签为下一页）
通过execute_script执行点击操作
固定等待5秒确保页面加载

5. 数据存储模块

data = {
    "标题": titless, 
    "价格": prices,
    "店铺": shop_names,
    "销量": saleses,
    "图片": urls
}
pd.DataFrame(data).to_excel('./data/手机销售.xlsx', index=False)

Pandas技巧：

字典直接转换为DataFrame
index=False取消默认索引列
支持中文字段存储

三、技术优化建议

1. 增强反爬对抗能力

# 建议新增以下配置
op.add_argument("--disable-blink-features=AutomationControlled")
op.add_argument(f"user-agent={random.choice(USER_AGENTS)}")  # 随机UA
op.add_argument("--proxy-server=http://127.0.0.1:10809")     # 代理IP

2. 改进页面等待机制

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 替换time.sleep为显式等待
wait = WebDriverWait(web, 10)
element = wait.until(EC.presence_of_element_located((By.ID, "J_goodsList")))

3. 增加异常处理

try:
    price = item.xpath('./div[@class="p-price"]//i/text()')[0]
except IndexError:
    price = "暂无报价"

4. 提升代码可配置性

# 添加配置文件config.py
KEYWORDS = "vivo x100s" 
MAX_PAGE = 10
SAVE_PATH = "./data/"

四、潜在问题与解决方案

问题现象	原因分析	解决方案
商品列表加载不全	滚动速度过快	调整`slide()`步长至200像素
翻页按钮定位失败	页面DOM结构变更	改用CLASS_NAME定位器
数据包含空值	商品信息字段缺失	增加try-except捕获异常
Excel乱码	中文编码问题	导出时指定`encoding='utf-8-sig'`

五、项目扩展方向

分布式爬虫架构：结合Scrapy-Redis实现多节点协同
价格监控系统：定时任务+邮件报警功能
可视化看板：通过Matplotlib生成销量趋势图
API服务化：使用FastAPI暴露数据接口

关注博主，获取更多Python爬虫实战技巧！

京东商品爬虫技术解析：基于Selenium的自动化数据采集实战

一、代码概述

二、核心功能模块解析

1. Cookie管理机制

2. 页面动态加载控制

3. 商品数据解析

4. 多页爬取逻辑

5. 数据存储模块

三、技术优化建议

1. 增强反爬对抗能力

2. 改进页面等待机制

3. 增加异常处理

4. 提升代码可配置性

四、潜在问题与解决方案

五、项目扩展方向

网站公告

今日签到

热门文章

最新发布