1. 引言
在电子商务和数据分析领域,1688(阿里巴巴批发网)作为国内领先的B2B平台,拥有海量的商品数据。企业、研究机构或个人开发者往往需要获取这些数据用于市场分析、价格监控或竞品研究。然而,1688的商品页面通常采用动态加载(AJAX)和反爬机制,传统的静态爬虫难以直接获取数据。
本文将介绍如何利用 Python爬虫 + 动态页面解析技术,精准抓取1688店铺的所有商品信息,包括:
- 商品名称
- 价格
- 销量
- 库存
- 商品链接
- 店铺信息
我们将使用 Selenium + BeautifulSoup 结合的方式,绕过动态加载限制,并优化爬虫效率。文章附带完整代码实现,并提供反爬应对策略。
2. 技术选型
2.1 为什么选择Selenium?
1688的商品列表和详情页通常采用 AJAX动态加载,普通HTTP请求(如**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
)无法获取完整数据。而 Selenium 可以模拟浏览器操作,等待JavaScript渲染完成后再解析页面,确保数据完整性。
2.2 辅助工具
- BeautifulSoup:解析HTML,提取结构化数据
- Pandas:存储数据到CSV/Excel
- ChromeDriver:配合Selenium驱动浏览器
3. 环境准备
3.1 安装依赖库
Selenium需要浏览器驱动(如ChromeDriver),推荐使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">webdriver-manager</font>**
自动管理:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
4. 爬虫实现步骤
4.1 分析1688页面结构
目标URL示例:
https://shop.1688.com/xxxxx/xxxxxx.htm(店铺主页)
商品数据通常通过AJAX加载,需分析:
- 商品列表的API接口(如果有)
- 动态加载的滚动触发方式
- 分页逻辑
4.2 模拟登录(可选)
部分店铺需要登录才能查看价格,可使用Selenium自动填充账号密码:
driver.get("https://login.1688.com/")
driver.find_element_by_id("fm-login-id").send_keys("your_username")
driver.find_element_by_id("fm-login-password").send_keys("your_password")
driver.find_element_by_class_name("fm-submit").click()
4.3 获取商品列表
使用Selenium滚动页面,触发AJAX加载所有商品:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
def scroll_to_bottom(driver):
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # 等待加载
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# 访问店铺首页
driver.get("https://shop.1688.com/shop/xxxxxx.htm")
scroll_to_bottom(driver) # 滚动到底部加载所有商品
4.4 解析商品数据
使用BeautifulSoup提取商品信息:
from bs4 import BeautifulSoup
import pandas as pd
def parse_products(driver):
soup = BeautifulSoup(driver.page_source, 'html.parser')
products = []
for item in soup.select(".offer-list-row .offer-item"):
name = item.select_one(".offer-title").get_text(strip=True)
price = item.select_one(".price").get_text(strip=True)
sales = item.select_one(".sale-num").get_text(strip=True)
link = item.select_one(".offer-title a")["href"]
products.append({
"商品名称": name,
"价格": price,
"销量": sales,
"链接": link
})
return pd.DataFrame(products)
df = parse_products(driver)
df.to_csv("1688_products.csv", index=False)
4.5 处理分页
如果店铺有分页,可循环点击“下一页”:
while True:
try:
next_btn = driver.find_element(By.CSS_SELECTOR, ".next-btn")
next_btn.click()
time.sleep(3) # 等待加载
df = pd.concat([df, parse_products(driver)])
except:
break # 无下一页时退出
5. 反爬策略优化
5.1 随机延迟
避免频繁请求导致封禁:
5.2 使用代理IP
防止IP被封:
5.3 修改User-Agent
模拟不同浏览器:
6. 完整代码示例
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
# 代理配置信息
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"
def scroll_to_bottom(driver):
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(random.uniform(1, 3))
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
def parse_products(driver):
soup = BeautifulSoup(driver.page_source, 'html.parser')
products = []
for item in soup.select(".offer-list-row .offer-item"):
name = item.select_one(".offer-title").get_text(strip=True)
price = item.select_one(".price").get_text(strip=True)
sales = item.select_one(".sale-num").get_text(strip=True)
link = item.select_one(".offer-title a")["href"]
products.append({
"商品名称": name,
"价格": price,
"销量": sales,
"链接": link
})
return pd.DataFrame(products)
def main():
# 配置Chrome代理选项
chrome_options = webdriver.ChromeOptions()
# 设置代理认证信息
proxy_auth_plugin_path = create_proxy_auth_extension(
proxy_host=proxyHost,
proxy_port=proxyPort,
proxy_username=proxyUser,
proxy_password=proxyPass
)
chrome_options.add_extension(proxy_auth_plugin_path)
# 初始化浏览器驱动
driver = webdriver.Chrome(
ChromeDriverManager().install(),
options=chrome_options
)
try:
driver.get("https://shop.1688.com/shop/xxxxxx.htm")
scroll_to_bottom(driver)
df = parse_products(driver)
while True:
try:
next_btn = driver.find_element(By.CSS_SELECTOR, ".next-btn")
next_btn.click()
time.sleep(3)
df = pd.concat([df, parse_products(driver)])
except:
break
df.to_csv("1688_products.csv", index=False)
finally:
driver.quit()
def create_proxy_auth_extension(proxy_host, proxy_port, proxy_username, proxy_password, scheme='http'):
"""创建代理认证插件"""
import string
import zipfile
import io
import os
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "Chrome Proxy",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"<all_urls>",
"webRequest",
"webRequestBlocking"
],
"background": {
"scripts": ["background.js"]
},
"minimum_chrome_version":"22.0.0"
}
"""
background_js = string.Template(
"""
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "${scheme}",
host: "${host}",
port: parseInt(${port})
},
bypassList: ["localhost"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "${username}",
password: "${password}"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: ["<all_urls>"]},
['blocking']
);
"""
).substitute(
host=proxy_host,
port=proxy_port,
username=proxy_username,
password=proxy_password,
scheme=scheme
)
# 创建临时目录
temp_dir = os.path.join(os.getcwd(), "chrome_proxy_ext")
if not os.path.exists(temp_dir):
os.mkdir(temp_dir)
# 写入manifest.json
with open(os.path.join(temp_dir, "manifest.json"), "w") as f:
f.write(manifest_json)
# 写入background.js
with open(os.path.join(temp_dir, "background.js"), "w") as f:
f.write(background_js)
# 打包成crx文件
proxy_auth_plugin_path = os.path.join(temp_dir, "proxy_auth_plugin.zip")
with zipfile.ZipFile(proxy_auth_plugin_path, "w") as zp:
zp.write(os.path.join(temp_dir, "manifest.json"), "manifest.json")
zp.write(os.path.join(temp_dir, "background.js"), "background.js")
return proxy_auth_plugin_path
if __name__ == "__main__":
main()
7. 结论
本文介绍了如何使用 Python + Selenium + BeautifulSoup 精准抓取1688店铺商品数据,并提供了完整的代码实现。关键点包括:
- 动态页面解析:Selenium模拟浏览器加载AJAX数据
- 反爬优化:随机延迟、代理IP、User-Agent轮换
- 数据存储:Pandas导出CSV