本文将介绍如何利用Python开发京东数据采集工具,并通过GitHub进行版本控制、协作和自动化执行,创建一个完整的电商数据采集解决方案。
项目概述
我们将开发一个能够自动采集京东商品信息的Python工具,并通过GitHub实现以下功能:
· 代码版本控制
· 团队协作开发
· 自动化任务执行
· 数据存储与管理
环境设置与GitHub初始化
1. 创建GitHub仓库
首先在GitHub上创建新的仓库:
```bash
# 本地初始化
mkdir jd-data-collector
cd jd-data-collector
git init
# 添加README和.gitignore
echo "# JD Data Collector" >> README.md
echo "__pycache__/" >> .gitignore
echo "*.pyc" >> .gitignore
echo "data/" >> .gitignore
echo "venv/" >> .gitignore
# 连接到远程仓库
git remote add origin https://github.com/your-username/jd-data-collector.git
```
2. 设置Python环境
创建并激活虚拟环境:
```bash
# 创建虚拟环境
python -m venv venv
# 激活虚拟环境 (Windows)
venv\Scripts\activate
# 激活虚拟环境 (Mac/Linux)
source venv/bin/activate
# 安装依赖包
pip install requests beautifulsoup4 selenium pandas lxml
```
3. 项目结构设计
创建以下项目结构:
```
jd-data-collector/
│
├── src/
│ ├── __init__.py
│ ├── crawler.py # 主要采集逻辑
│ ├── parser.py # 页面解析器
│ └── utils.py # 工具函数
│
├── config/
│ └── settings.py # 配置文件
│
├── data/ # 采集的数据(添加到.gitignore)
│ └── outputs/
│
├── tests/ # 测试代码
│ └── test_crawler.py
│
├── requirements.txt # 项目依赖
├── main.py # 主程序入口
└── README.md # 项目说明
```
核心代码实现
1. 主要采集模块 (src/crawler.py)
```python
import requests
from bs4 import BeautifulSoup
import time
import random
from .utils import create_session, rotate_user_agent
class JDCrawler:
def __init__(self, keywords, max_pages=5, delay=1.5):
self.keywords = keywords
self.max_pages = max_pages
self.delay = delay
self.session = create_session()
def fetch_search_results(self, keyword, page):
"""获取搜索页面结果"""
url = f'https://search.jd.com/Search?keyword={keyword}&page={page}'
headers = {'User-Agent': rotate_user_agent()}
try:
response = self.session.get(url, headers=headers)
response.raise_for_status()
return response.text
except Exception as e:
print(f"请求失败: {str(e)}")
return None
def parse_products(self, html_content):
"""解析商品信息"""
if not html_content:
return []
soup = BeautifulSoup(html_content, 'html.parser')
products = []
items = soup.find_all('div', class_='gl-i-wrap')
for item in items:
try:
product = {
'name': item.find('div', class_='p-name').get_text(strip=True),
'price': item.find('div', class_='p-price').get_text(strip=True),
'shop': item.find('div', class_='p-shop').get_text(strip=True) if item.find('div', class_='p-shop') else '',
'comment': item.find('div', class_='p-commit').get_text(strip=True) if item.find('div', class_='p-commit') else ''
}
products.append(product)
except Exception as e:
print(f"解析商品时出错: {str(e)}")
continue
return products
def crawl(self):
"""执行采集任务"""
all_products = []
for keyword in self.keywords:
print(f"正在采集关键词: {keyword}")
for page in range(1, self.max_pages + 1):
print(f"采集第 {page} 页...")
html_content = self.fetch_search_results(keyword, page)
products = self.parse_products(html_content)
all_products.extend(products)
# 添加随机延迟,避免被封IP
time.sleep(random.uniform(self.delay, self.delay + 1.0))
return all_products
```
2. 工具函数 (src/utils.py)
```python
import random
def rotate_user_agent():
"""随机生成User-Agent"""
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
]
return random.choice(user_agents)
def create_session():
"""创建请求会话"""
session = requests.Session()
# 可以在这里添加代理、Cookie等配置
return session
```
3. 主程序入口 (main.py)
```python
from src.crawler import JDCrawler
import pandas as pd
import json
import os
from datetime import datetime
def main():
# 配置采集参数
keywords = ["手机", "笔记本电脑", "平板电脑"]
max_pages = 3
# 创建采集器实例
crawler = JDCrawler(keywords, max_pages)
# 执行采集
products = crawler.crawl()
# 保存结果
if products:
# 创建数据目录
os.makedirs("data/outputs", exist_ok=True)
# 生成文件名(含时间戳)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"data/outputs/jd_products_{timestamp}"
# 保存为CSV
df = pd.DataFrame(products)
df.to_csv(f"{filename}.csv", index=False, encoding='utf-8-sig')
# 保存为JSON
with open(f"{filename}.json", 'w', encoding='utf-8') as f:
json.dump(products, f, ensure_ascii=False, indent=2)
print(f"采集完成!共获取 {len(products)} 条商品数据")
print(f"数据已保存至: {filename}.csv 和 {filename}.json")
else:
print("未采集到任何数据")
if __name__ == "__main__":
main()
```
GitHub自动化工作流
1. 设置GitHub Actions自动执行
创建 .github/workflows/run_crawler.yml 文件:
```yaml
name: Run JD Crawler Daily
on:
schedule:
- cron: '0 2 * * *' # 每天UTC时间2点运行(北京时间10点)
workflow_dispatch: # 允许手动触发
jobs:
run-crawler:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run crawler
run: |
python main.py
- name: Upload data as artifact
uses: actions/upload-artifact@v2
with:
name: jd-data
path: data/outputs/
- name: Commit and push if changed
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add data/outputs/
git diff --quiet && git diff --staged --quiet || git commit -m "Auto-commit collected data"
git push
```
2. 配置 requirements.txt
```
requests==2.25.1
beautifulsoup4==4.9.3
selenium==3.141.0
pandas==1.3.0
lxml==4.6.3
```
团队协作与项目管理
1. 分支策略
· main 分支:稳定版本,受保护
· develop 分支:开发集成分支
· 功能分支:feature/功能名称
· 修复分支:fix/问题描述
2. Issue和项目管理
使用GitHub Issues跟踪功能和问题:
· 标记优先级(high, medium, low)
· 分配负责人
· 关联到项目看板
· 使用里程碑管理版本
3. Pull Request流程
1. 从最新