Python使用API提取代理json格式写爬虫-EW帮帮网

在Python中通过API提取代理（JSON格式）并编写爬虫，可以高效实现动态IP代理池的构建。根据我以往的经验可以有以下步骤：

在这里插入图片描述

步骤1：获取代理API

选择一个提供免费或付费爬虫ip的API服务（例如：https://proxy.webshare.io/ 或其他免费API），注意替换成你自己的API密钥或URL。

步骤2：安装必要库

pip install requests

步骤3：完整代码示例

import requests
import time

def fetch_proxies(api_url, api_key=None):
    """从API获取爬虫Ip列表"""
    headers = {"Authorization": f"Token {api_key}"} if api_key else {}
    try:
        response = requests.get(api_url, headers=headers, timeout=10)
        response.raise_for_status()  # 检查HTTP错误
        return response.json()  # 解析JSON响应
    except requests.exceptions.RequestException as e:
        print(f"获取爬虫Ip失败: {e}")
        return []

def test_proxy(proxy, test_url="http://httpbin.org/ip"):
    """测试爬虫Ip是否有效"""
    proxies = {
        "http": f"http://{proxy['ip']}:{proxy['port']}",
        "https": f"http://{proxy['ip']}:{proxy['port']}"
    }
    try:
        start = time.time()
        response = requests.get(test_url, proxies=proxies, timeout=10)
        latency = time.time() - start
        if response.status_code == 200:
            print(f"爬虫Ip {proxy['ip']}:{proxy['port']} 有效 | 延迟: {latency:.2f}s | 响应IP: {response.json()['origin']}")
            return True
    except Exception:
        pass
    print(f"爬虫Ip {proxy['ip']}:{proxy['port']} 无效")
    return False

def crawl_with_proxy(target_url, proxy):
    """使用爬虫Ip爬取目标网站"""
    proxies = {
        "http": f"http://{proxy['ip']}:{proxy['port']}",
        "https": f"http://{proxy['ip']}:{proxy['port']}"
    }
    try:
        response = requests.get(target_url, proxies=proxies, timeout=15)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"爬取失败: {e}")
        return None

# 配置参数
API_URL = "https://proxy.webshare.io/api/proxy/list/"  # 替换为你的API地址
API_KEY = "your_api_key_here"  # 替换为你的API密钥
TARGET_URL = "https://example.com"  # 目标网站

# 主流程
if __name__ == "__main__":
    # 1. 获取爬虫Ip列表
    proxies_data = fetch_proxies(API_URL, API_KEY)
    
    # 示例响应格式（根据你的API调整）：
    # [{"ip": "1.2.3.4", "port": 80, ...}, ...]
    if not proxies_data:
        print("未获取到爬虫Ip，程序终止")
        exit()
    
    # 2. 测试并选择有效爬虫Ip
    valid_proxies = [proxy for proxy in proxies_data if test_proxy(proxy)]
    
    if not valid_proxies:
        print("无有效爬虫Ip")
        exit()
    
    # 3. 使用第一个有效爬虫Ip进行爬取
    best_proxy = valid_proxies[0]  # 简单选择第一个
    print(f"\n使用爬虫Ip {best_proxy['ip']}:{best_proxy['port']} 爬取中...")
    
    # 4. 执行爬虫
    content = crawl_with_proxy(TARGET_URL, best_proxy)
    if content:
        print(f"爬取成功！获取内容长度: {len(content)} 字符")
        # 这里可添加HTML解析/数据提取逻辑
    else:
        print("爬取失败")

关键说明：

1、代理API响应格式：

示例API返回JSON数组，包含ip和port字段
根据你的API实际响应调整数据提取逻辑（如：proxy['ip'] → 可能需改为proxy['address']）

2、爬虫ip测试：

使用 httpbin.org/ip 验证爬虫ip有效性
显示延迟和爬虫ipIP（验证匿名性）

3、实际爬虫：

替换 TARGET_URL 为目标网站
在 crawl_with_proxy() 后添加解析逻辑（如BeautifulSoup）

4、增强建议：

# 随机选择爬虫ip（避免单一爬虫ip被封）
import random
proxy = random.choice(valid_proxies)

# 添加用户爬虫ip头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, proxies=proxies, headers=headers)

# 爬虫ip认证（如需用户名密码）
proxies = {
    "http": f"http://user:pass@{ip}:{port}",
    "https": f"http://user:pass@{ip}:{port}"
}

常见代理API格式示例：

// 格式1：对象数组
[
  {"ip": "192.168.1.1", "port": 8080, "country": "US"},
  {"ip": "10.0.0.1", "port": 3128, "country": "UK"}
]

// 格式2：嵌套结构
{
  "data": [
    {"proxy": "1.1.1.1:8888", "protocol": "HTTP"},
    {"proxy": "2.2.2.2:3128", "protocol": "HTTPS"}
  ]
}

免费爬虫ip通常不稳定，生产环境建议使用（如Luminati、Smartproxy等），并遵守目标网站的robots.txt规则。

此方案可根据实际需求扩展为分布式代理池系统，结合Redis实现代理的自动获取、验证、分配和淘汰。

Python使用API提取代理json格式写爬虫

关键说明：

常见代理API格式示例：

网站公告

今日签到

热门文章

最新发布