python基础20(2025.6.24)_scrapy案例模拟登陆&中间件(模拟登陆、切换请求头、代理模式)

发布于:2025-06-30 ⋅ 阅读:(13) ⋅ 点赞:(0)

scrapy案例:某爱读

任务:模拟登陆:http://www.woaige.net/

回顾命令:
1、创建项目:scrapy startproject xxxx
2、生成项目:scrapy genspider xxx 项目名称.com

一、爬取的方案一:

此时运行时, 显示的是该用户还未登录. 不论是哪个方案. 在请求到start_urls里面的url之前必须得获取到cookie. 但是默认情况下, scrapy会自动的帮我们完成其实request的创建.

查看scrapy源码,查看初始的start_url是如何工作的:

# 以下是scrapy源码

def start_requests(self):
    cls = self.__class__
    if not self.start_urls and hasattr(self, 'start_url'):
        raise AttributeError(
            "Crawling could not start: 'start_urls' not found "
            "or empty (but found 'start_url' attribute instead, "
            "did you miss an 's'?)")
    if method_is_overridden(cls, Spider, 'make_requests_from_url'):
        warnings.warn(
            "Spider.make_requests_from_url method is deprecated; it "
            "won't be called in future Scrapy releases. Please "
            "override Spider.start_requests method instead (see %s.%s)." % (
                cls.__module__, cls.__name__
            ),
        )
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
    else:
        for url in self.start_urls:
            # 核心就这么一句话. 组建一个Request对象.我们也可以这么干. 
            yield Request(url, dont_filter=True)

可以重写start_url

def start_requests(self):
    print("我是万恶之源")
    yield Request(
        url=LoginSpider.start_urls[0],
        callback=self.parse
    )
  1. 方案一, 直接从浏览器复制cookie过来
import scrapy


class DengSpider(scrapy.Spider)