Python实例题:基于scrapy爬虫的天气数据采集

发布于:2025-05-19 ⋅ 阅读:(19) ⋅ 点赞:(0)

目录

Python实例题

题目

weather_spider

spiders

weather_spider.py

items.py

pipelines.py

settings.py

代码解释

items.py:

weather_spider.py:

parse 方法:

parse_city 方法:

pipelines.py:

WeatherPipeline:

MongoPipeline:

settings.py:

运行思路

创建 Scrapy 项目:

替换代码:

安装依赖:

运行爬虫:

数据输出:

注意事项

Python实例题

题目

基于scrapy爬虫的天气数据采集(python)

weather_spider

spiders

weather_spider.py
import scrapy
from weather_spider.items import WeatherItem
import re

class WeatherSpider(scrapy.Spider):
    name = "weather"
    allowed_domains = ["weather.com.cn"]
    start_urls = ["http://www.weather.com.cn/textFC/hb.shtml"]  # 从华北地区开始

    def parse(self, response):
        """解析地区页面,获取省份链接"""
        province_links = response.css("div.conMidtab2 a::attr(href)").getall()
        for link in province_links:
            if link.startswith("http"):
                yield scrapy.Request(link, callback=self.parse_city)
            else:
                yield scrapy.Request("http://www.weather.com.cn" + link, callback=self.parse_city)

    def parse_city(self, response):
        """解析城市页面,获取天气数据"""
        city_name = response.css("div.crumbs a::text").getall()[-1]
        city_code = re.search(r'/(\d+)\.html', response.url).group(1) if re.search(r'/(\d+)\.html', response.url) else ""

        # 获取当天天气数据
        today_weather = response.css("div.today ul li")
        if len(today_weather) >= 4:
            item = WeatherItem()
            item["city_name"] = city_name
            item["city_code"] = city_code
            item["date"] = today_weather[0].css("::text").get()
            item["week"] = today_weather[1].css("::text").get()
            item["weather"] = today_weather[2].css("::text").get()
            item["temp_high"] = today_weather[3].css("span::text").get()
            item["temp_low"] = today_weather[3].css("i::text").get()
            item["wind"] = today_weather[4].css("::text").get() if len(today_weather) > 4 else ""
            yield item

        # 获取未来天气预报
        forecast_items = response.css("div.forecast ul li")
        for item in forecast_items:
            date_info = item.css("h1::text").get()
            if date_info:
                date_parts = date_info.split()
                if len(date_parts) >= 2:
                    date = date_parts[0]
                    week = date_parts[1]
                    weather = item.css("p.wea::text").get()
                    temp = item.css("p.tem span::text").get()
                    temp_low = item.css("p.tem i::text").get()
                    wind = item.css("p.win i::text").get()

                    weather_item = WeatherItem()
                    weather_item["city_name"] = city_name
                    weather_item["city_code"] = city_code
                    weather_item["date"] = date
                    weather_item["week"] = week
                    weather_item["weather"] = weather
                    weather_item["temp_high"] = temp
                    weather_item["temp_low"] = temp_low
                    weather_item["wind"] = wind
                    yield weather_item    

items.py

import scrapy

class WeatherItem(scrapy.Item):
    city_name = scrapy.Field()  # 城市名称
    city_code = scrapy.Field()  # 城市代码
    date = scrapy.Field()       # 日期
    week = scrapy.Field()       # 星期
    weather = scrapy.Field()    # 天气状况
    temp_high = scrapy.Field()  # 最高温度
    temp_low = scrapy.Field()   # 最低温度
    wind = scrapy.Field()       # 风力    

pipelines.py

import json
import pymongo

class WeatherPipeline:
    def __init__(self):
        self.file = open('weather_data.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()

class MongoPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'weather')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db['weather_data'].insert_one(dict(item))
        return item    

settings.py

BOT_NAME = 'weather_spider'

SPIDER_MODULES = ['weather_spider.spiders']
NEWSPIDER_MODULE = 'weather_spider.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8

# Configure a delay for requests for the same website (default: 0)
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 4
CONCURRENT_REQUESTS_PER_IP = 0

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 500,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    'scrapy.extensions.logstats.LogStats': 500,
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'weather_spider.pipelines.WeatherPipeline': 300,
    # 如果需要保存到MongoDB,取消下面的注释并配置MONGO_URI
    # 'weather_spider.pipelines.MongoPipeline': 400,
}

# MongoDB配置
# MONGO_URI = 'mongodb://localhost:27017'
# MONGO_DATABASE = 'weather'

# 日志配置
LOG_LEVEL = 'INFO'    

代码解释

  • items.py

    • 定义了爬取数据的结构,包括城市名称、日期、天气状况等字段。
  • weather_spider.py

    • parse 方法
      • 从地区页面解析出省份链接并发送请求。
    • parse_city 方法
      • 解析城市天气页面,提取当前天气和未来天气预报数据。
  • pipelines.py

    • WeatherPipeline
      • 将爬取的数据保存为 JSON 文件。
    • MongoPipeline
      • 将数据存储到 MongoDB 数据库(需要取消相关注释并配置)。
  • settings.py

    • 配置爬虫的各种参数,如请求延迟、并发数、请求头和管道等。

运行思路

  • 创建 Scrapy 项目

scrapy startproject weather_spider
cd weather_spider
scrapy genspider weather weather.com.cn
  • 替换代码

    • 将上述代码文件替换到对应的位置。
  • 安装依赖

pip install scrapy pymongo  # 如果需要保存到MongoDB
  • 运行爬虫

scrapy crawl weather
  • 数据输出

    • 爬取的数据会保存到 weather_data.json 文件中,或根据配置保存到 MongoDB。

注意事项

  • 爬取频率不宜过高,避免被网站封禁 IP(已设置 1 秒延迟)。
  • 如需存储到 MongoDB,需取消 settings.py 中相关注释并正确配置连接信息。
  • 网站结构可能会变化,若爬虫失效,需根据最新页面结构调整选择器。
  • 该爬虫仅用于学习,爬取的数据仅供个人研究,请勿用于商业用途。

网站公告

今日签到

点亮在社区的每一天
去签到