Python爬虫去重策略：增量爬取与历史数据比对

suger7 · 发表于昨天 16:53

1. 引言
在数据采集过程中，爬虫经常需要面对重复数据的问题。如果每次爬取都全量抓取，不仅浪费资源，还可能导致数据冗余。增量爬取（Incremental Crawling）是一种高效策略，它仅抓取新增或更新的数据，而跳过已采集的旧数据。
本文将详细介绍 Python爬虫的增量爬取与历史数据比对策略，涵盖以下内容：
1. 增量爬取的核心思路
2. 去重方案对比（数据库、文件、内存）
3. 基于时间戳、哈希、数据库比对的实现方法
4. 完整代码示例（Scrapy + MySQL 增量爬取）
2. 增量爬取的核心思路
增量爬取的核心是识别哪些数据是新的或已更新的，通常采用以下方式：
● 基于时间戳（Last-Modified / Update-Time）
● 基于内容哈希（MD5/SHA1）
● 基于数据库比对（MySQL/Redis/MongoDB）
2.1 基于时间戳的增量爬取
适用于数据源带有发布时间（如新闻、博客）的场景：
1. 记录上次爬取的最新时间戳
2. 下次爬取时，只抓取晚于该时间戳的数据
优点：简单高效，适用于结构化数据
缺点：依赖数据源的时间字段，不适用于无时间戳的网页
2.2 基于内容哈希的去重
适用于内容可能更新但URL不变的页面（如电商价格）：
1. 计算页面内容的哈希值（如MD5）
2. 比对哈希值，若变化则视为更新
优点：适用于动态内容
缺点：计算开销较大
2.3 基于数据库比对的增量爬取
适用于大规模数据管理：
1. 将已爬取的 URL 或关键字段存入数据库（MySQL/Redis）
2. 每次爬取前查询数据库，判断是否已存在
优点：支持分布式去重
缺点：需要额外存储
3. 去重方案对比
方案适用场景优点缺点
内存去重单机小规模爬虫速度快（set()
）重启后数据丢失
文件存储中小规模爬虫简单（CSV/JSON）性能较低
SQL数据库结构化数据管理支持复杂查询（MySQL）需要数据库维护
NoSQL数据库高并发分布式爬虫高性能（Redis/MongoDB）内存占用较高
4. 增量爬取实现方法
4.1 基于时间戳的增量爬取（示例）
import scrapy
from datetime import datetime

class NewsSpider(scrapy.Spider):
name = "news_spider"
last_crawl_time = None  # 上次爬取的最新时间

def start_requests(self):
      # 从文件/DB加载上次爬取时间
      self.last_crawl_time = self.load_last_crawl_time()

      # 设置代理信息
      proxy = "http://www.16yun.cn:5445"
      proxy_auth = "16QMSOML:280651"

      # 添加代理到请求中
      yield scrapy.Request(
         url="https://news.example.com/latest",
         meta={
            'proxy': proxy,
            'proxy_user_pass': proxy_auth
         }
      )

def parse(self, response):
      # 检查响应状态码，判断是否成功获取数据
      if response.status != 200:
         self.logger.error(f"Failed to fetch data from {response.url}. Status code: {response.status}")
         self.logger.error("This might be due to network issues or an invalid URL. Please check the URL and try again.")
         return

      for article in response.css(".article"):
         pub_time = datetime.strptime(
            article.css(".time::text").get(),
            "%Y-%m-%d %H:%M:%S"
         )
         if self.last_crawl_time and pub_time <= self.last_crawl_time:
            continue  # 跳过旧文章

         yield {
            "title": article.css("h2::text").get(),
            "time": pub_time,
         }

      # 更新最新爬取时间
      self.save_last_crawl_time(datetime.now())

def load_last_crawl_time(self):
      try:
         with open("last_crawl.txt", "r") as f:
            return datetime.strptime(f.read(), "%Y-%m-%d %H:%M:%S")
      except FileNotFoundError:
         return None

def save_last_crawl_time(self, time):
      with open("last_crawl.txt", "w") as f:
         f.write(time.strftime("%Y-%m-%d %H:%M:%S"))
4.2 基于内容哈希的去重（示例）
import hashlib

class ContentHashSpider(scrapy.Spider):
name = "hash_spider"
seen_hashes = set()  # 存储已爬取的哈希

def parse(self, response):
      content = response.css("body").get()
      content_hash = hashlib.md5(content.encode()).hexdigest()

      if content_hash in self.seen_hashes:
         return  # 跳过重复内容

      self.seen_hashes.add(content_hash)
      yield {"url": response.url, "content": content}
4.3 基于MySQL的增量爬取（完整示例）
（1）MySQL

账号		自动登录	找回密码
密码			论坛注册

Python爬虫去重策略：增量爬取与历史数据比对

浏览过的版块

客服中心

投诉建议