scrapy直接用并不好用, 对于需要js渲染或动态dom的网页, scrapy爬不到.
这时候需要splash 中文文档 英文原档, 动态解析js并返回其最终渲染结果. splash实际上是一个docker image, 需要预先安装docker,然后拉取下来就可以用. 另外splash是基于lua的, 嗯, 感觉python+docker+lua,真是个缝合怪啊
介绍文章: https://www.bilibili.com/read/cv12375274
github上有一个scrapy-splash的项目.

scrapy-splash使用方法

安装scrapy, splash, scrapy-splash

pip install scrapy
docker pull scrapinghub/splash
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
# 在上一步命令以后,命令行界面将挂起运行docker, 此时需要新建一个命令行继续运行后面的命令
pip install scrapy-splash

新建一个scrapy项目

scrapy startproject tutorial

修改其中的settings.py文件, 简便起见, 加入下面内容即可


SPLASH_URL = 'http://127.0.0.1:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
FEED_EXPORT_ENCODING = 'UTF-8' 
SPIDER_MIDDLEWARES = {
  'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
  'scrapy_splash.SplashCookiesMiddleware': 723,
  'scrapy_splash.SplashMiddleware': 725,
  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

spiders目录下, 新建一个爬虫文件, 比如爬虫名字叫df, 文件名字叫df_spider.py

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = 'df'
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        page = '1' # response.url.split("/")[-2]
        filename = f'df-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

项目目录下运行爬虫,会得到一个df-1.html的文件.

scrapy crawl df

查看这个文件是不是已经有js渲染, 如果还是没有渲染, 可增加爬虫文件中splash的wait时间, 上面设置的是0.5秒.
splash click一个元素的方式:

  assert(splash:runjs('document.querySelector(".next a[href]").click()'))

标签: none 阅读量: 866

添加新评论