scrapy直接用并不好用, 对于需要js渲染或动态dom的网页, scrapy爬不到.
这时候需要splash 中文文档 英文原档, 动态解析js并返回其最终渲染结果. splash实际上是一个docker image, 需要预先安装docker,然后拉取下来就可以用. 另外splash是基于lua的, 嗯, 感觉python+docker+lua,真是个缝合怪啊
介绍文章: https://www.bilibili.com/read/cv12375274
github上有一个scrapy-splash的项目.
scrapy-splash使用方法
安装scrapy, splash, scrapy-splash
pip install scrapy
docker pull scrapinghub/splash
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
# 在上一步命令以后,命令行界面将挂起运行docker, 此时需要新建一个命令行继续运行后面的命令
pip install scrapy-splash
新建一个scrapy项目
scrapy startproject tutorial
修改其中的settings.py
文件, 简便起见, 加入下面内容即可
SPLASH_URL = 'http://127.0.0.1:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
FEED_EXPORT_ENCODING = 'UTF-8'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
在spiders
目录下, 新建一个爬虫文件, 比如爬虫名字叫df, 文件名字叫df_spider.py
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = 'df'
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
page = '1' # response.url.split("/")[-2]
filename = f'df-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
项目目录下运行爬虫,会得到一个df-1.html的文件.
scrapy crawl df
查看这个文件是不是已经有js渲染, 如果还是没有渲染, 可增加爬虫文件中splash的wait时间, 上面设置的是0.5秒.
splash click一个元素的方式:
assert(splash:runjs('document.querySelector(".next a[href]").click()'))