实例:爬取网易新闻中的五大版块
url:https://news.163.com/
分析:
首页没有动态加载的数据,从中提取五个版块对应的url,每一个版块对应的页面中的新闻标题是动态加载,这里要配合selenium来提取爬取新闻标题和详情页的url,每一条新闻详情页面中的数据不是动态加载,直接爬取新闻内容,下面讲一下selenium在scrapy中的使用流程:
- 在爬虫类中实例化一个浏览器对象,将其作为爬虫类的一个属性
- 在中间件中实现浏览器自动化相关的操作
- 在爬虫类中重写closed(self,spider),在其内部关闭浏览器对象
程序代码:
先在终端依次输入以下命令创建一个新的工程和爬虫:
- scrapy startproject ProName
- cd ProName
- scrapy genspider spiderName www.xxx.com
接着编写spider文件夹下的爬虫文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import scrapy from wangyiPro.items import WangyiproItem from selenium import webdriver class WangyiSpider(scrapy.Spider): name = 'wangyi' # allowed_domains = ['xxx.com'] start_urls = ['https://news.163.com/'] model_urls = [] driver = webdriver.Chrome(executable_path='D:\pycharm\Scrapy\chromedriver.exe') # 实例化浏览器对象 def parse(self, response): li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li') index = [3, 4, 6, 7, 8] # 五个版块的索引 for i in index: model_url = li_list[i].xpath('./a/@href').extract_first() self.model_urls.append(model_url) for url in self.model_urls: yield scrapy.Request(url=url, callback=self.parse_model) def parse_model(self, response): div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div') for div in div_list: title = div.xpath('./div/div[1]/h3/a/text()').extract_first() # 这里xpath路径中的h3不能省略,否则会报错 detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first() if detail_url: item = WangyiproItem() item['title'] = title yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item}) # 通过请求传参将meta传递给callback def parse_detail(self, response): content = response.xpath('//*[@id="endText"]/p/text()').extract() content = ''.join(content) item = response.meta['item'] item['content'] = content yield item def closed(self, reason): # 爬虫结束爬取后关闭浏览器对象 self.driver.quit() |
在middlwares.py文件中实现浏览器自动化相关的操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | from time import sleep from scrapy.http import HtmlResponse class WangyiproDownloaderMiddleware(object): def process_request(self, request, spider): return None def process_response(self, request, response, spider): if request.url in spider.model_urls: driver = spider.driver driver.get(request.url) sleep(2) driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')# js注入:滑动页面一个屏幕的长度,可以获取更多的新闻信息 sleep(1) page_text = driver.page_source return HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request) else: return response def process_exception(self, request, exception, spider): pass |
最后要记得在setting.py里开启管道机制和中间件机制:
1 2 3 4 5 6 7 | DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543, } ITEM_PIPELINES = { 'wangyiPro.pipelines.WangyiproPipeline': 300, } |