Using Scrapy to Find a String in a Web Site

Last updated Sunday, 12. February 2017 07:53AM

I wanted to find pages on the University of Guam College of Natural and Life Sciences Web Site containing a specific string. This short python script, which uses the scrapy framework, does the trick:

test_spider.py

from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['cnas-re.uog.edu']
  start_urls = ['http://cnas-re.uog.edu']
  rules = ( Rule(LinkExtractor(allow=()), follow=True,callback='parse_item'), )

  def parse_item(self, response):
    target = 'bell pepper'
    log = 'test_spider_log.md'
    if target in str(response.body):
      with open(log, 'a') as f:
        f.write('**{} was found in <{}>\n'.format(target, response.url))
    return

Executed from the command line using:

scrapy runspider test_spider.py -s DEPTH_LIMIT=2

Output: test_spider_log.md

bell pepper was found in http://cnas-re.uog.edu/soils-of-guam/

bell pepper was found in http://cnas-re.uog.edu/cnas-publications/?auth=&limit=17&tgid=&type=&usr=&yr=

bell pepper was found in http://cnas-re.uog.edu/cnas-publications/?auth=&tgid=115&type=&usr=&yr=

bell pepper was found in http://cnas-re.uog.edu/cnas-publications/?auth=&tgid=66&type=&usr=&yr=