Last updated Sunday, 12. February 2017 07:53AM
I wanted to find pages on the University of Guam College of Natural and Life Sciences Web Site containing a specific string. This short python script, which uses the scrapy framework, does the trick:
test_spider.py
from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class someSpider(CrawlSpider):
name = 'crawltest'
allowed_domains = ['cnas-re.uog.edu']
start_urls = ['http://cnas-re.uog.edu']
rules = ( Rule(LinkExtractor(allow=()), follow=True,callback='parse_item'), )
def parse_item(self, response):
target = 'bell pepper'
log = 'test_spider_log.md'
if target in str(response.body):
with open(log, 'a') as f:
f.write('**{} was found in <{}>\n'.format(target, response.url))
return
Executed from the command line using:
scrapy runspider test_spider.py -s DEPTH_LIMIT=2
Output: test_spider_log.md
bell pepper was found in http://cnas-re.uog.edu/soils-of-guam/
bell pepper was found in http://cnas-re.uog.edu/cnas-publications/?auth=&limit=17&tgid=&type=&usr=&yr=
bell pepper was found in http://cnas-re.uog.edu/cnas-publications/?auth=&tgid=115&type=&usr=&yr=
bell pepper was found in http://cnas-re.uog.edu/cnas-publications/?auth=&tgid=66&type=&usr=&yr=