Using Scrapy to Extract Scientific Names from PestNet Fact Sheets
PestNet serves a couple of hundred excellent pest fact sheets on its site at http://www.pestnet.org/fact_sheets/index.htm. Unfortunately, these are indexed only by vernacular names. I want to hyperlink to sheets using scientific names. So I wrote a scrapy script to crawl the site and scrape the section from each page which contains scientific names.
from scrapy.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor import re class someSpider(CrawlSpider): name = 'crawltest' allowed_domains = ['www.pestnet.org'] start_urls = ['http://www.pestnet.org/fact_sheets/index.htm'] rules = ( Rule(LinkExtractor(allow=()), follow=True, callback='parse_item'), ) # Let's capture scientific names def parse_item(self, response): log = 'scientific_names_log.md' s = str(response.body) searchObj = re.search( r'<a name="Scientific Name"(.*?)<a name=', s, re.M|re.I) if searchObj: result = searchObj.group(1) else: result = "Nothing found!!" with open(log, 'a') as f: f.write('{} {}\n'.format(response.url, result)) return
The script was invoked from the terminal using:
scrapy runspider scrapePP.py -s DEPTH_LIMIT=1
Here are the first few lines from scientific_names_log.md:
http://www.pestnet.org/fact_sheets/mini/index.htm Nothing found!! http://www.pestnet.org/fact_sheets/batiki_blue_grass_eye_spot_207.htm ></a><h1 class="" style="False">Scientific Name</h1><P><EM>Curvularia ischaemi</EM></P> http://www.pestnet.org/fact_sheets/bean_pod_borer_037.htm ></a><h1 class="" style="False">Scientific Name</h1><P><EM>Maruca</EM> <EM>vitrata</EM>; it used to be known as <EM>Maruca testulalis.</EM></P> http://www.pestnet.org/fact_sheets/bean_lace_bug_253.htm ></a><h1 class="" style="False">Scientific Name</h1><P><EM></EM> <EM>Corythucha gossypii</EM></P> http://www.pestnet.org/fact_sheets/bean_phaseolus_rust_217.htm ></a><h1 class="" style="False">Scientific Name</h1><P><EM>Uromyces appendiculatus</EM> \r\nvar. <EM>appendiculatus.</EM> Previously <EM>Uromyces \r\nphaseoli.</EM> </P>
As you can see, there was still some work to be done to clean this up. I used the atom editor to delete the extraneous bits and saved urls and scientific names to pacific_pests_insects.csv. Note that this file contains info only on arthropod pests.