This is the third part of the series Scrapping with Scrapy.
In this post I will covering how to use selenium with scrapy, how to change the template, that gets loaded when a new Scrapy project is created. You may need to read part 1 and part 2 of this series to understand more.
Let’s start with how to use Selenium with Scrapy.
First download the selenium jar, then cd to where it is present. Then start it using
java -jar selenium-server-jarfilename.jar
How to scrap when you can’t fetch data directly from the source, but you need to load the page, click somewhere, scroll down e.t.c, Selenium is for the rescue.
Here is the complete code of the scrapper.
You need to open the url using selenium, so that you can fetch what Scrapy can’t see.
Here is the code for such a spider. You need to add some lines in your spider, to get the page loaded using selenium. Have a look at the spider code, here.
To get what all functions, it provides, you can use
dir (object name)
I will be posting some tips and tricks related to xpaths in some other posts.
Now let me tell you how to change the templates that gets loaded when you create new project in Scrapy.
First let’s install open-as-administrator, to easily edit files that requires sudo permission in Linux.
sudo add-apt-repository ppa:noobslab/apps sudo apt-get update sudo apt-get install open-as-administrator nautilus -q
Then find Scrapy’s dist location, it would be somewhere here,
/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.2-py2.7.egg/scrapy
Here you will have templates folder, open that, go to project, then to module inside it. Here you can see the all template files.
Your items.py template is also here named as items.py.tmpl. Now right click on it, open as administrator and edit it the way you want to get it loaded.