Setting up the environment for web scraping - Part 1
Installing Webdriver and Selenium
My first project as part of the AiCore fellowship is collecting data scattered online and creating datasets. I use a web scraper, which is an automated bot to crawl through the internet and extract data.
The most popular libraries used by developers in Python are Beautiful Soup, Scrapy, and Selenium, I am using Selenium for this project.
Selenium is an automated testing tool for web applications. It is not the most efficient in data collecting, but it lets us easily interact with DOM elements and extract data on the dynamic page of the target site.
How to install
The process might differ depending on the environment, but in this post, I am using the environment listed below.
- Miniconda3 (docs.conda.io/en/latest/miniconda.html)
- Visual Studio Code (code.visualstudio.com/download)
Step 1. Prepare web drivers
Using the Selenium official website, the web driver drives a browser natively, as a user would, either locally or on a remote machine using the Selenium server. In simple terms, it imitates user actions instructed by the Python code. If you want to use Selenium, you need to download the web driver for the browser you want to use.
- Download gecko driver from the official website(github.com/mozilla/geckodriver/releases).
- Extract the file (Remember the location).
- Download chrome driver from the official website(chromedriver.chromium.org/downloads).
- Extract the file
When you download the driver, the version of the web browser needs to be matched.
- Firefox version: Settings > General > Firefox Updates
- Google version: Settings > About Chrome
Step 2. Move the drivers to the right location
Now it is time to move the extracted driver files to the right location so our Python code can find them when we run the code.
To find the location, you put the command below in VS terminal.
$PATH is a list of file locations related to environment variables. If you put an executable in either one of these directories, you do not need to set the path to the executable / script, but you can run it by its name as a command.
My driver files were in downloads folder, and I moved files to one of the folders(/usr/bin).
$ cd /usr/bin $ mv /home/yoojin/Downloads/geckodriver . $ mv /home/yoojin/Downloads/chromedriver .
Step 3. Selenium library setup
The next step is installing the Selenium library. As I am using Miniconda environment, I used conda command.
conda install selenium
Step 4. Run the sample code
It is now ready to run the sample code. Copy the code below and run it on your VS.
from selenium import webdriver driver = webdriver.Firefox() driver.get('https://www.google.co.uk')
from selenium import webdriver driver = webdriver.Chrome() driver.get('https://www.google.co.uk')
- ModuleNotFound Error
Traceback (most recent call last): File "/home/yoojin/Documents/Pokemon_scraper/test.py", line 1, in <module> from selenium import webdriver ModuleNotFoundError: No module named 'selenium'
Sometimes when running the code, an error message appears and says "not able to find the library", even though you definitely installed the library.
In this case, it is worth checking you are in the right environment. In Visual Studio, you can change the Python interpreter by clicking it on the bottom-left of the screen, or using the Python: Select Interpreter command from the Command Palette (Ctrl+Shift+P).
- Display Error
selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status: 1
This error can come up when trying to run the browser in non-headless mode on a box that doesn't have a display. There are 2 ways to resolve this issue.
The first method is running a driver with headless mode. This means you can't see the actions of the web driver.
from selenium import webdriver from selenium.webdriver import FirefoxOptions fireFoxOptions = FirefoxOptions() fireFoxOptions.add_argument('--headless') driver = webdriver.Firefox(options=fireFoxOptions) driver.get('https://www.google.co.uk')
from selenium import webdriver from selenium.webdriver import ChromeOptions chrome_options = ChromeOptions() chrome_options.add_argument('--headless') driver = webdriver.Firefox(options=chrome_options) driver.get('https://www.google.co.uk')
But this is not helpful when you want to see how a web driver works and check the code in real-time. In this case, you can set the display environment variable in the terminal.
$ Export DISPLAY=:0.0
The reason this variable is needed is that you can have multiple X servers running locally, or you may wish to use a remote display. So if the DISPLAY variable is not set, your X11 apps have no idea where you want them to run.
Now, all I need for the project is properly set up. The next step is building a demo scraper.