-
Notifications
You must be signed in to change notification settings - Fork 8
Parsing Example 2
This example will walk you through how to parse a website using Selenium. Selenium allows you to open a webpage and automate the process of clicking through a website and extracting information. Selenium is the preferred choice over Beautiful Soup when parsing a website that requires navigating through menus or loading new web pages. Basically if the website requires you to click elements on the page or enter information in order to get to the information you want, you should use Selenium. Parsing a website will require you to be at least a little familiar with HTML so brush up on that here if you aren't comfortable reading HTML. Please note that every website is different and there is no universal way to parse websites. This wiki page will guide you through parsing one website and hopefully this will teach you the tools you need to start parsing other websites.
I always start by clicking through the website and making a mental note of the structure. Where do I want to start the process? How do I get from there to all the cameras I need to get to? Once I'm on a page that has cameras to parse, can I get to another page of cameras from there or do I have to go back to the start point? Come up with a plan as to how you want to navigate the site.
For this example we will be parsing the 511 Alberta Site. This site has a dropdown menu to select different highways.
Each option will have a different set of cameras to choose from. Then you need to click each camera box to pull up a new page that contains different camera angles for the chosen camera. Each camera page looks something like this:
With alternate angles on the left hand side or something like this, with alternate angles on the bottom:
Then you'll need to go back one page to return to the camera options and click the next camera. Once you've gone through each camera you'll need to choose the next option on the dropdown. This process repeats itself until you've navigated through the whole website. Once you have a plan for navigation you are ready to start coding.
The first thing you'll need to do is copy the script header into your file and fill it out. This helps document your code and makes it easy for someone else to take a look and understand what your code is doing. Next you'll want to import the python modules you will be using. We will be using the following import statements for this script.
from selenium import webdriver
from selenium.webdriver.support.select import Select
import urllib
from Geocoding import Geocoding
from WriteToFile import WriteToFile
import time
Please note that you will need to install the geopy module pip install geopy
and copy the Geocoding.py script from GitHub.
Now create a function to perform the navigation. We'll call this "Navigate"
def Navigate():
if __name__ == "__main__":
Navigate()
The line "if __name__ == '__main__':" allows this file to be run as a program from within another file. You should use this whenever you are writing a script as it makes your code more reusable.
Next we will open a web browser and go to the webpage. We do this with the following lines of code.
driver = webdriver.Firefox()
driver.get("http://511.alberta.ca/cameras/")
At this point your code should look like this:
from selenium import webdriver
from selenium.webdriver.support.select import Select
import urllib
from Geocoding import Geocoding
from WriteToFile import WriteToFile
import time
def Navigate():
driver = webdriver.Firefox()
driver.get("http://511.alberta.ca/cameras/")
driver.close()
if __name__ == "__main__":
Navigate()
Now you'll need to inspect the HTML of the webpage and find the HTML data for the dropdown menu. Right click the dropdown in your web browser and click "inspect". A new window will open up with the dropdown element highlighted. Shown below:
In this case it's made easy for us as the dropdown has an id that we can use to locate it on the page. We will use the following line of code to click on the dropdown menu:
option = Select(driver.find_element_by_id("highway_dropdown"))
Selenium allows you to find elements on the page through several different methods. In this case we used find_element_by_id. The ways to find elements through Selenium are as follows:
- find_element_by_id
- find_element_by_name
- find_element_by_xpath
- find_element_by_link_text
- find_element_by_partial_link_text
- find_element_by_tag_name
- find_element_by_class_name
- find_element_by_css_selector
Sometimes there may be more than one element on a page that match what you're looking for. If you want to return a list of all elements on the page that match your query just add an 's' to the end of 'element'. For example if you want a list of all elements with id = "highway_dropdown" you would use 'driver.find_elements_by_id("highway_dropdown")'
This website is a really helpful resource for understanding how to use these functions to find elements on a page using Selenium.
Next we want to know how many options there are on the dropdown so that we can write a while loop that can iterate through all the options. I tried to do this with a for loop instead but every time the page changes you need to find the dropdown element again. If you notice the HTML above, the options all have a tag name of "option" so we will use this to get a list of all elements with the tag name "option"
numOption = len(driver.find_elements_by_tag_name("option"))
countOption = 0
while countOption < numOption:
option = Select(driver.find_element_by_id("highway_dropdown"))
option.select_by_index(countOption)
time.sleep(1.5)
countOption += 1
The above code will find the number of options on the dropdown, click the first option, wait 1.5 seconds to allow the page to load, click the second option, wait 1.5 seconds to allow the page to load and so on until all options have been clicked. Now that you've successfully navigated through all the dropdown menu options we will now navigate through each camera option on each page.
Inspect a camera option on the page like shown below.
You may notice the class is "thumbnail thumbnail-camera". You can either find this element by class name or by css selector. Each camera box on this page has that same class and we want to be able to click each of them so we want a list of all elements that match this description.
cameras = driver.find_elements_by_css_selector(".thumbnail.thumbnail-camera")
Now we want to click the first camera box, do what we need to do, go back, click the next camera box and so on until we've clicked through them all.
numCam = len(cameras)
countCam = 0
while countCam < numCam:
cameras = driver.find_elements_by_css_selector(".thumbnail.thumbnail-camera")
if cameras[countCam].is_displayed():
cameras[countCam].click()
time.sleep(1)
driver.back()
time.sleep(1)
countCam += 1
This counts the number of camera boxes on the page. The camera boxes under other dropdown menu options are also included in this count they are just hidden. Then we iterate through each camera box, check if it's displayed, if it is click on it, wait 1 second for the page to load, go back, wait a second, increment the count and repeat until all camera boxes have been clicked.
The last step in navigation is to click through each of the camera angles if there are more than one. As shown above the camera angles can either be displayed on the left hand side or underneath the currently selected picture. Both variations have their own class name/css selector so we will have to check for both.
countAngles = 0
try:
thumbnail = driver.find_elements_by_css_selector(".thumbnail")
if(len(thumbnail) == 0):
raise Exception('one image')
for angles in thumbnail:
countAngles += 1
angles.click()
time.sleep(1)
#Get info
except:
try:
thumbnail = driver.find_elements_by_css_selector(".thumbnail.thumbnail-horizontal")
if(len(thumbnail) == 0):
raise Exception('one image')
for angles in thumbnail:
countAngles += 1
angles.click()
time.sleep(1)
#Get info
except:
if (countAngles == 0):
#Only one image, get info
The above code will try to find camera angles on the left hand side of the page. If there are no camera angles on the left hand side it will check for camera angles on the bottom of the page. If there aren't any there either then this page only has one camera, so get its information. If there are camera angles on either the left hand side or on the bottom, click through each of them and get the camera information. Now we've gone through each part of the navigation aspect of the code so let's put it all together.
def Navigate():
driver = webdriver.Firefox()
driver.get("http://511.alberta.ca/cameras/")
option = Select(driver.find_element_by_id("highway_dropdown"))
numOption = len(driver.find_elements_by_tag_name("option"))
countOption = 0
while countOption < numOption:
option = Select(driver.find_element_by_id("highway_dropdown"))
option.select_by_index(countOption)
time.sleep(1.5)
cameras = driver.find_elements_by_css_selector(".thumbnail.thumbnail-camera")
numCam = len(cameras)
countCam = 0
while countCam < numCam:
cameras = driver.find_elements_by_css_selector(".thumbnail.thumbnail-camera")
if cameras[countCam].is_displayed():
cameras[countCam].click()
time.sleep(1)
countAngles = 0
try:
thumbnail = driver.find_elements_by_css_selector(".thumbnail")
if(len(thumbnail) == 0):
raise Exception('one image')
for angles in thumbnail:
countAngles += 1
angles.click()
time.sleep(1)
#Get info
except:
try:
thumbnail = driver.find_elements_by_css_selector(".thumbnail.thumbnail-horizontal")
if(len(thumbnail) == 0):
raise Exception('one image')
for angles in thumbnail:
countAngles += 1
angles.click()
time.sleep(1)
#Get info
except:
if (countAngles == 0):
#Only one image, get info
driver.back()
time.sleep(1)
countCam += 1
countOption += 1
driver.close()
Now that you've successfully navigated through the website it's time to extract the data and print it to your output file. We're going to write a new function that will be called in each of the places marked by the comment "Get info" in the Navigate function. We will also have to put the following 2 lines near the top of the navigate function outside of any loop:
coords = Geocoding('Google', None)
file = WriteToFile(False, 'list_Alberta_CA.txt')
These initialize the file and geocoding classes for use in the GetInfo function we are about to write. Now create a new function to perform your data extraction.
def GetInfo(driver, coords, file):
Let's take a look at the page and see where we need to get our information from.
Looks like the location is under the class "panel-title" with the "h4" tag. We can locate this element through the xpath.
location = driver.find_element_by_xpath("//div[@class = 'panel-title']/h4").text
city = driver.find_element_by_xpath("//div[@class = 'panel-title']/h4/small").text
location = location.replace(city, "")
city = city.replace("Near", "")
Take a look at the documentation to better understand how the xpath works. But basically what this is doing is storing the entire line into 'location', storing the small text into 'city', removing the small text from location and then removing the word "Near" from city. Now you have two separate variables one with the location (big text) and one with the city minus "Near" (small text).
Now we need to extract the URL of the image. Again looking at the HTML above we see that the URL of the image is found under the id 'displayImageContainer' and tag name 'img'. We will use the following line to get the URL.
url = urllib.quote(driver.find_element_by_xpath("//div[@id = 'displayImageContainer']/img").get_attribute("src"), safe = ':?,=/&')
We first find the element through the xpath, then get the attribute "src" which is the actual link and we specify certain characters to be "safe", as in take them literally not as special characters.
Next we need to find the latitude and longitude coordinates.
try:
coords.locateCoords(location, city, "", "CA")
file.writeInfo("CA", "", coords.city, url, coords.latitude, coords.longitude)
except:
pass
This section of code will attempt to find the coordinates for the location and city in Canada (CA). Since Canada doesnt have states we have an empty set of "" where the state field would normally go. If coordinates can be found country, city, url, and coordinates are written to the file. If they cannot be found, nothing happens.
Now all that's left is to put all these pieces together. Your code should look something like this:
from selenium import webdriver
from selenium.webdriver.support.select import Select
import urllib
from Geocoding import Geocoding
from WriteToFile import WriteToFile
import time
def Navigate():
driver = webdriver.Firefox()
driver.get("http://511.alberta.ca/cameras/")
coords = Geocoding('Google', None)
file = WriteToFile(False, 'list_Alberta_CA.txt')
option = Select(driver.find_element_by_id("highway_dropdown"))
numOption = len(driver.find_elements_by_tag_name("option"))
countOption = 0
while countOption < numOption:
option = Select(driver.find_element_by_id("highway_dropdown"))
option.select_by_index(countOption)
time.sleep(1.5)
cameras = driver.find_elements_by_css_selector(".thumbnail.thumbnail-camera")
numCam = len(cameras)
countCam = 0
while countCam < numCam:
cameras = driver.find_elements_by_css_selector(".thumbnail.thumbnail-camera")
if cameras[countCam].is_displayed():
cameras[countCam].click()
time.sleep(1)
countAngles = 0
try:
thumbnail = driver.find_elements_by_css_selector(".thumbnail")
if(len(thumbnail) == 0):
raise Exception('one image')
for angles in thumbnail:
countAngles += 1
angles.click()
time.sleep(1)
GetInfo(driver, coords, file)
except:
try:
thumbnail = driver.find_elements_by_css_selector(".thumbnail.thumbnail-horizontal")
if(len(thumbnail) == 0):
raise Exception('one image')
for angles in thumbnail:
countAngles += 1
angles.click()
time.sleep(1)
GetInfo(driver, coords, file)
except:
if (countAngles == 0):
GetInfo(driver, coords, file)
driver.back()
time.sleep(1)
countCam += 1
countOption += 1
driver.close()
def GetInfo(driver, coords, file):
location = driver.find_element_by_xpath("//div[@class = 'panel-title']/h4").text
city = driver.find_element_by_xpath("//div[@class = 'panel-title']/h4/small").text
location = location.replace(city, "")
city = city.replace("Near", "")
url = urllib.quote(driver.find_element_by_xpath("//div[@id = 'displayImageContainer']/img").get_attribute("src"), safe = ':?,=/&')
try:
coords.locateCoords(location, city, "", "CA")
file.writeInfo("CA", "", coords.city, url, coords.latitude, coords.longitude)
except:
pass
if __name__ == "__main__":
Navigate()
Congrats you have just learned how to parse a website using Selenium! Always check your output file to ensure everything looks good before you attempt to add your cameras to the database. Once you have verified it works as expected you are ready to [add your cameras to the database](Adding Cameras to the Database).
©️ 2016 Cam2 Research Group