-
Notifications
You must be signed in to change notification settings - Fork 8
Finding Websites to Parse
Parsing cameras for the database is not an easy task. There are many things you have to consider when looking at a website and deciding how to parse it. There are 7 major pieces of information you want to pull from the website for each camera Latitude, Longitude, City, State (USA Only), County, Snapshot URL, and Description. To get this information we will need to take advantage of several tools.
The most important tool we use is Python if you are unfamiliar with python you should take some time to get used to the syntax. One easy way to learn the syntax is to read through the python documentation or you can use a tutorial such as Codecadamy.
The next tools that you should familiarize yourself with are BeautifulSoup and Selenium both are Python modules that allow you to access and manipulate webpages based on the source HTML. The main difference between BeautifulSoup and Selenimum is that BeautifulSoup is headless meaning that it simply loads the page source rather than fully rendering the website in a browser like selenium does. BeautifulSoup is much faster if the website only contains static elements (not Javascript) otherwise it is best to use selenium. If you with to run a full featured headless browser environment you may also consider running PhantomJS which works well with BeautifulSoup.
You may also consider reviewing the re and json Python modules as they can be helpful. Finally if you are unfamiliar with Bash, HTML, CSS, and JavaScript it is recommended that you become vaguely familiar with their structure and syntax.
The first step is finding a website to parse by searching for web cameras using Google. Once you find a website that has a substantial amount of cameras on it, you should check that the webcam stream is published in a way that we can interpret.
The camera must be a stream of images, a collection of JPEG (or another static image format) images published to a specific URL and updated intermittently. You must be able to isolate the URL on which the image is published. You should look over the website and see if they have any copyright claim on the images and what the site's fair use policy is. To isolate the image data URL go to the page on the site that shows the newest image from the webcam. The example below is taken from the NYC traffic camera website. When we click on a camera on the map a popup window appears.
This is not the actual URL on which the image is published because there is text on the page (Highlighted in red). This is not the URL that we want. To find the link to the image data we need to view the page source. Try using Ctrl+Shift+I or F12 to open the developer window in Chrome or Firefox. We want to find the image tag in the page HTML so we look for the img tag.
Now navigating to the src URL (http://207.251.86.238/cctv693.jpg) we find the address to the image data. The number on the end of the URL doesn't appear to be needed to access the image. Make sure to refresh the image and see that it changes to rarefy that the URL doesn't change with time.
Before moving on you should also make sure there is a good description of the image and its location. If the site has a map of cameras plotted using the Google maps API then likely the location information is contained in a JSon file. We will discuss methods for extracting the location information in Part 3.
Now that you have checked the site satisfies all the above conditions, you should check the database to make sure that the site is not already indexed. To do this refer to the Navigating the Database page.
Now you are ready to parse the site. To continue this example go to Parsing Example 1
©️ 2016 Cam2 Research Group