Architecture




Camoogle++

The basic overview of our webcam search engine composed extract possible webcam web pages through a seeded web crawler. These crawled links are fed into a processor to determine if the webpage is a valid webcam. These results are then put into an indexer where the contents of the pages can searched through a front end interface.

Figure below shows a basic outline of Camoogle++'s main components. Major Components:

Web crawling and seeding

We are using Google's web search API to generate a list of URL seeds to start out crawl. The lists of the URL seeds are sent to the URL control server where they are distributed to the web crawler clients.

Unlike most other groups, we did not use Nutch as the default web crawler. Instead, we used an open source crawler called Heritrix. The reason was simplicity. Since we only needed the URL links as a result of the crawl, Heritrix was much easier to run and to manage. For our crawl, we used about 3 or 4 Heritrix crawler clients in parallel. Our heuristic for a possible webcam page include that it contained the string "cam" in the text, and it had at least one jpeg image. This heuristic was a bit broad, thus the crawled pages were not stored, and the links were sent back to the URL control server for further processing.

Image Processing

Instead of using a machine learning system to classify webcams, we went with an image processing approach. The idea behind the image processing is to analyze the pixel data and aspect ratios, and then classify the images as a webcam or not. By downloading the image at different intervals and comparing the difference between the two images, we were able to determine if the image had changed. A system was put in place to filter out images if their pixel data changed too much in hopes of filtering out advertisements, but instead a simpler system involving aspect rations appears to have completely solved this problem (explained below), so it was never actually implemented.

A very important heuristic used in image processing is the aspect ratio. There are thousands of advertisements with images that update, however they are normally in the form of banner ads which have quite different aspect ratio than webcams. By using a ratio filter, we can eliminate most non-webcam images during the image processing. The max aspect ratio value used was 2.5, and no advertisements were found to be labeled as webcams.

The image processors are able to run in parallel on multiple computers. The URL control server sends the each processor with list of URLs to process. The images were compared with updated versions at 2 minutes, 10 minutes, then again at half hour. By using a pipelined system, we were able to process about 1,000 web pages per hour per machine. The status of the each image on the webpage was sent back to the URL control server to form lists of pages marked as being Webcams. The list of webcams is passed to the Thumbnail generator where it archives a set of Thumbnail images every two hours. The list is also passed onto the Lucene indexer where they are indexed. The stored information is then used by location processor to determine the estimated location.

Location Processing

During the indexing process an attempt is made to localize each webcam. The indexer parses the html of the webcam page around the link to the webcam, the URL of the webcam page and the URL of the jpeg. We assembled a large body of location data by crawling http://www.fallingrain.com/world which formed an index of 700,000 cities across the North America and Europe, with location and population information. Using city and country names parsed from the page, as well as the country codes from the URLs it becomes possible to form hypotheses about where the webcam is located.

Each city name that is found to be present in the html or URL associated with the webcam is given a weight. The weight is determined by the location of the city name in the html or URL; an extremely high weight is given if the city name is in the URL or the title of the webpage. Otherwise a weight as assigned by the distance from the link to the webcam to the city name. Originally a simple metric of how many characters lay in between the two was used, but gave poor results. Many pages in our crawl linked to tens of webcams and the algorithm would become confused.

A revised metric was implemented that tried to better match the distances that are perceived from viewing the rendered html. Initially the distance between the link to the webcam and the city name are assigned as in the original algorithm. The algorithm then looks for any of a subset of html token between the two that greatly affect how the html is visually rendered. The tokens used are" {'<'td'>','<'tr'>','<'table'>','<'div'>','<'p'>','<'br'>'}" as well as their corresponding closing tokens. Very large amounts are added to the distance metric for each instance of any of the previous tokens, with larger amounts being assigned for tokens have a greater affect on the rendered html: for example an instance of '<'td'>' is weighted less heavily than an instance of '<'table>'.

After all cities have been assigned weights a region is assigned to the webcam page. The region is determined by summing the weights of all the cities that reside in a general region and choosing the region with the highest weight. This is made somewhat difficult by many cities sharing the same name, so the weights of the cities are weighted proportionally to the population of the specific city in that region. This allows for Paris, France to be weight more highly than Paris, Texas.

The final city name that is selected for the webcam is the city of highest weight that resides in the region selected. In some cases this set is empty and only a region has been determined. It is also of note that we disallow the algorithm to localize itself to solely 'United States', and instead make it localize to individual states.

Thumbnail Generator and Image Lookup System

As thirteen thousand webcams have been identified each time a set of all the images is generated by the thumbnail generator, thirteen thousand new images are crated for that snapshot of history. Since we are archiving a set of thumbnails every two hours, the numbers of files needed increases very quickly. To avoid the problem of having too many files on a UNIX file system, we implemented the Image Lookup System. Every two hours, a custom utility program was used to compress all the current thumbnail images into one Thumbnail Image (TNI) file and archives the file in a reachable location. A corresponding summary file is use to keep the location of image in the compressed file. By looking up the summary file, we can easily extract the needed thumbnail when the user requests to view it. For simplicity and speed we keep an un-archived copy of the latest image from each webcam.

Web Front End

Since all the modules we built are coded in Java, it was logical to create a Java enabled front end. We used Jakarta Tomcat to develop and run the active JSP web pages. Two noticeable features on the front end are the compiled GIF generator, and the map search. The GIF module uses the archived Thumbnail images and generates an animated gif image of the webcams gathered history. The map search allows user to search by location on the map; it uses the calculated location data to determine where the webcam should be marked on the map. The user simply clicks on a location of the map to find webcams near the selected location.