Web crawling:
The open source web crawler Heritix was easy to use, but it is not very fast. We selected Heritrix as our crawler due the simplicity of modifying it to our purpose and in fact, only a few lines of code were necessary for it to log the pages that we would then later inspect for the possibility of being webcams. A few other changes were made to modify its behavior as Heritrix normally archives every object it encounters, and in doing so unfavorably slows its crawl. While Heritrix was easy to modify and use, it is not the fastest web crawler. A limitation in the number of pages that we could crawl formed through this and became the largest detractor from our project.
One suggested improvement would be using a more powerful crawler such as Nutch. It would definitely increase the number of pages we could retrieve in the same amount of time.
Classifying webcams:
The precision of classifying webcam pages was another topic we could improve upon. When processing through webpage contents such as weather forecasts, the images matched our given classification heuristics: they had the proper aspect ratio and refreshed often. Such analysis created false positive results that initially seemed to detract from out project, but in retrospect do create an interesting object in their own right as they now given a visual history of Doppler radar, or temperature forecasts.
One possible suggestion to improve the precision includes using a learning algorithm for classifying webcam conjunct with our image processing technique. Theoretically, this would improve the precision as we take the union of results from the two techniques. It is unclear however what the algorithm would learn off of, image data or surrounding textual data, but hopefully that is a research problem that has been somewhat solved by our classmates this quarter.
Another concern in classifying images comes from managing the positive results in the URL control server. The positive results were only run through the image processor once before passed on to the indexer module. If the webpage only changed once during the processing period due to some superficial change in the html, then it would have lead to a false positive result. Another error was that we kept track of potential webcams on the web pages in a poor manner: we kept track by their order on the page, i.e. 'the third jpeg on the page is a web cam'. While this is simple and quick, it can lead to false positives when the html is changed, or a jpeg is removed and either the images appears to have changed, or what pointed to a webcam now points to something else.
An improvement would be loop through the list of webcams periodically and check to see if they are still changing. This would improve the precision of classification, but also leads to a new challenge. Many webcams do not refresh at night-time, or can become inactive for several days. A Bayesian system could keep track of this and in theory could even generate models to predict when the webcam would be active and only then include it in the search results.