Tests conducted
A quantitative analysis was done of the location algorithm to determine which parts of the algorithm had the greatest impact on its performance. Five different variations were tried: best, code, stop, population1, and population2. Best ignores choosing a weighted best region for location, and instead merely takes the city with the highest score. Code ignores any country or state code in the URL when determining the location. Stop does not use the stopword list that was experimentally developed through use of the algorithm that filters out many common words present in html that are actually cities. Population1 builds the city list out of every city in the index (every city in most of Europe, all of US) without a minimum population threshold. Population2 builds the same city list, but instead uses a minimum population threshold of 500,000. These are both in contrast to a standard minimum population threshold of 5,000 used in that standard Camoogle localization algorithm.
Results
The tests were conducted by running each of the algorithms over the same webcam results and comparing the locations determined to a list of hand labeled data. This data set contained 142 different webcams.
The results for the most current version implemented in Camoogle are shown as well for reference; it misclassifies 36 webcams, for an accuracy of about 75%. Both the best and the code algorithms lose a great deal of accuracy by throwing away a large amount of information. The presence of a state or country code in the URL is often a major clue as to where the webcam is located, and the presence of multiple cities names are often a good clue as to which state the city being referred to is actually in.
The stop run is closer to achieving the same results as the Camoogle run, but for obvious reasons it chooses many cities that to a user are rather unlikely such as Image, Washington. The population1 run is also fairly close, but it starts to fail in the same way as the stop run, only it is choosing city names that are not yet in the stopword list. It would be possible to add all of these names to the stopword list, but it is quite likely that the distribution of those city names follows a tail heavy distribution and hence would be ineffectual to label them by hand. Instead it might be better to use a machine learning approach involving a dictionary as most words in the stop list are common nouns.
Population2 actually achieves a better accuracy in this comparison, but is not the ideal algorithm. By setting a minimum population of 500,000 the algorithm can localize to a city far less often than the standard algorithm, instead merely settling on a region. While in doing so it achieves a much higher accuracy, about 85%, this loss of information is considered to be undesirable. Theoretically a balance between accuracy and information loss could be determined through many iterations of this testing process, but it would be much more interesting to instead develop a stronger system for creating city/region hypothesis for the various webcams. The current system merely weights cities, then weights the regions from the city weights and chooses the best city that lies within the chosen region. While workable this is clearly not the best possible system.