Feel free to
e-mail me any comments
and or questions
My
Google Tech Talk: Googlewhacks for Fun and Profit
ABSTRACT:
We study the number of internet search results returned from
multi-word queries
based on the number of results returned when each word is searched for
individually.
We derive a model to describe search result values for multi-word
queries using the
total number of pages indexed by Google and by applying the Zipf power
law to the
words per page distribution on the internet and Heaps’ law for unique
word counts.
Based on data from 351 word pairs each with exactly one hit when
searched for
together, and a Zipf law coefficient determined in other studies, we
approximate the
Heaps’ law coefficient for the indexed worldwide web (about 8 billion
pages) to be
beta=0.52. Previous studies used under 20,000 pages. We demonstrate
through examples
how the model can be used to analyse automatically the relatedness of
word pairs
assigning each a value we call ‘‘strength of associativity’’. We
demonstrate the validity
of our method with word triplets and through two experiments conducted 8
months
apart. We then use our model to compare the index sizes of competing
search giants
Yahoo and Google.