Publications

Stats

View publication

Title Query-Sets++: A Scalable Approach for Modeling Web Sites
Authors Bárbara Poblete, Myra Spiliopoulou, Marcelo Mendoza
Publication date 2011
Abstract We explore an effective approach for modeling and classifying Web
sites in the World Wide Web. The aim of this work is to classify Web
sites using features which are independent of size, structure and
vocabulary. We establish Web site similarity based on search
engine query hits, which convey document relevance and utility in
direct relation to users' needs and interests. To achieve this, we
use a generic Web site representation scheme over different fea-
ture spaces, built upon query traffic to the site's documents. For
this task we extend, in a non-trivial way, our prior work using
query-sets for single document representation. We discuss why this
previous methodology is not scalable for a large set of
heterogeneous Web sites. We show that our models achieve very com-
pact Web site representations. Furthermore, our experiments on site
classification show excellent performance and quality/dimensionality
trade-off. In particular, we sustain a reduction in the feature
space to 5% of the size of the bag-of-words representation, while
achieving 99% precision in our classification experiments on DMOZ.
Downloaded 3 times
Pages 129-134
Conference name International Symposium on String Processing and Information Retrieval
Publisher Springer-Verlag (Berlin/Heidelberg, Germany)
PDF View PDF