Publications

Stats

View publication

Title Indexes for Highly Repetitive Document Collections
Authors Francisco Claude, Antonio Fariña, Miguel Martínez-Prieto, Gonzalo Navarro
Publication date 2011
Abstract We introduce new compressed inverted indexes for highly repetitive
document collections. They are based on run-length, Lempel-Ziv, or
grammar-based compression of the differential inverted lists, instead
of gap-encoding them as is the usual practice. We show that our
compression methods significantly reduce the space achieved by
classical compression, at the price of moderate slowdowns. Moreover,
many of our methods are universal, that is, they do not need to know
the versioning structure of the collection.
\n\n
We also introduce compressed self-indexes in the comparison. We show
that techniques can compress much further, using a small fraction of
the space required by our new inverted indexes, yet they are orders of
magnitude slower.
Downloaded 6 times
Pages 463-468
Conference name ACM International Conference on Information and Knowledge Management
Publisher ACM Press (New York, NY, USA)
PDF View PDF