Publications

Stats

View publication

Title Self-Indexed Grammar-Based Compression
Authors Francisco Claude, Gonzalo Navarro
Publication date 2011
Abstract Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness.
We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T[1,u], then an SLP-compressed representation of T requires 2 n log_2 n bits. For that same SLP, our self-index takes O(n log n) + n log_2 u bits. It extracts any text substring of length m in time O((m+h) log n), and finds occ occurrences of a pattern string of length m in time O((m (m+h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time.
As byproducts we introduce (i) a representation of SLPs that takes 2 n log2 n (1+o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.
Pages 313-337
Volume 111
Journal name Fundamenta Informaticae
Publisher IOS Press (Amsterdam, The Netherlands)
Reference URL View reference page