Abstract |
Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such a clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size gamma of the smallest attractor of a string S[1..n], was introduced. The measure gamma lower bounds all the previous relevant ones (including z), yet S can be represented and indexed within space O(gamma log (n/gamma)), which also upper bounds most measures. While gamma is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and it is not known if S can be represented in o(gamma log n) space. In this paper, we study a smaller measure, delta ≤ gamma, which can be computed in linear time. We show that delta better captures the compressibility of repetitive strings. For every length n and every value delta; ≥ 2, we construct a string such that gamma = Omega(delta log (n/\delta;)). Still, we show a representation of any string S in O(delta log (n/delta)) space that supports direct access to any character S[i] in time O(log (n/delta)) and finds the occ occurrences of any pattern P[1..m] in time O(m log n + occ log^epsilon n) for any constant epsilon > 0. Further, we prove that no o(delta log n)-space representation exists: for every length n and any value 2 ≤ delta ≤ n^(1-epsilon;), we exhibit a string family whose elements can only be encoded in Omega(delta log (n/delta)) space. We complete our characterization of delta by showing that, although gamma, z, and other repetitiveness measures are always O(delta log (n/delta)), in some string families the smallest context-free grammar can be of size g = Omega(delta log^2 n / log log n). No such a separation is known for gamma. |