Publications

Stats

View publication

Title Storage and Retrieval of Individual Genomes
Authors Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, Niko Välimäki
Publication date 2009
Abstract A repetitive sequence collection is one where portions of a \emph{base sequence} of length $n$ are
repeated many times with small variations, forming a collection of total length $N$.
Examples of such collections are version control data and genome sequences of individuals, where
the differences can be expressed by lists of basic edit operations.
Flexible and efficient data analysis on a such typically huge collection is plausible using
suffix trees. However, suffix tree occupies $O(N \log N)$ bits, which very soon inhibits
in-memory analyses. Recent advances in full-text \emph{self-indexing} reduce the space of suffix tree
to $O(N \log \sigma)$
bits, where $\sigma$ is the alphabet size.
In practice, the space reduction is more than $10$-fold, for example on suffix tree of
Human Genome. However, this reduction factor remains constant when more sequences are
added to the collection.
\n\n
We develop a new family of self-indexes suited for the repetitive sequence collection setting.
Their expected space requirement depends only on the length $n$ of the base sequence
and the number $s$ of variations in its repeated copies. That is, the space reduction factor is
no longer constant, but depends on $N/n$.

We believe the structures developed in this work will provide a fundamental basis for storage and
retrieval of individual genomes as they become available due to rapid progress in
the sequencing technologies.
Pages 121-137
Conference name Annual International Conference on Research in Computational Molecular Biology
Publisher Springer-Verlag (Berlin/Heidelberg, Germany)
Reference URL View reference page