compressed full-text indexes for highly repetitive collections · lectio praecursoria jouni sirén...

Compressed Full-Text Indexes for Highly

Repetitive Collections

Lectio praecursoriaJouni Sirén 29.6.2012

ALGORITHM

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Are there papers withSadakane as the first author?










How many papers haveSadakane as the first author?










What are the papers withSadakane as the first author?










DATA STRUCTURE

• What if we have to preserve the original order of the records?

• We may want even faster queries.

• Perhaps there are too many records to fit into memory.

• Then we probably need another data structure.










Navarro

Raman

Grossi

Burrows

Ferragina

Manber

Sadakane

FULL-TEXT INDEX

$A C G T A C T G $A C T G $C G T A C T G $C T G $G $G A C G T A C T G $G T A C T G $T A C T G $T G $

GACGTACTG$

Suffix Array

10263791458

GACGTACTG$

Suffix Array

• While a character takes 1 byte, each pointer requires 4 or 8 bytes.

• Suffix array usually requires 5 or 9 times more space than the text.

• We need something smaller to handle large texts.

COMPRESSED INDEX

• Ferragina, Manzini 2000, 2005: FM-index

• Grossi, Vitter 2000, 2005: Compressed Suffix Array

• Use Burrows-Wheeler transform to simulate the suffix array.

• Compresses to 40% to 80% of text size.

• Yet some data should compress better.

HIGHLY REPETITIVE DATA

Individual GenomesG A C G T A - C T G C A G A T G - T A A T G CG A C G T A - C T G C A G A T G C T A A T C CG A C G T A - - - G C A G A T G C T A A T G CG A C G T A - C T G C A G - T G C T A A T G CG A C G T A - - - G C A G A T G C T A A T C CG A C G T A - C T G C T G A T G C T A A T G CG A C G T A C C T G C A G A T G C T A A T G CG A C G T A C C T G C A G - T G C T A A T G CG A C G T A - C T G C T G A T G C T A A T G CG A C G T A - C T G C A G A T G C T A A T C C

Version Historydhcp-eduroam-hy-138-42:thesis jltsiren$ svn diff -r 662 thesis.texIndex: thesis.tex===================================================================--- thesis.tex (revision 662)+++ thesis.tex (working copy)@@ -23,7 +23,7 @@ \isbnpdf{978-952-10-8052-4} \issn{1238-8645} \printhouse{Unigrafia}-\pubpages{108 + 72} % FIXME+\pubpages{97 + 63} \supervisorlist{Veli Mäkinen, University of Helsinki, Finland} \preexaminera{Kunihiko Sadakane, National Institute of Informatics, Japan} \preexaminerb{Jorma Tarhio, Aalto University, Finland}

Suffix array construction378 gigabytes

Finnish language Wikipedia with full version history42 gigabytes

Run-length compressed suffix array4.4 gigabytes

Do we have 378 gigabytes of memory?

INDEX CONSTRUCTION

Data Construction RLCSA

Suffix Array Direct Construction

INDEXING AUTOMATA

# G G GA AC C CT

T

T $

# G G G

A

A

A

A

A

A

C

C C C

T

T T $# GA GT

ACTA CTA

ACG CG

AT TGT

TA

AG

ACC

ACTG

CC CTG TG$ G$ $

HELSINGIN YLIOPISTOHELSINGFORS UNIVERSITET

UNIVERSITY OF HELSINKIMATEMAATTIS-LUONNONTIETEELLINEN TIEDEKUNTA

MATEMATISK-NATURVETENSKAPLIGA FAKULTETENFACULTY OF SCIENCE

Indexing Finite Language Representationof Population Genotypes Jouni Sirén, Niko Välimäki, Veli Mäkinen

ABSTRACTCompressed full-text indexes [6] based on the Bur-rows-Wheeler transform (BWT) are widely used inbioinformatics. Their most succesful application sofar has been mapping short reads to a referencesequence (e.g. Bowtie [3], BWA [4], SOAP2 [5]).These indexes use the BWT to simulate the suffixtree or the suffix array (SA), while using much lessspace than either of them. A simple generalizationallows indexing a set of sequences.

We propose a biologically motivated generalizationof the BWT to finite languages. Given a multiplealignment of sequences (e.g. individual genomes),we build a compressed index capable of simulatingthe suffix array over plausible recombinations of thesequences. Alternatively, we start from a referencesequence and a set of mutations, and build the in-dex over sequences containing any subset of themutations.

Our approach is based on finite automata. We startwith an automaton recognizing the input language.This automaton is transformed into an equivalentautomaton, where each state corresponds to a lexi-cographic range of suffixes of the language. A gen-eralization of the XBW transform for labeled trees[2] is used to index the transformed automaton.

FULL-TEXT INDEXES FOR PATTERN MATCHING AND SEQUENCE ANALYSIS

A

Suffix Tree SA Sorted Suffixes BWT

10

2

6

3

7

9

1

4

5

8

$

$GTCATGCAG $

10

2

6

3

7

9

1

4

5

8

$GTCATGCA

$GTCATGC

$GTCATG

$GTCAT

$GTCA

$GTC

$GT

$G

GTCATGCA

A

C

C

G

G

G

T

T

G

G

G

G

G

G

G

G

G

A

A

A

A

A

A

A

C

C

C

C

C

C

G

G

G

G

G

T

T

T

T

A

A

A

C

C

T

AC

C

$

G

T

GTACTG$

TG$

GTACTG$

TG$

$

ACGTACTG$

TACTG$

ACTG$

G$

$GTCATGCAGGC

A MATCH IN MULTIPLE ALIGNMENT

GTCATGCAG –

GATGCAG –

GTCATGAG –

GTCATCAG

– –

T

– CT TG GA

INITIAL AUTOMATON AND SORTED AUTOMATON

# G G GA AC C CT

T

T $

# G G G

A

A

A

A

A

A

C

C C C

T

T T $# GA GT

ACTA CTA

ACG CG

AT TGT

TA

AG

ACC

ACTG

CC CTG TG$ G$ $

GENERALIZED COMPRESSED SUFFIX ARRAY

$ ACC ACG ACTA ACTG AG AT CC CG CTA CTG G$ GA GT TA TG$ TGT #

BWT G T G G T T G A A A AC AT # CT CG C A $Edges 1 1 1 1 1 1 1 1 1 1 1 1 100 1 100 1 1 1

Basic operations are about 2 times slower than in regular BWT-based indexes. For reasonable mutationfrequencies f , the expected size of the sorted automaton is n(1 + f )O(log n), where n is the length of thereference sequence. For 1/f = W(log n), this becomes O(n). In our experiments, an index built for thehuman reference genome and the genetic variation found in the Finnish population sample of the 1000Genomes Project took approximately 2.8 gigabytes.

FUTURE DIRECTIONS• With our current algorithm, the construction of

a genome-scale index requires 12 hours and192 gigabytes of memory. We are currently in-vestigating other algorithms, such as externalmemory construction and distributed construc-tion in the MapReduce framework [1].

• In principle, our index can be used in any algo-rithm using a regular BWT-based index. Whatcan be done efficiently in practice?

• We are currently investigating several ways touse the generalized index in read alignment.Are there other applications, where our indexcould be superior to the existing approaches?

REFERENCES[1] J. Dean, S. Ghemawat: Simplified Data Pro-

cessing on Large Clusters. OSDI 2004.

[2] P. Ferragina et al.: Compressing and indexinglabeled trees, with applications. Journal of theACM, 2009.

[3] B. Langmead et al.: Ultrafast and memory-effi-cient alignment of short DNA sequences to thehuman genome. Genome Biology, 2009.

[4] H. Li, R. Durbin: Fast and accurate short readalignment with Burrows-Wheeler Transform.Bioinformatics, 2009.

[5] R. Li et al.: SOAP2: an improved ultrafast toolfor short read alignment. Bioinformatics, 2009.

[6] G. Navarro, V. Mäkinen: Compressed full-textindexes. ACM Computing Surveys, 2007.

Compressed Full-Text Indexes for Highly

Repetitive Collections

compressed full-text indexes for highly repetitive collections · lectio praecursoria jouni sirén...

Documents