compressed full-text indexes for highly repetitive collections · lectio praecursoria jouni sirén...

32
Compressed Full-Text Indexes for Highly Repetitive Collections Lectio praecursoria Jouni Sirén 29.6.2012

Upload: others

Post on 13-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Compressed Full-Text Indexes for Highly

Repetitive Collections

Lectio praecursoriaJouni Sirén 29.6.2012

Page 2: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

ALGORITHM

Page 3: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 4: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Are there papers withSadakane as the first author?

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 5: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

How many papers haveSadakane as the first author?

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 6: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

What are the papers withSadakane as the first author?

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 7: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 8: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 9: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 10: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 11: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 12: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Page 13: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

DATA STRUCTURE

Page 14: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

• What if we have to preserve the original order of the records?

• We may want even faster queries.

• Perhaps there are too many records to fit into memory.

• Then we probably need another data structure.

Page 15: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

INDEX

Page 16: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Sadakane: New text indexing functionalities of the compressed suffix

Burrows, Wheeler: A block sorting lossless data compression algorithm

Ferragina, Manzini: Indexing compressed text

Grossi, Vitter: Compressed suffix arrays and suffix trees with

Navarro, Mäkinen: Compressed full-text indexes

Raman, Raman, Rao: Succinct indexable dictionaries with applications

Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of

Sadakane: Compressed suffix trees with full functionality

Manber, Myers: Suffix arrays: A new method for on-line string searches

Navarro

Raman

Grossi

Burrows

Ferragina

Manber

Sadakane

Page 17: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

FULL-TEXT INDEX

Page 18: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

$A C G T A C T G $A C T G $C G T A C T G $C T G $G $G A C G T A C T G $G T A C T G $T A C T G $T G $

GACGTACTG$

Suffix Array

Page 19: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

10263791458

GACGTACTG$

Suffix Array

Page 20: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

• While a character takes 1 byte, each pointer requires 4 or 8 bytes.

• Suffix array usually requires 5 or 9 times more space than the text.

• We need something smaller to handle large texts.

Page 21: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

COMPRESSED INDEX

Page 22: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

• Ferragina, Manzini 2000, 2005: FM-index

• Grossi, Vitter 2000, 2005: Compressed Suffix Array

• Use Burrows-Wheeler transform to simulate the suffix array.

• Compresses to 40% to 80% of text size.

• Yet some data should compress better.

Page 23: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

HIGHLY REPETITIVE DATA

Page 24: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Individual GenomesG A C G T A - C T G C A G A T G - T A A T G CG A C G T A - C T G C A G A T G C T A A T C CG A C G T A - - - G C A G A T G C T A A T G CG A C G T A - C T G C A G - T G C T A A T G CG A C G T A - - - G C A G A T G C T A A T C CG A C G T A - C T G C T G A T G C T A A T G CG A C G T A C C T G C A G A T G C T A A T G CG A C G T A C C T G C A G - T G C T A A T G CG A C G T A - C T G C T G A T G C T A A T G CG A C G T A - C T G C A G A T G C T A A T C C

Page 25: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Version Historydhcp-eduroam-hy-138-42:thesis jltsiren$ svn diff -r 662 thesis.texIndex: thesis.tex===================================================================--- thesis.tex (revision 662)+++ thesis.tex (working copy)@@ -23,7 +23,7 @@ \isbnpdf{978-952-10-8052-4} \issn{1238-8645} \printhouse{Unigrafia}-\pubpages{108 + 72} % FIXME+\pubpages{97 + 63} \supervisorlist{Veli Mäkinen, University of Helsinki, Finland} \preexaminera{Kunihiko Sadakane, National Institute of Informatics, Japan} \preexaminerb{Jorma Tarhio, Aalto University, Finland}

Page 26: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Suffix array construction378 gigabytes

Finnish language Wikipedia with full version history42 gigabytes

Run-length compressed suffix array4.4 gigabytes

Do we have 378 gigabytes of memory?

Page 27: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

INDEX CONSTRUCTION

Page 28: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Data Construction RLCSA

Suffix Array Direct Construction

Page 29: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Data Construction RLCSA

Suffix Array Direct Construction

Page 30: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

INDEXING AUTOMATA

Page 31: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

# G G GA AC C CT

T

T $

# G G G

A

A

A

A

A

A

C

C C C

T

T T $# GA GT

ACTA CTA

ACG CG

AT TGT

TA

AG

ACC

ACTG

CC CTG TG$ G$ $

HELSINGIN YLIOPISTOHELSINGFORS UNIVERSITET

UNIVERSITY OF HELSINKIMATEMAATTIS-LUONNONTIETEELLINEN TIEDEKUNTA

MATEMATISK-NATURVETENSKAPLIGA FAKULTETENFACULTY OF SCIENCE

Indexing Finite Language Representationof Population Genotypes Jouni Sirén, Niko Välimäki, Veli Mäkinen

ABSTRACTCompressed full-text indexes [6] based on the Bur-rows-Wheeler transform (BWT) are widely used inbioinformatics. Their most succesful application sofar has been mapping short reads to a referencesequence (e.g. Bowtie [3], BWA [4], SOAP2 [5]).These indexes use the BWT to simulate the suffixtree or the suffix array (SA), while using much lessspace than either of them. A simple generalizationallows indexing a set of sequences.

We propose a biologically motivated generalizationof the BWT to finite languages. Given a multiplealignment of sequences (e.g. individual genomes),we build a compressed index capable of simulatingthe suffix array over plausible recombinations of thesequences. Alternatively, we start from a referencesequence and a set of mutations, and build the in-dex over sequences containing any subset of themutations.

Our approach is based on finite automata. We startwith an automaton recognizing the input language.This automaton is transformed into an equivalentautomaton, where each state corresponds to a lexi-cographic range of suffixes of the language. A gen-eralization of the XBW transform for labeled trees[2] is used to index the transformed automaton.

FULL-TEXT INDEXES FOR PATTERN MATCHING AND SEQUENCE ANALYSIS

A

Suffix Tree SA Sorted Suffixes BWT

10

2

6

3

7

9

1

4

5

8

$

$GTCATGCAG $

10

2

6

3

7

9

1

4

5

8

$GTCATGCA

$GTCATGC

$GTCATG

$GTCAT

$GTCA

$GTC

$GT

$G

GTCATGCA

A

C

C

G

G

G

T

T

G

G

G

G

G

G

G

G

G

A

A

A

A

A

A

A

C

C

C

C

C

C

G

G

G

G

G

T

T

T

T

A

A

A

C

C

T

AC

C

$

G

T

GTACTG$

TG$

GTACTG$

TG$

$

ACGTACTG$

TACTG$

ACTG$

G$

$GTCATGCAGGC

A MATCH IN MULTIPLE ALIGNMENT

GTCATGCAG –

GATGCAG –

GTCATGAG –

GTCATCAG

– –

T

– CT TG GA

INITIAL AUTOMATON AND SORTED AUTOMATON

# G G GA AC C CT

T

T $

# G G G

A

A

A

A

A

A

C

C C C

T

T T $# GA GT

ACTA CTA

ACG CG

AT TGT

TA

AG

ACC

ACTG

CC CTG TG$ G$ $

GENERALIZED COMPRESSED SUFFIX ARRAY

$ ACC ACG ACTA ACTG AG AT CC CG CTA CTG G$ GA GT TA TG$ TGT #

BWT G T G G T T G A A A AC AT # CT CG C A $Edges 1 1 1 1 1 1 1 1 1 1 1 1 100 1 100 1 1 1

Basic operations are about 2 times slower than in regular BWT-based indexes. For reasonable mutationfrequencies f , the expected size of the sorted automaton is n(1 + f )O(log n), where n is the length of thereference sequence. For 1/f = W(log n), this becomes O(n). In our experiments, an index built for thehuman reference genome and the genetic variation found in the Finnish population sample of the 1000Genomes Project took approximately 2.8 gigabytes.

FUTURE DIRECTIONS• With our current algorithm, the construction of

a genome-scale index requires 12 hours and192 gigabytes of memory. We are currently in-vestigating other algorithms, such as externalmemory construction and distributed construc-tion in the MapReduce framework [1].

• In principle, our index can be used in any algo-rithm using a regular BWT-based index. Whatcan be done efficiently in practice?

• We are currently investigating several ways touse the generalized index in read alignment.Are there other applications, where our indexcould be superior to the existing approaches?

REFERENCES[1] J. Dean, S. Ghemawat: Simplified Data Pro-

cessing on Large Clusters. OSDI 2004.

[2] P. Ferragina et al.: Compressing and indexinglabeled trees, with applications. Journal of theACM, 2009.

[3] B. Langmead et al.: Ultrafast and memory-effi-cient alignment of short DNA sequences to thehuman genome. Genome Biology, 2009.

[4] H. Li, R. Durbin: Fast and accurate short readalignment with Burrows-Wheeler Transform.Bioinformatics, 2009.

[5] R. Li et al.: SOAP2: an improved ultrafast toolfor short read alignment. Bioinformatics, 2009.

[6] G. Navarro, V. Mäkinen: Compressed full-textindexes. ACM Computing Surveys, 2007.

Page 32: Compressed Full-Text Indexes for Highly Repetitive Collections · Lectio praecursoria Jouni Sirén 29.6.2012. ALGORITHM. Sadakane: New text indexing functionalities of the compressed

Compressed Full-Text Indexes for Highly

Repetitive Collections