index building. -2--2- overview database tables building flow (logical) sequential drawbacks...

49
Index Building

Upload: morgan-haynes

Post on 01-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

Page 2: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-2-

Overview

• Database tables• Building flow (logical)• Sequential• Drawbacks• Parallel processing• Recovery• Helpful rules

Page 3: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-3-

Database tables

Word Index:• Z97 - word dictionary• Z98 - bitmap• Z980 - cache of bitmap updates• Z95 - words in document

Page 4: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-4-

Database tables

Z97• translation from word to

internal representation (sequence)

• same character set as documents

Page 5: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-5-

Database tables

Z98• “bitmap” of word occurrence in

documents• each bitmap is physically made

up of one or more records• compressed• one bitmap for every

combination of word and index

Page 6: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-6-

Database tables

Z980• cache of bitmap updates • increases speed of large bitmap

updates• 1/1000

Page 7: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-7-

Database tables

Z95• list of words and their location

in a document• adjacency

Page 8: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-8-

Database tables

Heading index:• Z01 - phrase dictionary• Z02 - phrase->document

mapping

Page 9: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-9-

Database tables

Z01:• filing phrase• connection to authority

database• hash key (display text)

Page 10: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-10-

Building flow - word

Stage 1: Retrieval + Sort• Read document• prepare list of words and

locations• for each word find list of indices

it belongs to• sort according to words

Page 11: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-11-

Building flow - word

Stage 2: Word Dictionary• read intermediate file from

stage 1• build up word dictionary (check

+ load)• replace word with internal

representation• create 2nd intermediate file

Page 12: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-12-

Building flow - word

Stage 3: Sort + Build Z95• sort intermediate file from

stage 2 - by document number• create Z95 records• load Z95 sequential file to

database

Page 13: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-13-

Building flow - word

Stage 4: Merge + Build Z98• intermediate file from stage 2

already sorted by word number• split words into a number of

files according to range of word numbers

• merge into Z98 records• load sequential files

Page 14: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-14-

Building flow - heading

Stage 1: Retrieval + Sort• Read document• prepare list of phrases• for each phrase find list of

indices it belongs to• sort according to hash key

Page 15: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-15-

Building flow - heading

Stage 2: Phrase Dictionary• read intermediate file from stage

1• build up phrase dictionary• generate unique key - acc

sequence• load Z01 sequential file to

database• build Z02 - non unique

Page 16: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-16-

Building flow - heading

Stage 3: Sort + Load Z02• sort non unique Z02 sequential

file• load Z02 sequential file to

database

Page 17: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-17-

Sequential - word

• Every stage is handled by a single process

• Only after handling by a previous stage would the next stage proceed

• stage 4 would proceed after all other stages were finished

Page 18: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-18-

Sequential - word

Example from version 12.1 csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log

csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log

Page 19: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-19-

Sequential - word

• p_manage_01_a: retrieval• p_manage_01_b: sort (by word)• p_manage_01_c: build Z97• p_manage_01_d: build Z95• p_manage_01_e: merge + build

Z98

Page 20: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-20-

Drawbacks

• Minimum parallel processing• Single process per stage• No recoverability - Z97 could be

reused but the whole building process needed to be rerun

• Computer resources not fully utilized

• Long run time

Page 21: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-21-

Parallel processing

• Large databases - multiple processors

• Identify stages that are not “workflow” bottlenecks

• Coordinate parallel processes with assignment/progress table

Page 22: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-22-

Parallel processing (word)

Stage 1: Retrieval + Sort• Retrieval is parallel - “io” not

“workflow” bottleneck• Split into cycles of range

document numbers

Page 23: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-23-

Parallel processing (word)

p_manage_01_a.cycles - initial

0001 - - - - 000000001 0000100000002 - - - - 000010001 0000200000003 - - - - 000020001 0000300000004 - - - - 000030001 0000400000005 - - - - 000040001 0000500000006 - - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511

Page 24: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-24-

Parallel processing (word)

p_manage_01_a.cycles - 3 processes, 1st retrieval cycle

0001 ? - - - 000000001 0000100000002 ? - - - 000010001 0000200000003 ? - - - 000020001 0000300000004 - - - - 000030001 0000400000005 - - - - 000040001 0000500000006 - - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511

Page 25: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-25-

Parallel processing (word)

p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle

0001 + + ? - 000000001 0000100000002 + ? - - 000010001 0000200000003 + - - - 000020001 0000300000004 ? - - - 000030001 0000400000005 ? - - - 000040001 0000500000006 ? - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511

Page 26: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-26-

Parallel processing (word)

• Whenever possible stages were split into separate sub-stages

• Usually in cases of non-parallel stages

• stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage

Page 27: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-27-

Parallel processing (word)

Stage 2 and 3 were subdivided into the 3 sub stages:

• build Z97 + load• sort intermediate file by

document number• build Z95 + load

Page 28: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-28-

Parallel processing (word)

p_manage_01_a.cycles - example

0001 + + + + 000000001 0000100000002 + + + ? 000010001 0000200000003 + + ? - 000020001 0000300000004 + + - - 000030001 0000400000005 + ? - - 000040001 0000500000006 + - - - 000050001 0000600000007 ? - - - 000060001 0000700000008 ? - - - 000070001 0000800000009 ? - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511

Page 29: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-29-

Parallel processing (word)

Stage 4 is split into sub stages:• pre-processing of intermediate

files from stage 2 - distribution of words

• build Z98 - parallel• load Z98 sequential file• input files are compressed and

stored in separate directory

Page 30: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-30-

Parallel processing (word)

Pre-processing:• generate histogram - # of lines

per 5000 words• determine range of words - no

more than 1G in intermediate files

Page 31: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-31-

Parallel processing (word)

p_manage_01_e.cycles

0001 - - 000000001 0006000000002 - - 000600001 0009000000003 - - 000900001 999999999

Page 32: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-32-

Parallel processing (word)

Build Z98:• intermediate files - split into

discrete range of words• parallel merging and building of

Z98

Page 33: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-33-

Parallel processing (word)

p_manage_01_e.cycles - example

0001 + ? 000000001 0006000000002 ? - 000600001 0009000000003 ? - 000900001 999999999

Page 34: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-34-

Parallel processing (heading)

Stage 1: Retrieval + Sort• same handling as word index

stage 1• “io” bottleneck • Split into cycles of range

document numbers

Page 35: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-35-

Parallel processing (heading)

p_manage_02.cycles

0001 - - - - 000000001 0000050000002 - - - - 000005001 0000100000003 - - - - 000010001 0000150000004 - - - - 000015001 0000200000005 - - - - 000020001 0000250000006 - - - - 000025001 0000300000007 - - - - 000030001 0000350000008 - - - - 000035001 0000400000009 - - - - 000040001 0000450000010 - - - - 000045001 000048435

Page 36: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-36-

Parallel processing (heading)

Stage 2 and 3 were subdivided into the 3 sub stages:

• build Z01 + load + build Z02• sort non unique Z02 sequential

file• load Z02

Page 37: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-37-

Parallel processing (heading)

p_manage_02.cycles - example

0001 + + + ? 000000001 0000050000002 + + ? - 000005001 0000100000003 + + - - 000010001 0000150000004 + ? - - 000015001 0000200000005 + - - - 000020001 0000250000006 ? - - - 000025001 0000300000007 ? - - - 000030001 0000350000008 ? - - - 000035001 0000400000009 - - - - 000040001 0000450000010 - - - - 000045001 000048435

Page 38: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-38-

Parallel processing (heading)

Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98)

Page 39: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-39-

Recovery

Word index:• stages 1-3 and stage 4 are

separate• stage 4 runs only after all

processing is done in stage 3

Page 40: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-40-

Recovery

Stage 1-3 - scenarios:• database tables need to be

enlarged• not enough disk space -

intermediate files• not enough disk spaces - sort• general disaster?

Page 41: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-41-

Recovery

Stage 1-3:• identify last successful section• change “in process” signs (?) to

“not processed” sign (-)• rerun discrete stage scripts:

– p_manage_01_a– p_manage_01_c– p_manage_01_d– p_manage_01_d1

Page 42: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-42-

Recovery

Stage 4:• must be rerun in totality• input files are saved and

compressed• $word_compress_dir• p_manage_01_e

Page 43: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-43-

Helpful rules

Stage 1 outrunning stage 2-3:• decide on number of stage 1

processes to stop (p_manage_01_a)

• kill shell and program process• reset associated cycle in

p_manage_01_a.cycles

Page 44: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-44-

Helpful rules

Log file names:p_manage_01_a_{process_number}.logp_manage_01_e_{process_number}.log

others are without process_number

p_manage_01_c.logp_manage_01_d.logp_manage_01_d1.logp_manage_01_e1.logp_manage_01_e2.log

Page 45: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-45-

Helpful rules

cycle size:

# docs<2M - 50k# docs<4M - 100kotherwise - 200k

Page 46: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-46-

Helpful rules

Disk space calculation:

d = no. documentsc = no. cycles p = no. processorss = size of retrieval file

Page 47: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-47-

Helpful rules

Sort space ($TMPDIR):

sort = p*s + 20%

stage 1 sort (parallel) +stage 2,3 sorting (single file)

Page 48: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-48-

Helpful rules

Scratch space:

scratch = p*1.5*s +c*s*1/3

output from stage 1 (in process and not yet processed) +

output from stage 3

Page 49: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building

-49-

Helpful rules

Example: UBU

d=2M cycle size=50kp=4, c=40, s= ~0.5G

sort=4*0.5*1.2=2.4Gscratch=4*1.5*0.5 + 40*0.5*1/3

= 3G + 6.67G= 10.67G