file processing - indexing mvnc1 indexing jim skon
Post on 16-Jan-2016
234 Views
Preview:
TRANSCRIPT
File Processing - Indexing MVNC 1
Indexing
Jim Skon
File Processing - Indexing MVNC 2
Indexing
Index structures can greatly speed access Consider a library card catalog
» Allows quick access to books» Why not just order books by author name?
Actually three indexes:» Author» Topic» Title
File Processing - Indexing MVNC 3
Indexing
Simple Index» Provides a shortcut, based on a key value, to
desired.» Each index based on a certain key(s) value» Can have indexs for any key field
Index File
File Processing - Indexing MVNC 4
Indexing
Multiple Indexes» May have indexes for more then one field
Index File Index
File Processing - Indexing MVNC 5
Indexing
Example: Record Albums» Record label» Record ID» Title» Composer(s)» Artisit(s)
Primary key: Record label + Record ID
File Processing - Indexing MVNC 6
Indexing
Consider an index file which which contains records which contain:» Primary Key (Record label + Record ID)» Byte Offset
Index sorted in primary key order
File Processing - Indexing MVNC 7
Operations in indexed file
Retrieving record» Search index file(perhaps using binary file)» Seek in main file to the byte offset specified in
index» Read record from main file
File Processing - Indexing MVNC 8
Operations in indexed file
Create the empty index and data files Load the index file into memory Rewrite the index file after index change Add records to the file and index Delete records from data file Update records in data file
File Processing - Indexing MVNC 9
Operations in indexed file
Create the empty index and data files» Create new files» Write header records indicating number of records
File Processing - Indexing MVNC 10
Operations in indexed file
Load the index file into memory» Simply index index in sequential order, placing into
an array of (key,offset) structures» Since the records are small, could read several
records at once
File Processing - Indexing MVNC 11
Operations in indexed file
Rewrite the index file after index change» Need only be done after index changes» Simply iterate through array, writing to index file» Can be done after EVERY change» Could wait until files are ready to be closed
– Need to keep track of whether file version is outof date
File Processing - Indexing MVNC 12
Operations in indexed file
Add records to the file and index» Add record to main file
– Next free record– Maybe a linked list of “unused” records could be used to
keep track of available records.– Record order of main file unimportant
» Add record to index– requires moving down later records to keep file sorted– Could put at end, sorting occasionally.
File Processing - Indexing MVNC 13
Operations in indexed file
Delete records from data file» Delete in main file
– Mark record– Perhaps link into list of free records
» Delete in index– Perhaps move every later record down one– Perhaps just mark as deleted
Could still search of key field still intact
File Processing - Indexing MVNC 14
Operations in indexed file
Update records in data file» If change involves key field
– Will need to move entry in index– Can be thought of as a delete followed by an insert
» If change does not change key field– Case one - record does not move
just rewrite record index unchanged
– Case two - record changes position Perhaps the record in variable size, and it grows Index will have to changed to reflect new position Position of reference in index unchanged
File Processing - Indexing MVNC 15
Indexes too large to keep in memory
Searching» Binary searching requires several reads» Not much better then searching a sorted complete
file
Updating» Indexing update can require rewritting much of the
file» Orders of magnitude more expensive then in
memory index management
File Processing - Indexing MVNC 16
Indexes too large to keep in memory
In such cases consider» A hash file system» A tree-structured index (i.e. B-tree)
However, a file based index still has benefits» Allows binary searching on unordered file» Allows binary searching on variable length records» Indexes are smaller then main files, so somewhat
cheaper to manipulate» Allows file “rearrangement” without moving actual
records. (Consider when pinned)
File Processing - Indexing MVNC 17
Indexing with multiple keys
Consider an additional index for access to album file by composer
Secondary index: fields» Composer» Offset into main file
Problem» Every time record moved in main file, ALL indexes
must change» The indexes pin the records!
File Processing - Indexing MVNC 18
Indexing with multiple keys
Secondary index pinning - solution» Refer to primary kay rather then offset to actual
record» Now secondary key index doesn’t reference actual
records, records not pinned.» Main file can be reorganized without changing
secondary index
File Processing - Indexing MVNC 19
Indexing with multiple keys
searching by secondary index» Search secondary index (binary search?)» If found, use associated primary key to look up
record in primary index» Use offset in primary index to lookup actual record
remember - the secondary key may contain multiple matches (E.g. Beethoven)» A secondary key can be thought of a refering to a
subset of records
File Processing - Indexing MVNC 20
Indexing with multiple keys
Adding new records» Add record in main file and primary index as before» Add entry in primary in index» Add entry in secondary file
– As before, shift data as needed.– Duplicate keyed index entry stored together.– Duplicate’s should be stored in primary key order
File Processing - Indexing MVNC 21
Indexing with multiple keys
Deleting records» remove entry from all secondary indexes
– Costly if many secondary indexes
» simply leave in secondary indexes– search in primary index will fail, indicating record not
available– Failed searches longer, but file management simpler
(faster)
File Processing - Indexing MVNC 22
Indexing with multiple keys
Updating records» The fact that secondary indexes refer to primary
key insolates secondary indexes from most updates
– Records can move in main file without effecting secondary index
» Change in secondary key– If a secondary key value changes, then we must change
the key value in secondary index, requiring secondary index reordering
– Orther secondary indexes unchanged
File Processing - Indexing MVNC 23
Indexing with multiple keys
Updating records» Change of primary key value
– All secondary indexes must be updated to refer to the new key value
– Since the secondary key is uncanged, no reorganization required in secondary indexes - just rewrite index entries in same spot
– Usually one index entry needs updating per secondary index.
– The main record itself will simplifying looking up associated reference in secondary index!
File Processing - Indexing MVNC 24
Retrieval using combinations of secondary
keys
Consider:» Find all records with ID COL3345» Find all records of Beethoven’s work» Find all records of “Violin Concerto”
All require single index!
File Processing - Indexing MVNC 25
Retrieval using combinations of secondary
keys
Now consider:» Find all records with composer = “Beethoven” and
title = “Symphony No. 9”. Method one:
» Search composer index for those matching Beethoven. This yields a list of primary keys.
» Next search title index for those matching “Symphony No. 9”. This also yields a list of primary keys.
» Now intersect the two primary key lists. This is a list of primary keys for record which match the query.
File Processing - Indexing MVNC 26
Retrieval using combinations of secondary
keys
General Strategies» and queries: Intersect primary keys lists» or queries: Union primary keys lists
Point: Complex queries can be performed accessing only the matching records!
File Processing - Indexing MVNC 27
Secondary index problems Consider problems with this secondary index structure:
» we have to rearrange the index file every time a new record is add!
– If we add anew version of Beethoven’s Symphony No. 9, we would have to add a new element to both the composer and the title indexes
» If there are duplicate secondary keys, the seconary key value is stored in the secondary index once for every record with the secondary key!
– Beethoven is stored in secondary index once for every Beethoven record in the main file.
– Waste of space!
File Processing - Indexing MVNC 28
Inverted lists Solution one:
» Increase secondary index record size to include a list of all primary keys with matching values.
» Solves the two problems» Introduces problems:
– records must be large enough for maximum size list– Wastes space!
This is an Inverted List
File Processing - Indexing MVNC 29
Inverted lists Solution Two:
» The Bible Index is a type of an Inverted List– Works ok since never updated– If updates needed, MANY records would have to be
moved
File Processing - Indexing MVNC 30
Inverted lists Solution Three:
» Secondary index has:– A list of secondary keys (all unique)– Each entry contains a pointer to a list of primary key
references
» Now each key value stored exactly once» But how do we maintain the lists of primary key
references?
Solution - linked lists!
File Processing - Indexing MVNC 31
Inverted lists Inverted lists with linked lists of references Two data structures
» A list of secondary keys, with pointers into a list of references
» A list if references, each with a (next) pointer, which refers to another reference in list, or null
File Processing - Indexing MVNC 32
Inverted lists The secondary key list is no bigger then the
number of distinct secondary key values» Can be often stored in RAM» Lookups - binary search
The reference list can be stored in a file» Maintained as a linked list of free records» records added by delinked from free list, and linked
into the appropriate secondary key’s list.» record can be deleted by removing from the key’s
link listed and linked into a free list.
File Processing - Indexing MVNC 33
Selective indexes
Consider a “special” index for Christain music The index(s) would only contain reference to
albums which are considered Christain.
top related