lis 7450, searching electronic databases

23
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres

Upload: ralph

Post on 23-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

LIS 7450, Searching Electronic Databases. Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres. Database Structure. Organization of Data Elements and records. Database Record. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LIS  7450,  Searching Electronic Databases

LIS 7450, Searching Electronic Databases

Basic: Database Structure & Database Construction

Dialog: Database Construction for Dialog (FYI)

Deborah A. Torres

Page 2: LIS  7450,  Searching Electronic Databases

Database Structure

Organization of Data Elements and records

Page 3: LIS  7450,  Searching Electronic Databases

Database Record

Record – basic unit of information in a database (file). Example: Bibliographic record contains

description information, i.e. author, title, publisher etc.

Page 4: LIS  7450,  Searching Electronic Databases

Fields

Field – a distinct part or section of a record (a unit of information within the record) Example of personnel record fields:

employee’s name, special identifier number, address, date of hire etc.

Page 5: LIS  7450,  Searching Electronic Databases

Field Design Decisions

For each field Decide what information is placed within

that field & format for that information (text, numeric)

Should there be subfields within a field? What to call the fields? Field codes (abbreviations, numbering) Order of the fields

Page 6: LIS  7450,  Searching Electronic Databases

Example: MARC Record (a type of record you should be familiar with)

Record Fields & CodesThe 100 field

contain author information.The 245 field contains main title information.

Page 7: LIS  7450,  Searching Electronic Databases

Other Design Decisions

Hyphenated words Home-school

Stop words High frequency words not useful for searching

Single words and phrases Library, library science, color of money

Alternative spellings of words Color, colour

Page 8: LIS  7450,  Searching Electronic Databases

Types of Databases

Bibliographic – references and abstracts of published documents

Fulltext – complete text of articles, dictionary entry, code of law, or other such document.

Directory – factual information about organizations, companies, products, people, or materials.

Page 9: LIS  7450,  Searching Electronic Databases

Types of Databases

Numeric – data in a tabular or statistically manipulated form, often with some added text.

Hybrid – a mix of record types. For example, a database may have full-text records for some publications and citations and abstracts for other source documents.

Page 10: LIS  7450,  Searching Electronic Databases

Database Construction

Basic Steps for automatic indexing of text documents

Page 11: LIS  7450,  Searching Electronic Databases

Six Basic StepsStep 1: Parse text into wordsStep 2: Compare to stoplist and eliminate

stopwordsStep 3: Stem content words (reduce to root

words) (skip this step if decide not to stem)

Step 4: Count stemmed word occurrencesStep 5: Create union list of termsStep 6: Create data structure for specific

retrieval techniques (i.e. an inverted file)

Page 12: LIS  7450,  Searching Electronic Databases

Example: Simple Set of 5, One-sentence documents

D1: It is a dog eat dog world!D2: While the world sleeps.D3: Let sleeping dogs lie.D4: I will eat my hat.D5: My dog wears a hat.

“D” stands for document

Page 13: LIS  7450,  Searching Electronic Databases

Step 1: Parse Text into WordsD1:itisa dogeatdogworld

D2:whiletheworldsleeps

D3:letsleepingdogslie

D4:Iwilleatmyhat

D5:mydogwearsahat

Note: Some databases remove punctuation for words, like possessives; others preserve it. What difference would this make?

Page 14: LIS  7450,  Searching Electronic Databases

Step 2: Eliminate Stop WordsD1:dogeatdogworld

D2:worldsleeps

D3:letsleepingdogslie

D4:eathat

D5:dogwearshat

Stop words are content-free words – those not useful in determining the content of the document.Examples: pronouns (I, my), prepositions (of, by, on), articles (a, the, this)

Page 15: LIS  7450,  Searching Electronic Databases

Step 3: Stemming (remember not all databases stem words)

D1:dogeatdogworld

D2:worldsleeps

D3:letsleepingdogslie

D4:eathat

D5:dogwearshat

D1:dogeatdogworld

D2:worldsleep

D3:letsleepdoglie

D4:eathat

D5:dogwearhat

Page 16: LIS  7450,  Searching Electronic Databases

Types of Stemming DecisionsNo Stemming:contractcontractscontractedcontractingcontractorcontractioncontractualcontracture

Weak Stemming:Inflections: -s, -es, -ed, -ing, -’s

Strong Stemming:Derivations: -tion, -ly, -ally

Reduce words to a root variant; there are different stemming algorithms

Page 17: LIS  7450,  Searching Electronic Databases

A bit more about stemming for searching…

Some databases automatically search for all of the words that come from the same stem/root word unless you indicate that you only want the word you entered.

Example: if you entered computer, the database would also search for computing, computers, computation, etc.

Page 18: LIS  7450,  Searching Electronic Databases

Step 4: Sort Words, Count DuplicatesD1:dogdogeatworld

D2:sleep world

D3:dogletliesleep

D4:eathat

D5:doghat wear

D1:dog(2)eatworld

D2:sleep world

D3:dogletliesleep

D4:eathat

D5:doghat wear

Sort into Alpha order

Count any duplicate

s

Page 19: LIS  7450,  Searching Electronic Databases

Step 5: Create Union List of Unique TermsUnsorted List

dogeat

world sleep world dogletlie

sleep eathat doghat wear

Sorted List dogdogdogeateathat hat letlie

sleep sleep wearworld world

Sorted, Unique List

dogeathatletlie

sleepwearworld

Page 20: LIS  7450,  Searching Electronic Databases

Step 6: Create Inverted Index (inverted file)

dogeathatletliesleepwearword

Union List Unique terms

dog: D1 D3 D5eat: D1 D4hat: D4 D5let: D3lie: D3sleep: D2 D3wear: D5word: D1 D2

Inverted Index: has pointers to documents in which word occurs

Inverted Index

Page 21: LIS  7450,  Searching Electronic Databases

Dialog Database Construction

FYI: For those interested in Dialog

Page 22: LIS  7450,  Searching Electronic Databases

Dialog Database Construction

Step 1: Create a linear file of records received from the Information Provider. Assign sequential accession numbers to the records.

Step 2: Label the fields within the records: AU for Author, TI for Title, etc. If a field is word-indexed, also label the words within each field. Exclude stop words: AN FOR THE AND FROM TO BY WITH

Page 23: LIS  7450,  Searching Electronic Databases

Dialog Database Construction

Step 3: Create the Basic Index: all words and phrases from fields containing subject-related terms.

Step 4: Create the Additional Indexes: all terms from all remaining fields.