information management

53
1 Information Management Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida Original image* by Moshell et al .

Upload: jerzy

Post on 05-Jan-2016

15 views

Category:

Documents


0 download

DESCRIPTION

Information Management. Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida. Original image* by Moshell et al. Cataloging and Indexing. Why are we discussing this?. I don't believe in memorizing a bunch of soon-obsolete facts. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Management

1

Information Management

Lecture 3: Cataloging, Indexing, Searching

J. Michael Moshell

University of Central Florida

Original image* by Moshell et al .

Page 2: Information Management

-2 -

Cataloging and Indexing

Why are we discussing this?

www.joe-ks.com

I don't believe in memorizing a bunch of soon-obsolete facts.

I DO believe that many of you will have to solve info-management problems.

You will probably invent ways of doing it.

So you should "steal from the best" – not reinvent the wheelbarrow.

Page 3: Information Management

-3 -

How do we find things?

1) By starting in the neighborhood of similar things.

1) By using the name of the thing,and asking an "expert" or "resource"

Page 4: Information Management

-4 -

How do we find things?

1) By starting in the neighborhood of similar things.

1) By using the name of the thing,and asking an "expert" or "resource"

When reading a book:

Look in the table of contents, for an ARTICLE.

Look in the index, for a TOPIC.

Page 5: Information Management

-5 -

How do we find things?

1) By starting in the neighborhood of similar things.

1) By using the name of the thing,and asking an "expert" or "resource"

At the library:

Go to the relevant section, browse shelves.

Use the (card) catalog (really an index.)

Page 6: Information Management

-6 -

How do we find things?

1) By starting in the neighborhood of similar things.

1) By using the name of the thing,and asking an "expert" or "resource"

On the Internet:

Follow links from trusted sources (like cnet).

Use the indexes, e. g.

• those provided by search engines

• those provided by vendors (eBay, Amazon...)

• those provided by facilitators (uTube, craigslist)

Page 7: Information Management

-7 -

What's an index?

• An index is a system that serves to optimize speed in finding relevant documents in a search.

• An index is a system that, given one or more search terms from either metadata or essence, efficiently reports the location of the essence.

What's fast? What's efficient?

here comes some math ... (how we all love it!)

Page 8: Information Management

-8 -

Order statistics

A document contains k records. (perhaps k=1000).

If you must examine EACH RECORD to find what you seek,

the search is Order-k (written as O(k).)

For ancient records, this is usually the only way.

For instance, the Archivo General de Indias in Seville, Spain

www.learningcurve.gov.uk

Page 9: Information Management

-9 -

Order statistics

A document contains k records. (perhaps k=1000).

If you must examine EACH RECORD to find what you seek,

the search is Order-k (written as O(k).)

For ancient records, this is usually the only way.

On the average, you would look at 500 records (0.5*k) to

find the one you are seeking.

Let's say we seek a ship named Nuestra Senora de Atocha

Page 10: Information Management

-10 -

Indexing

To prepare an index of all ships' names, , captains' names,

owners and dates in the archive, it would take O(k) time.

Why? Because every document would be visited. Each index item contains SEARCH TERM and DOCUMENT NUMBER

BUT now (if the index is sorted, which it is) we can

find S=Nuestra Senora de Atocha much faster, by playing

"binary search".

S>this?

sorted

index

A

Z

Page 11: Information Management

-11 -

Indexing

If someone prepared an index of all ships, captains' names,

owners and dates in the archive, this would take O(k) time.

Why? Because every document would be visited.

BUT now (if the index is sorted, which it is) we can

find S=Nuestra Senora de Atocha much faster, by playing

"binary search". sorted

index

A

ZS>this? no

Page 12: Information Management

-12 -

Indexing and binary Search

1 comparison distinguishes 2 records

2 comparison distinguish 4 records

3 comparisons distinguish 8 records ...

10 comparisons distinguish 1024

20 comparisons distinguish over a million records.

sorted

index

A

Z

Each comparison

cuts in half

the search space

Page 13: Information Management

-13 -

Indexing and binary Search

1 comparison distinguishes 2 records

2 comparison distinguish 4 records

3 comparisons distinguish 8 records ...

10 comparisons distinguish 1024

20 comparisons distinguish over a million records.

sorted

index

A

Z

Each comparison

cuts in half

the search space

O(log k)

Page 14: Information Management

-14 -

OMG, a Log? Puleeeeez ....

Yep, this is college and you are a

DIGITAL Media Major. So here goes.

20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2= 1024=1 kilo, about a thousand

O(log k)

Ten twos

Page 15: Information Management

-15 -

OMG, a Log? Puleeeeez ....

Yep, this is college and you are a

DIGITAL Media Major. So here goes.

20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024

220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million

O(log k)

Twenty twos

Page 16: Information Management

-16 -

OMG, a Log? Puleeeeez ....

Yep, this is college and you are a

DIGITAL Media Major. So here goes.

20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024

220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million

230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion

O(log k)

Thirty twos

Page 17: Information Management

-17 -

OMG, a Log? Puleeeeez ....

Yep, this is college and you are a

DIGITAL Media Major. So here goes.

20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024

220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million

230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion

(log2 k)

1

2

3

10

20

30

k

2

4

8

1024

1 meg

1 gig

Page 18: Information Management

-18 -

OMG, a Log? Puleeeeez ....

Yep, this is college and you are a

DIGITAL Media Major. So here goes.

20=1

21=2

22=2*2=4

23=2*2*2=8

...

210=2*2*...*2=1024

220= 1 meg

230=1 gig

You need to be able to tell me what is log2(k) for any k (power of two) between

1 and 1meg.

Example:

256k?

256=28. and 1k~=210.

So that's 2*2*2..*2

18 twoslog2(256k) = 18

Page 19: Information Management

-19 -

OMG, a Log? Puleeeeez ....

I will provide a Logarithm Practice Sheet on the website

to help you study and practice for the midterm exam.

Page 20: Information Management

-20 -

Indexing and binary Search

Linear Search Binary Search

1000 items 10 steps

1 million items 20 steps

1 billion items 30 steps

sorted

index

A

Z

Each comparison

cuts in half

the search space

O(log2 k)

Page 21: Information Management

-21 - -21 -

Sorting N Objects

We will discuss sorting, a bit later

After you recover from Math Anxiety

Slcc.edu

Page 22: Information Management

-22 -

Why not just keep books in order?

Could you do 'binary search' directly on the books ...?

Well, WHICH order? If they're on the shelf in that order, yes.

- by ship names?

- by captains' names?

- by year of construction?

- by year of sinking or decommissioning?

An index can be sorted on any data field, then searched.

(Sorting k objects takes O(k * log k) time

(so sorting a billion objects; 1 billion * log2(1 billion)

=1 billion* 30 = 30 billion steps)

Page 23: Information Management

-23 -

Why not just keep books in order?

- An index can be sorted on any data field, then searched.

(Sorting k objects takes O(k log k) time

(so sorting a billion objects; 1 billion * log2(1 billion)

=1 billion* 30 = 30 billion steps)

(This can be done overnight, when computers aren't busy)

BUT – once sorted, inserting new information is O(log k) time.

So, you can insert a new fact into our billion-item index in

about 30 steps. Fast!

Page 24: Information Management

-24 -

What terms shall we index?

- For text, the essence yields keyword search

- The dumbest but easiest kind of search, if essence=digital text.

Page 25: Information Management

-25 -

What terms shall we index?

- For text, the essence yields keyword search

- The dumbest but easiest kind of search, if essence=digital text.

- This was not true for traditional libraries.

- Nobody had time to catalog every word of every book.

- Professional catalogers had to develop techniques:

- Author

- Title

- Publication Date

- Subject

(METADATA!)

And this last one, Subject, took more work than all the rest together.

Page 26: Information Management

-26 -

What's so hard about subject indexing?

- The problem: restricting the vocabulary.

Let's consider a fictional book:

The Skills of a Nineteenth Century Bartender.

Henry Macintosh, New York, 1889

How might someone seek this book?

Or: what metadata fields might the librarian use?

Occupations: bartender, barkeeper, barman, barkeep

(Are there others we forgot to search for?)

So catalogers established rules involving precedent

to restrict vocabularies and establish standards

Page 27: Information Management

-27 -

Cataloging an Item for a Library

The card catalog at Yale University(of course, it's all computerized now)

Page 28: Information Management

-28 - -28 -

Cataloging an Item for a Library

Problem #1: What book (or other object) are we talking about?

- Each item has an accession number (that's easy to issue)

- Each title has a catalog number, shared with all instances

(sometimes separate copies are called .c1, .c3 etc.)

Problem #2: What catalog number should I give this item?

- Did someone else catalog it already? If so, use that.

- If not, follow the

-

International Standard Bibliographic Description (ISBD)

Page 29: Information Management

-29 -

•Title

•statement of responsibility (author or editor),

•edition,

•material specific details (for example, the scale of a map),

•publication and distribution,

•physical description (for example, number of pages),

•Series (e. g. this might be part 3 of a trilogy)

•notes,

•standard number (ISBN).

International Standard Bibliographic Description (ISBD)

Page 30: Information Management

-30 - -30 -

And then follow

A complex set of rules

Most English cataloging follows

Anglo-American Cataloging Rules (AACR2)

Germans follow

Regeln für die alphabetische Katalogisierung

Etc…

Page 31: Information Management

-31 -

How to organize an index

- Step 1: Deciding what fields to include

(the Ontology) of the subject space

- Step 2: Deciding if each metadata field is open or controlled (CV).

Open set: American family names

Closed set: Chinese family names

In software, ,CV fields are often presented as pulldown menus.

- Step 3: Establishing the controlled

vocabulary, and rules for

extending it.

- Step 4: Maintaining it.

- (e. g. MIME types, subtypes.)

http://www.kksou.com

Page 32: Information Management

-32 -

Concept: "Low-hanging fruit"

- In any new domain, some ideas will come together

that present opportunities not previously possible

- Some of them will be easy to do.

- Get these first, and you may be rich.

The cataloging of dynamic media such as

video can take advantage of techniques

for Content Logging.

In this area,

closed captions was a low-hanging

fruit. www.recipeforlowhangingfruit.com

Page 33: Information Management

-33 -

Closed Captions for Content Logging

- Originally for deaf ... now for bars, etc.

- "Closed" – not all viewers will see the captions

- But they are built into most TV broadcasts.

>> Indicates a new speaker has begun to talk.

www.recipeforlowhangingfruit.com

Page 34: Information Management

-34 -

Closed Captions for TV

- Originally for deaf ... now for bars, etc.

- "Closed" – not all viewers will see the captions

- But they are built into most TV broadcasts.

>> Indicates a new speaker has begun to talk.

But – isn't speech recognition still hard?

- yes – but there are SCRIPTS and TELEPROMPTERS behind

most TV programming. Live news feeds are a mix of scripted

and unscripted.

BBC developed a re-speak technology to maximize clarity.

Sound effects and music are shown by # or notes.

www.recipeforlowhangingfruit.com

Page 35: Information Management

-35 -

Closed Captions for TV

- now that CC exists, you can index it to produce metadata.

- Services monitor in real-time for significant stories.

www.recipeforlowhangingfruit.com

Page 36: Information Management

-36 -

Can you think of another TV "LHF"?

Where is another source of already-in-text-form metadata

about TV program contents? (I can think of two).

www.recipeforlowhangingfruit.com

Page 37: Information Management

-37 -

Can you think of another TV "LHF"?

Where is another source of already-in-text-form metadata

about TV program contents? (I can think of two).

• Electronic Program Guides, such as

Tivo's TV programming schedule

• Broadcasters' Websites (e. g. www.cbs.com)

Page 38: Information Management

-38 -

We've discussed third party logging

But what about in-house logging (by materials' own producers.)

Static metadata (exists independently of the essence)

• Production Notes, including original scripts

• Edit Decision List (part of production notes)

• Advanced Authoring Format (AAF)

• News Feed rundowns (cues for local broadcasters)

Media Object Server (MOS) format

Page 39: Information Management

-39 -

We've discussed third party logging

But what about in-house logging (by materials' own producers.)

Dynamic metadata (sampled from or derived from the essence)

A hierarchy of proxy representations:

- time code (ties it all together)

- Proxy video (low res, maybe easier to scan – or harder!)

- Keyframes (still images for pattern recognition)

- Audio transcript

- annotation – added by staff

Page 40: Information Management

-40 -

Speech Analysis

- Phoneme: minimal meaningful unit of speech. English has 44.

- Phone: the 'rendering' of a phoneme by an individual. Infinite #

- Recognition of words: difficult under good conditions,

nearly impossible under noisy conditions

However, you don't need to get ALL the words to make the

document searchable. Even getting SOME of the words is better

than none.

www.nuance.com

Page 41: Information Management

-41 -

Indexing things that aren't words

- Built-in metadata (e. g. digital camera data, Adobe metadata)

- Image libraries – cataloged by human beings

(We will study some of the metadata standards used.)

- Automatic pattern recognition

- http://www.autonomy.com/content/Solutions/video-surveillance/index.en.html

- Assignment: Download ONE of the "Autonomy Virage" documents,

- read it and be prepared to give a one-minute summary of its claims.

Page 42: Information Management

-42 -

Recognizing Faces

- FINDING a face in a scene is far easier than RECOGNIZING it.

- Nikon's cameras can now find faces and focus on them.

Face-priority AF in Nikon Coolpix Cameras

But it's a rough rough world out there. The website listed

below provides a list of vendors ... many of which are 'dead

links' as companies come and go.

http://www.face-rec.org/vendors//

Page 43: Information Management

-43 -

And ... where do we go from here?

Go back through these slides. Make a list of the important words.

If you can write a one-sentence explanation of every word on this list, AND answer logarithm questions, you're ready for the midterm. ...

at least with regard to Searching and Pattern Recognition.

But now let's go talk about SORTING.

Page 44: Information Management

-44 -

SortingWhy are we discussing this?

It's a good example of DUMB vs. SMART algorithms.

What's an algorithm?

A systematic procedure for solving a problem.

Programs are built on the basis of algorithms. But so are

* carpentry

* medical diagnosis

* electronic repair .. Etc etc etc

.

Page 45: Information Management

-45 - -45 -

Sorting and IgnoranceTwo thousand name-tags

Printed in NAME order

Needed in COMPANY order

So… they put

Six temps to

Work …

For HOURS… Mnddc.org

.

Page 46: Information Management

-46 - -46 - -46 -

Sorting the Hard WaySpread 'em all on a long table

Insert each one into the ordered pile.

Problem: The pile gets bigger and bigger,

so the insertion goes more & more slowly..

Page 47: Information Management

-47 - -47 - -47 - -47 -

Sorting the Hard WaySpread 'em all on a long table

Insert each one into the ordered pile.

This technique takes O(n2) – that's n squared.

2000 * 2000 = 4 million operations!

Walk down the row (pass n badges), insert one.

Do this n times. You have n * n distance to walk.

.

Page 48: Information Management

-48 - -48 - -48 - -48 - -48 -

Sorting, a smart way1. Grab 20 badges, and sort them in a small group.

Create 100 small, sorted batches.

2. Combine the batches 2 by 2, like this:

20 40

20 80 etc.

20 40

20

.

Page 49: Information Management

-49 - -49 - -49 - -49 - -49 - -49 -

Sorting, a smart way2. Combine the batches 2 by 2, like this:

20 40

20 80 etc.

20 40

20

Reminds you of binary search? Yes,

Merging twice as many groups only takes

One more step (layer).

4 groups – 2 layers (3 operations)

8 groups – 3 layers (7 operations) etc.

.

Page 50: Information Management

-50 - -50 - -50 - -50 - -50 - -50 - -50 -

Sorting by 'merge-sort'

Merge-Sort requires O(n log2 n) operations to sort n objects.

For 2000 name badges, log2 (2000) = log2 (1000) + 1You recognize log2 (1000) ~= log2 (1k) = 10,So log2 (2000) ~= 11

So our total estimate for sorting 2000 name badges is

Approximately 2000 * 11 or 22,000 steps

Compared to 4 million steps (2000 * 2000) ifdoing the job the BFI (Brute Force & Ignorance) way!

Page 51: Information Management

-51 - -51 - -51 - -51 - -51 - -51 - -51 - -51 -

Sorting by 'merge-sort'

The moral of this story:

1) Do a little research before you undertakeA major project.

An hour's investigation might save WEEKS of work,And it might save your BUSINESS.

2) Ask an expert, if you have one.Become an expert, if you don't have one.

Usfamily.net

Page 52: Information Management

-52 - -52 - -52 - -52 - -52 - -52 - -52 - -52 - -52 -

Sorting by 'merge-sort'

The moral of this story:

1) Do a little research before you undertakeA major project.

An hour's investigation might save WEEKS of work,And it might save your BUSINESS.

2) Ask an expert, if you have one.Become an expert, if you don't have one.

<<Tell the story of the stuck sailboat>>Usfamily.net

Page 53: Information Management

-53 -

Seattletimes.nwnews.com