understanding the content index. review: the search engine

15
Understanding the Content Index

Upload: cecily-dalton

Post on 03-Jan-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Understanding the Content Index. Review: The Search Engine

Understanding the Content Index

Page 2: Understanding the Content Index. Review: The Search Engine

Review: The Search Engine

Page 3: Understanding the Content Index. Review: The Search Engine

Review: Content Indices

• Content indices store pages in compressed form; inverted files.

• A basic content index has terms, identification numbers (i.d.s), and occurrences.

1 (review) 12 (content) 2, 4, 153 (indices) 3, 5, 16

Page 4: Understanding the Content Index. Review: The Search Engine

Review: Content Indices

• Content indices are the raw material needed to conduct a relevant search.

• A valuable search ranks the relevant material ensuring you get the best of the available information.

• To get the best possible search results, you have to combine many page measurements.

Page 5: Understanding the Content Index. Review: The Search Engine

Types of indexesA single page can be processed to derive a number of indices:

• Content indices (Semantics)– Text, data, metadata

• Structural indices (Structures)– Tags, links, parent-child relationships

• File indices (Supports)– Availability & relationship to NMP files (pdf, mov,

avi, etc)

Page 6: Understanding the Content Index. Review: The Search Engine

Re-focus on the content index (ci)A simple inverted file allows you to claim that you have

“indexed the web.”

In 1998 Alta Vista had the largest web index (repository of page indices). Google surpassed AV 2000.

In mid 2001, AllTheWeb surpassed Google—for about 3 months.

Google took the lead in late 2001, and has been the leader in “pages indexed” ever since.

So what?

Number of pages indexed is a measure, not a business. You have to have the ability to process those indices in a meaningful way to create value.

Page 7: Understanding the Content Index. Review: The Search Engine

Refine the CIThe simple inverted file index is insufficient; it doesn’t capture all the available information in a web page, web site, and internet domain. Consider the example below:

1 (review) 1

2 (content) 2, 4, 15

3 (indices) 3, 5, 16

What about the relationship of this page index to the other pages in the site? What about the structure of the page? Is there information that can refine our understanding of the quality of the content?

Page 8: Understanding the Content Index. Review: The Search Engine

The CI in detail, continued

We can refine the quality of the page CI by replacing simple occurrences with vector space measurements.

What makes sense? How should we structure the vector measure?

First, look at page structure.

Page 9: Understanding the Content Index. Review: The Search Engine

The CI in detail, continued<html>

<head> <title> Understanding the Content Index Scores</title> <meta name="DESCRIPTION" content=“Lecture on the

optimal design of content for content indexing”><meta name="KEYWORDS" content=“Content index,CI,

search engine,SEO,ICC600, Applied Research in Content Management">

<meta name="AUTHOR" content=”Stephen Masiclat, Associate Professor, The Newhouse School, Syracuse University>

</head>

<body> [Text from each slide]</body>

</html>

Page 10: Understanding the Content Index. Review: The Search Engine

Replace Occurrence with a Vector

The vector is developed using a heuristic; a set of rules. Different people might use different heuristics to obtain a valuable measure. Those differences create value differentiations.

One heuristic might be to create a vector to measure occurrence in the TITLE, META DESCRIPTION, and BODY tags.

Page 11: Understanding the Content Index. Review: The Search Engine

CMS matrixReplace Occurrence with a Vector (2)

1 (review) 12 (content) 2, 4, 153 (indices) 3, 5, 16

Becomes

1 (review) 2 [1,1,0]2 (content) 1[1,1,0] 3 [1,1,4], 4 [1,1,2], 5 [0,1,1]3 (indices) 1 [1,1,0], 3 [1,1,3], 4 [1,1,1], 5 [1,1,4]

Page 12: Understanding the Content Index. Review: The Search Engine

What Does a Vector Do?You can now calculate a “content score” that subdivides the set of available information into those that satisfy the search parameters, and which inherently ranks pages in relation to each other.

Examining the previous index:

1 (review) 2 [1,1,0]2 (content) 1[1,1,0] 3 [1,1,4], 4 [1,1,2], 5 [0,1,1]3 (indices) 1 [1,1,0], 3 [1,1,3], 4 [1,1,1], 5 [1,1,4]

Suppose you search for “content indices”:

Page 13: Understanding the Content Index. Review: The Search Engine

Calculate the score for “content indices”:

2 (content) 1[1,1,0] 3 [1,1,4], 4 [1,1,2], 5 [0,1,1]3 (indices) 1 [1,1,0], 3 [1,1,3], 4 [1,1,1], 5 [1,1,4]

A basic content score heuristic takes the sum of each vector and multiplies it by the summed vector of the partner term in each page.

Page 1 Content Score for S=(1+1+0) X (1+1+0) = 4

Page 3 Content Score for S=(1+1+4) X (1+1+3) = 30

Page 4 Content Score for S=(1+1+2) X (1+1+1) = 12

Page 5 Content Score for S=(0+1+1) X (1+1+4) = 12

Page 14: Understanding the Content Index. Review: The Search Engine

Caveats & Observations• Obviously, different search terms will result in a

different content scores. Therefore, indices must find a balance between accuracy and economy (i.e. speed).

• The CS vector is an evolving construct. The color example illustrates this lesson.

• The relevant calculation heuristics are also evolving. This is the basis of the “arms race” between Search Engines (Google) and marketers (spammers).

• The dimensional limit to the vector is a fertile area for research, especially in the field of SEO.

• Content Scores are factored with other index measures to yield the final SERP for a given search term or string. In other words, this is still only a piece of the puzzle of Excellent Content Management.

Page 15: Understanding the Content Index. Review: The Search Engine

Discussion: What are the appropriate vector dimensions, and what are the ramifications for content?

Title, meta, body. . . What else?

What direct measures, what derivatives?

Research Question: How should pictures be factored into a page’s content score?