understanding the content index. review: the search engine

Understanding the Content Index

Review: The Search Engine

Review: Content Indices

• Content indices store pages in compressed form; inverted files.

• A basic content index has terms, identification numbers (i.d.s), and occurrences.

1 (review) 12 (content) 2, 4, 153 (indices) 3, 5, 16

Review: Content Indices

• Content indices are the raw material needed to conduct a relevant search.

• A valuable search ranks the relevant material ensuring you get the best of the available information.

• To get the best possible search results, you have to combine many page measurements.

Types of indexesA single page can be processed to derive a number of indices:

• Content indices (Semantics)– Text, data, metadata

• Structural indices (Structures)– Tags, links, parent-child relationships

• File indices (Supports)– Availability & relationship to NMP files (pdf, mov,

avi, etc)

Re-focus on the content index (ci)A simple inverted file allows you to claim that you have

“indexed the web.”

In 1998 Alta Vista had the largest web index (repository of page indices). Google surpassed AV 2000.

In mid 2001, AllTheWeb surpassed Google—for about 3 months.

Google took the lead in late 2001, and has been the leader in “pages indexed” ever since.

So what?

Number of pages indexed is a measure, not a business. You have to have the ability to process those indices in a meaningful way to create value.

Refine the CIThe simple inverted file index is insufficient; it doesn’t capture all the available information in a web page, web site, and internet domain. Consider the example below:

1 (review) 1

2 (content) 2, 4, 15

3 (indices) 3, 5, 16

What about the relationship of this page index to the other pages in the site? What about the structure of the page? Is there information that can refine our understanding of the quality of the content?

The CI in detail, continued

We can refine the quality of the page CI by replacing simple occurrences with vector space measurements.

What makes sense? How should we structure the vector measure?

First, look at page structure.

The CI in detail, continued<html>

<head> <title> Understanding the Content Index Scores</title> <meta name="DESCRIPTION" content=“Lecture on the

optimal design of content for content indexing”><meta name="KEYWORDS" content=“Content index,CI,

search engine,SEO,ICC600, Applied Research in Content Management">

<meta name="AUTHOR" content=”Stephen Masiclat, Associate Professor, The Newhouse School, Syracuse University>

</head>

<body> [Text from each slide]</body>

</html>

Replace Occurrence with a Vector

The vector is developed using a heuristic; a set of rules. Different people might use different heuristics to obtain a valuable measure. Those differences create value differentiations.

One heuristic might be to create a vector to measure occurrence in the TITLE, META DESCRIPTION, and BODY tags.

CMS matrixReplace Occurrence with a Vector (2)

1 (review) 12 (content) 2, 4, 153 (indices) 3, 5, 16

Becomes

1 (review) 2 [1,1,0]2 (content) 1[1,1,0] 3 [1,1,4], 4 [1,1,2], 5 [0,1,1]3 (indices) 1 [1,1,0], 3 [1,1,3], 4 [1,1,1], 5 [1,1,4]

What Does a Vector Do?You can now calculate a “content score” that subdivides the set of available information into those that satisfy the search parameters, and which inherently ranks pages in relation to each other.

Examining the previous index:

1 (review) 2 [1,1,0]2 (content) 1[1,1,0] 3 [1,1,4], 4 [1,1,2], 5 [0,1,1]3 (indices) 1 [1,1,0], 3 [1,1,3], 4 [1,1,1], 5 [1,1,4]

Suppose you search for “content indices”:

Calculate the score for “content indices”:

2 (content) 1[1,1,0] 3 [1,1,4], 4 [1,1,2], 5 [0,1,1]3 (indices) 1 [1,1,0], 3 [1,1,3], 4 [1,1,1], 5 [1,1,4]

A basic content score heuristic takes the sum of each vector and multiplies it by the summed vector of the partner term in each page.

Page 1 Content Score for S=(1+1+0) X (1+1+0) = 4




Caveats & Observations• Obviously, different search terms will result in a

different content scores. Therefore, indices must find a balance between accuracy and economy (i.e. speed).

• The CS vector is an evolving construct. The color example illustrates this lesson.

• The relevant calculation heuristics are also evolving. This is the basis of the “arms race” between Search Engines (Google) and marketers (spammers).

• The dimensional limit to the vector is a fertile area for research, especially in the field of SEO.

• Content Scores are factored with other index measures to yield the final SERP for a given search term or string. In other words, this is still only a piece of the puzzle of Excellent Content Management.

Discussion: What are the appropriate vector dimensions, and what are the ramifications for content?

Title, meta, body. . . What else?

What direct measures, what derivatives?

Research Question: How should pictures be factored into a page’s content score?

understanding the content index. review: the search engine

Documents