search accuracy analytics white paper

8/17/2019 Search Accuracy Analytics White Paper

1/35

Search Accuracy AnalyticsPaul Nelson

Fall 14

White Paper

Version 2.6.3

November 2014

©2014 Search Technologies


2/35

Search Accuracy Analytics – White Paper

©2014 Search Technologies 2

Table of Contents

1 Summary ................................................................................................................................ 5

1.1

The Impact of Poor Accuracy .............................................................................................. 5

1.2

The Solution to Poor Accuracy ............................................................................................ 5

1.3

A Reliable, Step-By-Step Process ......................................................................................... 6

1.4

A User-Focused Approach ................................................................................................... 6

1.5

A Comprehensive Approach ................................................................................................ 8

2 Problem Description ............................................................................................................... 9

3 Gathering and Auditing Log Files ........................................................................................... 10

3.1

Search Logs ........................................................................................................................ 10

3.2

Click Logs ........................................................................................................................... 11

3.3

User Information ............................................................................................................... 11

3.4

Log Cleanup ....................................................................................................................... 12

3.5

Auditing the Logs ............................................................................................................... 13

4

Index and Log File Snapshots ................................................................................................ 14

5

The Search Engine Score ....................................................................................................... 15

5.1

User-Based Score Model ................................................................................................... 15

5.2

What is Relevant to the User?........................................................................................... 15

5.2.1

Using Logs to Identify Relevant Documents ............................................................. 16

5.2.2

Gradated Relevancy versus Binary Relevancy .......................................................... 16

5.2.3

Documents are relevant to users, and not to queries .............................................. 16

5.3

Score Computation ............................................................................................................ 17

5.3.1

The Scoring Factor .................................................................................................... 18

5.4

Interpreting the Score ....................................................................................................... 18

5.5

User and Query Scoring ..................................................................................................... 19

5.6

Engine Quality Predictions / Quality Regression Testing .................................................. 19

6

Additional Metrics ................................................................................................................ 20

6.1

Query Metrics .................................................................................................................... 20

6.2

Relevancy and Clicks.......................................................................................................... 20

6.3

Result Metrics .................................................................................................................... 20

6.4

Document Metrics ............................................................................................................. 21


3/35



7

Continuous Improvement ..................................................................................................... 22

7.1

The Continuous Improvement Cycle ................................................................................. 22

7.2

Continuous Improvement Requirements.......................................................................... 23

7.3

Recording Performance from Run to Run ......................................................................... 23

8

Manual Relevancy Judgments ............................................................................................... 25

8.1

The Relevancy Judgment User Interface ........................................................................... 25

8.2

Advantages ........................................................................................................................ 25

8.3

Statistics from Manual Judgments .................................................................................... 26

9

Using Logs to Analyze User Intent ......................................................................................... 27

9.1

Query Analysis ................................................................................................................... 27

9.1.1

Top Queries .............................................................................................................. 27

9.1.2

Randomly Selected Queries ...................................................................................... 27

9.1.3

Overall Internal Search Usage .................................................................................. 27

9.1.4

Randomly Selected Users ......................................................................................... 28

9.1.5

Help Desk Analysis .................................................................................................... 28

9.2

Tooling for Query Analysis ................................................................................................ 28

9.3

User Interface Click Analysis ............................................................................................. 30

9.3.1

Feature Usage ........................................................................................................... 30

9.3.2

Sequences ................................................................................................................. 30

9.4

Long Tail Analysis .............................................................................................................. 31

9.4.1

Query Database ........................................................................................................ 31

9.4.2

Supporting Databases ............................................................................................... 31

9.4.3

Tooling Support ........................................................................................................ 32

9.4.4

Long Tail Analysis – Outputs ..................................................................................... 32

9.4.5

More Information ..................................................................................................... 33

10

A/B Testing .......................................................................................................................... 34

11

Conclusions .......................................................................................................................... 35


4/35



Search Accuracy Analytics

Top quality search accuracy is not achieved with technology alone or through a one-time “quick fix”.

It can only be achieved with a careful, continuous improvement process driven by a varied set of

metrics. And the starting point for such a process is an objective, statistically valid measurement ofsearch accuracy.

When search results are not satisfactory or relevant enough to users, search development teams

often analyze the problem by looking at accuracy metrics from a query perspective. They ask

questions like: “What queries worked? What queries are most frequently executed? What queries

returned zero results?” And so on. In contrast, this paper presents a broader, user focused approach

to search relevancy. We ask the question “is the user satisfied?” And “are the results worthy of

further user action?”

Search Technologies recommends using this white paper as a reference guide. The techniques in this

paper have been used successfully by Search Technologies with a number of customers for firstcomputing search engine accuracy metrics and then using those metrics to iteratively improve search

engine relevancy using a reliable, measurable process.


5/35



1

Summary

The number one complaint about search engines is that they are “not accurate”.

Customers complain to us that their engine brings back irrelevant, old, or even

bizarre off-the-wall documents in response to their queries.

This problem is often compounded by the secretive nature of search engines and search engine

companies. Relevancy ranking algorithms are often veiled in secrecy (described variously as

‘intellectual property’ or more quaintly as the ‘secret sauce’). And even when algorithms are open to

the public (for Open Source search engines, for example), the algorithms are often so complex and

convoluted that they defy simple understanding or analysis.

1.1 The Impact of Poor Accuracy

What makes this situation all the more frustrating for customers is the impact of poor accuracy to the

bottom line.

For Corporate Wide Search – Wasted employee time. Missed opportunities for new business.

Re-work, mistakes, and “re-inventing the wheel” due to lack of knowledge sharing. Wasted

investment in corporate search when minimum user requirements for accuracy are not met.

For e-commerce Search – Lower conversion rates. Higher abandonment rates. Missed

opportunities. Lower sales revenue. Loss of mobile revenue.

For Publishing – Unsatisfied customers. Less value to the customer. Lower renewal rates.

Fewer subscriptions. Loss of subscription revenue.

For Government – Unmet mission objectives. Less public engagement. Lower search and

download activity. More difficult to justify one’s mission. Incomplete intelligence analysis.

Missed threats from foreign agents. Missed opportunities for mission advancement.

For Recruiting – Lower fill rate. Lower margins or spread. Unhappy candidates assigned to

inappropriate jobs. Loss of candidate and hiring manager goodwill. Loss of revenue.

1.2

The Solution to Poor Accuracy

When organizations encounter poor relevancy from their search engine, they usually have one of

two reactions:

A.

Buy a new search engine.

B.

Give up.

Both of these approaches are unproductive, expensive and wasteful.


6/35



The difficult truth is that all search engines require tuning, and all content requires processing. The

more tuning and the more processing you do, the better your search results will be.

Search engines are designed and implemented by software programmers to handle standard use

cases such as news stories and web pages. They are not delivered out-of-the-box to handle the wide

variety and complexity of content and use case that are found around the world. They need to be

tuned, and part of that tuning is to process the content so that it is easily interpretable.

And so there is no “easy fix”, no “silver bullet” and no substitute to a little bit of elbow grease (aka

hard work) when it comes to creating a satisfying search result.

1.3

A Reliable, Step-By-Step Process

This paper gives a step-by-step process to solve the problem of poor search engine relevancy. The

major steps are:

1.

Gather, audit, and process log files.

a.

Query logs, click logs, and other indicative user activity.

2.

Create a snapshot of the log files and search engine index for testing.

3.

Compute engine score.

4.

Compute additional search metrics.

5.

Implement and score accuracy improvements (the continuous improvement process).

6.

Perform manual relevancy judgments (when practical).

7.

Use logs to analyze user intent.

8.

Perform A/B testing to validate improvements and calculate Return on Investment (ROI).

These steps have been successfully implemented by Search Technologies and are known to provide

reliably, methodical, measurable accuracy improvements.

1.4

A User-Focused Approach

When discussing what documents are relevant to what queries, it is common for different parts of

the organization to have different goals and opinions as to “what is a good document”. For example:

(Functional) Do the results contain the words the user entered?

(Visual perception) Is it easy to see that the results contain good documents?

(Knowledge) Does it answer the deeper question implied by the query?

(Marketing) Do the results enhance the perception of the corporatebrand ?

(Sales) Do the results lead to more sales?

(Editorial) Do the results highlight editor selections?


7/35



With many competing goals for search, it is easy to get lost when trying to figure out what is

important and what should be fixed (and why).

Most search accuracy metrics are from a query perspective. They ask questions like: “What queries

worked? What are the most frequently executed queries? What queries returned zero results? How

do I improve my queries?” etc.

In contrast, this paper presents a user focused approach to search accuracy:

1.

All accuracy evaluation is from the user’s point of view .

2.

All that matters does the user find results of interest ?

We are interested, in this paper, as to the central question of “is the user satisfied?” We attempt to

answer this question by analyzing user activity to see if the search engine is providing results which

the user has found worthy of further activity.

The user-centered approach is a powerful approach with some subtle consequences. For example, if

a query is executed by 10 different users, then that query will be analyzed 10 times – once from each

user’s point of view (analyzing the activity stream for each user individually).

Further this approach is much more accurate by providing scores which are automatically normalized

to the user population. A very small number of highly interactive users will not adversely skew the

score, unlike with query-based approaches.

And finally, user-based scoring provides more detailed information for use in system analysis. It

scores every user and every user’s query (as well as the system as a whole). It identifies the least

effective queries and the most effective queries, and traces every query back to the user session so

the query can be viewed in context.

What about the other perspectives? Aren’t they important too?

Of course they are. But first you need to understand how your end-users view your search results.

Once you have a solid understanding of this, you can then include other perspectives (such as brand

awareness, editorial selections, etc.) into the equation.

E-commerce and Increasing Sales

Finally, the user-based approach is the best approach for increased e-commerce sales. What matters

most to e-commerce is how many customers (i.e. users) purchased products from the query results.

With e-commerce, you are less concerned about the query, and more concerned about the

customer. What is most important is: did the query return something that the customer purchased?

How many customers that executed the query ended up purchasing something? What queries lead

to sales? What queries never lead to sales?


8/35



The user-based approach can answer all of these questions. The user-based approach can aggregate

all customer activity and leverage that user activity to determine the success (or failure) of every

query for every user, from each customer’s unique point of view.

Search Technologies has shown that this approach can dramatically improve conversion rates for e-

commerce sites. In one implementation, we produced a 7.5% increase in conversion rate for

products purchased based on search results for one site, 3% increase overall – validated by A/B

testing.

1.5

A Comprehensive Approach

Finally, this paper represents a comprehensive approach. The approach includes:

Log file analysis

Automatic engine scoring

Continuous improvement

Manual scoring

User intent analysis

A/B testing

Note that not all techniques will be appropriate for all situations. Some techniques are appropriate

for production systems with sufficient log information to use for analysis while others are better for

evaluating brand-new systems. Some techniques are for on-line analysis and others are for off-line

analysis.

Search Technologies recommends using this white paper as a reference guide. We have architects

and data scientists who can help you determine which methods and processes are best suited to

your situation. Once you have decided on a plan, we have lead engineers, senior developers, and

experienced project managers who can ensure that the plan is delivered efficiently, on-time, and

with the best possible outcome.


9/35



2

Problem Description

What often happens with fuzzy algorithms like search engine relevancy is that some

improvements make relevancy better (more accurate) for some queries and worse

(less accurate) for other queries.

Only by evaluating a statistically valid sample set can one know if the algorithm is better in the

aggregate.

For each new release of the search engine, the accuracy of the overall system must be

measured to see if it has improved (or worsened).

o Simple bugs can easily cause dramatic degradation in the overall quality of the

system.

o Therefore, it is much too easy for a new bug to slip into production unnoticed.

o The problem is exacerbated by the size of the data sets.

Any algorithm which operates over hundreds of thousands of documents

and queries must have a very large test suite for continuous statistical

evaluation and improvement.

The relative benefit of each parameter of the search relevancy algorithm must be measured

individually. For example:

o

How much does field weighting help?

o

How much will other query adjustments (weighting by document type, exact phrase

weighting, etc.) help the relevancy?

o How much do synonyms help?

o How much will link counting and popularity counting help?

A data-directed method for determining what types of queries are performing poorly, and

what fixes are most likely to improve accuracy needs to be implemented.

o Instead, this information on what queries to fix comes from anecdotal end-user

requests and evaluations.

Fortunately, standard, best-practices procedures do exist for continuous improvement of search

engine algorithms. Further, these procedures have been demonstrated in real-world situations to

produce high quality, statistically verifiable results. User centered approaches provide more reliable

statistics which better correlate to end-user satisfaction.


10/35



3

Gathering and Auditing Log Files

Clean, complete log files gathered from a production system are a critical for

metrics analysis. There are two types of log files required for accuracy

measurements: search logs and click logs.

3.1 Search Logs

Search logs contain an event for every search executed by every user. Every event should contain:

1.

The date and time (down to the second) when the search was executed.

2.

A “search ID” which uniquely identifies the search.

a.

This is optional, but highly desired to connect searches with click log events.

b.

If the user executes the same search twice in a row, each search should have a unique ID.

3.

The user ID of the user who executed the search.

4.

The text exactly as entered by the user.

5.

The query as tokenized and cleaned by the query processor.

6.

Search parameters as submitted to the search engine.

7.

Whether the query was selected from an auto-suggest menu

8.

The number of documents found by the search engine in response to the search.

9.

The starting result number requested.

a.

A number > 0 indicates the user was asking for page 2-n of the results.

10.

The number of rows requested (e.g. the page size of the results).

11.

The URL from where the search was executed (like the Web log ”referer” field).

12.

A code for the type of search.

a.

Should indicate if the user clicked on a facet value, an advanced search, or a simple

search.

13.

A list of filters (advanced search or facets) turned on for the search.

14.

The time it took for the search to execute (milliseconds).

Notes:

Some of these parameters may be parsed from the URL used to submit the search.

Test searches should be removed (such as searches on the word “test” or “john smith”).

Searches from test accounts should be removed (see auditing logs, section 3.5 below).


11/35



“Automatic searches” executed in response to browsing or navigation should be removed.

o Searches executed automatically “behind the scenes” when the user clicks on a link

should be removed.

o External links directly to the search engine

Other sorts of administrative searches (for example, to download a list of all documents for

index auditing) should be removed.

3.2

Click Logs

In addition to searches, accuracy testing needs to capture exactly what the user clicked in response

to a search.

Generally, click logs should contain all clicks by all users within the search user interface. This

includes clicks on documents, tabs, advanced search, help, more information, etc.

But in addition, clicks on search results should contain the following information:

1.

The time stamp (date and time to the second) when the search result was clicked.

2.

The document ID (from the index) or URL of the search result selected.

3.

The unique search ID (see previous section, 3.1) of the search which provided the search

result.

4.

The ID of the user who clicked on the result.

5.

The position within the search results of the document, where 0 = the first document returned

by the search engine.

In order to do this, search results must be wrapped (they cannot be bare URLs), so that when clicked,the event is captured in the search engine logs.

3.3

User Information

[Optional] User information can help determine when subsets of the user population are being

poorly served by the search engine.

User information should be available for search engine statistics and cross referenced by user ID.

Useful user information to gather includes:

Start date of the user within the organization

Business unit(s) or product line(s) to which the user belongs or is subscribed to

Manager to whom the user reports (for employees)

Employer of the user (for pubic users) if available


12/35



Geographic location of the user

Job title of the user, if available

Further, e-commerce and informational sites can identify users, especially when they have logged in,

their population groups and interaction history, using cookies or back-end server data.

Gathering additional information about users is optional, and can be deferred to a future

implementation when the basics of search engine accuracy have been improved.

3.4 Log Cleanup

Just gathering logs will not be enough. All logs will contain useless information that is not related to

actual end-user usage of the system. This useless information will need to be cleaned up before the

logs are useful.

This includes:

Queries by monitoring programs (‘test probes’)

Queries by the search team

Common queries used in demonstrations (‘water’, ‘john smith’, ‘big data’)

Log file entries which are not queries or user clicks at all (HEAD requests, status requests,

etc.)

Multiple server accesses for a single end-user entered query

o Where a click on the [SEARCH] button actually fires off multiple queries behind the

scenes

Clicks on page 2, 3, 4 for a query

o These should be categorized as such and set aside

o In other words, simply clicking on page 2 for a query should not increase the total

count for that query or the words it contains.

Clicks on facet filters for a query

o These should be categorized and set aside

Canned queries automatically executed (typically through the APIs) to provide summary

information for special use cases

o In other words, queries where the query expression is fixed and executed behind the

scenes

On public sites, there may be spam such as random queries, fake comment text containing

URLs, attempts to influence the search results or suggestions, or even targeted probes.


13/35



o These should be detected and removed where possible.

Log cleanup is a substantial task by itself and requires some investigative work on how the internal

search system works. It must be treated carefully.

3.5

Auditing the Logs

Accuracy metrics will only be as good as the accuracy of the log files gathered. Therefore, log auditing

is required. This is accomplished by creating and logging in as a test account and interacting with the

system. Each interaction should be manually logged.

Once complete, the log files should be gathered and the manual accounting of the actions should be

compared with the automatically generated logs. Discrepancies found should be investigated and

corrected.

Log auditing can and should be automated and executed periodically. Manual log auditing will still be

required whenever a new version of the system is released.

Log auditing should be performed both before and after log cleanup – to ensure that log cleanup

does not remove important log information.


14/35



4

Index and Log File Snapshots

Most systems are “living” systems with constant updates, new documents, and user

activity. But for systems under evaluation, especially those which are being tuned

and refined, constant updates means changing and non-comparable metrics fromrun to run.

Therefore, Search Technologies recommends creating a “snapshot” of data and logs at a certain

period of time. All accuracy measurements will be performed on this snapshot.

This further means:

The snapshot must include a complete data set and log database.

Any time-based query parameters will need to be fixed to the time of the snapshot.

o

For example, relevancy boosting based on document age or “freshness”

Naturally, the snapshot will need to be periodically refreshed with the latest production data.

However, when this occurs:

o The same engine and accuracy metrics must be computed on both the old and new

snapshots.

o This way, score changes based merely on differences in the document database and

recent user activity can be determined and factored into future analysis.


15/35



5

The Search Engine Score

This section describes the metrics and reports which should be computed to

determine search engine accuracy. Multiple reports are required to look at

accuracy from a variety of dimensions.

The most important report is the search engine score. This is an overall judgment of how well the

search engine is meeting the user need.

5.1

User-Based Score Model

The Search Engine Score in this section is user based . This means that the model takes an end-user

perspective (rather than a search engine perspective) on the problem. Results are gathered by user

(or session) and then analyzed within the user/session for accuracy. These user-based statistics are

then aggregated to create the score for the engine as a whole.

This user-based model provides a more accurate description as to how well users are being served by

the search engine. In particular:

What percentage of users are satisfied?

What is the average user score?

Who are the most and least satisfied users?

Similarly, when queries are scored, these scores use data derived from the user who executed the

query. If a query is executed by multiple users, then that query is judged multiple times – from each

user’s individual perspective.

This provides useful, user-based information on queries:

What are the queries which satisfy the highest percentage of users?

What are the queries which satisfy the smallest percentage users?

What is the overall query score?

What are the most and least satisfying queries to the users who execute them?

5.2

What is Relevant to the User?

The search engine score requires that we understand what documents are relevant (of interest) to a

particular user or session. The ultimate goal of the search engine is to bring back documents which

the user finds to be interesting as quickly and as prominently as possible.


16/35



5.2.1

Using Logs to Identify Relevant Documents

There are many different ways to identify documents which are “relevant” or “of interest” to a user:

Did the user view the document?

Did the user download or share the document?

Did the user view the product details page?

Did the user “add to cart” the product?

Did the user purchase the product?

Did the user hover-over to view details of the product or document preview?

All of these signals indicate that the user found the document (or product) sufficiently worthy of

some additional investigative action on their part. It is these documents that we consider to be

“relevant” to the user.

5.2.2

Gradated Relevancy versus Binary Relevancy

A question at this point is whether there should be a gradation of relevancy. For example, should

“add to cart” be ‘more relevant’ than “product details view”? Should “product purchase” be ‘more

relevant’ than “add to cart”? Should documents which are viewed longer be ‘more relevant’ than

documents viewed for a short amount of time?

The answer to all of these questions, is ‘no’.

The problem with gradations of relevancy is scale. Is product purchase 2x relevant than add to cart?

Or only 1.5x? Attempting to answer these sorts of questions will skew the results in hard-to-predictways.

And so, Search Technologies recommends simply choosing some threshold for all documents which

are relevant (value = 1.0) and all other documents are, therefore, non-relevant (value = 0.0). This

avoids all complexity and uncertainty with choosing relevancy levels, star-systems, etc. which

confuse and skew the statistics.

5.2.3

Documents are relevant to users, and not to queries

Note that, in the above discussion, documents are considered to be relevant to the user and not to

the query . This gets to the core of the user-based engine score, that we are creating a model for theuser: queries entered by the user and documents which the user has indicated are interesting (in

some way).

It does not matter, to the engine score model, which query originally retrieved the document. The

document is relevant for all queries entered by the user.


17/35



It is the assumption of this model that users will execute multiple queries to find the document or

product which they ultimately want. They may start with “shirt”, then move to “long sleeved shirt”,

then to “long sleeved dress shirt” and so on. Or their first query may be misspelled. Or it may be

ambiguous. Another way of saying this is that a relevant document found by query #3 is also relevant

if it was returned by query #1.

In all of these examples, it is desirable for the search engine to “short circuit” the process and bring

back relevant documents earlier in the query sequence. If this can be achieved, then the user gets to

their document or product much faster and is therefore more satisfied.

5.3

Score Computation

Computing the user scores requires a statistical analysis of the search logs, click logs, and search

engine response generated by the previous section.

The algorithm is as follows:

1.

Gather the list of all unique queries across all users ALL_QUERIES

a.

Send every query to the search engine and gather the results RESULTS_SET[q]

2.

Loop through all users, u = 0 to - 1

a.

Accumulate a (de-duplicated) list of all documents clicked by the user

RELEVANT_SET

b.

Loop through all queries executed by the user, i = 0 to - 1 Q[i]

i.

Look up the query and results in the search engine response RESULTS_SET[Q[i]].

ii.

Set queryScore[i] = 0

iii.

Loop through each document in RESULTS_SET[Q[i]], k = 0 to - 1(1) If the document is in RELEVANT_SET

(a)

queryScore[I] = queryScore[i] + power(FACTOR, k)

(see below for a discussion of FACTOR)

c.

userScore[u] = average of all queryScore[i] values for the user

3.

Compute the total engine score = average of all userScore[u] values for all users

Notes:

A random sampling of users can be used to generate the results, if computing the score

across all users requires too much time or resource.

All search team members should be removed from engine score computations, to ensure an

unbiased score.


18/35



5.3.1

The Scoring Factor

The FACTOR is chosen based on the typical user behavior or how much sensitivity the score should

have to matching documents lower in the search results. Factors must be between 0.0 and 1.0.

A high factor (for example, 0.99) is good for identifying relatively small changes to the score based onsmall improvements to the search engine.

A low factor (for example, 0.80) is good for producing a score which is a better measure of user

performance, for systems that produce (say) 10 documents per page.

5.4

Interpreting the Score

The resulting score will be a number which has the following characteristics:

Score = 1.0

o The first document of every query was relevant

Score < 1.0

o Relevant documents were found in positions lower in the results list

Score > 1.0

o

Multiple relevant results were returned for the query

Depending on the data, scores can be as low as 0.05 (this typically means that there are other

external factors affecting the score such as filters not being applied, etc.). A score of 0.25 is generally

thought to be “very good”.

Note that the score does have an upper limit, which can be computed based on the factor:

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

'K = 0.5

'K = 0.667

'K = 0.75

'K = 0.9


19/35



Maximum Score (all results relevant) = 1/ (1-FACTOR)

And so if the factor is “0.8” the upper limit of the score (all results are relevant) is 5.0. This assumes

analyzing a sufficiently large number of results. When only analyzing the top 10 results (for example),

the maximum score for FACTOR=0.8 will be 4.463.

5.5 User and Query Scoring

The scoring algorithm from section 5.1 can be accumulated on a per-user or per query basis. This

provides a useful comparison across users and queries to compute the following metrics:

1.

Top 100 lowest scoring (i.e. least satisfied) users with more than 5 queries.

2.

Top 100 highest scoring (i.e. most satisfied) users with more than 5 queries.

3.

Top 100 lowest scoring queries executed by more than 5 users.

4.

Top 100 highest scoring queries executed by more than 5 users.

5.

Most underperforming of the top 10% most frequently executed queries.

6.

Engine score per unit, for all users from each unit.

7.

Engine score per location, for all user from each location.

8.

Engine score per job title, for all users with the same job title.

This information is critical for determining what is working, what needs work and how the search

engine is serving various sub-sets of the user community when evaluating search engine accuracy.

5.6

Engine Quality Predictions / Quality Regression Testing

Finally, the algorithm above provides a way to perform off-line testing of new search improvements

before those improvements are put into production.

To make this happen, the RESULTS_SET (step 1.a of the algorithm from section 5.1) is recomputed

using the new search engine. With a new RESULTS_SET, the new score can be recomputed for the

new engine and compared to the baseline score. If the overall score improves, then the search

engine accuracy will be better when the engine is moved to production.

This process is also recommended for regression testing of all new search engines before they are

fielded to production to ensure that bugs introduced into the system do not adversely affect search

engine accuracy.


20/35



6

Additional Metrics

The engine score is the most complex and the most helpful score to compute to

determine engine accuracy. However, additional metrics provide useful insights

to help determine where work needs to be applied.

6.1 Query Metrics

Most frequent queries – Gives an idea of “hot topics” in the community and what’s most on

the mind of the community.

Most frequent typed queries – Information needs that are not satisfied by the queries on the

auto-suggest menu

Most frequent query terms – Identifies individual words which may be worth improving with

synonyms.

100 random queries, categorized – When manually grouped into use cases, gives an excellent

idea of the scope of the overall user search experience.

Most frequent spelling corrections – Verifying the spellchecker algorithm and identifying

possible conflicts

6.2 Relevancy and Clicks

Percent of queries which result in click on a search result – Should increase as the search

engine is improved.

Histogram of clicks on search results, by position – Shows how often users click on the first,second, third document, etc. An abnormally large drop off between positions may indicate a

user interface problem.

Histogram of number of clicks per query – Ideally, the search engine should return many

good results, so higher number of clicks per query is better.

Bounce Rate – Rate at which users execute one query and then leave the search site. High

bounce rates indicate very unsatisfied users.

6.3 Result Metrics

Total number and percentage of queries with zero results – Large numbers here generally

indicates that more attention should be spent on spell checking, synonyms, or phonetic

searching for people names.

Most frequent queries with zero results – Identifies those words which may require spell

checking or synonyms


21/35



Histogram of results per query – Gives an idea of how generic or specific user queries are. If

the median is large then providing alternative queries (i.e. query suggestions) may be

appropriate.

6.4

Document Metrics Top 1,000 least successful documents – Documents most frequently returned by the search

engine which are never clicked.

Top 100 hidden documents – Documents never returned by search.

o Should be further categorized by time (i.e. Top hidden documents introduced this

week, this month, this quarter, this half-year, this year, etc.)

o Indicates documents which have language miss-match problems with the queries.

Consider adding additional search terms to these documents.

Documents with best and worst language coverage – These are documents with the most

words found in queries.

o This involves processing all documents against a dictionary made up of all query

words. The goal is to identify documents which do not connect with users because

there is little language in common.


22/35



7

Continuous Improvement

A key goal for accuracy measurements is to produce metrics which can participate in a continuous

improvement process.

7.1 The Continuous Improvement Cycle

The continuous improvement cycle is shown below:

The steps in this process are:

1.

Make changes to the search engine to improve accuracy.

2. Gather search engine response for all unique queries specified in the log files (or log file

sample set)

3.

Compute user scores as described above in section 5.3.

4.

Produce analysis reports including all metrics described above in section 6.

5. Evaluate results to determine if the search engine did get better.

If the search engine did not get better, then the most recent changes will be reverted.

Otherwise, a manual investigation of the analysis reports and individual use cases (perhaps

from manual scoring, see section 8 below) will determine what further changes are

required.


23/35



7.2

Continuous Improvement Requirements

Implementing a continuous improvement process for on-going accuracy improvements has the

following requirements:

A stable snapshot of document data and log fileso

See section 4 above.

A QA system with all data index with the “search engine under test”

o The production system cannot typically be used because:

1) It does not represent a stable snapshot, and

2) Accuracy testing will require executing 100’s of thousands of queries –

which is a load that is typically too large for production systems.

Automated tools to produce engine scores and metrics as described above in 7.1.

Note that the scoring algorithms specified above in section 7.1 can be run off-line on the QA server.This is a requirement for a continuous improvement cycle, since only off-line analysis will be

sufficiently agile to allow for running the dozens of iterative tests necessary to optimize relevancy.

7.3

Recording Performance from Run to Run

It is expected that a continuous improvement cycle will ultimately be implemented. This cycle should

be executed multiple times as the system is tuned up and improved. As this happens, it is important

to record the improvement from release to release.

The following is an example of such a recording from a prior customer engagement:

whole score %q w match %q w zero results

rev3 0.226567992 28.84% 33.62%

rev83 0.237346148 29.97% 31.71%

rev103 0.241202611 30.95% 28.01%

latest 0.251230958 32.14% 25.83%

As you can see, the “whole score” steadily improved while the percentage of queries with at least

one relevant document increased and the percentage of queries which returned zero results steadily

decreased.


24/35



It is this sort of analysis which engenders confidence that the process is working and is steadily

improving the accuracy of the search engine, step by step.

The customer for whom we ran this analysis next performed an A/B test on production with

live users to verify that the search engine performance would, in fact, lead to improved user

behavior.

This was an e-commerce site, and the results of the A/B test was a 3% improvement in

conversion rate, which equated to a (roughly) $4 million improvement in total sales that year.


25/35



8

Manual Relevancy Judgments

In addition to a fully automated statistical analysis, Search Technologies

recommends implementing a manual judgment process.

8.1 The Relevancy Judgment User Interface

To improve productivity of the judgers, Search Technologies recommends a user interface to

manually judge relevant documents for queries. This is a simple user interface backed by simple files

of relevancy judgments.

To prepare for the judgment process:

Selects 200 random queries from the cleansed query logs

Queries may need to be annotated to clarify the intent of the query for whoever is

performing the relevancy judgments.

Executes each query on the search engine and saves the results

The relevancy judgments user interface will do the following:

Shows the search results

o Should use the same presentation as the production user interface

Provides buttons to judge “relevant”, “non-relevant”, “relevant but old”, and “unknown” for

every query

The same 200 queries are used from run to run, and so the database of what documents are relevant

for what queries can be maintained and grown over time.

To increase judger consistency , Search Technologies recommends creating a “relevancy judger’s

handbook” – a document which identifies how to determine if a document is relevant or not to a

particular query. The handbook will help judgers decide the relevancy between documents which are

principally about the subject, versus documents which contain only an aside-reference to the subject

and similar judgment decisions. Search Technologies has written such hand books in the past and can

help with writing the hand book appropriate to your data and user population.

8.2 Advantages

Manual relevancy judgments have a number of advantages over strictly log-based metrics:

It forces the QA team to analyze queries one-by-one

o

This helps the team recognize patterns across queries in terms of why good

documents are missed and bad documents are retrieved.


26/35



It provides a very clean view of relevancy

o The manual judgments are uncluttered by external factors which affect log data

(machine crashes, network failure, received a phone call while searching, user

entered the wrong site, etc.)

Over time, it helps identify recall as well as precision

The last point is perhaps the most important. Log based relevancy can only identify as relevant

documents which are shown to users by the search engine. Naturally this is the most important

aspect of relevancy (“Are the documents that I see relevant?” ).

But it does ignore the second aspect of relevancy, “Did the search engine retrieve all possible relevant

documents for me?” This second factor can only be approached with a relevancy database which is

expanded over time.

8.3

Statistics from Manual JudgmentsThe following statistics can be computed from manual relevancy judgments:

Percent relevant documents in the top 10 – This is perhaps the most useful score, since it

provides an easy-to-understand number on what to expect in the search results.

Percent queries with at least one relevant document retrieved in the top 10

Percent total relevant retrieved in the top 100 – This is “recall at 100”, which identifies how

well this configuration of the search can respond to deeper research requirements

Percent queries with no relevant documents retrieved


27/35



9

Using Logs to Analyze User Intent

A second use of search and click logs is to analyze user intent by looking at how

users interact with the internal search user interface.

9.1 Query Analysis

The first step for determining user intent is to look at searches executed by users.

9.1.1

Top Queries

A simple but useful analysis is to look at the top 200 most frequent queries executed by users.

The top 10-20% of queries, especially, are indicative of what’s “top of mind” for the user community.

All queries should be categorized by intent by someone who is well versed in the activities of the

content domain.

Is the user looking for a particular document or web site which they already know exists?

Is the user looking to answer a question?

Is the user looking for a set of documents for research?

Is the user looking for a set of documents for self-education?

Naturally, it may be difficult to determine intent simply from the query. See section 9.1.4 how this

process can be refined and extended.

9.1.2

Randomly Selected Queries

The top most frequent queries do not, in fact, give a good idea of “what the user population is

searching for”. This is because the top queries are often the easiest, most obvious, and least variable

queries to enter. These are often most frequent simply because they are simple and consistently

entered.

In particular, personal names and document titles vary widely. Rarely will a single individual or

document title be in the top 200 most frequent queries, and yet these are often the most common

queries.

Therefore, a randomly selected set of 200 queries should be analyzed for intent and categorized.

9.1.3 Overall Internal Search Usage

A histogram of the number of queries executed by the same user should be performed. This includes:

Histogram (mean, median, min, max) of the number of queries executed in a session


28/35



Histogram (mean, median, min, max) of the number of queries executed over 3 months

Total queries executed per month, trending

Total unique users, per month, trending

This will provide a good look at how often the user population turns to search as part of their dailywork, and how often they find it to be a valuable tool, overall.

9.1.4

Randomly Selected Users

Finally, a set of 25-50 randomly selected users should be analyzed for use cases.

This should be done in three groups:

1.

Users who execute a single query in a session.

The goal here is to try and determine why these users only executed a single query. Did they

find what they were looking for? Were the results too poor to think about continuing?

2.

Users who execute three or more queries in a session.

These are, presumably, users with a more substantial need who are willing to keep searching

to answer their question.

How did these users reformulate or re-cast their query? Could these reformulations have been

performed automatically by the search engine? What were these users looking for?

3.

Users who execute ten or more queries over month’s period.

The goal here is to understand those ‘power’ users who frequently return to internal search.

Do they execute the same query over and over? What are their primary use cases?

9.1.5

Help Desk Analysis

If your organization has a “search help line”, this is a valuable source of information for determining

user intent and usage models.

To the extent possible, search help desk activity should be captured and analyzed. This can include

chat logs, e-mails, voice transcriptions, etc. as available.

9.2 Tooling for Query Analysis

Query analysis, of necessity, requires 1) a keen understanding of the user and their goals and needs,

2) a deep understanding of the content and what it can provide to the user, and 3) how the user’s

intent is expressed through queries.

While this is primarily a manual process, additional tooling can help to categorize and cluster queries.

This includes:


29/35



Token statistics

o Identifying the frequency of all query tokens

Across all queries

Across all documents

Dictionary lookups

o Including in domain-specific dictionaries (ontologies, synonyms lists, etc.)

o General, natural language dictionaries

o Stop-word lists

o Other, use-case specific dictionaries, such as dictionaries for:

First and last names from the census

Ticker symbols

Wikipedia entries (with types)

Regular expression matching

BNF pattern matching

o

Patterns of tokens by token category types

Cross-query comparisons

o For example, identifying all word pairs which exist as single tokens in other queries

(and vice-versa)

User query-set analysis

o Comparing queries within a single user’s activity set

o This is used for analyzing query chains, to determine if one type of query often leads

to another type of query.

Statistical analysis

o Extracting statistics from the queries is essential for using these statistics to

determine query types and the impact that can be achieved by working on a class of

query.

Generally, these sorts of statistics are computed using a series of ad-hoc tools and programs

including UNIX utilities and custom software programs. If the query database is very large, then Big

Data and/or Map Reduce algorithms may be required.


30/35



9.3

User Interface Click Analysis

Analyzing clicks within the user interface is another key component to understanding how employees

are interacting with search.

9.3.1

Feature Usage

The first and most important analysis is to determine what user interface features are being used and

how frequently. This includes:

Tabs

Facets

Side-bar results

“Recommended” results (e.g. best bets or elevated results)

Sorting

Paging

Advanced search

A thorough understanding of the usage of these features will help determine what is working and

what is not, and what should be kept and what should be abandoned.

9.3.2

Sequences

Second, click sequences can help determine if users are leveraging user interface features to their

best advantage.

The goal here is to determine if a user interface feature leads the user to information which helps

solve their problem.

How often does a facet click lead to a document click?

How often does a side-bar click lead to a new search?

o And how often does this lead to a document click?

How often does a tab click lead to a document click?

How often does a sort click lead to a document click?

In this way, we can determine if user interface features are leading to actual improvements in the

end-user’s ability to find information.


31/35



9.4

Long Tail Analysis

A “long tail analysis” involves the manual analysis of a very large number of queries (5,000 to 10,000

queries). The goal of this analysis is categorize long-tail queries so they can be properly binned and

dealt with using a variety of techniques.

Long tail analysis shifts the process from analyzing “top queries” (of any sort), to a large-scale manual

analysis of a very large volume of queries.

9.4.1

Query Database

The essence of long-tail analysis is a large and evolving query database. This database includes:

Status flags (new, analyzed, unknown, deferred, tentatively analyzed, problem, solved, good

results, bad results, etc.)

The query the user entered

The date time the query was entered into the database

The assigned categories and sub-categories for the query

Other description text / explanation

User who analyzed the query

9.4.2 Supporting Databases

These are typically created on an as-needed basis, but may include:

Query logs indexed by queryo So that sets of similar queries can be searched and grouped and processed together

o For example, all queries that start with “I need” or “account information”, etc.

o This also allows retrieval of other query metadata:

List of users who have executed the query

List of times (for trend lines) on when the query was executed

(if possible) geographic locations for the query

User information indexed by user

o So that information on an individual user (i.e. user events) can be quickly brought up

and analyzed when analyzing use cases.

Click logs indexed by user and time

o

So that a user’s traffic can be analyzed to help determine intent


32/35



9.4.3

Tooling Support

To handle manual reviews of such a large set of databases, a search engine over the log data is

recommended. Further tooling will:

Link all queries to the query database

o To determine the status and disposition of the query

Allow for import and export to the query database

o For off-line analysis of sets of queries

9.4.4 Long Tail Analysis – Outputs

The outputs of the long-tail analysis include:

Categories & Sub-Categories

o

These represent use cases and query patterns to be handled as a set

o Can include categories / use cases for documents as well

I.e. major document types and use cases

General language understanding

o

Query patterns

o

Nouns / verbs

o

Sophistication of language / education level of user

o Jargon / lay-language

Common question / answer patternso For specific information or open-ended research

Synonyms

o (antivirus security software, mic microphone)

Identify recommended re-direction patterns

o Account information Account search / custom responses

o Support & help queries Support search

o Generic company information Website .com search

o

People queries People database

Identify best bet (recommended result) patterns

Identify user interface search presentation issues

o Incorrect fields being displayed in results, hiding important information, etc.

Identify seasonal patterns (where appropriate)


33/35



And along with each use case, an idea of the scope of the use case and what sorts of fixes are

required to improve user satisfaction.

9.4.5

More Information

For high-value content sources, long-tail analysis will provide the most thorough and complete

picture of how users are using the system.

Search Technologies has implemented long-tail analysis at one high-value e-Commerce site, where

we have refined the process.


34/35



10

A/B Testing

Where possible, A/B testing is recommended to validate the engine score and other

improvements and to determine the exact relationship between engine score

improvements and other web site metrics (i.e. abandonment rates and conversion rates).

There can be no “step by step” process for doing A/B testing, since it will involve your production

system and production data. Therefore, it must be handled carefully and with production

considerations.

The following are some broad guidelines for A/B testing.

Both systems, A & B, need to be up simultaneously

o The only way for A/B testing to be accurate is if incoming requests are randomly

assigned to A or B for a period of time.

o

This will ensure that the users for A or B are drawn from the exact same userpopulation.

A/B does not need to be 50/50

o Typically, A/B testing is 95/5.

95% on the current production system

5% on the new (under test) system.

o

This way, if the “B” system is severely deficient, then only a small percentage of

production users will be affected.

Routing between A & B can occur at any of a variety of different levels:

o User interface

Two user interfaces (A / B) to serve different user populations

o

Results mixing / query processing

Two different results mixing models or query processing models which query

over the same underlying indexes

o Multiple indexes

Different indexes indexed in different ways

Ideally, of course, a system will be designed and implemented with A/B testing as a primary goal

from the very beginning. If it is, then “turning on” a B system for testing becomes a standard part ofsystem administration and testing procedures.


35/35


11

Conclusions

This paper is the result of some 22 years of search engine accuracy testing, evaluation, and practical

experience. The journey started in 1992 with the first TReC Conference, which several Search

Technologies employees attended. Many of the philosophies and strategies described in this papertrace their roots back to TReC.

The journey continued throughout the 1990’s, as we experimented with varying relevancy ranking

technologies, and were the first (to my knowledge) to use machine learning to optimize relevancy

ranking formulae for relevancy based on manual judgments.

As we enter the age of the Cloud and Big Data, we now have vastly more data available and the

machine resources required to process it. This has opened up new and exciting possibilities for

improving search – not just for a select few – but for everyone.

At Search Technologies we continue to work every day to make this vision, a vision of high-quality,

optimized, tuned, targeted, engaging, and powerful search a reality for everyone. This paper is just

another step in the journey to that ultimate goal.

For further information or an informal discussion CONTACT US.
http://en.wikipedia.org/wiki/Text_Retrieval_Conferencehttp://en.wikipedia.org/wiki/Text_Retrieval_Conferencehttp://en.wikipedia.org/wiki/Text_Retrieval_Conferencehttp://www.searchtechnologies.com/contactshttp://www.searchtechnologies.com/contactshttp://www.searchtechnologies.com/contactshttp://www.searchtechnologies.com/contactshttp://en.wikipedia.org/wiki/Text_Retrieval_Conference