search accuracy analytics white paper

Upload: lylienkiet

Post on 06-Jul-2018




3 download


  • 8/17/2019 Search Accuracy Analytics White Paper



    Search Accuracy AnalyticsPaul Nelson

    Fall  14

    White Paper

    Version 2.6.3

    November 2014

    ©2014 Search Technologies

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  2 

    Table of Contents 

    1  Summary ................................................................................................................................ 5 


    The Impact of Poor Accuracy .............................................................................................. 5 


    The Solution to Poor Accuracy ............................................................................................ 5 


    A Reliable, Step-By-Step Process ......................................................................................... 6 


    A User-Focused Approach ................................................................................................... 6 


    A Comprehensive Approach ................................................................................................ 8 

    2  Problem Description ............................................................................................................... 9 

    3  Gathering and Auditing Log Files ........................................................................................... 10 


    Search Logs ........................................................................................................................ 10 


    Click Logs ........................................................................................................................... 11 


    User Information ............................................................................................................... 11 


    Log Cleanup ....................................................................................................................... 12 


    Auditing the Logs ............................................................................................................... 13 

    Index and Log File Snapshots ................................................................................................ 14 

    The Search Engine Score ....................................................................................................... 15 


    User-Based Score Model ................................................................................................... 15 


    What is Relevant to the User?........................................................................................... 15 


    Using Logs to Identify Relevant Documents ............................................................. 16 


    Gradated Relevancy versus Binary Relevancy .......................................................... 16 


    Documents are relevant to users, and not to queries .............................................. 16 


    Score Computation ............................................................................................................ 17 


    The Scoring Factor .................................................................................................... 18 


    Interpreting the Score ....................................................................................................... 18 


    User and Query Scoring ..................................................................................................... 19 


    Engine Quality Predictions / Quality Regression Testing .................................................. 19 

    Additional Metrics ................................................................................................................ 20 


    Query Metrics .................................................................................................................... 20 


    Relevancy and Clicks.......................................................................................................... 20 


    Result Metrics .................................................................................................................... 20 


    Document Metrics ............................................................................................................. 21 

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  3 

    Continuous Improvement ..................................................................................................... 22 


    The Continuous Improvement Cycle ................................................................................. 22 


    Continuous Improvement Requirements.......................................................................... 23 


    Recording Performance from Run to Run ......................................................................... 23 

    Manual Relevancy Judgments ............................................................................................... 25 


    The Relevancy Judgment User Interface ........................................................................... 25 


    Advantages ........................................................................................................................ 25 


    Statistics from Manual Judgments .................................................................................... 26 

    Using Logs to Analyze User Intent ......................................................................................... 27 


    Query Analysis ................................................................................................................... 27 


    Top Queries .............................................................................................................. 27 


    Randomly Selected Queries ...................................................................................... 27 


    Overall Internal Search Usage .................................................................................. 27 


    Randomly Selected Users ......................................................................................... 28 


    Help Desk Analysis .................................................................................................... 28 


    Tooling for Query Analysis ................................................................................................ 28 


    User Interface Click Analysis ............................................................................................. 30 


    Feature Usage ........................................................................................................... 30 


    Sequences ................................................................................................................. 30 


    Long Tail Analysis .............................................................................................................. 31 


    Query Database ........................................................................................................ 31 


    Supporting Databases ............................................................................................... 31 


    Tooling Support ........................................................................................................ 32 


    Long Tail Analysis – Outputs ..................................................................................... 32 


    More Information ..................................................................................................... 33 


    A/B Testing .......................................................................................................................... 34 


    Conclusions .......................................................................................................................... 35 

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  4 

    Search Accuracy Analytics 

    Top quality search accuracy is not achieved with technology alone or through a one-time “quick fix”.

    It can only be achieved with a careful, continuous improvement process driven by a varied set of

    metrics. And the starting point for such a process is an objective, statistically valid measurement ofsearch accuracy.

    When search results are not satisfactory or relevant enough to users, search development teams

    often analyze the problem by looking at accuracy metrics from a query perspective. They ask

    questions like: “What queries worked? What queries are most frequently executed? What queries

    returned zero results?” And so on. In contrast, this paper presents a broader, user focused  approach

    to search relevancy. We ask the question “is the user satisfied?” And “are the results worthy of

    further user action?” 

    Search Technologies recommends using this white paper as a reference guide. The techniques in this

    paper have been used successfully by Search Technologies with a number of customers for firstcomputing search engine accuracy metrics and then using those metrics to iteratively improve search

    engine relevancy using a reliable, measurable process.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  5 


    The number one complaint about search engines is that they are “not accurate”.

    Customers complain to us that their engine brings back irrelevant, old, or even

    bizarre off-the-wall documents in response to their queries.

    This problem is often compounded by the secretive nature of search engines and search engine

    companies. Relevancy ranking algorithms are often veiled in secrecy (described variously as

    ‘intellectual property’ or more quaintly as the ‘secret sauce’). And even when algorithms are open to

    the public (for Open Source search engines, for example), the algorithms are often so complex and

    convoluted that they defy simple understanding or analysis.

    1.1  The Impact of Poor Accuracy

    What makes this situation all the more frustrating for customers is the impact of poor accuracy to the

    bottom line.

      For Corporate Wide Search  – Wasted employee time. Missed opportunities for new business.

    Re-work, mistakes, and “re-inventing the wheel” due to lack of knowledge sharing. Wasted

    investment in corporate search when minimum user requirements for accuracy are not met.

      For e-commerce Search  – Lower conversion rates. Higher abandonment rates. Missed

    opportunities. Lower sales revenue. Loss of mobile revenue.

      For Publishing  – Unsatisfied customers. Less value to the customer. Lower renewal rates.

    Fewer subscriptions. Loss of subscription revenue.

      For Government   – Unmet mission objectives. Less public engagement. Lower search and

    download activity. More difficult to justify one’s mission. Incomplete intelligence analysis.

    Missed threats from foreign agents. Missed opportunities for mission advancement.

      For Recruiting  – Lower fill rate. Lower margins or spread. Unhappy candidates assigned to

    inappropriate jobs. Loss of candidate and hiring manager goodwill. Loss of revenue.


    The Solution to Poor Accuracy

    When organizations encounter poor relevancy from their search engine, they usually have one of

    two reactions:


    Buy a new search engine.


    Give up.

    Both of these approaches are unproductive, expensive and wasteful.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  6 

    The difficult truth is that all search engines require tuning, and all content requires processing. The

    more tuning and the more processing you do, the better your search results will be.

    Search engines are designed and implemented by software programmers to handle standard use

    cases such as news stories and web pages. They are not delivered out-of-the-box to handle the wide

    variety and complexity of content and use case that are found around the world. They need to be

    tuned, and part of that tuning is to process the content so that it is easily interpretable.

    And so there is no “easy fix”, no “silver bullet” and no substitute to a little bit of elbow grease (aka

    hard work) when it comes to creating a satisfying search result.


     A Reliable, Step-By-Step Process

    This paper gives a step-by-step process to solve the problem of poor search engine relevancy. The

    major steps are:


    Gather, audit, and process log files.


    Query logs, click logs, and other indicative user activity.


    Create a snapshot of the log files and search engine index for testing.


    Compute engine score.


    Compute additional search metrics.


    Implement and score accuracy improvements (the continuous improvement process).


    Perform manual relevancy judgments (when practical).


    Use logs to analyze user intent.


    Perform A/B testing to validate improvements and calculate Return on Investment (ROI).

    These steps have been successfully implemented by Search Technologies and are known to provide

    reliably, methodical, measurable accuracy improvements.


     A User-Focused Approach

    When discussing what documents are relevant to what queries, it is common for different parts of

    the organization to have different goals and opinions as to “what is a good document”. For example: 

      (Functional) Do the results contain the words the user entered?


    (Visual perception) Is it easy to see that the results contain good documents?

      (Knowledge) Does it answer the deeper question implied by the query?

      (Marketing) Do the results enhance the perception of the corporatebrand ?

      (Sales) Do the results lead to more sales?

      (Editorial) Do the results highlight editor selections?

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  7 

    With many competing goals for search, it is easy to get lost when trying to figure out what is

    important and what should be fixed (and why).

    Most search accuracy metrics are from a query perspective. They ask questions like: “What queries

    worked? What are the most frequently executed queries? What queries returned zero results? How

    do I improve my queries?” etc.

    In contrast, this paper presents a user focused  approach to search accuracy:


    All accuracy evaluation is from the user’s point of view .


    All that matters does the user find results of interest ?

    We are interested, in this paper, as to the central question of “is the user satisfied?” We attempt to

    answer this question by analyzing user activity to see if the search engine is providing results which

    the user has found worthy of further activity.

    The user-centered approach is a powerful approach with some subtle consequences. For example, if

    a query is executed by 10 different users, then that query will be analyzed 10 times – once from each

    user’s point of view (analyzing the activity stream for each user individually).

    Further this approach is much more accurate by providing scores which are automatically normalized

    to the user population. A very small number of highly interactive users will not adversely skew the

    score, unlike with query-based approaches.

    And finally, user-based scoring provides more detailed information for use in system analysis. It

    scores every user and every user’s query (as well as the system as a whole). It identifies the least

    effective queries and the most effective queries, and traces every query back to the user session so

    the query can be viewed in context.

    What about the other perspectives? Aren’t they important too? 

    Of course they are. But first you need to understand how your end-users view your search results.

    Once you have a solid understanding of this, you can then include other perspectives (such as brand

    awareness, editorial selections, etc.) into the equation.

    E-commerce and Increasing Sales

    Finally, the user-based approach is the best approach for increased e-commerce sales. What matters

    most to e-commerce is how many customers (i.e. users) purchased products from the query results.

    With e-commerce, you are less concerned about the query, and more concerned about the

    customer. What is most important is: did the query return something that the customer purchased?

    How many customers that executed the query ended up purchasing something? What queries lead

    to sales? What queries never lead to sales?

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  8 

    The user-based approach can answer all of these questions. The user-based approach can aggregate

    all  customer activity and leverage that user activity to determine the success (or failure) of every

    query for every user, from each customer’s unique point of view. 

    Search Technologies has shown that this approach can dramatically improve conversion rates for e-

    commerce sites. In one implementation, we produced a 7.5% increase in conversion rate for

    products purchased based on search results for one site, 3% increase overall – validated by A/B



     A Comprehensive Approach

    Finally, this paper represents a comprehensive approach. The approach includes:

      Log file analysis

      Automatic engine scoring


    Continuous improvement

      Manual scoring

      User intent analysis

      A/B testing

    Note that not all techniques will be appropriate for all situations. Some techniques are appropriate

    for production systems with sufficient log information to use for analysis while others are better for

    evaluating brand-new systems. Some techniques are for on-line analysis and others are for off-line


    Search Technologies recommends using this white paper as a reference guide. We have architects

    and data scientists who can help you determine which methods and processes are best suited to

    your situation. Once you have decided on a plan, we have lead engineers, senior developers, and

    experienced project managers who can ensure that the plan is delivered efficiently, on-time, and

    with the best possible outcome.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  9 

    Problem Description 

    What often happens with fuzzy algorithms like search engine relevancy is that some

    improvements make relevancy better (more accurate) for some queries and worse

    (less accurate) for other queries.

    Only by evaluating a statistically valid sample set can one know if the algorithm is better in the


      For each new release of the search engine, the accuracy of the overall system must be

    measured to see if it has improved (or worsened).

    o  Simple bugs can easily cause dramatic degradation in the overall quality of the


    o  Therefore, it is much too easy for a new bug to slip into production unnoticed.

    o  The problem is exacerbated by the size of the data sets.


    Any algorithm which operates over hundreds of thousands of documents

    and queries must have a very large test suite for continuous statistical

    evaluation and improvement.

      The relative benefit of each parameter of the search relevancy algorithm must be measured

    individually. For example:

    How much does field weighting help?

    How much will other query adjustments (weighting by document type, exact phrase

    weighting, etc.) help the relevancy?

    o  How much do synonyms help?

    o  How much will link counting and popularity counting help?

      A data-directed method for determining what types of queries are performing poorly, and

    what fixes are most likely to improve accuracy needs to be implemented.

    o  Instead, this information on what queries to fix comes from anecdotal end-user

    requests and evaluations.

    Fortunately, standard, best-practices procedures do exist for continuous improvement of search

    engine algorithms. Further, these procedures have been demonstrated in real-world situations to

    produce high quality, statistically verifiable results. User centered approaches provide more reliable

    statistics which better correlate to end-user satisfaction.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  10 

    Gathering and Auditing Log Files 

    Clean, complete log files gathered from a production system are a critical for

    metrics analysis. There are two types of log files required for accuracy

    measurements: search logs and click logs.

    3.1  Search Logs

    Search logs contain an event for every search executed by every user. Every event should contain:


    The date and time (down to the second) when the search was executed.


    A “search ID” which uniquely identifies the search. 


    This is optional, but highly desired to connect searches with click log events.


    If the user executes the same search twice in a row, each search should have a unique ID.


    The user ID of the user who executed the search.


    The text exactly as entered by the user.


    The query as tokenized and cleaned by the query processor.


    Search parameters as submitted to the search engine.


    Whether the query was selected from an auto-suggest menu


    The number of documents found by the search engine in response to the search.


    The starting result number requested.


    A number > 0 indicates the user was asking for page 2-n of the results.


    The number of rows requested (e.g. the page size of the results).


    The URL from where the search was executed (like the Web log ”referer” field).


    A code for the type of search.


    Should indicate if the user clicked on a facet value, an advanced search, or a simple



    A list of filters (advanced search or facets) turned on for the search.


    The time it took for the search to execute (milliseconds).



    Some of these parameters may be parsed from the URL used to submit the search.

      Test searches should be removed (such as searches on the word “test” or “john smith”). 

      Searches from test accounts should be removed (see auditing logs, section 3.5 below).

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  11 

      “Automatic searches” executed in response to browsing or navigation should be removed. 

    o  Searches executed automatically “behind the scenes” when the user clicks on a link

    should be removed.

    o  External links directly to the search engine

      Other sorts of administrative searches (for example, to download a list of all documents for

    index auditing) should be removed.


    Click Logs

    In addition to searches, accuracy testing needs to capture exactly what the user clicked in response

    to a search.

    Generally, click logs should contain all clicks by all users within the search user interface. This

    includes clicks on documents, tabs, advanced search, help, more information, etc.

    But in addition, clicks on search results should contain the following information:


    The time stamp (date and time to the second) when the search result was clicked.


    The document ID (from the index) or URL of the search result selected.


    The unique search ID (see previous section, 3.1) of the search which provided the search



    The ID of the user who clicked on the result.


    The position within the search results of the document, where 0 = the first document returned

    by the search engine.

    In order to do this, search results must be wrapped (they cannot be bare URLs), so that when clicked,the event is captured in the search engine logs.


    User Information

    [Optional] User information can help determine when subsets of the user population are being

    poorly served by the search engine.

    User information should be available for search engine statistics and cross referenced by user ID.

    Useful user information to gather includes:


    Start date of the user within the organization

      Business unit(s) or product line(s) to which the user belongs or is subscribed to

      Manager to whom the user reports (for employees)

      Employer of the user (for pubic users) if available

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  12 

      Geographic location of the user

      Job title of the user, if available

    Further, e-commerce and informational sites can identify users, especially when they have logged in,

    their population groups and interaction history, using cookies or back-end server data.

    Gathering additional information about users is optional, and can be deferred to a future

    implementation when the basics of search engine accuracy have been improved.

    3.4  Log Cleanup

    Just gathering logs will not be enough. All logs will contain useless information that is not related to

    actual end-user usage of the system. This useless information will need to be cleaned up before the

    logs are useful.

    This includes:

      Queries by monitoring programs (‘test probes’) 

      Queries by the search team

      Common queries used in demonstrations (‘water’, ‘john smith’, ‘big data’) 

      Log file entries which are not queries or user clicks at all (HEAD requests, status requests,


      Multiple server accesses for a single end-user entered query

    o  Where a click on the [SEARCH] button actually fires off multiple queries behind the


      Clicks on page 2, 3, 4 for a query

    o  These should be categorized as such and set aside

    o  In other words, simply clicking on page 2 for a query should not increase the total

    count for that query or the words it contains.

      Clicks on facet filters for a query

    o  These should be categorized and set aside

      Canned queries automatically executed (typically through the APIs) to provide summary

    information for special use cases

    o  In other words, queries where the query expression is fixed and executed behind the


      On public sites, there may be spam such as random queries, fake comment text containing

    URLs, attempts to influence the search results or suggestions, or even targeted probes.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  13 

    o  These should be detected and removed where possible.

    Log cleanup is a substantial task by itself and requires some investigative work on how the internal

    search system works. It must be treated carefully.


     Auditing the Logs

    Accuracy metrics will only be as good as the accuracy of the log files gathered. Therefore, log auditing

    is required. This is accomplished by creating and logging in as a test account and interacting with the

    system. Each interaction should be manually logged.

    Once complete, the log files should be gathered and the manual accounting of the actions should be

    compared with the automatically generated logs. Discrepancies found should be investigated and


    Log auditing can and should be automated and executed periodically. Manual log auditing will still be

    required whenever a new version of the system is released.

    Log auditing should be performed both before and after log cleanup – to ensure that log cleanup

    does not remove important log information.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  14 

    Index and Log File Snapshots 

    Most systems are “living” systems with constant updates, new documents, and user

    activity. But for systems under evaluation, especially those which are being tuned

    and refined, constant updates means changing and non-comparable metrics fromrun to run.

    Therefore, Search Technologies recommends creating a “snapshot” of data and logs at a certain

    period of time. All accuracy measurements will be performed on this snapshot.

    This further means:

      The snapshot must include a complete data set and log database.

      Any time-based query parameters will need to be fixed to the time of the snapshot.

    For example, relevancy boosting based on document age or “freshness” 


    Naturally, the snapshot will need to be periodically refreshed with the latest production data.

    However, when this occurs:

    o  The same engine and accuracy metrics must be computed on both the old and new


    o  This way, score changes based merely on differences in the document database and

    recent user activity can be determined and factored into future analysis.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  15 

    The Search Engine Score 

    This section describes the metrics and reports which should be computed to

    determine search engine accuracy. Multiple reports are required to look at

    accuracy from a variety of dimensions.

    The most important report is the search engine score. This is an overall judgment of how well the

    search engine is meeting the user need.


    User-Based Score Model

    The Search Engine Score in this section is user based . This means that the model takes an end-user

    perspective (rather than a search engine perspective) on the problem. Results are gathered by user

    (or session) and then analyzed within the user/session for accuracy. These user-based statistics are

    then aggregated to create the score for the engine as a whole.

    This user-based model provides a more accurate description as to how well users are being served by

    the search engine. In particular:

      What percentage of users are satisfied?

      What is the average user score?

      Who are the most and least satisfied users?

    Similarly, when queries are scored, these scores use data derived from the user who executed the

    query. If a query is executed by multiple users, then that query is judged multiple times – from each

    user’s individual perspective. 

    This provides useful, user-based information on queries:

      What are the queries which satisfy the highest percentage of users?

      What are the queries which satisfy the smallest percentage users?

      What is the overall query score?

      What are the most and least satisfying queries to the users who execute them?


    What is Relevant to the User?

    The search engine score requires that we understand what documents are relevant (of interest) to a

    particular user or session. The ultimate goal of the search engine is to bring back documents which

    the user finds to be interesting as quickly and as prominently as possible.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  16 


    Using Logs to Identify Relevant Documents

    There are many different ways to identify documents which are “relevant” or “of interest” to a user: 

      Did the user view the document?


    Did the user download or share the document?

      Did the user view the product details page?

      Did the user “add to cart” the product? 

      Did the user purchase the product?

      Did the user hover-over to view details of the product or document preview?

    All of these signals indicate that the user found the document (or product) sufficiently worthy of

    some additional investigative action on their part. It is these documents that we consider to be

    “relevant” to the user. 


    Gradated Relevancy versus Binary Relevancy

    A question at this point is whether there should be a gradation of relevancy. For example, should

    “add to cart” be ‘more relevant’ than “product details view”? Should “product purchase” be ‘more

    relevant’ than “add to cart”? Should documents which are viewed longer be ‘more relevant’ than

    documents viewed for a short amount of time?

    The answer to all of these questions, is ‘no’. 

    The problem with gradations of relevancy is scale. Is product purchase 2x relevant than add to cart?

    Or only 1.5x? Attempting to answer these sorts of questions will skew the results in hard-to-predictways.

    And so, Search Technologies recommends simply choosing some threshold for all documents which

    are relevant (value = 1.0) and all other documents are, therefore, non-relevant (value = 0.0). This

    avoids all complexity and uncertainty with choosing relevancy levels, star-systems, etc. which

    confuse and skew the statistics.


    Documents are relevant to users, and not to queries

    Note that, in the above discussion, documents are considered to be relevant to the user  and not to

    the query . This gets to the core of the user-based engine score, that we are creating a model for theuser: queries entered by the user and documents which the user has indicated are interesting (in

    some way).

    It does not matter, to the engine score model, which query originally retrieved the document. The

    document is relevant for all  queries entered by the user.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  17 

    It is the assumption of this model that users will execute multiple queries to find the document or

    product which they ultimately want. They may start with “shirt”, then move to “long sleeved shirt”,

    then to “long sleeved dress shirt” and so on. Or their first query may be misspelled. Or it may be

    ambiguous. Another way of saying this is that a relevant document found by query #3 is also relevant

    if it was returned by query #1.

    In all of these examples, it is desirable for the search engine to “short circuit” the process and bring

    back relevant documents earlier in the query sequence. If this can be achieved, then the user gets to

    their document or product much faster and is therefore more satisfied.


    Score Computation

    Computing the user scores requires a statistical analysis of the search logs, click logs, and search

    engine response generated by the previous section.

    The algorithm is as follows:


    Gather the list of all unique queries across all users ALL_QUERIES


    Send every query to the search engine and gather the results RESULTS_SET[q]


    Loop through all users, u = 0 to - 1


    Accumulate a (de-duplicated) list of all documents clicked by the user



    Loop through all queries executed by the user, i = 0 to - 1 Q[i]


    Look up the query and results in the search engine response RESULTS_SET[Q[i]].


    Set queryScore[i] = 0


    Loop through each document in RESULTS_SET[Q[i]], k = 0 to - 1(1)  If the document is in RELEVANT_SET


    queryScore[I] = queryScore[i] + power(FACTOR, k)

    (see below for a discussion of FACTOR)


    userScore[u] = average of all queryScore[i] values for the user


    Compute the total engine score = average of all userScore[u] values for all users


      A random sampling of users can be used to generate the results, if computing the score

    across all users requires too much time or resource.


    All search team members should be removed from engine score computations, to ensure an

    unbiased score.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  18 


    The Scoring Factor

    The FACTOR is chosen based on the typical user behavior or how much sensitivity the score should

    have to matching documents lower in the search results. Factors must be between 0.0 and 1.0.

    A high factor (for example, 0.99) is good for identifying relatively small changes to the score based onsmall improvements to the search engine.

    A low factor (for example, 0.80) is good for producing a score which is a better measure of user

    performance, for systems that produce (say) 10 documents per page.


    Interpreting the Score

    The resulting score will be a number which has the following characteristics:

      Score = 1.0

    o  The first document of every query was relevant

      Score < 1.0

    o  Relevant documents were found in positions lower in the results list

      Score > 1.0

    Multiple relevant results were returned for the query

    Depending on the data, scores can be as low as 0.05 (this typically means that there are other

    external factors affecting the score such as filters not being applied, etc.). A score of 0.25 is generally

    thought to be “very good”. 

    Note that the score does have an upper limit, which can be computed based on the factor:












    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

    'K = 0.5

    'K = 0.667

    'K = 0.75

    'K = 0.9

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  19 

    Maximum Score (all results relevant) = 1/ (1-FACTOR)

    And so if the factor is “0.8” the upper limit of the score (all results are relevant) is 5.0. This assumes

    analyzing a sufficiently large number of results. When only analyzing the top 10 results (for example),

    the maximum score for FACTOR=0.8 will be 4.463.

    5.5  User and Query Scoring

    The scoring algorithm from section 5.1 can be accumulated on a per-user or per query basis. This

    provides a useful comparison across users and queries to compute the following metrics:


    Top 100 lowest scoring (i.e. least satisfied) users with more than 5 queries.


    Top 100 highest scoring (i.e. most satisfied) users with more than 5 queries.


    Top 100 lowest scoring queries executed by more than 5 users.


    Top 100 highest scoring queries executed by more than 5 users.


    Most underperforming of the top 10% most frequently executed queries.


    Engine score per unit, for all users from each unit.


    Engine score per location, for all user from each location.


    Engine score per job title, for all users with the same job title.

    This information is critical for determining what is working, what needs work and how the search

    engine is serving various sub-sets of the user community when evaluating search engine accuracy.


    Engine Quality Predictions / Quality Regression Testing

    Finally, the algorithm above provides a way to perform off-line testing of new search improvements

    before those improvements are put into production.

    To make this happen, the RESULTS_SET (step 1.a of the algorithm from section 5.1) is recomputed

    using the new search engine. With a new RESULTS_SET, the new score can be recomputed for the

    new engine and compared to the baseline score. If the overall score improves, then the search

    engine accuracy will be better when the engine is moved to production.

    This process is also recommended for regression testing of all new search engines before they are

    fielded to production to ensure that bugs introduced into the system do not adversely affect search

    engine accuracy.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  20 

    Additional Metrics 

    The engine score is the most complex and the most helpful score to compute to

    determine engine accuracy. However, additional metrics provide useful insights

    to help determine where work needs to be applied.

    6.1  Query Metrics

      Most frequent queries  – Gives an idea of “hot topics” in the community and what’s most on

    the mind of the community.

      Most frequent typed queries – Information needs that are not satisfied by the queries on the

    auto-suggest menu

      Most frequent query terms  – Identifies individual words which may be worth improving with


      100 random queries, categorized   – When manually grouped into use cases, gives an excellent

    idea of the scope of the overall user search experience. 

      Most frequent spelling corrections – Verifying the spellchecker algorithm and identifying

    possible conflicts 

    6.2  Relevancy and Clicks

      Percent of queries which result in click on a search result   – Should increase as the search

    engine is improved. 


    Histogram of clicks on search results, by position  – Shows how often users click on the first,second, third document, etc. An abnormally large drop off between positions may indicate a

    user interface problem. 

      Histogram of number of clicks per query   – Ideally, the search engine should return many

    good results, so higher number of clicks per query is better. 

      Bounce Rate  – Rate at which users execute one query and then leave the search site. High

    bounce rates indicate very unsatisfied users. 

    6.3  Result Metrics


    Total number and percentage of queries with zero results  – Large numbers here generally

    indicates that more attention should be spent on spell checking, synonyms, or phonetic

    searching for people names.

      Most frequent queries with zero results  – Identifies those words which may require spell

    checking or synonyms

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  21 

      Histogram of results per query  – Gives an idea of how generic or specific user queries are. If

    the median is large then providing alternative queries (i.e. query suggestions) may be



    Document Metrics  Top 1,000 least successful documents  – Documents most frequently returned by the search

    engine which are never clicked.

      Top 100 hidden documents  – Documents never returned by search.

    o  Should be further categorized by time (i.e. Top hidden documents introduced this

    week, this month, this quarter, this half-year, this year, etc.)

    o  Indicates documents which have language miss-match problems with the queries.

    Consider adding additional search terms to these documents.

      Documents with best and worst language coverage  – These are documents with the most

    words found in queries.

    o  This involves processing all documents against a dictionary made up of all query

    words. The goal is to identify documents which do not connect with users because

    there is little language in common.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  22 

    Continuous Improvement 

    A key goal for accuracy measurements is to produce metrics which can participate in a continuous

    improvement process.

    7.1  The Continuous Improvement Cycle

    The continuous improvement cycle is shown below:

    The steps in this process are:


    Make changes to the search engine to improve accuracy. 

    2.  Gather search engine response for all unique queries specified in the log files (or log file

    sample set) 


    Compute user scores as described above in section 5.3. 


    Produce analysis reports including all metrics described above in section 6. 

    5.  Evaluate results to determine if the search engine did get better. 

    If the search engine did not get better, then the most recent changes will be reverted.

    Otherwise, a manual investigation of the analysis reports and individual use cases (perhaps

    from manual scoring, see section 8 below) will determine what further changes are


  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  23 


    Continuous Improvement Requirements

    Implementing a continuous improvement process for on-going accuracy improvements has the

    following requirements:


    A stable snapshot of document data and log fileso


    See section 4 above.

      A QA system with all data index with the “search engine under test” 

    o  The production system cannot typically be used because:


    1) It does not represent a stable snapshot, and


    2) Accuracy testing will require executing 100’s of thousands of queries – 

    which is a load that is typically too large for production systems.

      Automated tools to produce engine scores and metrics as described above in 7.1. 

    Note that the scoring algorithms specified above in section 7.1 can be run off-line on the QA server.This is a requirement for a continuous improvement cycle, since only off-line analysis will be

    sufficiently agile to allow for running the dozens of iterative tests necessary to optimize relevancy.


    Recording Performance from Run to Run

    It is expected that a continuous improvement cycle will ultimately be implemented. This cycle should

    be executed multiple times as the system is tuned up and improved. As this happens, it is important

    to record the improvement from release to release.

    The following is an example of such a recording from a prior customer engagement:

    whole score %q w match %q w zero results

    rev3 0.226567992 28.84% 33.62%

    rev83 0.237346148 29.97% 31.71%

    rev103 0.241202611 30.95% 28.01%

    latest 0.251230958 32.14% 25.83%

    As you can see, the “whole score” steadily improved while the percentage of queries with at least

    one relevant document increased and the percentage of queries which returned zero results steadily


  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  24 

    It is this sort of analysis which engenders confidence that the process is working and is steadily

    improving the accuracy of the search engine, step by step.

    The customer for whom we ran this analysis next performed an A/B test on production with

    live users to verify that the search engine performance would, in fact, lead to improved user


    This was an e-commerce site, and the results of the A/B test was a 3% improvement in

    conversion rate, which equated to a (roughly) $4 million improvement in total sales that year.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  25 

    Manual Relevancy Judgments 

    In addition to a fully automated statistical analysis, Search Technologies

    recommends implementing a manual judgment process.

    8.1  The Relevancy Judgment User Interface

    To improve productivity of the judgers, Search Technologies recommends a user interface to

    manually judge relevant documents for queries. This is a simple user interface backed by simple files

    of relevancy judgments.

    To prepare for the judgment process:

      Selects 200 random queries from the cleansed query logs

      Queries may need to be annotated to clarify the intent of the query for whoever is

    performing the relevancy judgments.

      Executes each query on the search engine and saves the results

    The relevancy judgments user interface will do the following:

      Shows the search results

    o  Should use the same presentation as the production user interface

      Provides buttons to judge “relevant”, “non-relevant”, “relevant but old”, and “unknown” for

    every query

    The same 200 queries are used from run to run, and so the database of what documents are relevant

    for what queries can be maintained and grown over time.

    To increase judger consistency , Search Technologies recommends creating a “relevancy judger’s

    handbook” – a document which identifies how to determine if a document is relevant or not to a

    particular query. The handbook will help judgers decide the relevancy between documents which are

    principally about the subject, versus documents which contain only an aside-reference to the subject

    and similar judgment decisions. Search Technologies has written such hand books in the past and can

    help with writing the hand book appropriate to your data and user population.

    8.2  Advantages

    Manual relevancy judgments have a number of advantages over strictly log-based metrics:

      It forces the QA team to analyze queries one-by-one

    This helps the team recognize patterns across queries in terms of why good

    documents are missed and bad documents are retrieved.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  26 

      It provides a very clean view of relevancy

    o  The manual judgments are uncluttered by external factors which affect log data

    (machine crashes, network failure, received a phone call while searching, user

    entered the wrong site, etc.)


    Over time, it helps identify recall  as well as precision 

    The last point is perhaps the most important. Log based relevancy can only identify as relevant

    documents which are shown to users by the search engine. Naturally this is the most important

    aspect of relevancy (“Are the documents that I see relevant?” ).

    But it does ignore the second aspect of relevancy, “Did the search engine retrieve all possible relevant

    documents for me?”  This second factor can only be approached with a relevancy database which is

    expanded over time.


    Statistics from Manual JudgmentsThe following statistics can be computed from manual relevancy judgments:

      Percent relevant documents in the top 10  – This is perhaps the most useful score, since it

    provides an easy-to-understand number on what to expect in the search results.

      Percent queries with at least one relevant document retrieved in the top 10

      Percent total relevant retrieved in the top 100  – This is “recall at 100”, which identifies how

    well this configuration of the search can respond to deeper research requirements

      Percent queries with no relevant documents retrieved

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  27 

    Using Logs to Analyze User Intent 

    A second use of search and click logs is to analyze user intent by looking at how

    users interact with the internal search user interface.

    9.1  Query Analysis

    The first step for determining user intent is to look at searches executed by users.


    Top Queries

    A simple but useful analysis is to look at the top 200 most frequent queries executed by users.

    The top 10-20% of queries, especially, are indicative of what’s “top of mind” for the user community. 

    All queries should be categorized by intent by someone who is well versed in the activities of the

    content domain.

      Is the user looking for a particular document or web site which they already know exists?

      Is the user looking to answer a question?

      Is the user looking for a set of documents for research?

      Is the user looking for a set of documents for self-education?

    Naturally, it may be difficult to determine intent simply from the query. See section 9.1.4 how this

    process can be refined and extended.


    Randomly Selected Queries

    The top most frequent queries do not, in fact, give a good idea of “what the user population is

    searching for”. This is because the top queries are often the easiest, most obvious, and least variable

    queries to enter. These are often most frequent simply because they are simple and consistently


    In particular, personal names and document titles vary widely. Rarely will a single individual or

    document title be in the top 200 most frequent queries, and yet these are often the most common


    Therefore, a randomly selected set of 200 queries should be analyzed for intent and categorized.

    9.1.3  Overall Internal Search Usage

    A histogram of the number of queries executed by the same user should be performed. This includes:

      Histogram (mean, median, min, max) of the number of queries executed in a session

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  28 

      Histogram (mean, median, min, max) of the number of queries executed over 3 months

      Total queries executed per month, trending

      Total unique users, per month, trending

    This will provide a good look at how often the user population turns to search as part of their dailywork, and how often they find it to be a valuable tool, overall.


    Randomly Selected Users

    Finally, a set of 25-50 randomly selected users should be analyzed for use cases.

    This should be done in three groups:


    Users who execute a single query in a session.

    The goal here is to try and determine why these users only executed a single query. Did they

    find what they were looking for? Were the results too poor to think about continuing?


    Users who execute three or more queries in a session.

    These are, presumably, users with a more substantial need who are willing to keep searching

    to answer their question.

    How did these users reformulate or re-cast their query? Could these reformulations have been

    performed automatically by the search engine? What were these users looking for?


    Users who execute ten or more queries over month’s period. 

    The goal here is to understand those ‘power’ users who frequently return to internal search.

    Do they execute the same query over and over? What are their primary use cases?


    Help Desk Analysis

    If your organization has a “search help line”, this is a valuable source of information for determining

    user intent and usage models.

    To the extent possible, search help desk activity should be captured and analyzed. This can include

    chat logs, e-mails, voice transcriptions, etc. as available.

    9.2  Tooling for Query Analysis

    Query analysis, of necessity, requires 1) a keen understanding of the user and their goals and needs,

    2) a deep understanding of the content and what it can provide to the user, and 3) how the user’s

    intent is expressed through queries.

    While this is primarily a manual process, additional tooling can help to categorize and cluster queries.

    This includes:

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  29 

      Token statistics

    o  Identifying the frequency of all query tokens

      Across all queries

      Across all documents

      Dictionary lookups

    o  Including in domain-specific dictionaries (ontologies, synonyms lists, etc.)

    o  General, natural language dictionaries

    o  Stop-word lists

    o  Other, use-case specific dictionaries, such as dictionaries for:

      First and last names from the census

      Ticker symbols


    Wikipedia entries (with types)


    Regular expression matching

      BNF pattern matching

    Patterns of tokens by token category types

      Cross-query comparisons

    o  For example, identifying all word pairs which exist as single tokens in other queries

    (and vice-versa)

      User query-set analysis

    o  Comparing queries within a single user’s activity set 

    o  This is used for analyzing query chains, to determine if one type of query often leads

    to another type of query.

      Statistical analysis

    o  Extracting statistics from the queries is essential for using these statistics to

    determine query types and the impact that can be achieved by working on a class of


    Generally, these sorts of statistics are computed using a series of ad-hoc tools and programs

    including UNIX utilities and custom software programs. If the query database is very large, then Big

    Data and/or Map Reduce algorithms may be required.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  30 


    User Interface Click Analysis

    Analyzing clicks within the user interface is another key component to understanding how employees

    are interacting with search.


    Feature Usage

    The first and most important analysis is to determine what user interface features are being used and

    how frequently. This includes:



      Side-bar results

      “Recommended” results (e.g. best bets or elevated results) 




      Advanced search

    A thorough understanding of the usage of these features will help determine what is working and

    what is not, and what should be kept and what should be abandoned.



    Second, click sequences can help determine if users are leveraging user interface features to their

    best advantage.

    The goal here is to determine if a user interface feature leads the user to information which helps

    solve their problem.

      How often does a facet click lead to a document click?

      How often does a side-bar click lead to a new search?

    o  And how often does this lead to a document click?

      How often does a tab click lead to a document click?

      How often does a sort click lead to a document click?

    In this way, we can determine if user interface features are leading to actual improvements in the

    end-user’s ability to find information. 

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  31 


    Long Tail Analysis

    A “long tail analysis” involves the manual analysis of a very large number of queries (5,000 to 10,000

    queries). The goal of this analysis is categorize long-tail queries so they can be properly binned and

    dealt with using a variety of techniques.

    Long tail analysis shifts the process from analyzing “top queries” (of any sort), to a large-scale manual

    analysis of a very large volume of queries.


    Query Database

    The essence of long-tail analysis is a large and evolving query database. This database includes:

      Status flags (new, analyzed, unknown, deferred, tentatively analyzed, problem, solved, good

    results, bad results, etc.)

      The query the user entered

      The date time the query was entered into the database

      The assigned categories and sub-categories for the query

      Other description text / explanation

      User who analyzed the query

    9.4.2  Supporting Databases

    These are typically created on an as-needed basis, but may include:


    Query logs indexed by queryo  So that sets of similar queries can be searched and grouped and processed together

    o  For example, all queries that start with “I need” or “account information”, etc. 

    o  This also allows retrieval of other query metadata:

      List of users who have executed the query


    List of times (for trend lines) on when the query was executed


    (if possible) geographic locations for the query

      User information indexed by user

    o  So that information on an individual user (i.e. user events) can be quickly brought up

    and analyzed when analyzing use cases.

      Click logs indexed by user and time

    So that a user’s traffic can be analyzed to help determine intent 

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  32 


    Tooling Support

    To handle manual reviews of such a large set of databases, a search engine over the log data is

    recommended. Further tooling will:

      Link all queries to the query database

    o  To determine the status and disposition of the query

      Allow for import and export to the query database

    o  For off-line analysis of sets of queries

    9.4.4  Long Tail Analysis –  Outputs

    The outputs of the long-tail analysis include:

      Categories & Sub-Categories

    These represent use cases and query patterns to be handled as a set

    o  Can include categories / use cases for documents as well

      I.e. major document types and use cases

      General language understanding

    Query patterns

    Nouns / verbs

    Sophistication of language / education level of user

    o  Jargon / lay-language


    Common question / answer patternso  For specific information or open-ended research


    o  (antivirus security software, mic microphone)

      Identify recommended re-direction patterns

    o  Account information Account search / custom responses

    o  Support & help queries Support search

    o  Generic company information Website .com search

    People queries People database

      Identify best bet (recommended result) patterns

      Identify user interface search presentation issues

    o  Incorrect fields being displayed in results, hiding important information, etc.

      Identify seasonal patterns (where appropriate)

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  33 

    And along with each use case, an idea of the scope of the use case and what sorts of fixes are

    required to improve user satisfaction.


    More Information

    For high-value content sources, long-tail analysis will provide the most thorough and complete

    picture of how users are using the system.

    Search Technologies has implemented long-tail analysis at one high-value e-Commerce site, where

    we have refined the process.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper

    ©2014 Search Technologies  34 


    A/B Testing 

    Where possible, A/B testing is recommended to validate the engine score and other

    improvements and to determine the exact relationship between engine score

    improvements and other web site metrics (i.e. abandonment rates and conversion rates).

    There can be no “step by step” process for doing A/B testing, since it will involve your production

    system and production data. Therefore, it must be handled carefully and with production


    The following are some broad guidelines for A/B testing.

      Both systems, A & B, need to be up simultaneously

    o  The only way for A/B testing to be accurate is if incoming requests are randomly

    assigned to A or B for a period of time.

    This will ensure that the users for A or B are drawn from the exact same userpopulation.

      A/B does not need to be 50/50

    o  Typically, A/B testing is 95/5.


    95% on the current production system


    5% on the new (under test) system.

    This way, if the “B” system is severely deficient, then only a small percentage of

    production users will be affected.

      Routing between A & B can occur at any of a variety of different levels:

    o  User interface


    Two user interfaces (A / B) to serve different user populations

    Results mixing / query processing


    Two different results mixing models or query processing models which query

    over the same underlying indexes

    o  Multiple indexes

      Different indexes indexed in different ways

    Ideally, of course, a system will be designed and implemented with A/B testing as a primary goal

    from the very beginning. If it is, then “turning on” a B system for testing becomes a standard part ofsystem administration and testing procedures.

  • 8/17/2019 Search Accuracy Analytics White Paper


      Search Accuracy Analytics – White Paper



    This paper is the result of some 22 years of search engine accuracy testing, evaluation, and practical

    experience. The journey started in 1992 with the first TReC Conference, which several Search

    Technologies employees attended. Many of the philosophies and strategies described in this papertrace their roots back to TReC.

    The journey continued throughout the 1990’s, as we experimented with varying relevancy ranking

    technologies, and were the first (to my knowledge) to use machine learning to optimize relevancy

    ranking formulae for relevancy based on manual judgments.

    As we enter the age of the Cloud and Big Data, we now have vastly more data available and the

    machine resources required to process it. This has opened up new and exciting possibilities for

    improving search – not just for a select few – but for everyone.

    At Search Technologies we continue to work every day to make this vision, a vision of high-quality,

    optimized, tuned, targeted, engaging, and powerful search a reality for everyone. This paper is just

    another step in the journey to that ultimate goal.

    For further information or an informal discussion CONTACT US.