clustering search log data
DESCRIPTION
Presented at the Harvard ABCD-WWW/CMS session, Nov. 15, 2012 A previous version of this talk was presented at Enterprise Search Europe, May 2012TRANSCRIPT
Copyright © President & Fellows of Harvard College.
Sophy Bishop & Ravi Mynampaty
Clustering Search Query Log Data to Improve Search
Agenda
Background
Five W’s of Clustering
• What, why, who, how, when
Is it really repeatable?
Questions
About Information Management Services (IMS)
- Standards - Best Practices - User Needs - Service Models
Analytics
Metadata Mgmt.
Taxonomy Dev.
Search
Lifecycle Mgmt.
Inspired by…
Chapters 8 & 9
About this talk…
Case study on how we are improving search and
browse by performing clustering exercises on your
search query data
Not rocket science
High-level overview
You can follow this method, with your own insights and
tweaks
You can kick this off next week at your work
What is clustering?
A process for organizing and analyzing search log
data that:
Is repeatable, low-cost, scalable, simple
Yields actionable results
Supports constant incremental improvement
to search
What’s clustering good for?
Ensure results for high frequency queries
Improve Metadata and Taxonomy
Inform and validate decision making in site IA
Informs editorial/curatorial activities
Provides Feedback for Search Suggestions
o Autosuggest, synonym lists, no-hits page
suggestions
But more on this later...
So how do I cluster search queries?
A simple set of steps
Create query report
Cluster queries
Determine # queries to analyze
Analyze clusters
Draw conclusions
and ACT
Step 1: Create a query report
We started with the site with the most traffic
• Upper-bound limit
• One year’s data by quarter
• Cut off tail at frequency < 10
Step 1: Create a query report
We started with the site with the most traffic
• Upper-bound limit
• One year’s data by quarter
• Cut off tail at frequency < 10
HBS Working Knowledge FY12 Use Snapshot
Overall Traffic
Page Views: 6,439,485
Visits: 3,635,746
Unique visitors: 2,734,620
On-site searches: 174,425
Views per Visit: 1.77
Local Search visit rate: 5%
Organic Search visit rate: 46%
Step 2: Cluster the queries
Step 2 (cont’d): Three levels of clustering
Level Method Example
Narrow Simple
normalization
Eliminate
grammatical,
spelling, typos, and
punctuation
differences
Mid-level Group by subject management,
finance, decision
making
Broad Group by facet topic, name, date,
content type
Step 2 (cont’d): Levels Tasks Enabled
Level Improve your
base for
query
analysis
Ensure
representation
of major
clusters on your
site
Improve
Metadata/Index
/Taxonomy
Improve
Search
Suggestions
Narrow
(simple)
X X X
Mid-level
(group by
subject)
X X X
Broad
(group by
facet)
X X
Step 2 (cont’d): Narrow Clustering Example
Step 2 (cont’d): Mid-level Example
Cluster brand
branding 245
brand 160
brand management 73
consumer branding 57
global brand 32
service brands 24
brand image retail bank 17
employer branding 16
brand management professional
services 16
global branding 13
b2b branding 13
importance of branding 12
brand 2002 12
brand equity 11
brand image 11
Step 2 (cont’d): Mid-level Example Cluster brand
branding 245
brand 160
brand management 73
consumer branding 57
global brand 32
service brands 24
brand image retail bank 17
employer branding 16
brand management professional
services 16
global branding 13
b2b branding 13
importance of branding 12
brand 2002 12
brand equity 11
brand image 11
Step 2 (cont’d): Mid-level Example Cluster brand
branding 245
brand 160
brand management 73
consumer branding 57
global brand 32
service brands 24
brand image retail bank 17
employer branding 16
brand management professional
services 16
global branding 13
b2b branding 13
importance of branding 12
brand 2002 12
brand equity 11
brand image 11
333
179
145
111 101
88
40
26 26 25 20 19 15 14 12 12 11 11 10 10 10
0
50
100
150
200
250
300
350
customer
Step 2 (cont’d): Broad Clustering Example
Step 2 (cont’d): List of facets we used
Facet Example
content type case studies, cases, working papers, articles, newspaper
date 2011, world in 2030 demographic characteristics women, Gen Y, gender, baby boomers event economic crisis format podcast, video geographic area india, japan, mount everest industry global wine industry
job type/role independent director, entrepreneur, ceo, phd economist
organization name ikea, zara, toyota person name michael porter, kanter, sebenius product name / brand name ipad product/commodity coffee, wine, cement topic this covers the majority of keywords
work faculty work, ex: publication name, title of a case
Step 3: Choose #clusters to analyze
Number of
Clusters
Analyzed
Analyze Top Hits Improve Metadata/
Taxonomy
/Index
Supply Search
Suggestions
50 X
150 X X
300+ X X X
Small # Clusters can cover a lot of your data
Number of top clusters % Total Queries
Top 20 clusters 14
Top 30 clusters 18
Top 50 clusters 26
Top 100 clusters 37
Now you have your clusters…
What do you do with them?
TAKE ACTION!
Analyze Top (“Short Head”) Clusters
Clustering has created a condensed and reliable
list of your top search queries
Are they what you thought they would be?
Does the information on your site accurately
represent the top searches?
Are you fulfilling user needs?
Use your clusters: Improve Site Navigation
Examine the short-head of clusters, basically:
For each cluster, add up the frequencies
of queries
Reorder clusters by cumulative frequency
descending
Ensure top clusters are accounted for in your
navigation
Use cluster topics as browse/navigation
headers/footers for your website
WK Top Clusters
Cluster Frequency
innovation 867
balanced scorecard 794
leadership 570
cases 545
social media 508
negotiation 470
knowledge management 457
ethics 448
apple 430
corporate social responsibility 398
Use your clusters: Improve Taxonomy
• Missing categories in browse taxonomy
• "Balanced Scorecard"
• “Ethics”
• “Social media”
• Second-level topics in the WK context
Use your clusters: Improve Taxonomy
• Missing categories in browse taxonomy
• "Balanced Scorecard"
• “Ethics”
• “Social media”
• Second-level topics in the WK context
Use your clusters: Improve Taxonomy
• Missing categories in browse taxonomy
• "Balanced Scorecard"
• “Ethics”
• “Social media”
• Second-level topics in the WK context
Mid-level clustering:
Informs editorial /curatorial activities
“Featured Topics”
o What topics to highlight this week/month/year
o News items to focus on
o What research guides to create
o How to formulate queries for the topics
Use your clusters: Improve Synonym Handling
Clustered list provides synonyms for taxonomy
Requires human judgment and
standards/guidelines for synonyms – in our
case, synonyms are exact
Map to one "like term" in the search engine
Example:
Balanced Scorecard, BSC, Balanced score card
kaplan and norton -> Balanced Scorecard
Use your clusters: Improve no-hits page
Time Commitment
• 2 hours to 2 weeks
• Variables include:
• What kind of information you want to gather
• How broad or narrow you want your clusters
• How many queries you analyze
• In our case ~2 person-weeks
• We had Sophy Bishop
• Intern, MSLIS student
Results vs. Time Invested
Analyze top
clusters
Update
Taxonomy
Create New
Metadata
Determine
New Search
Suggestions
2 Hours X X
6 Hours X X X
One Week X X X X
Next Steps: Autosuggest
Your top clusters probably make up a large
percentage of what people are looking for
o Use them to establish/supplement
auto-suggest!
Example: suggestions for “innovation”
o innovation and leadership
o disruptive innovation
o innovation management
o open innovation
Next Steps: New Access Structures
Needed an obvious way to search podcasts
o Put in best bets for now
A lot of people searching for article titles o Considering simple interface/approach for select
field-specific search, e.g. “title”
Consider adding other facets to browse
taxonomy where we have entities tagged o “company name”, “job type/class”, etc.
Next Steps
SEO Optimization Input
o Advise authors to use top cluster terms in Titles,
Abstracts, Keywords
o Report on clusters in our monthly analytics reports
to faculty (“Top search topics/subjects in May 2012
were…” ; “Searchers found your works with
following queries”)
Repeat process on other sites/content
Summary
Established plan/process, but be willing to tweak
as you go
Keep it very simple.
Play with your data – the more we played, the better
we understood what benefits could be realized by
levels of clustering and effort
Tuning process/results
o Build staging/working prototypes
o Repeat process on other sites
TAKE ACTION!