projecthub
DESCRIPTION
Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friendsTRANSCRIPT
Copyright 2010 Sematext Int'l. All rights reserved.
ProjectHub
Crawling, Indexing, and Searching Software Project Datawith Droids, Tika, Solr & friends
Otis Gospodnetić ◦◦ [email protected] ◦◦ @otisg
Sematext Int'l ◦◦ www.sematext.com ◦◦ @sematext
1
Copyright 2010 Sematext Int'l. All rights reserved.
What I Will Cover
• Who I am• What Why Where• Architecture• Info Gathering & Indexing• Search & Extra Search Dog Food• Performance & Analytics• Ops & Stats
2
Copyright 2010 Sematext Int'l. All rights reserved.
About Otis Gospodnetić
• Lucene/Solr/Nutch/Mahout/... committer
• Lucene in Action 1 & 2 co-author
• Lucene Consulting since 2005
• Sematext International since 2007
3
Copyright 2010 Sematext Int'l. All rights reserved.
About Sematext
Search (Lucene, Solr, Elastic Search...)
Web Crawling (Nutch)
Machine Learning (Mahout)
Big Data (Hadoop, HBase, Voldemort...)
Copyright 2010 Sematext Int'l. All rights reserved.
What
• Search everything about a Software Project• Lucene & Hadoop
– All sub-projects– All content• Mailing list archives• JIRA issues• Web site & Wiki pages• Source code (local syntax highlighting), trunk• Javadoc, trunk
5
Copyright 2010 Sematext Int'l. All rights reserved. 6
Copyright 2010 Sematext Int'l. All rights reserved.
Why
• We need it• Other Hadoop, Lucene, Solr... users need it• Our own playground• Live product demos• Yummy dog food
7
Copyright 2010 Sematext Int'l. All rights reserved.
Where
• search-lucene.com• search-hadoop.com• Other suggestions / needs?• In your Enterprise?
8
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture
9
Copyright 2010 Sematext Int'l. All rights reserved.
Tool Matrix
Data Source Fetch Parse
JIRA URLConnection (feed) Digester (feed) DOM (item)
ML FileInputStream (fs) URLConnection (feed)Droid (works, unused)
Digester (feed) MIME4J (mbox)
Web site Droids Tika via Droids
Wiki Droids Tika via Droids
Source code svn co QDox
Javadoc svn co QDox
Copyright 2010 Sematext Int'l. All rights reserved.
Information Gathering
• Multiple independent JVM processes (cron)
• Different polling frequencies
• Different data sources / formats:– RSS (JIRA, Mailing Lists)– Mbox (Mailing Lists)– HTTP/HTML (Web site, Wiki)– Subversion (source code, Javadoc)
• Nutch is a beast. Droids is light & simple.
• ML thread detection is tricky
• Finding deleted docs (Wiki, Web, Javadoc...)
Copyright 2010 Sematext Int'l. All rights reserved.
Thread Detection
• Email clients are kaput
• SMTP headers are unreliable
• Heuristics are needed– Try headers– Fall back to subjects (get subject skeleton,
calculate hash)– Factor in time (4 weeks)– Use index for thread info retrieval
Q: Are there any libraries for this?
Copyright 2010 Sematext Int'l. All rights reserved.
Indexing
• Use StreamingUpdateSolrServer
• AutoCommit use-case
• Solr index abuse: track seen/unseen
• &qsrc=indexer
• &warmUp=true
• Separate processes – easier reindexing (esp. with frequent project infra changes)
• Treating quoted portions of ML messages
Copyright 2010 Sematext Int'l. All rights reserved.
Search
• Facets (multi-select)– Project– Data source/type– Author (based on names only)
• Boosting more recent documents vs. pure relevance vs. newest/oldest first
give equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs)
recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4
Copyright 2010 Sematext Int'l. All rights reserved.
Search cont'd
• Query Spellchecker
• Sematext components:– ReSearcher & Relaxer– AutoComplete– Key Phrase Extractor (2 approaches)
• Threaded vs. flat view
• In-document search term highlighting
• Short URLs
Copyright 2010 Sematext Int'l. All rights reserved.
Search cont'd
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #1: Auto-Complete
• Source: nightly refreshed subject and titles
• Approach: go directly to selection
• sematext.com/products/autocomplete/
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #2: ReSearcher & Relaxer• Avoid “sorry, no/poor matches”
• Multiple algos trigger re-searching
• Different forms of relaxing
• sematext.com/products/dym-researcher/
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #3: Key Phrases
Help narrow search results, like facets
• 2 types:– Stored in index vs. calculated from top N hits
sematext.com/products/key-phrase-extractor/
Copyright 2010 Sematext Int'l. All rights reserved.
Basic Search Analytics
• Top queries, top terms...
• Daily, weekly, monthly
• MRRhttp://en.wikipedia.org/wiki/Mean_reciprocal_rank
Copyright 2010 Sematext Int'l. All rights reserved.
Very Basic Search Analytics
Copyright 2010 Sematext Int'l. All rights reserved.
Real Search Analytics
Copyright 2010 Sematext Int'l. All rights reserved.
Performance & Monitoring: RPM
Copyright 2010 Sematext Int'l. All rights reserved.
Availability: Site24x7.com
Copyright 2010 Sematext Int'l. All rights reserved.
Operations
• Small EC2 instance: 1.7 GB RAM
• EBS for data - got burnt once
• Local disk for index
• Solr 1.4.1 multi-core
• Performance monitoring via RPM
• Availability & performance via site24x7.com
Copyright 2010 Sematext Int'l. All rights reserved.
Statistics
• search-hadoop.com:– 110K+ documents– ~700 MB optimized
• search-lucene.com– 170K+ documents– ~900 MB optimized
Copyright 2010 Sematext Int'l. All rights reserved.
Future
• Field collapsing (threads)
• Bot detection (load) DONE
• Solr duplicate detection (release notes)
• Relevance tuning (MRR)
• Open sourcing?
Copyright 2010 Sematext Int'l. All rights reserved.
World-wide!
Search & Data Analytics
Machine Learning & NLP
Big Data
WE ARE HIRING
Copyright 2010 Sematext Int'l. All rights reserved.
Questions
?
Copyright 2010 Sematext Int'l. All rights reserved.
Contact
• sematext.com
• blog.sematext.com
• @sematext
• @otisg
30