projecthub

Copyright 2010 Sematext Int'l. All rights reserved.

ProjectHub

Crawling, Indexing, and Searching Software Project Datawith Droids, Tika, Solr & friends

Otis Gospodnetić ◦◦ [email protected] ◦◦ @otisg

Sematext Int'l ◦◦ www.sematext.com ◦◦ @sematext

1


What I Will Cover

• Who I am• What Why Where• Architecture• Info Gathering & Indexing• Search & Extra Search Dog Food• Performance & Analytics• Ops & Stats

2


About Otis Gospodnetić

• Lucene/Solr/Nutch/Mahout/... committer

• Lucene in Action 1 & 2 co-author

• Lucene Consulting since 2005

• Sematext International since 2007

3


About Sematext

Search (Lucene, Solr, Elastic Search...)

Web Crawling (Nutch)

Machine Learning (Mahout)

Big Data (Hadoop, HBase, Voldemort...)


What

• Search everything about a Software Project• Lucene & Hadoop

– All sub-projects– All content• Mailing list archives• JIRA issues• Web site & Wiki pages• Source code (local syntax highlighting), trunk• Javadoc, trunk

5


Why

• We need it• Other Hadoop, Lucene, Solr... users need it• Our own playground• Live product demos• Yummy dog food

7


Where

• search-lucene.com• search-hadoop.com• Other suggestions / needs?• In your Enterprise?

8


Architecture

9


Tool Matrix

Data Source Fetch Parse

JIRA URLConnection (feed) Digester (feed) DOM (item)

ML FileInputStream (fs) URLConnection (feed)Droid (works, unused)

Digester (feed) MIME4J (mbox)

Web site Droids Tika via Droids

Wiki Droids Tika via Droids

Source code svn co QDox

Javadoc svn co QDox


Information Gathering

• Multiple independent JVM processes (cron)

• Different polling frequencies

• Different data sources / formats:– RSS (JIRA, Mailing Lists)– Mbox (Mailing Lists)– HTTP/HTML (Web site, Wiki)– Subversion (source code, Javadoc)

• Nutch is a beast. Droids is light & simple.

• ML thread detection is tricky

• Finding deleted docs (Wiki, Web, Javadoc...)


Thread Detection

• Email clients are kaput

• SMTP headers are unreliable

• Heuristics are needed– Try headers– Fall back to subjects (get subject skeleton,

calculate hash)– Factor in time (4 weeks)– Use index for thread info retrieval

Q: Are there any libraries for this?


Indexing

• Use StreamingUpdateSolrServer

• AutoCommit use-case

• Solr index abuse: track seen/unseen

• &qsrc=indexer

• &warmUp=true

• Separate processes – easier reindexing (esp. with frequent project infra changes)

• Treating quoted portions of ML messages


Search

• Facets (multi-select)– Project– Data source/type– Author (based on names only)

• Boosting more recent documents vs. pure relevance vs. newest/oldest first

give equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs)

recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4


Search cont'd

• Query Spellchecker

• Sematext components:– ReSearcher & Relaxer– AutoComplete– Key Phrase Extractor (2 approaches)

• Threaded vs. flat view

• In-document search term highlighting

• Short URLs


Search cont'd


Dog food #1: Auto-Complete

• Source: nightly refreshed subject and titles

• Approach: go directly to selection

• sematext.com/products/autocomplete/


Dog food #2: ReSearcher & Relaxer• Avoid “sorry, no/poor matches”

• Multiple algos trigger re-searching

• Different forms of relaxing

• sematext.com/products/dym-researcher/


Dog food #3: Key Phrases

Help narrow search results, like facets

• 2 types:– Stored in index vs. calculated from top N hits

sematext.com/products/key-phrase-extractor/


Basic Search Analytics

• Top queries, top terms...

• Daily, weekly, monthly

• MRRhttp://en.wikipedia.org/wiki/Mean_reciprocal_rank


Very Basic Search Analytics


Real Search Analytics


Performance & Monitoring: RPM


Availability: Site24x7.com


Operations

• Small EC2 instance: 1.7 GB RAM

• EBS for data - got burnt once

• Local disk for index

• Solr 1.4.1 multi-core

• Performance monitoring via RPM

• Availability & performance via site24x7.com


Statistics

• search-hadoop.com:– 110K+ documents– ~700 MB optimized

• search-lucene.com– 170K+ documents– ~900 MB optimized


Future

• Field collapsing (threads)

• Bot detection (load) DONE

• Solr duplicate detection (release notes)

• Relevance tuning (MRR)

• Open sourcing?


World-wide!

Search & Data Analytics

Machine Learning & NLP

Big Data

[email protected]

WE ARE HIRING


Questions

?


Contact

• sematext.com

• blog.sematext.com

• @sematext

• @otisg

• [email protected]

30

projecthub

Technology

sematext search lucene

elastic search

basic search analytics

real search analytics

search facets multi

document search term

narrow search results

index solr