projecthub

30
Copyright 2010 Sematext Int'l. All rights reserved. ProjectHub Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends Otis Gospodnetić ◦◦ [email protected] ◦◦ @otisg Sematext Int'l ◦◦ www.sematext.com ◦◦ @sematext 1

Upload: sematext-group-inc

Post on 27-Jan-2015

108 views

Category:

Technology


2 download

DESCRIPTION

Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

TRANSCRIPT

Page 1: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

ProjectHub

Crawling, Indexing, and Searching Software Project Datawith Droids, Tika, Solr & friends

Otis Gospodnetić ◦◦ [email protected] ◦◦ @otisg

Sematext Int'l ◦◦ www.sematext.com ◦◦ @sematext

1

Page 2: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

What I Will Cover

• Who I am• What Why Where• Architecture• Info Gathering & Indexing• Search & Extra Search Dog Food• Performance & Analytics• Ops & Stats

2

Page 3: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

About Otis Gospodnetić

• Lucene/Solr/Nutch/Mahout/... committer

• Lucene in Action 1 & 2 co-author

• Lucene Consulting since 2005

• Sematext International since 2007

3

Page 4: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

About Sematext

Search (Lucene, Solr, Elastic Search...)

Web Crawling (Nutch)

Machine Learning (Mahout)

Big Data (Hadoop, HBase, Voldemort...)

Page 5: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

What

• Search everything about a Software Project• Lucene & Hadoop

– All sub-projects– All content• Mailing list archives• JIRA issues• Web site & Wiki pages• Source code (local syntax highlighting), trunk• Javadoc, trunk

5

Page 6: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved. 6

Page 7: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Why

• We need it• Other Hadoop, Lucene, Solr... users need it• Our own playground• Live product demos• Yummy dog food

7

Page 8: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Where

• search-lucene.com• search-hadoop.com• Other suggestions / needs?• In your Enterprise?

8

Page 9: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Architecture

9

Page 10: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Tool Matrix

Data Source Fetch Parse

JIRA URLConnection (feed) Digester (feed) DOM (item)

ML FileInputStream (fs) URLConnection (feed)Droid (works, unused)

Digester (feed) MIME4J (mbox)

Web site Droids Tika via Droids

Wiki Droids Tika via Droids

Source code svn co QDox

Javadoc svn co QDox

Page 11: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Information Gathering

• Multiple independent JVM processes (cron)

• Different polling frequencies

• Different data sources / formats:– RSS (JIRA, Mailing Lists)– Mbox (Mailing Lists)– HTTP/HTML (Web site, Wiki)– Subversion (source code, Javadoc)

• Nutch is a beast. Droids is light & simple.

• ML thread detection is tricky

• Finding deleted docs (Wiki, Web, Javadoc...)

Page 12: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Thread Detection

• Email clients are kaput

• SMTP headers are unreliable

• Heuristics are needed– Try headers– Fall back to subjects (get subject skeleton,

calculate hash)– Factor in time (4 weeks)– Use index for thread info retrieval

Q: Are there any libraries for this?

Page 13: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Indexing

• Use StreamingUpdateSolrServer

• AutoCommit use-case

• Solr index abuse: track seen/unseen

• &qsrc=indexer

• &warmUp=true

• Separate processes – easier reindexing (esp. with frequent project infra changes)

• Treating quoted portions of ML messages

Page 14: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Search

• Facets (multi-select)– Project– Data source/type– Author (based on names only)

• Boosting more recent documents vs. pure relevance vs. newest/oldest first

give equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs)

recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4

Page 15: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Search cont'd

• Query Spellchecker

• Sematext components:– ReSearcher & Relaxer– AutoComplete– Key Phrase Extractor (2 approaches)

• Threaded vs. flat view

• In-document search term highlighting

• Short URLs

Page 16: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Search cont'd

Page 17: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Dog food #1: Auto-Complete

• Source: nightly refreshed subject and titles

• Approach: go directly to selection

• sematext.com/products/autocomplete/

Page 18: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Dog food #2: ReSearcher & Relaxer• Avoid “sorry, no/poor matches”

• Multiple algos trigger re-searching

• Different forms of relaxing

• sematext.com/products/dym-researcher/

Page 19: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Dog food #3: Key Phrases

Help narrow search results, like facets

• 2 types:– Stored in index vs. calculated from top N hits

sematext.com/products/key-phrase-extractor/

Page 20: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Basic Search Analytics

• Top queries, top terms...

• Daily, weekly, monthly

• MRRhttp://en.wikipedia.org/wiki/Mean_reciprocal_rank

Page 21: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Very Basic Search Analytics

Page 22: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Real Search Analytics

Page 23: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Performance & Monitoring: RPM

Page 24: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Availability: Site24x7.com

Page 25: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Operations

• Small EC2 instance: 1.7 GB RAM

• EBS for data - got burnt once

• Local disk for index

• Solr 1.4.1 multi-core

• Performance monitoring via RPM

• Availability & performance via site24x7.com

Page 26: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Statistics

• search-hadoop.com:– 110K+ documents– ~700 MB optimized

• search-lucene.com– 170K+ documents– ~900 MB optimized

Page 27: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Future

• Field collapsing (threads)

• Bot detection (load) DONE

• Solr duplicate detection (release notes)

• Relevance tuning (MRR)

• Open sourcing?

Page 28: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

World-wide!

Search & Data Analytics

Machine Learning & NLP

Big Data

[email protected]

WE ARE HIRING

Page 29: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Questions

?

Page 30: ProjectHub

Copyright 2010 Sematext Int'l. All rights reserved.

Contact

• sematext.com

• blog.sematext.com

• @sematext

• @otisg

[email protected]

30