leveraging solr and mahout

12
Confidential © Copyright 2012 Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief Scientist

Upload: grant-ingersoll

Post on 10-May-2015

884 views

Category:

Technology


1 download

DESCRIPTION

My talk from last night's Big Data Warehouse meetup in NYC on using Solr and Mahout to build next generation data access tools

TRANSCRIPT

Page 1: Leveraging Solr and Mahout

Confidential © Copyright 2012

Leveraging Solr and Mahout for Next Gen Data Access and Insight

Grant IngersollChief Scientist

Page 2: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Search is Dead, Long Live Search

Content

Users

Access

Content Relationships

• Modern Data Challenges are multi-structured

• Search is a system building block- Text is only a part of the story

• If the algorithms fit,

use them!

• Embrace fuzziness!

• Scoring features are everywhere

Page 3: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks3

Topics

• Intros

• Search (R)Evolution

• Apache Solr• Apache Mahout

• Search and Machine Learning

• Scaling

Page 4: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

• Co-founder:- LucidWorks – Chief Scientist- Apache Mahout

• Long time Lucene/Solr committer• Author: Taming Text

- www.manning.com/ingersoll

• Background in IR and NLP- Built CLIR, QA and a variety of other search-based apps

Grant’s Background

Page 5: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Search (R)evolution

• Search use leads to search abuse- Denormalization frees your mind- Scoring is just a sparse matrix multiply

• Lucene/Solr evolution- Non-free text usages abound- Many DB-like features- NoSQL before NoSQL was cool- Flexible indexing- Finite State Transducers FTW!

• Scale

• “This ain’t your father’s relevance anymore”

Page 6: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Apache Solr?

• “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “- http://lucene.apache.org/solr

• Did I mention free?

Page 7: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Apache Mahout

• Goal: create library of scalable machine learning algorithms

• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery- Collaborative Filtering- Classification- Clustering

• Also: - Collocations (Statistically Interesting Phrases)- SVD- Java math, primitives libraries and more

Page 8: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Search + Machine Learning

• Search-driven applications present multiple opportunities for leveraging machine learning- Clustering – Enhance Discovery, outlier detection- Classification – Queries, Documents, Users- Content Recommendation – Collab. Filtering and

personalization- NLP – phrases, named entities, co-reference, much more

• Many of these can also power faceted navigation

• Aside: Search can also often be used effectively to implement many machine learning algorithms

Page 9: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

How and When

Shards

12

3 N

Search View

•Documents •Users •Logs

DocumentStore

Analytic Services

•View into numeric/historic data

•Classification•Recommendation

Personalization & Machine Learning

Services

Classification Models

In memoryReplicatedMulti-tenant

Discovery & EnrichmentClustering, classification, NLP, topic identification, search log analysis, user behavior

Content AcquisitionETL, batch or near real-time

Access APIs

Data• LucidWorks Search

connectors• Push

Page 10: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Scaling

• Search- Solr Cloud = Large scale, distributed search and faceting

» http://wiki.apache.org/solr/SolrCloud

• Machine Learning- Mahout is built on Hadoop for most things- SGD is sequential and really fast

• Sometimes all you can do is make an educated guess- Storm, Kafka, etc. can help by allowing you to make estimates

in near real time

Page 11: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Wrap

• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users

• LucidWorks has combined many of these things into LucidWorks Big Data- http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based applications

Page 12: Leveraging Solr and Mahout

Confidential and Proprietary © 2012 LucidWorks

Resources

• LucidWorks- http://www.lucidworks.com- http://www.lucidworks.com/products/lucidworks-big-data- @LucidImagineer

• Me- [email protected] @gsingers

• Taming Text- http://www.manning.com/ingersoll- http://www.tamingtext.com- @tamingtext