keynote: enabling scalable search, discovery and analytics with solr,mahout and hadoop

1 |

Search Discover Analyze

Grant Ingersoll Chief Scien:st Lucid Imagina:on

Enabling Scalable Search, Discovery and Analy6cs with Solr, Mahout and Hadoop

2 |

l  ________ data growth in the next ___ days/months/years –  Many es:mate 80-‐90% of data is “unstructured” (mul:-‐structured?)

l  The Age of “Data Paranoia” –  What if I don’t collect it all? –  What if I miss something or lose something? –  What if I can’t store it long enough? –  How do I secure it? –  Can I afford to do any of this? Can I afford not to?

–  What if I can’t make sense of it?

We All Know the Pain

3 |

Big Data Premise and Promise

Premise Promise

Large Scale Data Collection/Storage ✔

Prevents Data Loss ✔

Long Term Storage ✔

Affordable ✔

New Science Delivering New Insights ?

4 |

l  User Needs: –  Real-‐:me, ad hoc access to content –  Aggressive Priori:za:on based on Importance –  Serendipity

l  Batch processing isn’t enough

l  Search is built for mul:-‐structured

l  Deeper analysis yields: –  Business insight into users –  Beaer Search and Discovery for users

Why Search, Discovery and Analy;cs (SDA)?

Search

Discovery Analytics

5 |

l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing

l  Large scale, cost effective storage

l  Large scale processing power –  Large scale and distributed for whole data consumption and analysis –  Sampling tools –  Distributed In Memory where appropriate

l  NLP and machine learning tools that scale to enhance discovery and analysis

What do you need for SDA?

6 |

l  Dark Data – Petabytes (and beyond) of content in storage with liale insight into what’s in it –  Forensics, Intelligence Gathering, Risk analysis, etc.

l  Financial – Enable total customer view to beaer understand risks and opportuni:es

l  Medical – Extend research capabili:es through deeper analysis of both scien:fic data, publica:ons and field usage

l  Social Media Monitoring – Understand and analyze social networks and their trends all the :me, no maaer the scale

l  Commerce – Drive more sales through metric driven search and discovery without the guesswork

Example Use Cases

7 |

An applica:on development plaiorm aimed at enabling Search, Discovery and Analysis of your content and user interac:ons, no maaer the volume, variety

and velocity of that content, nor the number of users

Announcing LucidWorks Big Data Beta

8 |

Architecture

9 |

l  Combines the real :me, ad hoc data accessibility of LucidWorks with compute and storage capabili:es of Hadoop

l  Delivers analy:c capabili:es along with scalable machine learning algorithms for deeper insight into both content and users

l  RESTful API suppor:ng JSON input/output formats for easy integra:on

l  Full Stack -‐ Minimizes the impact of provisioning Hadoop, LucidWorks and other components

l  Hosted in cloud and supported by Lucid Imagina:on

Key Features of Beta

10 |

APIs

l  Search and Indexing –  Full power of LucidWorks (Solr) –  Bulk and Near Real Time Indexing –  Sharded via SolrCloud

l  Workflows –  Predefined workflows ease

common data tasks such as bulk indexing

l  Administra:on –  Access to key system informa:on –  User management

l  Analy:cs –  Common search analy:cs for

beaer understanding of relevancy based on log analysis

–  Historical views

l  Machine Learning –  Clustering –  Sta:s:cally Interes:ng Phrases –  Future enhancements planned

l  Proxy APIs –  LucidWorks –  WebHDFS

11 |

Under the Hood

l  Lucene/Solr 4.0-‐dev

l  Sharded with SolrCloud –  1 second (default) som commits for

NRT updates –  1 minute (default) hard commits

(no searcher reopen) –  Transac:on logs for recovery –  Solr takes care of leader elec:on,

etc. so no more master/worker

l  See Mark Miller’s talk on SolrCloud

l  RESTful services built on Restlet 2.1

l  Service Discovery, load balancing, failover enabled via ZooKeeper + Neilix Curator

l  Authen:ca:on and authoriza:on over SSL (op:onal)

l  Proxies for LucidWorks and WebHDFS API

l  Workflow engine coordinates data flow

LucidWorks 2.1 SDA Engine

12 |

Under the Hood

l  Apache Hadoop –  Map-‐Reduce (MR) jobs for ETL and

bulk indexing into SolrCloud sharded system

–  Leverage Pig and custom MR jobs for log processing and metric calcula:on

–  WebHDFS

l  Apache Mahout –  K-‐Means Clustering –  Sta:s:cally Interes:ng Phrases –  More to come

l  Apache HBase –  Key-‐value and :me series of all

calculated metrics

l  Apache Pig –  ETL –  Log analysis -‐> HBase

l  Apache ZooKeeper –  Neilix Curator for service

discovery and higher level ZK client

l  Apache Kasa –  Pub-‐sub for collec:ng logs from

LucidWorks into HDFS

13 |

l  Our approach is from search and discovery outwards to analy:cs –  Analy:cs in beta are focused around analysis of search logs

l  Analy:cs Themes –  Relevance –  Data quality –  Discovery –  Integra:on with other packages (R?)

l  Machine Learning –  Classifica:on –  NLP

l  More analy:cs on the index itself?

The Road Ahead

14 |

l  hap://bit.ly/lucidworks-‐big-‐data

l  hap://www.lucidimagina:on.com

l  grant@lucidimagina:on.com

l  @gsingers

Contacts

keynote: enabling scalable search, discovery and analytics with solr,mahout and hadoop

Technology

c data

data consumption

data growth

analysis of search logs

lucidworks big data

search discover

scalable search bulk

users beaer search