keynote: enabling scalable search, discovery and analytics with solr,mahout and hadoop
DESCRIPTION
Presented by Grant Ingersoll, Chief Scientist, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the needs of batch processing approaches. In many cases, one needs both ad hoc, real-time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we'll discuss a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others. The talk will discuss the architecture and capabilities of the system along with how the capabilities of Solr 4 help drive real time access for content discovery and analytics.TRANSCRIPT
1 |
Search Discover Analyze
Grant Ingersoll Chief Scien:st Lucid Imagina:on
Enabling Scalable Search, Discovery and Analy6cs with Solr, Mahout and Hadoop
2 |
l ________ data growth in the next ___ days/months/years – Many es:mate 80-‐90% of data is “unstructured” (mul:-‐structured?)
l The Age of “Data Paranoia” – What if I don’t collect it all? – What if I miss something or lose something? – What if I can’t store it long enough? – How do I secure it? – Can I afford to do any of this? Can I afford not to?
– What if I can’t make sense of it?
We All Know the Pain
3 |
Big Data Premise and Promise
Premise Promise
Large Scale Data Collection/Storage ✔
Prevents Data Loss ✔
Long Term Storage ✔
Affordable ✔
New Science Delivering New Insights ?
4 |
l User Needs: – Real-‐:me, ad hoc access to content – Aggressive Priori:za:on based on Importance – Serendipity
l Batch processing isn’t enough
l Search is built for mul:-‐structured
l Deeper analysis yields: – Business insight into users – Beaer Search and Discovery for users
Why Search, Discovery and Analy;cs (SDA)?
Search
Discovery Analytics
5 |
l Fast, efficient, scalable search – Bulk and Near Real Time Indexing
l Large scale, cost effective storage
l Large scale processing power – Large scale and distributed for whole data consumption and analysis – Sampling tools – Distributed In Memory where appropriate
l NLP and machine learning tools that scale to enhance discovery and analysis
What do you need for SDA?
6 |
l Dark Data – Petabytes (and beyond) of content in storage with liale insight into what’s in it – Forensics, Intelligence Gathering, Risk analysis, etc.
l Financial – Enable total customer view to beaer understand risks and opportuni:es
l Medical – Extend research capabili:es through deeper analysis of both scien:fic data, publica:ons and field usage
l Social Media Monitoring – Understand and analyze social networks and their trends all the :me, no maaer the scale
l Commerce – Drive more sales through metric driven search and discovery without the guesswork
Example Use Cases
7 |
An applica:on development plaiorm aimed at enabling Search, Discovery and Analysis of your content and user interac:ons, no maaer the volume, variety
and velocity of that content, nor the number of users
Announcing LucidWorks Big Data Beta
8 |
Architecture
9 |
l Combines the real :me, ad hoc data accessibility of LucidWorks with compute and storage capabili:es of Hadoop
l Delivers analy:c capabili:es along with scalable machine learning algorithms for deeper insight into both content and users
l RESTful API suppor:ng JSON input/output formats for easy integra:on
l Full Stack -‐ Minimizes the impact of provisioning Hadoop, LucidWorks and other components
l Hosted in cloud and supported by Lucid Imagina:on
Key Features of Beta
10 |
APIs
l Search and Indexing – Full power of LucidWorks (Solr) – Bulk and Near Real Time Indexing – Sharded via SolrCloud
l Workflows – Predefined workflows ease
common data tasks such as bulk indexing
l Administra:on – Access to key system informa:on – User management
l Analy:cs – Common search analy:cs for
beaer understanding of relevancy based on log analysis
– Historical views
l Machine Learning – Clustering – Sta:s:cally Interes:ng Phrases – Future enhancements planned
l Proxy APIs – LucidWorks – WebHDFS
11 |
Under the Hood
l Lucene/Solr 4.0-‐dev
l Sharded with SolrCloud – 1 second (default) som commits for
NRT updates – 1 minute (default) hard commits
(no searcher reopen) – Transac:on logs for recovery – Solr takes care of leader elec:on,
etc. so no more master/worker
l See Mark Miller’s talk on SolrCloud
l RESTful services built on Restlet 2.1
l Service Discovery, load balancing, failover enabled via ZooKeeper + Neilix Curator
l Authen:ca:on and authoriza:on over SSL (op:onal)
l Proxies for LucidWorks and WebHDFS API
l Workflow engine coordinates data flow
LucidWorks 2.1 SDA Engine
12 |
Under the Hood
l Apache Hadoop – Map-‐Reduce (MR) jobs for ETL and
bulk indexing into SolrCloud sharded system
– Leverage Pig and custom MR jobs for log processing and metric calcula:on
– WebHDFS
l Apache Mahout – K-‐Means Clustering – Sta:s:cally Interes:ng Phrases – More to come
l Apache HBase – Key-‐value and :me series of all
calculated metrics
l Apache Pig – ETL – Log analysis -‐> HBase
l Apache ZooKeeper – Neilix Curator for service
discovery and higher level ZK client
l Apache Kasa – Pub-‐sub for collec:ng logs from
LucidWorks into HDFS
13 |
l Our approach is from search and discovery outwards to analy:cs – Analy:cs in beta are focused around analysis of search logs
l Analy:cs Themes – Relevance – Data quality – Discovery – Integra:on with other packages (R?)
l Machine Learning – Classifica:on – NLP
l More analy:cs on the index itself?
The Road Ahead
14 |
l hap://bit.ly/lucidworks-‐big-‐data
l hap://www.lucidimagina:on.com
l grant@lucidimagina:on.com
l @gsingers
Contacts