keynote: enabling scalable search, discovery and analytics with solr,mahout and hadoop

Click here to load reader

Post on 24-May-2015




0 download

Embed Size (px)


Presented by Grant Ingersoll, Chief Scientist, Lucid Imagination - See conference video - Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the needs of batch processing approaches. In many cases, one needs both ad hoc, real-time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we'll discuss a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others. The talk will discuss the architecture and capabilities of the system along with how the capabilities of Solr 4 help drive real time access for content discovery and analytics.


  • 1. Search Discover Analyze Enabling Scalable Search, Discovery and Analy6cs with Solr, Mahout and Hadoop Grant Ingersoll Chief Scien:st Lucid Imagina:on | 1

2. We All Know the Pain l ________ data growth in the next ___ days/months/years Many es:mate 80-90% of data is unstructured (mul:-structured?) l The Age of Data Paranoia What if I dont collect it all? What if I miss something or lose something? What if I cant store it long enough? How do I secure it? Can I aord to do any of this? Can I aord not to? What if I cant make sense of it? | 2 3. Big Data Premise and Promise PremisePromiseLarge Scale Data Collection/StoragePrevents Data Loss Long Term StorageAffordable New Science Delivering New Insights? | 3 4. Why Search, Discovery and Analy;cs (SDA)? l User Needs: Real-:me, ad hoc access to content Search Aggressive Priori:za:on based on Importance Serendipity l Batch processing isnt enough l Search is built for mul:-structured AnalyticsDiscoveryl Deeper analysis yields: Business insight into users Beaer Search and Discovery for users | 4 5. What do you need for SDA? l Fast, efficient, scalable search Bulk and Near Real Time Indexingl Large scale, cost effective storagel Large scale processing power Large scale and distributed for whole data consumption and analysis Sampling tools Distributed In Memory where appropriatel NLP and machine learning tools that scale to enhance discovery andanalysis | 5 6. Example Use Cases l Dark Data Petabytes (and beyond) of content in storage with liale insight into whats in it Forensics, Intelligence Gathering, Risk analysis, etc. l Financial Enable total customer view to beaer understand risks and opportuni:es l Medical Extend research capabili:es through deeper analysis of both scien:c data, publica:ons and eld usage l Social Media Monitoring Understand and analyze social networks and their trends all the :me, no maaer the scale l Commerce Drive more sales through metric driven search and discovery without the guesswork | 6 7. Announcing LucidWorks Big Data Beta An applica:on development plaiorm aimed at enabling Search, Discovery and Analysis of your content and user interac:ons, no maaer the volume, variety and velocity of that content, nor the number of users | 7 8. Architecture | 8 9. Key Features of Beta l Combines the real :me, ad hoc data accessibility of LucidWorks with compute and storage capabili:es of Hadoop l Delivers analy:c capabili:es along with scalable machine learning algorithms for deeper insight into both content and users l RESTful API suppor:ng JSON input/output formats for easy integra:on l Full Stack - Minimizes the impact of provisioning Hadoop, LucidWorks and other components l Hosted in cloud and supported by Lucid Imagina:on | 9 10. APIs l Search and Indexing l Analy:cs Full power of LucidWorks (Solr) Common search analy:cs for Bulk and Near Real Time Indexing beaer understanding of relevancy based on log analysis Sharded via SolrCloud Historical views l Workows l Machine Learning Predened workows ease common data tasks such as bulk Clustering indexing Sta:s:cally Interes:ng Phrases l Administra:on Future enhancements planned Access to key system informa:on l Proxy APIs User management LucidWorks WebHDFS | 10 11. Under the Hood LucidWorks 2.1SDA Enginel Lucene/Solr 4.0-dev l RESTful services built on Restlet 2.1 l Sharded with SolrCloud l Service Discovery, load balancing, 1 second (default) som commits for failover enabled via ZooKeeper + NRT updates Neilix Curator 1 minute (default) hard commits l Authen:ca:on and authoriza:on (no searcher reopen) over SSL (op:onal) Transac:on logs for recovery l Proxies for LucidWorks and Solr takes care of leader elec:on, etc. so no more master/worker WebHDFS API l See Mark Millers talk on SolrCloud l Workow engine coordinates data ow | 11 12. Under the Hood l Apache Hadoop l Apache HBase Map-Reduce (MR) jobs for ETL and Key-value and :me series of all bulk indexing into SolrCloud calculated metrics sharded system l Apache Pig Leverage Pig and custom MR jobs for log processing and metric ETL calcula:on Log analysis -> HBase WebHDFS l Apache ZooKeeper l Apache Mahout Neilix Curator for service K-Means Clustering discovery and higher level ZK client Sta:s:cally Interes:ng Phrases l Apache Kasa More to come Pub-sub for collec:ng logs from LucidWorks into HDFS | 12 13. The Road Ahead l Our approach is from search and discovery outwards to analy:cs Analy:cs in beta are focused around analysis of search logs l Analy:cs Themes Relevance Data quality Discovery Integra:on with other packages (R?) l Machine Learning Classica:on NLP l More analy:cs on the index itself? | 13 14. Contacts l hap:// l hap:// l [email protected] l @gsingers | 14