dataweek keynote: large scale search, discovery and analysis in action
DESCRIPTION
Session Description: Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other insights. In this talk, we`ll discuss real world use cases across several industries as well as how to effectively leverage open source tools like Hadoop, Solr, Mahout and others to better enable user access to big data.TRANSCRIPT
![Page 1: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/1.jpg)
Confidential © Copyright 2012
Large Scale Search, Discovery and Analysis in Action
Ivan ProvalovResearch EngineerOffice of the Chief ScientistSeptember 25, 2012
![Page 2: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/2.jpg)
Confidential and Proprietary © 2012 LucidWorks
User Interactions With Big Data
2
Data
Data
Data
DFS
Key Value Store
Index
Command Line
Query Language
Keyword Search
SystemAdministrator
Engineer
End User
![Page 3: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/3.jpg)
Confidential and Proprietary © 2012 LucidWorks
Search, Discovery and Analytics
Is Search Enough?
• Keyword search is a commodity
• Holistic view of the data and the user interactions with that data
• Search, Discovery and Analytics are the key to unlocking this view of users and data
Search
endeavour shuttle bay area
3
![Page 4: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/4.jpg)
Confidential and Proprietary © 2012 LucidWorks
Why Search, Discovery and Analytics?
• User Needs- real-time, ad hoc access to
content- aggressive prioritization
based on importance- serendipity- feedback/learning from past
• Business Needs- deeper insight into users- leverage existing internal
knowledge- cost effective
Search
DiscoveryAnalytics
4
![Page 5: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/5.jpg)
Confidential and Proprietary © 2012 LucidWorks
Topics
• Background and needs• Architecture• Search, Discovery and Analytics in action• Road map• Wrap up
5
![Page 6: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/6.jpg)
Confidential and Proprietary © 2012 LucidWorks
Search
• Performance• Real time• Relevance and importance• Presenting results• Experiment management
6
![Page 7: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/7.jpg)
Confidential and Proprietary © 2012 LucidWorks
Discovery
• Content clustering • Discovering near duplicate documents• Finding ‘dark data’• Making recommendations• Uncovering trends• Recognizing topics• More like this
7
![Page 8: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/8.jpg)
Confidential and Proprietary © 2012 LucidWorks
Analytics
• Term frequency• Facets• Click analysis• Relevancy metrics• Zero results queries• Hot spots• Statistically interesting phrases
8
![Page 9: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/9.jpg)
Confidential and Proprietary © 2012 LucidWorks
Some Use Cases
• Video streaming- classification- recommendations
• Financial, transportation, telecommunications- fraud detection
• Social media- trend monitoring
• Information technology- logs monitoring
•Healthcare- identifying patients for clinical studies
9
![Page 10: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/10.jpg)
Confidential and Proprietary © 2012 LucidWorks
In Focus: Personalized Medicine
10
Genetic Variations
Patient DNA
Alignment and other analysis
Search and Faceting
Standard Therapies
Alternative Therapies
![Page 11: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/11.jpg)
Confidential and Proprietary © 2012 LucidWorks11
In Focus: Log Processing in Telecommunications
• Each year, large sums of money are lost due to fraudulent calls and poor service
• Logs are usually semi-structured and contain vital information about errors and fraud
• Deeper batch analytics can provide insight into patterns across vast amounts of data
• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities
![Page 12: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/12.jpg)
Confidential and Proprietary © 2012 LucidWorks
What Does a Search, Discovery and Analytics Platform Need?
• Fast, efficient, scalable search- bulk and near real time indexing
- handle billions of records with sub-second search and faceting
• Large scale, cost effective storage and processing capabilities- need whole data consumption and analysis
- experimentation/sampling tools
• NLP and machine learning tools that scale to enhance discovery and analysis
12
![Page 13: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/13.jpg)
Confidential and Proprietary © 2012 LucidWorks
Building a Search, Discovery and Analytics Platform
Inpu
tsAPI
Man
agem
entSearch, Discovery, Analytics
Processing & Storage
Provisioning, Monitoring & Configuration
Bulk & Real Time
![Page 14: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/14.jpg)
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs
API
Provisioning, Monitoring & Configuration
Man
agem
entSearch, Discovery, Analytics
Processing & Storage
![Page 15: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/15.jpg)
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs
Processing & Storage
API
Provisioning, Monitoring & Configuration
Man
agem
entSearch, Discovery, Analytics
![Page 16: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/16.jpg)
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs Search, Discovery, Analytics
Processing & Storage
Analytics Service Document Service
API
Provisioning, Monitoring & Configuration
Man
agem
ent
![Page 17: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/17.jpg)
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs MgmtSearch, Discovery, Analytics
Processing & Storage
Analytics Service Document ServiceAdmin
ServiceMgmt
DataMgmt
API
Provisioning, Monitoring & Configuration
![Page 18: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/18.jpg)
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs MgmtSearch, Discovery, Analytics
Processing & Storage
Provisioning, Monitoring & Configuration
Analytics Service Document ServiceAdmin
ServiceMgmt
DataMgmt
API
![Page 19: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/19.jpg)
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs
API
MgmtSearch, Discovery, Analytics
Processing & Storage
Analytics Service Document Service
Big Data LucidWorks Web HDFS
Admin
ServiceMgmt
DataMgmt
Provisioning, Monitoring & Configuration
![Page 20: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/20.jpg)
Confidential and Proprietary © 2012 LucidWorks20
Components – LucidWorks Search
Component Benefit
LucidWorks Search (2.1.1)• connector framework• security• user click framework• business process integration• administration
Lucene/Solr 4.0-dev, sharded with SolrCloud, near-real time indexing, transaction logs for recovery.
LucidWorks Search
![Page 21: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/21.jpg)
Confidential and Proprietary © 2012 LucidWorks21
Components - Hadoop
Component Benefit
Apache Hadoop (1.0.3) Distributed computing and processing for ETL and analytics jobs.
Apache HBase (0.92) Key-value store allowing fast access to the data.
Apache Oozie (modified 3.2) Workflow orchestration.
![Page 22: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/22.jpg)
Confidential and Proprietary © 2012 LucidWorks22
Components - Analysis/ML/NLP
Component Benefit
Apache Mahout (trunk)• k-means clustering• statistically interesting phrases• similar documents• classification
Distributed machine learning processing framework.
Apache UIMA (2.4.0) Text processing and annotations.
Apache OpenNLP (1.5.2)• named entity extraction
Machine learning toolkit for natural language processing.
Behemoth (modified trunk) Makes easier M/R data extraction, abstracts annotations frameworks.
Apache Pig (0.9.2)• ETL• log analysis
Helps with writing analytics M/R programs.
![Page 23: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/23.jpg)
Confidential and Proprietary © 2012 LucidWorks23
Components - Middleware
Component Benefit
Apache ZooKeeper (3.4.3)• Netflix Curator
Service discovery.
Apache Kafka (0.7) Logs consumption and event-based real-time document processing framework.
![Page 24: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/24.jpg)
Confidential and Proprietary © 2012 LucidWorks
Components - SDA Engine
• RESTful services (Restlet 2.1)• ZooKeeper + Netflix Curator• Authentication and authorization• Proxies for LucidWorks and
WebHDFS API• Workflow engine
24
![Page 25: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/25.jpg)
Confidential and Proprietary © 2012 LucidWorks
Road Map
• Analytics themes- relevance- data quality- discovery- integration with other packages (R)
• Machine learning- NLP- recommendations
• Experiment management
25
![Page 26: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/26.jpg)
Confidential and Proprietary © 2012 LucidWorks
Conclusions
• Search, Discovery and Analytics, when combined into a single, integrated system provides powerful insight into both your content and your users
• LucidWorks has combined many of these things into LucidWorks Big Data
26
![Page 27: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/27.jpg)
Confidential and Proprietary © 2012 LucidWorks27
LucidWorks Big Data
• Unified development platform for Big Data applications• Integrated open source stack: Lucene/Solr, Hadoop,
Mahout, NLP• Single, uniform REST API• Pre-tuned by open source industry experts• Out of the box provisioning - hosted or on premise
![Page 28: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION](https://reader036.vdocument.in/reader036/viewer/2022081515/54c672e94a795913618b46be/html5/thumbnails/28.jpg)
Confidential and Proprietary © 2012 LucidWorks
www.lucidworks.com/bigdata
@iprovalov
Search | Discover | Analyze
28