keynote: enabling scalable search, discovery and analytics with solr,mahout and hadoop

14
1 | Search Discover Analyze Grant Ingersoll Chief Scien:st Lucid Imagina:on Enabling Scalable Search, Discovery and Analy6cs with Solr, Mahout and Hadoop

Upload: lucenerevolution

Post on 24-May-2015

596 views

Category:

Technology


0 download

DESCRIPTION

Presented by Grant Ingersoll, Chief Scientist, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the needs of batch processing approaches. In many cases, one needs both ad hoc, real-time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we'll discuss a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others. The talk will discuss the architecture and capabilities of the system along with how the capabilities of Solr 4 help drive real time access for content discovery and analytics.

TRANSCRIPT

Page 1: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 1                |  

Search              Discover              Analyze  

Grant  Ingersoll  Chief  Scien:st  Lucid  Imagina:on  

Enabling  Scalable  Search,  Discovery  and  Analy6cs  with  Solr,  Mahout  and  Hadoop  

Page 2: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 2                |  

l  ________  data  growth  in  the  next  ___  days/months/years  –  Many  es:mate  80-­‐90%  of  data  is  “unstructured”  (mul:-­‐structured?)  

l  The  Age  of  “Data  Paranoia”  –  What  if  I  don’t  collect  it  all?  –  What  if  I  miss  something  or  lose  something?  –  What  if  I  can’t  store  it  long  enough?  –  How  do  I  secure  it?  –  Can  I  afford  to  do  any  of  this?    Can  I  afford  not  to?  

–  What  if  I  can’t  make  sense  of  it?  

We  All  Know  the  Pain  

Page 3: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 3                |  

Big  Data  Premise  and  Promise  

Premise Promise

Large Scale Data Collection/Storage ✔

Prevents Data Loss ✔

Long Term Storage ✔

Affordable ✔

New Science Delivering New Insights ?

Page 4: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 4                |  

l  User  Needs:  –  Real-­‐:me,  ad  hoc  access  to  content  –  Aggressive  Priori:za:on  based  on  Importance  –  Serendipity  

l  Batch  processing  isn’t  enough  

l  Search  is  built  for  mul:-­‐structured  

l  Deeper  analysis  yields:  –  Business  insight  into  users  –  Beaer  Search  and  Discovery  for  users  

Why  Search,  Discovery  and  Analy;cs  (SDA)?  

Search

Discovery Analytics

Page 5: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 5                |  

l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing

l  Large scale, cost effective storage

l  Large scale processing power –  Large scale and distributed for whole data consumption and analysis –  Sampling tools –  Distributed In Memory where appropriate

l  NLP and machine learning tools that scale to enhance discovery and analysis

What  do  you  need  for  SDA?  

Page 6: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 6                |  

l  Dark  Data  –  Petabytes  (and  beyond)  of  content  in  storage  with  liale  insight  into  what’s  in  it  –  Forensics,  Intelligence  Gathering,  Risk  analysis,  etc.  

l  Financial  –  Enable  total  customer  view  to  beaer  understand  risks  and  opportuni:es  

l  Medical  –  Extend  research  capabili:es  through  deeper  analysis  of  both  scien:fic  data,  publica:ons  and  field  usage  

l  Social  Media  Monitoring  –  Understand  and  analyze  social  networks  and  their  trends  all  the  :me,  no  maaer  the  scale  

l  Commerce  –  Drive  more  sales  through  metric  driven  search  and  discovery  without  the  guesswork  

Example  Use  Cases  

Page 7: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 7                |  

An  applica:on  development  plaiorm  aimed  at  enabling  Search,  Discovery  and  Analysis  of  your  content  and  user  interac:ons,  no  maaer  the  volume,  variety  

and  velocity  of  that  content,  nor  the  number  of  users  

Announcing  LucidWorks  Big  Data  Beta  

Page 8: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 8                |  

Architecture  

Page 9: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 9                |  

l  Combines  the  real  :me,  ad  hoc  data  accessibility  of  LucidWorks  with  compute  and  storage  capabili:es  of  Hadoop  

l  Delivers  analy:c  capabili:es  along  with  scalable  machine  learning  algorithms  for  deeper  insight  into  both  content  and  users  

l  RESTful  API  suppor:ng  JSON  input/output  formats  for  easy  integra:on  

l  Full  Stack  -­‐  Minimizes  the  impact  of  provisioning  Hadoop,  LucidWorks  and  other  components  

l  Hosted  in  cloud  and  supported  by  Lucid  Imagina:on  

Key  Features  of  Beta  

Page 10: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 10                |  

APIs  

l  Search  and  Indexing  –  Full  power  of  LucidWorks  (Solr)  –  Bulk  and  Near  Real  Time  Indexing  –  Sharded  via  SolrCloud  

l  Workflows  –  Predefined  workflows  ease  

common  data  tasks  such  as  bulk  indexing  

l  Administra:on  –  Access  to  key  system  informa:on  –  User  management  

l  Analy:cs  –  Common  search  analy:cs  for  

beaer  understanding  of  relevancy  based  on  log  analysis  

–  Historical  views  

l  Machine  Learning  –  Clustering  –  Sta:s:cally  Interes:ng  Phrases  –  Future  enhancements  planned  

l  Proxy  APIs  –  LucidWorks  –  WebHDFS  

Page 11: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 11                |  

Under  the  Hood  

l  Lucene/Solr  4.0-­‐dev  

l  Sharded  with  SolrCloud  –  1  second  (default)  som  commits  for  

NRT  updates  –  1  minute  (default)  hard  commits  

(no  searcher  reopen)  –  Transac:on  logs  for  recovery  –  Solr  takes  care  of  leader  elec:on,  

etc.  so  no  more  master/worker  

l  See  Mark  Miller’s  talk  on  SolrCloud  

l  RESTful  services  built  on  Restlet  2.1  

l  Service  Discovery,  load  balancing,  failover  enabled  via  ZooKeeper  +  Neilix  Curator  

l  Authen:ca:on  and  authoriza:on  over  SSL  (op:onal)  

l  Proxies  for  LucidWorks  and  WebHDFS  API  

l  Workflow  engine  coordinates  data  flow  

LucidWorks 2.1 SDA Engine

Page 12: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 12                |  

Under  the  Hood  

l  Apache  Hadoop  –  Map-­‐Reduce  (MR)  jobs  for  ETL  and  

bulk  indexing  into  SolrCloud  sharded  system  

–  Leverage  Pig  and  custom  MR  jobs  for  log  processing  and  metric  calcula:on  

–  WebHDFS  

l  Apache  Mahout  –  K-­‐Means  Clustering  –  Sta:s:cally  Interes:ng  Phrases  –  More  to  come  

l  Apache  HBase  –  Key-­‐value  and  :me  series  of  all  

calculated  metrics  

l  Apache  Pig  –  ETL  –  Log  analysis  -­‐>  HBase  

l  Apache  ZooKeeper  –  Neilix  Curator  for  service  

discovery  and  higher  level  ZK  client  

l  Apache  Kasa  –  Pub-­‐sub  for  collec:ng  logs  from  

LucidWorks  into  HDFS  

Page 13: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 13                |  

l  Our  approach  is  from  search  and  discovery  outwards  to  analy:cs  –  Analy:cs  in  beta  are  focused  around  analysis  of  search  logs  

l  Analy:cs  Themes  –  Relevance  –  Data  quality  –  Discovery    –  Integra:on  with  other  packages  (R?)  

l  Machine  Learning  –  Classifica:on  –  NLP  

l  More  analy:cs  on  the  index  itself?  

The  Road  Ahead  

Page 14: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 14                |  

l  hap://bit.ly/lucidworks-­‐big-­‐data  

l  hap://www.lucidimagina:on.com  

l  grant@lucidimagina:on.com  

l  @gsingers  

Contacts