Transcript
Page 1: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 1                |  

Search              Discover              Analyze  

Grant  Ingersoll  Chief  Scien:st  Lucid  Imagina:on  

Enabling  Scalable  Search,  Discovery  and  Analy6cs  with  Solr,  Mahout  and  Hadoop  

Page 2: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 2                |  

l  ________  data  growth  in  the  next  ___  days/months/years  –  Many  es:mate  80-­‐90%  of  data  is  “unstructured”  (mul:-­‐structured?)  

l  The  Age  of  “Data  Paranoia”  –  What  if  I  don’t  collect  it  all?  –  What  if  I  miss  something  or  lose  something?  –  What  if  I  can’t  store  it  long  enough?  –  How  do  I  secure  it?  –  Can  I  afford  to  do  any  of  this?    Can  I  afford  not  to?  

–  What  if  I  can’t  make  sense  of  it?  

We  All  Know  the  Pain  

Page 3: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 3                |  

Big  Data  Premise  and  Promise  

Premise Promise

Large Scale Data Collection/Storage ✔

Prevents Data Loss ✔

Long Term Storage ✔

Affordable ✔

New Science Delivering New Insights ?

Page 4: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 4                |  

l  User  Needs:  –  Real-­‐:me,  ad  hoc  access  to  content  –  Aggressive  Priori:za:on  based  on  Importance  –  Serendipity  

l  Batch  processing  isn’t  enough  

l  Search  is  built  for  mul:-­‐structured  

l  Deeper  analysis  yields:  –  Business  insight  into  users  –  Beaer  Search  and  Discovery  for  users  

Why  Search,  Discovery  and  Analy;cs  (SDA)?  

Search

Discovery Analytics

Page 5: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 5                |  

l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing

l  Large scale, cost effective storage

l  Large scale processing power –  Large scale and distributed for whole data consumption and analysis –  Sampling tools –  Distributed In Memory where appropriate

l  NLP and machine learning tools that scale to enhance discovery and analysis

What  do  you  need  for  SDA?  

Page 6: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 6                |  

l  Dark  Data  –  Petabytes  (and  beyond)  of  content  in  storage  with  liale  insight  into  what’s  in  it  –  Forensics,  Intelligence  Gathering,  Risk  analysis,  etc.  

l  Financial  –  Enable  total  customer  view  to  beaer  understand  risks  and  opportuni:es  

l  Medical  –  Extend  research  capabili:es  through  deeper  analysis  of  both  scien:fic  data,  publica:ons  and  field  usage  

l  Social  Media  Monitoring  –  Understand  and  analyze  social  networks  and  their  trends  all  the  :me,  no  maaer  the  scale  

l  Commerce  –  Drive  more  sales  through  metric  driven  search  and  discovery  without  the  guesswork  

Example  Use  Cases  

Page 7: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 7                |  

An  applica:on  development  plaiorm  aimed  at  enabling  Search,  Discovery  and  Analysis  of  your  content  and  user  interac:ons,  no  maaer  the  volume,  variety  

and  velocity  of  that  content,  nor  the  number  of  users  

Announcing  LucidWorks  Big  Data  Beta  

Page 8: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 8                |  

Architecture  

Page 9: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 9                |  

l  Combines  the  real  :me,  ad  hoc  data  accessibility  of  LucidWorks  with  compute  and  storage  capabili:es  of  Hadoop  

l  Delivers  analy:c  capabili:es  along  with  scalable  machine  learning  algorithms  for  deeper  insight  into  both  content  and  users  

l  RESTful  API  suppor:ng  JSON  input/output  formats  for  easy  integra:on  

l  Full  Stack  -­‐  Minimizes  the  impact  of  provisioning  Hadoop,  LucidWorks  and  other  components  

l  Hosted  in  cloud  and  supported  by  Lucid  Imagina:on  

Key  Features  of  Beta  

Page 10: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 10                |  

APIs  

l  Search  and  Indexing  –  Full  power  of  LucidWorks  (Solr)  –  Bulk  and  Near  Real  Time  Indexing  –  Sharded  via  SolrCloud  

l  Workflows  –  Predefined  workflows  ease  

common  data  tasks  such  as  bulk  indexing  

l  Administra:on  –  Access  to  key  system  informa:on  –  User  management  

l  Analy:cs  –  Common  search  analy:cs  for  

beaer  understanding  of  relevancy  based  on  log  analysis  

–  Historical  views  

l  Machine  Learning  –  Clustering  –  Sta:s:cally  Interes:ng  Phrases  –  Future  enhancements  planned  

l  Proxy  APIs  –  LucidWorks  –  WebHDFS  

Page 11: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 11                |  

Under  the  Hood  

l  Lucene/Solr  4.0-­‐dev  

l  Sharded  with  SolrCloud  –  1  second  (default)  som  commits  for  

NRT  updates  –  1  minute  (default)  hard  commits  

(no  searcher  reopen)  –  Transac:on  logs  for  recovery  –  Solr  takes  care  of  leader  elec:on,  

etc.  so  no  more  master/worker  

l  See  Mark  Miller’s  talk  on  SolrCloud  

l  RESTful  services  built  on  Restlet  2.1  

l  Service  Discovery,  load  balancing,  failover  enabled  via  ZooKeeper  +  Neilix  Curator  

l  Authen:ca:on  and  authoriza:on  over  SSL  (op:onal)  

l  Proxies  for  LucidWorks  and  WebHDFS  API  

l  Workflow  engine  coordinates  data  flow  

LucidWorks 2.1 SDA Engine

Page 12: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 12                |  

Under  the  Hood  

l  Apache  Hadoop  –  Map-­‐Reduce  (MR)  jobs  for  ETL  and  

bulk  indexing  into  SolrCloud  sharded  system  

–  Leverage  Pig  and  custom  MR  jobs  for  log  processing  and  metric  calcula:on  

–  WebHDFS  

l  Apache  Mahout  –  K-­‐Means  Clustering  –  Sta:s:cally  Interes:ng  Phrases  –  More  to  come  

l  Apache  HBase  –  Key-­‐value  and  :me  series  of  all  

calculated  metrics  

l  Apache  Pig  –  ETL  –  Log  analysis  -­‐>  HBase  

l  Apache  ZooKeeper  –  Neilix  Curator  for  service  

discovery  and  higher  level  ZK  client  

l  Apache  Kasa  –  Pub-­‐sub  for  collec:ng  logs  from  

LucidWorks  into  HDFS  

Page 13: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 13                |  

l  Our  approach  is  from  search  and  discovery  outwards  to  analy:cs  –  Analy:cs  in  beta  are  focused  around  analysis  of  search  logs  

l  Analy:cs  Themes  –  Relevance  –  Data  quality  –  Discovery    –  Integra:on  with  other  packages  (R?)  

l  Machine  Learning  –  Classifica:on  –  NLP  

l  More  analy:cs  on  the  index  itself?  

The  Road  Ahead  

Page 14: KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

 14                |  

l  hap://bit.ly/lucidworks-­‐big-­‐data  

l  hap://www.lucidimagina:on.com  

l  grant@lucidimagina:on.com  

l  @gsingers  

Contacts  


Top Related