essevent:# big#data#in#official#statistics# › eurostat › cros › system › files... ·...

38
1 ESS event: Big Data in Official Statistics v v erbi is Antonino Virgillito, Istat

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

1  

ESS  event:    Big  Data  in  Official  Statistics  

                                                              v v erbi is  

Antonino  Virgillito,  Istat  

Page 2: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

2  

About  me  

•  Head  of  Unit  “Web  and  BI  Technologies”,  IT  Directorate  of  Istat  

•  Project  manager  and  technical  coordinator  of    – Web  technologies,  mobile  applications,  BI  &  ETL  – Highlights:  online  Census  questionnaires,  Consumer  Price  Survey  on-­‐field  data  collection  architecture  

•  PhD  in  Computer  Engineering  –  Field  of  research:  large-­‐scale  distributed  systems  

•  Trainer  on  Hadoop-­‐MapReduce  

Page 3: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

3  

Abstract  

•  The  hype  surrounding  Big  Data  technologies  hides  the  complexity  of  their  adoption  in  NSIs  –  Strong  IT  know-­‐how  required  for  configuration  and  management  

–  Should  be  accessed  through  common  statistical  software  

•  Reasoned  overview  of  the  most  popular  Big  Data  technologies,  with  focus  on  their  usage  in  NSIs  

Page 4: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

4  

Outline  

•  Motivations  for  Big  Data  technologies  •  Overview  of  Big  Data  tools  •  Adopting  Big  Data  technologies  in  NSIs  

Page 5: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

5  

MOTIVATIONS  FOR  BIG  DATA  TECHNOLOGIES  

Page 6: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

6  

What  does    Big  mean?  

Background  

Page 7: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

7  

Big  Size  

Tera-­‐,  Peta-­‐  …  and  growing    

Background  

Processing  a  complex  statistical  method  can  become  untreatable  even  with  data  sets  of  “reasonable”  size    

Quality  Big  Data  is  often  loosely  

structured  and  highly  noisy    

Page 8: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

8  

Big  Size  

•  Real  Big  Data  begins  where  your  usual  tools  fail…    Distributed  file  systems  •  Clusters  of  commodity  hardware  that  can  scale  to  indefinite  

size  simply  by  adding  new  nodes  at  runtime  •  Overcome  physical  limitations  •  Should  be  managed  by  a  middleware  platform  (Hadoop  HDFS)  

 

Tools  and  Techniques  

Page 9: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

9  

Big  Processing  

Tools  and  Techniques  

MapReduce  •  Programming  paradigm  that  enables  

programs  to  be  executed  in  parallel  on  a  cluster  

•  Not  tied  to  a  programming  language,  interfaces  exist  for  all  common  languages  and  tools  

Page 10: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

10  

Big  Quality  

•  Pre-­‐processing  for  cleaning  and  organizing  data  

•  Big  Data  are  often  unstructured  but  the  viceversa  is  not  true  

Tools  and  Techniques  

Page 11: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

11  

Technical  Challenges  

 Handling  Big  Data  necessarily  requires  relying  on  complex  distributed  technologies    

If  you  want  to  get  something  from  real  big  data  you  have  to  deal  with  this  complexity  

Page 12: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

12  

Perspectives  

Colleen, the Statistical Analyst Moss, the IT Guy

Ok,  but    I  want  to  use  my  tools  and  methods.  I  don’t  want  to  touch  this  distributed  stuff  

I  can  setup  the  infrastructure  and  the  data  and  help  you  with  the  tools  

Deal    

I  don’t  want  to  write  programs  for  every  analysis  she  

makes  

Page 13: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

13  

BIG  DATA  TOOLS  OVERVIEW  

Page 14: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

14  

Big  Data  IT  Tools  Proliferation  

Page 15: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

15  

Our  focus:  IT  Tools  for    Statistical  Analysis  of  Big  Data  

•  What  are  the  basic  tools?  •  What  is  the  best  tool  for  the  job?  •  How  these  tools  integrate  with  common  elements  in  an  IT  architecture?  

Page 16: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

16  

Big  Data  IT  Tools:    the  Common  Denominator  

Page 17: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

17  

Distributed  Storage  and  Processing:  Hadoop  

Distributed  storage  platform        De-­‐facto  standard  for  Big  Data  processing    Open  source  project  supported  and/or  adopted  by  most  major  vendors    Virtually  unlimited  scalability  

–  storage,  memory,  processing  power  

Page 18: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

18  

Hadoop  

Hadoop  Principle  I’m one big

data set Hadoop  is  basically  a  middleware  platforms  that  manages  a  cluster  of  machines  The  core  components  is  a  distributed  file  system  (HDFS)  

HDFS  

Files  in  HDFS  are  split  into  blocks  that  are  scattered  over  the  cluster  

The  cluster  can  grow  indefinitely  simply  by  adding  new  nodes  

Page 19: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

19  

The  MapReduce  Paradigm  

Parallel  processing  paradigm  Programmer  is  unaware  of  parallelism  Programs  are  structured  into  a  two-­‐phase  execution  

Map  Data  elements  are  classified  into  categories  

Reduce  An  algorithm  is  applied  to  all  the  elements  of  the  same  category  

x 4 x 5 x 3

Page 20: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

20  

MapReduce  and  Hadoop  

Hadoop  

HDFS  

MapReduce   MapReduce  is  logically  placed  on  top  of  HDFS  

Page 21: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

21  

MapReduce  and  Hadoop  

Hadoop  

HDFS  

MR  

HDFS  

MR  

HDFS  

MR  

HDFS  

MR  

MR  works  on  (big)  files  loaded  on  HDFS  

Each  node  in  the  cluster  executes  the  MR  program  in  parallel,  applying  map  and  reduces  phases  on  the  blocks  it  stores  

Output  is  written  on  HDFS  

Scalability  principle:  Perform  the  computation  were  the  data  is  

Page 22: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

22  

MapReduce  Applications  •  Naturally  targeted  at  counts  and  aggregations  

–  1-­‐line  aggregation  algorithm  •  Collecting  &  combining  

–  It  all  began  there…inverted  index  computation  in  Google  •  Machine  learning,  cross-­‐correlation  •  Graph  analysis  

–  “People  you  may  know”  –  Geographical  data:  in  Google  Maps,  finding  nearest  feature  to  

a  given  address  or  location  •  Pre-­‐processing  of  unstructured  data    •  Can  also  handle  binary  files  

–  NYT  converted  4TBs  of  scanned  articles  into  1.5TB  of  PDFs    

Page 23: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

23  

Data  Analysis  with  Hadoop  

Colleen, the Statistical Analyst Moss, the IT Guy

Cool!  Now  how  can  I  analyze  them?  

I  finally  loaded  those  elephant-­‐size  data  sets  into  Hadoop!  

It’s  simple!  Write  a  MapReduce  program  in  Java!    

No  

Ok,  I’ll  do  that  for  you  

No  MapReduce  programs  can  be  written  in  various  programming  languages  Several  tools  are  also  available  that  translate  high-­‐level  analysis  languages  into  MapReduce  programs  

Page 24: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

24  

High-­‐level  languages  for  data  manipulation  

Tools  for  Data  Analysis  with  Hadoop  

Hadoop  

HDFS  

MapReduce  

Pig  

Statistical  Software  

Hive  

Page 25: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

25  

Using  Hadoop  from  Statistical  Software  

•  R  –  packages  rhdfs,  rmr  –  Issue  HDFS  commands  and  write  MapReduce  jobs    

•  SAS  –  SAS  In-­‐Memory  Statistics  –  SAS/ACCESS    

•  Makes  data  stored  in  Hadoop  appear  as  native  SAS  datasets  

•  Uses  Hive  interface  •  SPSS  

–  Transparent  integration  with  Hadoop  data  

Page 26: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

26  

Apache  Pig  

•  Tool  for  querying  data  on  Hadoop  clusters  •  Widely  used  in  the  Hadoop  world  

–  Yahoo!  estimates  that  50%  of  their  Hadoop  workload  on  their  100,000  CPUs  clusters  is  genarated  by  Pig  scripts  

•  Allows  to  write  data  manipulation  scripts  written  in  a  high-­‐level  language  called  Pig  Latin  –  Interpreted  language:  scripts  are  translated  into  MapReduce  jobs  

•  Mainly  targeted  at  joins  and  aggregations    

Page 27: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

27  

Pig  Example  Real  example  of  a  Pig  script  used  at  Twitter  

The  Java  equivalent…  

Page 28: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

28  

Hadoop-­‐MapReduce  Limitations  

•  Not  usable  in  transactional  applications  •  Not  suited  to  Real-­‐time  analysis  

– MapReduce  jobs  run  in  batch  mode.    •  HDFS  is  an  append-­‐only  file  system  

–  Can  insert  and  delete,  but  cannot  update  •  MapReduce  jobs  run  in  batch  mode  

–  You  cannot  expect  low  response  latency  – Not  suited  for  interactive,  real-­‐time  operations  and/or  random-­‐access  read/writes  

 

Page 29: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

29  

NoSQL  databases  

•  Distributed  storage  platforms  that  allows  for  lower  latency  processing    

•  NoSQL:  Not  Only  SQL  •  Non-­‐relational  data  models  that  trade  transactional  consistency  for  query  efficiency  and  support  semi-­‐structured  data  – No  joins,  no  transactions,  no  indexes  

Page 30: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

30  

NoSQL  Databases  Popular  choices:  Hbase  and  Cassandra  Use  a  column-­‐oriented  model  

–  Data  organized  in  families  of  key:  value  pairs  

–  variable  schema  where  each  row  is  slightly  different  

–  optimized  for  sparse  data  

Can  be  accessed  from  R  

Hadoop  

HDFS  

MapReduce  

HBase  

R  

Cassandra  

Fully  distributed  platform  Not  based  on  Hadoop  

Based  on  Hadoop  

Page 31: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

31  

Big  Data  Tools  in  the  IT  Architecture  

•  Hadoop  is  not  a  DB/DW  replacement  but  it  sits  besides  traditional  data  technologies  in  a  modern  IT  architecture  

•  The  outcome  of  Big  Data  processing  can  be  stored  in  a  traditional  DB-­‐DW  

•  Modern  (visual)  analytics  tools  can  integrate  both  kinds  of  data  sources  

Page 32: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

32  

Analysis  tools  

Augmented  IT  Architecture  

Hadoop  

Statistical  software   Visual  Analytics  

NoSQL  DB   DW  

multi-­‐structured    

big  data  

BI  

Initial  processing  and  cleaning  Keeps  multi-­‐structured  historical  data  online  and  accessible  

Analysis  results  

Page 33: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

33  

ADOPTING  BIG  DATA  TECHNOLOGIES  

Page 34: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

34  

   Maximum  control  of  configuration  and  costs      High  complexity  

 Pay-­‐per-­‐use  billing  model    Cuts  hardware  and  software  costs  and  eliminates  management  burden    Privacy  issues!    

 Easy    Costly        

Hadoop  Deployment  Options  

In-­‐house   Cloud   Appliance  

Page 35: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

35  

IT  Skills  for  Big  Data  Tools  

System  manager  Data  Integrator  Data  Engineer  Data  scientist  

Designs  the  IT  architecture  for  collecting  and  processing      Designs  and  develop  writes  MR  jobs  or  PIG  scripts  

Sets  up  and  manages  the  physical  infrastructure  

Develops  ETL  procedures  to  move  data  to/from  HDFS  and  NoSQL  DBs  

Derives  new  insights  by  applying  statistical  analysis  methods  on  different,  heterogeneous,  possibly  big,  data  sources      Has  strong  IT  foundations  and  can  develop  her  algorithms  using  both  statistical  tools  and  Hadoop  

Uses  statistical  tools  and  VA  

Data  analyst  

SQL  

R  -­‐  SAS  -­‐  SPSS  BI  and  Visual  Analytics  

Excel  

Linux  Map  Reduce  ETL  

Pig   Java  

Page 36: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

36  

Suggestions  for  ESS  

Training  on  data  science  for  statisticians  and  Big  data  engineering  for  IT  staff  

Implementation  of  standard  methods  and  tools  in  a  Hadoop-­‐compliant  version  

Eurostat  establishing  repositories  of  Big  Data  and  allowing  NSIs  to  access  them  

Set  up  of  a  “statistical  cloud”,  a  Hadoop  cluster  shared  by  NSIs    

Possible  agreements  with  providers  of  IT  solutions  (Google,  etc.)    

Page 37: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

37  

Conclusions  

•  Big  Data  tools  makes  sense  when  you  really  have  serious  size  issues  to  deal  with  – Not  much  use  for  a  2-­‐node  Hadoop  cluster    

•  No  value  in  jumping  on  the  Big  Data  bandwagon  for  its  own  sake  – High  costs  –  You  can  still  be  a  “data  scientist”...  

•  However,  Big  Data  engineering  provides  new  opportunities  –  Collect  more  data  –  “Ask  bigger  questions”  

Page 38: ESSevent:# Big#Data#in#Official#Statistics# › eurostat › cros › system › files... · IT#Skills#for#Big#Data#Tools# Data#scientist# Data#Engineer# Data#Integrator# Systemmanager#

38  

Questions