1.apache hive

Upload: almase

Post on 04-Jun-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 1.Apache Hive

    1/23

    Execution

    Environments forDistributedComputing

    Apache Hive

    EEDC 34330

    Master in Computer Architecture,Networks and Systems - CANS

    Homework number: 3Group number: EEDC-1

    Group members:Hugo Prez [email protected]

    Sergio Mendoza [email protected] Fenoy [email protected]

  • 8/14/2019 1.Apache Hive

    2/23

    Outline

    Introduction

    Hive Database Data Model Query Language

    Hive Arquitecture

    Conclusions

  • 8/14/2019 1.Apache Hive

    3/23

    Introduction

    Origins on Facebook...

    Facebook has 500.000.000 logs per day

    Facebook shares a billion pieces of content daily

    Facebook stores a vast amount of data

  • 8/14/2019 1.Apache Hive

    4/23

    Introduction

    What's the problem?

    250 million photos per day 2.7 billion likes and comments per day 2 billion total registered users 100 billion friendships ...

    TOO MUCH DATA!!

  • 8/14/2019 1.Apache Hive

    5/23

    Introduction

    What is Apache Hive?

    Hive is a data warehouse infrastructure

  • 8/14/2019 1.Apache Hive

    6/23

    Introduction

    What is Apache Hive?

    Hive is a data warehouse infrastructure

    and what is a Data Warehouse (DW)?

    a DW is a database for reporting and analysis

  • 8/14/2019 1.Apache Hive

    7/23

    Introduction

    How does Apache Hive works?

    Hive is built on top of Hadoop

    Hive stores data in the HDFS

    Hive compile SQL queries as MapReducejobsand run the jobs in the cluster

    http://en.wikipedia.org/wiki/MapReduce
  • 8/14/2019 1.Apache Hive

    8/23

    Introduction

    How does Apache Hive works?

    HiveQL query

  • 8/14/2019 1.Apache Hive

    9/23

    Introduction

    How does a simple web app works?

    MySQL query

  • 8/14/2019 1.Apache Hive

    10/23

    Outline

    Introduction

    Hive Database Data Model Query Language

    Hive Arquitecture

    Conclusions

  • 8/14/2019 1.Apache Hive

    11/23

  • 8/14/2019 1.Apache Hive

    12/23

    Hive defines a simple SQL-like query language,called QL

    - Supports DDL and DML.

    - Users can embed custom map-reduce scripts

    - Supports UDF, UDAF and UDTF.

    HiveQL

  • 8/14/2019 1.Apache Hive

    13/23

    REDUCE subq2.school, subq2.meme, subq2.cnt

    USING top10.pyAS (school,meme,cnt)FROM (SELECT subq1.school, subq1.meme, COUNT(1)

    AS cnt FROM (MAP b.school, a.statusUSING meme-extractor.pyAS (school,meme)

    FROM status_updates a JOIN profiles b ON (a.userid = b.userid) )subq1GROUP BY subq1.school, subq1.memeDISTRIBUTE BY school, memeSORT BY school, meme, cnt desc

    ) subq2;

    HiveQL Extract

  • 8/14/2019 1.Apache Hive

    14/23

    Outline

    Introduction

    Hive Database Data Model Query Language

    Hive Arquitecture

    Conclusions

  • 8/14/2019 1.Apache Hive

    15/23

    Architecture

  • 8/14/2019 1.Apache Hive

    16/23

    Architecture

    External Interfaces- provides both user interfaces likecommand line (CLI) and web UI, and applicationprogramming interfaces (API) like JDBC and ODBC

    Thrift Serverexposes a very simple client API toexecute HiveQL statements

    Metastoreis the system catalog. All other componentsof Hive interact with the metastore.

  • 8/14/2019 1.Apache Hive

    17/23

    Architecture

    Drivermanages the life cycle of a HiveQL statementduring compilation, optimization and execution

    Compilertranslates statements into a plan which

    consists of a DAG of map-reduce jobs

    The driver submits the individual map-reduce jobsfrom the DAG to the Execution Enginein atopological order

  • 8/14/2019 1.Apache Hive

    18/23

    Metastore

    The metastore is the system catalog which containsmetadata about the tables stored in Hive.

    Database- is a namespace for tables. Table- Metadata for table contains list of columns

    and their types, owner, storage and SerDe information Partition- Each partition can have its own columns and

    SerDe and storage information

  • 8/14/2019 1.Apache Hive

    19/23

    Query Compiler

    Parsertransforms a query string to a parsetree representation.

    Semantic Analyzertransforms the parse tree to a block-based internal query representation.

    Logical Plan Generatorconverts the internalquery representation to a logical plan, which consists of atree of logical operators

    Optimizerperforms multiple passes over the logical planand rewrites it in several ways

    Physical Plan Generator converts the logical plan into aphysical plan, consisting of a DAG of map-reduce jobs

  • 8/14/2019 1.Apache Hive

    20/23

    Outline

    Introduction

    Hive Database Data Model Query Language

    Hive Arquitecture

    Conclusions

  • 8/14/2019 1.Apache Hive

    21/23

  • 8/14/2019 1.Apache Hive

    22/23

  • 8/14/2019 1.Apache Hive

    23/23

    Links:

    http://i.stanford.edu/~ragho/hive-icde2010.pdfhttp://www.vldb.org/pvldb/2/vldb09-938.pdfhttp://hive.apache.org/https://cwiki.apache.org/Hive/languagemanual-

    transform.htmlhttp://biggdata.blogspot.com/2011/04/refreshing-trendingtopics-website-data.htmlhttp://code.google.com/p/hive-

    mrc/wiki/AboutHiveCore