shadha hive presentation and quiz - wmich.edu · qa design for online transaction processing ......

11/18/15

1

byshathamuhi

CS60301

q Big Data: collections of large datasets (huge volume, high velocity, and variety of data).

q Apache Hadoop framework emerged to solve big data management and processing challenges.

q Hadoop was not designed to migrate data from traditional relational databases to its HDFS.

q This is where Hive comes in.

2

q  Apache Software Foundation took Hive and developed it further as an open source in 2008 under the name Apache Hive.

3

11/18/15

2

q A petabyte data warehouse software for managing and querying unstructured large datasets as if it were structured residing in distributed storage in Hadoop cluster which provides

q  Access to files stored either directly in Appach HDFS or in other data storage systems such as Apache Hbase.

q  Tools to enable easy data extract/transform/load (ETL).

q  A mechanism to impose structure on a variety of data formats.

q  Query execution via MapReduce.

4

q Hadoop is a good infrastructure based on relational database while Hive is just a user interface.

q False. q True.

5

q A relational database. q A design for Online Transaction Processing (OLTP). q A Language for real-time queries and row-level

updates.

6

11/18/15

3

q  Internet of things (IOT) need real time system and Hive is the best choice because it isn’t a batch system.

q False q True

7

8

q The conjunction part of HiveQl process Engine and MapReduce in Hive architecture is Hadoop distributed file system(HDFS) which uses the flavor of MapReduce.

q True. q False.

9

11/18/15

4

10

1-Executequery.2-Getplan.3-Getmetadata.4-Sendmetadata.5-Sendplan.6-Executeplan.7-Executejob.7.1-Metadataops8-Fetchresults.9-Sendresults.10-Sendresults.

q  Uses SQL type language for querying called HiveQL (HQL), unlike pig.

q  Do the dirty(hard) work of mapping data operations to low-level Map-Reduce Java API which is hard even for experienced Java programmer.

q  Pluggable in which the underlying execution engine can be changed from MapReduce to Tez or Spark.

q  Fault-tolerance unlike all other engines including newer engines such as Impala.

q  Feature-rich because it is the oldest engine while new engines have fewer features. For example, it supports nested data types (structs, array, map, ect) 11

q One of the methods to enhance new versions of Hive is by using different execution engine other than MapReduse such as:

q Tez. q Pig. q  Impala. q  Spark.

12

11/18/15

5

q  Performance because it uses MapReduce as the execution engine.

q  MapReduce is not good choice for running ad hoc and interactive queries because it reads and writes to disk extensively besides the high startup cost.

q  For instance, multi join query could take minutes not because of data size but because of the number of read and writes to disk.

q  Pluggable engines and vectorized query execution are two main enhancements to reduce the effects of the performance drawback.

13

q The following are the features that make Hive very popular and a good choice in batch systems:

q Oldest system. q Pluggable. q Feature-rich. q Vectorized query execution.q  Sharedmetastore.

14

q  Primitive types: q  Integers:TINYINT,SMALLINT,INT,

BIGINT.q  Boolean:BOOLEAN.q  FloaMngpointnumbers:FLOAT,DOUBLE.q  String:STRING.

q  Complex types q  Structs:{aINT;bINT}.q  Maps:M['group'].q  Arrays:['a','b','c'],A[1]returns'b'.

15

11/18/15

6

q Hive engine has rich features which are complex data types such us :

q  Struct. q  Integer. q Map. q Array.

16

q  Tables: q  Analogous to tables in relational DBs. q  Each table has corresponding directory in HDFS.

q  Partitions: q  Analogous to indexes on partition columns. q  Nested sub-directories in HDFS for each combination

of partition column values. q  Allows users to efficiently retrieve rows. q  For instance, range partition tables by date.

17

q  Buckets q  Split data based on hash of a column – mainly for

parallelism. q  Data in each partition may in turn be divided into

Buckets based on the value of a hash function of some column of a table.

18

11/18/15

7

q Partition in Hive divides rows and efficiently retrieves columns :

q True. q False.

19

q  Uses of SQL syntax and Hadoop features made Hive very popular and easy to program.

q  Despite of its latency problem, being feature rich made it more used than the newer engines.

q  It has many enhancements and flexibility which make it the best choice for processing and querying data in a batch system till this moment.

20

1.  EdwardCapriolo,DeanWampler,andJasonRutherglen.2012.ProgrammingHive(1sted.).O'ReillyMedia,Inc..

2.  TomWhite.2009.Hadoop:TheDefini4veGuide(1sted.).O'ReillyMedia,Inc..3.  Thusoo,A.;Sarma,J.S.;Jain,N.;ZhengShao;Chakka,P.;NingZhang;Antony,S.;HaoLiu;Murthy,

R.,"Hive-apetabytescaledatawarehouseusingHadoop,“inDataEngineering(ICDE),2010IEEE26thInterna4onalConferenceon,vol.,no.,pp.996-1005,1-6March2010.

4.  Shao,Zheng.“Hadoop/HiveGeneralIntroducMon”.2008.Retrievedfromhips://view.officeapps.live.com/op/view.aspx?src=hip://u.cs.biu.ac.il/~ariel/download/ds590/resources/cloud/hadoop/Hadoop_General_IntroducMon.ppt.

5.  PerryHoekstra,JiahengLu,AvinashLakshman,PrashantMalik,andJimmyLin.“NoSQLandBigDataProcessingHbase,HiveandPig,etc.”.Retrievedfromhips://view.officeapps.live.com/op/view.aspx?src=hip://www.cs.kent.edu/~jin/Cloud12Spring/HbaseHivePig.pptx.

6.  “Hive:AdatawarehouseonHadoop”.2015.Retrievedfromhips://view.officeapps.live.com/op/view.aspx?src=hip://www.cse.buffalo.edu/~bina/cse487/fall2011/HiveNov11.ppt.

7.  JoydeepSenSarma,AshishThusoo.“HIVEDataWarehousing&AnalyMcsonHadoop”.Retrievedfromhip://www.slideshare.net/zshao/hive-data-warehousing-analyMcs-on-hadoop-presentaMon.

8.  AlanGates.“GevngStarted”.2015.Retrievedfromhips://cwiki.apache.org/confluence/display/Hive/GevngStarted.

9.  Wikipedia.“ApacheHive”.2015.Retrievedfromhips://en.wikipedia.org/wiki/Apache_Hive.10.  “Hivequerylanguage”.2014.Retrievedfromhip://www.tutorialspoint.com/hive/. 21

11/18/15

8

22

shadha hive presentation and quiz - wmich.edu · qa design for online transaction processing ......

Documents