shadha hive presentation and quiz - wmich.edu · qa design for online transaction processing ......
TRANSCRIPT
11/18/15
1
byshathamuhi
CS60301
q Big Data: collections of large datasets (huge volume, high velocity, and variety of data).
q Apache Hadoop framework emerged to solve big data management and processing challenges.
q Hadoop was not designed to migrate data from traditional relational databases to its HDFS.
q This is where Hive comes in.
2
q Apache Software Foundation took Hive and developed it further as an open source in 2008 under the name Apache Hive.
3
11/18/15
2
q A petabyte data warehouse software for managing and querying unstructured large datasets as if it were structured residing in distributed storage in Hadoop cluster which provides
q Access to files stored either directly in Appach HDFS or in other data storage systems such as Apache Hbase.
q Tools to enable easy data extract/transform/load (ETL).
q A mechanism to impose structure on a variety of data formats.
q Query execution via MapReduce.
4
q Hadoop is a good infrastructure based on relational database while Hive is just a user interface.
q False. q True.
5
q A relational database. q A design for Online Transaction Processing (OLTP). q A Language for real-time queries and row-level
updates.
6
11/18/15
3
q Internet of things (IOT) need real time system and Hive is the best choice because it isn’t a batch system.
q False q True
7
8
q The conjunction part of HiveQl process Engine and MapReduce in Hive architecture is Hadoop distributed file system(HDFS) which uses the flavor of MapReduce.
q True. q False.
9
11/18/15
4
10
1-Executequery.2-Getplan.3-Getmetadata.4-Sendmetadata.5-Sendplan.6-Executeplan.7-Executejob.7.1-Metadataops8-Fetchresults.9-Sendresults.10-Sendresults.
q Uses SQL type language for querying called HiveQL (HQL), unlike pig.
q Do the dirty(hard) work of mapping data operations to low-level Map-Reduce Java API which is hard even for experienced Java programmer.
q Pluggable in which the underlying execution engine can be changed from MapReduce to Tez or Spark.
q Fault-tolerance unlike all other engines including newer engines such as Impala.
q Feature-rich because it is the oldest engine while new engines have fewer features. For example, it supports nested data types (structs, array, map, ect) 11
q One of the methods to enhance new versions of Hive is by using different execution engine other than MapReduse such as:
q Tez. q Pig. q Impala. q Spark.
12
11/18/15
5
q Performance because it uses MapReduce as the execution engine.
q MapReduce is not good choice for running ad hoc and interactive queries because it reads and writes to disk extensively besides the high startup cost.
q For instance, multi join query could take minutes not because of data size but because of the number of read and writes to disk.
q Pluggable engines and vectorized query execution are two main enhancements to reduce the effects of the performance drawback.
13
q The following are the features that make Hive very popular and a good choice in batch systems:
q Oldest system. q Pluggable. q Feature-rich. q Vectorized query execution.q Sharedmetastore.
14
q Primitive types: q Integers:TINYINT,SMALLINT,INT,
BIGINT.q Boolean:BOOLEAN.q FloaMngpointnumbers:FLOAT,DOUBLE.q String:STRING.
q Complex types q Structs:{aINT;bINT}.q Maps:M['group'].q Arrays:['a','b','c'],A[1]returns'b'.
15
11/18/15
6
q Hive engine has rich features which are complex data types such us :
q Struct. q Integer. q Map. q Array.
16
q Tables: q Analogous to tables in relational DBs. q Each table has corresponding directory in HDFS.
q Partitions: q Analogous to indexes on partition columns. q Nested sub-directories in HDFS for each combination
of partition column values. q Allows users to efficiently retrieve rows. q For instance, range partition tables by date.
17
q Buckets q Split data based on hash of a column – mainly for
parallelism. q Data in each partition may in turn be divided into
Buckets based on the value of a hash function of some column of a table.
18
11/18/15
7
q Partition in Hive divides rows and efficiently retrieves columns :
q True. q False.
19
q Uses of SQL syntax and Hadoop features made Hive very popular and easy to program.
q Despite of its latency problem, being feature rich made it more used than the newer engines.
q It has many enhancements and flexibility which make it the best choice for processing and querying data in a batch system till this moment.
20
1. EdwardCapriolo,DeanWampler,andJasonRutherglen.2012.ProgrammingHive(1sted.).O'ReillyMedia,Inc..
2. TomWhite.2009.Hadoop:TheDefini4veGuide(1sted.).O'ReillyMedia,Inc..3. Thusoo,A.;Sarma,J.S.;Jain,N.;ZhengShao;Chakka,P.;NingZhang;Antony,S.;HaoLiu;Murthy,
R.,"Hive-apetabytescaledatawarehouseusingHadoop,“inDataEngineering(ICDE),2010IEEE26thInterna4onalConferenceon,vol.,no.,pp.996-1005,1-6March2010.
4. Shao,Zheng.“Hadoop/HiveGeneralIntroducMon”.2008.Retrievedfromhips://view.officeapps.live.com/op/view.aspx?src=hip://u.cs.biu.ac.il/~ariel/download/ds590/resources/cloud/hadoop/Hadoop_General_IntroducMon.ppt.
5. PerryHoekstra,JiahengLu,AvinashLakshman,PrashantMalik,andJimmyLin.“NoSQLandBigDataProcessingHbase,HiveandPig,etc.”.Retrievedfromhips://view.officeapps.live.com/op/view.aspx?src=hip://www.cs.kent.edu/~jin/Cloud12Spring/HbaseHivePig.pptx.
6. “Hive:AdatawarehouseonHadoop”.2015.Retrievedfromhips://view.officeapps.live.com/op/view.aspx?src=hip://www.cse.buffalo.edu/~bina/cse487/fall2011/HiveNov11.ppt.
7. JoydeepSenSarma,AshishThusoo.“HIVEDataWarehousing&AnalyMcsonHadoop”.Retrievedfromhip://www.slideshare.net/zshao/hive-data-warehousing-analyMcs-on-hadoop-presentaMon.
8. AlanGates.“GevngStarted”.2015.Retrievedfromhips://cwiki.apache.org/confluence/display/Hive/GevngStarted.
9. Wikipedia.“ApacheHive”.2015.Retrievedfromhips://en.wikipedia.org/wiki/Apache_Hive.10. “Hivequerylanguage”.2014.Retrievedfromhip://www.tutorialspoint.com/hive/. 21
11/18/15
8
22