impala first stepsofa oracle expert · project – twitter 400 –500 mio tweets per day 1 tweet...

35
2015 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN Big Data Impala Why do I need Impala? First steps of a Oracle Expert Author: Jan Ott – Trivadis AG Big Data - Impala 1

Upload: others

Post on 04-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

BigDataImpala

Why doIneed Impala?Firststeps of aOracleExpert

Author: JanOtt– Trivadis AG

BigData- Impala1

Page 2: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Ourcompany.

©Trivadis– TheCompany2

21.11.15

Trivadis is amarket leader inITconsulting,system integration,solutionengineeringand the provision of ITservices focusingonandtechnologiesinSwitzerland,Germany,Austriaand Denmark.We offer our services inthefollowing strategicbusiness fields:

Trivadis Servicestakes over the interactingoperation of your ITsystems.

O P E R A T I O N

Page 3: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

COPENHAGEN

MUNICH

LAUSANNEBERN

ZURICHBRUGG

GENEVA

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

BASEL

VIENNA

Withover600specialistsandITexpertsinyourregion.

©Trivadis– TheCompany3

21.11.15

14Trivadis branches andmore than600employees

200ServiceLevelAgreements

Over4,000trainingparticipants

Researchand development budget:CHF5.0million

Financially self-supportingandsustainably profitable

Experiencefrommore than 1,900projects peryear atover 800customers

Page 4: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

JanOtt – WhoamI?

BigData- Impala4

§ 25+yearsinIT§ 25+yearsusingOracle§ 15+yearsforTrivadis AG

§ BI– DWH

§ Tuning§ Speaker/Trainer

§ http://janottblog.wordpress.com/

Page 5: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project– HadoopasafilestorefortheDWH4. Project– HadoopforarchivingofaOracleDB5. Project- Twitter6. Summary

BigData- Impala5

Page 6: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Introduction§ AfewwordsaboutBigData

§ BigData§ Hadoop§ Impala– Why?

§ Impala– myfirststeps§ GetsomedataintoHadoop§ TablesinImpala§ UseSQL§ Diverse

§ Project1– DatainHadoopforaDWH§ Project2– HadoopforArchiving§ Project3– Twitter

BigData- Impala6

Page 7: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

BigData:Introduction§ BigData- V’s– 3,4or5

§ Volume– scaleofdata§ Velocity– analysisofstreamingdata§ Variety– differentformofdata§ Veracity– uncertaintyofdata(IBM)§ Value– businessvalue(Microsoft)

§ HadoopanditsZoo§ HDFS– MapReduce§ Impala,HBase,Hive,…§ Zookeeper

§ NoSQL Databases§ Architecture

§ LAMBDA

BigData- Impala7

TurningDataintoInsights

Page 8: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

What is Hadoop

§ afilesystem– HDFS§ BasedonpapersfromGoogle§ ApacheOpenSourceProject

§ Goal§ Fast§ Handleshugeamountofdata§ Handlesunstructured tofullystructureddata§ Horizontally scalable§ Reliable

BigData- Impala8

Page 9: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

What is Impala

§ aSQLQueryEngineontoHDFS– Hive§ NotanApacheOpenSourceProject§ OpenSource– Cloudera,Oracle,Amazon

§ Hive&Impala§ ImpalausesthemetadatastoreofHive

§ Goal§ Easytouse- SQL§ Fast§ Handleshugeamountofdata§ Handlesunstructured tofullystructureddata§ Horizontally scalable§ Reliable

BigData- Impala9

Page 10: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project– HadoopasafilestorefortheDWH4. Project– HadoopforarchivingofaOracleDB5. Project- Twitter6. Summary

BigData- Impala10

Page 11: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

FirstSteps

§ Keepitsimple

§ GetsomedataintoHadoop

§ GetsomedataintoImpala

§ Java– keepittoaminimum

§ Getanenvironmentthatissetup§ OracleVM– BigDataLight

§ PickonewaytogetthedataintoImpala§ Impalashellinterface

§ SeeSQLonaHDFSsystem

BigData- Impala11

Page 12: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Pre-Requisite– Environment

§ OracleBigDataLite§ VM§ Version4.1§ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-

bigdatalite-2104726.html

§ Contains§ OracleDatabase12c(12.1.0.2)§ Cloudera’s Distribution including ApacheHadoop (CDH5.1.2)§ Hadoop2.3.0§ Hive0.12.0§ Impala2.1§ OracleBigDataConnectors4.0§ OracleSQLDeveloper4.0.3

§ OracleVirtualBox

BigData- Impala12

Page 13: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Informationabout the VM

§ Login§ oracle/welcome1

§ Starthere§ file:///home/oracle/GettingStarted/StartHere.html

§ Start§ OracleDB§ Hive§ Impala§ HDFS

§ Yourdonepreparing

§ OraclehasaMovieexample

BigData- Impala13

Page 14: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

TheSteps – simple– focus

BigData- Impala14

SQLQuery

ImpalaTable

Page 15: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Impala

§ Sourcescanbe§ Adelimitedtextfileinthehostfilesystem

- whichwillbecopiedintotheImpalaStore(HDFS)§ AdelimitedtextfileinHDFS

§ Limits§ Readonlyoradd§ NoUpdate§ NoDelete§ NoCommit/Rollback§ NoIndexes§ Alwaysfulltable/filescan– orpartitionscan

§ UsesHiveMetastore

BigData- Impala15

Page 16: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

HDFS– CommandShell§ cat§ chmod§ cp§ ls§ put§ …§ https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-

common/FileSystemShell.html

§ Example

BigData- Impala16

$ hadoop fs –lsFound 6 itemsdrwx------ - oracle supergroup 0 2014-03-19 11:11 .Trashdrwx------ - oracle supergroup 0 2014-01-24 20:15 .stagingdrwxr-x--- - oracle supergroup 0 2014-01-13 00:15 moviedemodrwx------ - oracle supergroup 0 2014-01-24 16:32 movieworkdrwxr-xr-x - oracle supergrou 0 2013-12-27 16:36 olhcachedrwxr-xr-x - oracle supergroup 0 2013-12-27 16:36 temp_out_session

Page 17: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Step 1– TheData

§ /home/oracle/Desktop/impala_test/t10.txt§ Commadelimited§ Flatfile§ FormatthedatesoitfitsImpalasdateformat

- YYYY-MM-DDHH24:MI:SS.XXXX

BigData- Impala17

1,Hans,Meier,3000,1968-02-02 00:00:00,2000-01-01 00:00:00,12,Stefan,Müller,5000,1970-10-15 00:00:00,2001-07-01 00:00:00,13,Susanne,Kieser,3500,1972-03-14 00:00:00,2005-05-01 00:00:00,24,Paul,Steiner,4000,1960-07-28 00:00:00,2000-01-01 00:00:00,25,Monika,Hausmann,7000,1975-03-29 00:00:00,2000-01-01 00:00:00,36,Manuela,Ziegler,3700,1980-11-05 00:00:00,2010-01-01 00:00:00,47,Anna,Bosshard,4100,1984-11-08 00:00:00,2012-04-01 00:00:00,58,Armin,Studer,4900,1988-12-17 00:00:00,2013-05-22 00:00:00,39,Thomas,Bergmann,6000,1976-07-24 00:00:00,2012-08-15 00:00:00,510,Heiko,Zimmermann,4800,1955-04-21 00:00:00,2012-10-01 00:00:00,4

Page 18: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Step 2– Get the data intoHDFS

§ Filecopy– Referenceitincreatetable§ LOCATION

'hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/impala_test.db/t_10'

§ CreateTable– Copyfiletorightdirectory

§ LoadData

BigData- Impala18

Page 19: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Impala– SQL– general

§ Impala-shell

§ SQL§ noDUAL§ ANSISQL92sortoff§ Nodelete/update§ 1fileperinsert

§ DifferentDataTypes

§ DataDictionary§ showtables§ describe

BigData- Impala19

Page 20: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Hive - Metastore

§ ShowTables

BigData- Impala20

hive> SHOW tables LIKE 'dept';OKdeptTime taken: 0.03 seconds, Fetched: 1 row(s)hive>

§ Describehive>DESCRIBE dept;OKdeptno int None dname string None loc string None Time taken: 0.069 seconds, Fetched: 3 row(s)hive>

Page 21: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Impala– Performance

§ Statistics§ Computestats§ Showtablestats§ Showcolumnstats

§ ExplainPlan§ Explain

BigData- Impala21

Page 22: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Impala– Miscellaneous§ DataTypes

§ BOOLEAN§ VARCHAR2notbutVARCHAR

§ OracleConnectors§ UsesHive§ Partitionaware

§ Parquet

§ NoIndexnotevenHiveIndexes

§ Schemapossible

BigData- Impala22

§ UDF– UserDefinedFunctions§ C++§ Selfwritten§ ImpalawritteninC++

§ ODBC/JDBC§ BusinessObjects§ Cognos§ OtherTools§ EveryonewithSQLknowledge

§ BIG§ 1GBdefaultfilesize(parquet)

Page 23: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project– HadoopasafilestorefortheDWH4. Project– HadoopforarchivingofaOracleDB5. Project- Twitter6. Summary

BigData- Impala23

Page 24: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Project– Hadoopas afile store for the DWH

§ MovetoHadoopfordeliveredfiles§ Startcollecting§ FilesgetcopiedintoHDFSonetoone§ Nodecisionhadtobetaken

- Schema– schema-less- Tabledesign- non- …

§ ImmutableDataStore- CreateandRead- Noupdate/Nodelete

§ AddExternalTableswithORACLESQLConnector§ Datauseable

§ BuildHadoopinfrastructure– useImpala

BigData- Impala24

Page 25: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project– HadoopasafilestorefortheDWH4. Project– HadoopforarchivingofaOracleDB5. Project- Twitter6. Summary

BigData- Impala25

Page 26: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Project– Hadoopfor archiving of aOracleDB

§ MovetoHadoopforArchiving§ Possibletousethedata§ ImmutableDataStore- CreateandRead

- Noupdate/Nodelete

§ AddExternalTableswithORACLESQLConnector§ DatauseableinOracletoo

§ BuildHadoopinfrastructure– useImpala

§ Next§ AnalyzeOracleBigDataAppliance

- Exadata – BigDataAppliancecombined=>“usesameblockstructure”

BigData- Impala26

Page 27: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project– HadoopasafilestorefortheDWH4. Project– HadoopforarchivingofaOracleDB5. Project- Twitter6. Summary

BigData- Impala27

Page 28: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Project– Twitter

§ 400– 500Miotweetsperday

§ 1tweetcontains§ Around50metadatapieces

- Geo-location- Re-tweets- Followers

§ Thatisabout2A4pages

§ TwitterSampleStream§ 1%§ 4-5Miotweetsperday§ 50tweetspersecond

§ 20otherstreamswithdefinedkeywords

§ HDFS§ 1TBevery2monthsincluding replication

BigData- Impala28

Page 29: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

TheLambdaArchitecture- adopted

BigData- Impala29

Batchlayer

Speedlayer

AllData(HDFS)

Pre-computedViews

(MapReduce)Batch(re)compute

Query&

MergeREST

ProcessStream

IncrementedViews

Realtime Increment

Servinglayer

QFD= QueryFocusedData

QFD1 QFD2 QFDn…

QFD1 QFD2 QFDn

Realtime views

BatchviewsMessagingKafka

ClientWebApp

Consumerlayer

TwitterAPI

JavaAPP

Hadoop

Storm

Impala

Cassandra

Page 30: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project14. Project25. Summary

BigData- Impala30

Page 31: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Summary

§ BigData<>Hadoop,Impala

§ Impala§ SQLExtensionforHadoop§ BlockSize– 1GB§ Nothing forsmallfiles§ Nooptimizationwithindexes

§ AnewWorld

§ Impala,Hive,HadoopanditsZoo

§ LotscanbedonewithRDBMS

§ Starttocollectnow

BigData- Impala31

Page 32: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Why Impala

§ SQL§ ANSISQL92

§ Noprogrammingneeded

§ Speed!§ Adhoc§ Hive– batch

§ ItisINMEMORY- Limit§ NotlikeOracle– pinaobjecttomemory§ Loadedduringexecution

BigData- Impala32

Page 33: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Questions?

THANKYOU.

Trivadis AG

JanOtt

Europa-Strasse 5CH-8152Glattbrugg-Zurich

Tel. +41-44-808 7020(reception)Fax +41-44-808 7021

[email protected]

BigData- Impala33

Page 34: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

©Trivadis– DasUnternehmen34

21.11.15

Trivadis @DOAG2015

3rdFloor– next to the escalator

We look forward to your visit.

BecausewithTrivadis youalwayswinJ

Page 35: Impala First stepsofa Oracle Expert · Project – Twitter 400 –500 Mio tweets per day 1 tweet contains Around 50 metadata pieces - Geo-location - Re-tweets - Followers That is

2015©Trivadis

Sources§ Impala

§ http://impala.io§ http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/i

mpala_impala_shell.html

§ OracleConnectiontoHadoopwithOracle§ https://blogs.oracle.com/bigdataconnectors/entry/how_to_load_oracle_tables

§ Books:§ BigData– MEAPbyNathanMarz§ GettingStartedwithImpalabyJohnRussell§ LearningCloudera ImpalabyAvkash Chauhan

§ Pictures§ Oracle.com§ Twitter.com§ Apache.com§ Cloudera.com

BigData- Impala35