a glimpse of the hadoop echosystem - github...

16
A Glimpse of the Hadoop Echosystem 1

Upload: others

Post on 27-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

AGlimpseoftheHadoopEchosystem

1

Page 2: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

HadoopEchosystem

• Aclusterissharedamongseveralusersinanorganization• Differentservices

• HDFSandMapReduceprovidethelowerlayersoftheinfrastructures• Othersystems“plug”ontopofthese• Easierwaytoprogramapplications• MapReduceandHDFSare“lowlevel”

2

Page 3: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

HBase

• Hadoopdatabaseforrandomread/writeaccess• HBase isanopensource,non-relational,distributed“database”

• modeledafterGoogle'sBigTable.• ItrunsontopofHadoopandHDFS,providingBigTable-likecapabilitiesforHadoop.• EricBrewer’sCAPtheorem,HBase isaCPtypesystem.

• Consistency,availability,partitiontolerance.

3

Page 4: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

WhentouseHBase

• Realbigdata:billionsofrowsXmillionsofcolumns• Datacannotstoreinasinglenode.

• Randomread/writeaccess• Thousandsofoperationsonbigdata• NoneedofextrafeaturesofRDMSliketypedcolumns,secondaryindexes,transactions,advancedquerylanguages,etc.

4

HDFS Hbase

Good forstoringlargefile Built ontopofHDFS.GoodforhostingverylargetableslikebillionsofrowsXmillionsofcolumn

Writeonce.Append tofilesinsomeofrecentversionsbutnotcommonlyused

Read/writemany

No randomread/write Randomread/write

Noindividualrecordlookupratherreadalldata Fastrecordslookup(update)

Page 5: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

HBase

• TypeofNoSql database• HBase isreallymorea"DataStore"than"DataBase”.ItlacksmanyofthefeaturesyoufindinanRDBMS,suchastypedcolumns,secondaryindexes,triggers,andadvancedquerylanguages,…

• Stronglyconsistentreadandwrite• Automaticsharding (i.e.,“horizontalpartitioning”)• HBase tablesaredistributedontheclusterviaregions,andregionsareautomaticallysplitandre-distributedasdatagrows

• AutomaticRegionServer failover• Hadoop/HDFSIntegration• MassivelyparallelizedprocessingviaMapReduceforusingHBase asbothsourceandsink.• JavaAPIforprogrammaticaccess,RESTfornon-Javafront-ends.

5

Page 6: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

6

#Getsallthedatafortherowhbase>get'/user/user01/customer','jsmith’

#Limitthistoonlyonecolumnfamilyhbase>get'/user/user01/customer','jsmith',{COLUMNS=>['addr']}

#Limitthistoaspecificcolumnhbase>get'/user/user01/customer','jsmith',{COLUMNS=>['order:numb']}

#Scanallrowsoftable't1'hbase>scan't1'

#Specifyatimerangehbase>scan't1',{TIMERANGE=>[1303668804,1303668904]}

#Specifyastartrow,limittheresultto10rows,andonlyreturnselectedcolumnshbase>scan't1',{COLUMNS=>['c1','c2'],LIMIT=>10,STARTROW=>'xyz'}

Page 7: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

Hive

“TheApacheHive™datawarehousesoftwarefacilitatesreading,writing,andmanaginglargedatasetsresidingindistributedstorageusingSQL.Structurecanbeprojectedontodataalreadyinstorage.AcommandlinetoolandJDBCdriverareprovidedtoconnectuserstoHive.”

7

Page 8: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

Hive

• AnSQLlikeinterfacetoHadoop.• DatawarehouseinfrastructurebuiltontopofHadoop• Providedatasummarization,queryandanalysis• QueryexecutionviaMapReduce• HiveinterpretertransparentlyconvertsqueriestoMapReduce.• Butotherbackends arealsosupported,e.g.,Spark

• Opensource,developedbyFacebook• AlsousedbyNetflix,Cnet,Digg,eHarmonyetc.

8

SELECTcustomerId,max(total_cost)FROMhive_purchasesGROUPBYcustomerIdHAVINGcount(*)>3;

Page 9: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

• Wordcount inHive• Justacuriosity– probablynotthetypicalkindofquery

https://en.wikipedia.org/wiki/Apache_Hive

9

1DROP TABLE IFEXISTS docs;2CREATE TABLE docs(lineSTRING);3LOAD DATA INPATH'input_file'OVERWRITEINTO TABLE docs;4CREATE TABLE word_counts AS 5SELECT word,count(1)AS count FROM6(SELECT explode(split(line,'\s'))AS wordFROM docs)temp7GROUP BY word8ORDER BY word;

Page 10: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

YARN

• YetAnotherResourceNegotiator• YARNApplicationResourceNegotiator(RecursiveAcronym)• Remediesthescalabilityshortcomingsof“classic”MapReduce• A generalpurposeframework.MapReduceisoneapplication.

10

Page 11: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

MapReduceLimitations

• Scalability• MaximumClusterSize– 4000Nodes• MaximumConcurrentTasks– 40000• CoarsesynchronizationinJobTracker

• Singlepointoffailure• Failurekillsallqueuedandrunningjobs• Jobsneedtoberesubmittedbyusers• Restartistrickyduetocomplexstate

11

Page 12: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

12

Fora(short)introduction:https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Page 13: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

• SplitsupthemajorfunctionsofJobTracker:• TheResourceManager hastwocomponents:SchedulerandApplicationsManager.• Scheduler:performsnomonitoringortrackingofstatusfortheapplication.

• Noguaranteesaboutrestartingfailedtaskseitherduetoapplicationfailureorhardwarefailures.• Performsitsschedulingfunctionbasedontheresourcerequirementsoftheapplications;• Abstractnotionofaresource Container (memory,cpu,disk,networketc.)

• TheApplicationsManager isresponsibleforacceptingjob-submissions,negotiatingthefirstcontainerforexecutingtheapplicationspecificApplicationMaster• ProvidestheserviceforrestartingtheApplicationMaster containeronfailure.

• ApplicationMaster (oneperapplication)• NegotiateappropriateresourcecontainersfromtheScheduler• Trackstheirstatusandmonitoringforprogress.• Runsasanormalcontainer.• Frameworkspecificlibrary• WorkswiththeNodeManager(s)toexecuteandmonitorthetasks.

• NodeManager (NM)• Anewper-nodeslaveisresponsibleforlaunchingtheapplications’containers,monitoringtheirresourceusage(cpu,memory,disk,network)andreportingtotheResourceManager.

13

Page 14: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

YARN

• FaultToleranceandAvailability• ResourceManager

• Nosinglepointoffailure– statesavedinZooKeeper• ApplicationMastersarerestartedautomatically

• Optionalfailoverviaapplication-specificcheckpoint• MapReduceapplicationspickupwheretheyleftoffviastatesavedinHDFS

• Scalability• 6000- 10000Nodes• 100000+ConcurrentTasks• 10000+Jobs

14

Page 15: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

YARN

• SupportforparadigmsotherthanMapReduce(Multitenancy)• HBase onYARN(HOYA),MachineLearning:Spark,Graphprocessing:Giraph,Real-timeprocessing:Storm

• Enabledbyallowingtheuseofparadigm-specificapplicationmaster• RunallonthesameHadoopcluster!

15

Page 16: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

Sources

• Hadoop2.0andYARN- Subash D’Souza• https://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/• https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html• http://hbase.apache.org/book.html#arch.overview

16