an introduction of big data; big data for beginners; overview of big data; big data tutorial

Download An Introduction of Big data; Big data for beginners; Overview of Big Data; Big data Tutorial

If you can't read please download the document

Post on 07-Jan-2017



Data & Analytics

17 download

Embed Size (px)



Big DataIntroduction to Big Data, Hadoop and Spark

DEFINITIONSBY DEFINITION: Big data refers to large volume of data both structured and unstructured that inundates a business on a day-to-day basis. But its not the amount of data thats important. Its what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic.Hadoop & Spark are programming & processing technologies for Big Data.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.

Current ScenarioEnterprise applicationsOperationalDecision SupportTraditionally enterprise applications can be broadly categorized into Operational and Decision support systems.

Lately new set of applications such as Customer Analytics is gaining momentum (eg: YouTube Channel for different categories of users)Customer Analytics

Current Scenario Architecture(Typical Enterprise Application)Client(Browser)Client(Browser)Client(Browser)

App ServerApp Server


Current Scenario - ArchitectureRecent trendsStandardization and consolidation of hardware (servers, storage, network) etc., to cut down the costsStorage is physically separated from servers and connected with high speed fiber optics

Current Scenario - ArchitectureDatabase ServerDatabase ServerDatabase Server

Network SwitchNetwork SwitchStorage Cluster

*Typical database architecture in an enterprise

Oracle Architecture

StorageNetwork Switch

Network Switch(interconnect)Database Servers

Current Scenario - ArchitectureDatabasesDatabases are clustered (Oracle RAC)High availabilityFault toleranceLoad balancingScalable (not linear)Common network storageFile abstraction file can be of any sizeFault tolerance (using RAID)

Current Scenario - ArchitectureAlmost all these applications follow similar n-tier architectureCore applications (operational)EAI (Integrating Enterprise Applications)CRMERPDW/BI Tools like Informatica, Cognos, Business Objects etcHowever there are exceptions legacy (Mainframes based) applications which uses closed architecture

Current Scenario - ArchitectureApplication ServersDatabase ServersStorage Servers*Birds eye view after standardization and consolidation using cloud architecture

Current Scenario - ChallengesAlmost all operational systems are using relational databases (RDBMS like Oracle).RDBMS are originally designed for Operational and transactional.Not linearly scalable.TransactionsData integrityExpensivePredefined SchemaData processing do not happen where data is stored (storage layer)Some processing happens at database server level (SQL)Some processing happens at application server level (Java/.net)Some processing happens at client/browser level (Java Script)

Evolution of DatabasesRelational Databases (Oracle, Informix, Sybase, MySQL etc)NoSQL Databases (Cassandra, HBase, MongoDB etc)In memory Databases (Gemfire, Coherence etc)Search based Databases (Elastic Search, Solr etc)Batch processing frameworks (Map Reduce, Spark etc)

* Modern applications need to be polyglot (different modules need different category of databases)

Big Data eco system AdvantagesDistributed storageFault tolerance (RAID is replaced by replication)Distributed computing/processingData locality (code goes to data)Scalability (almost linear)Low cost hardware (commodity)Low licensing costs

Hadoop eco systemEvolution of Hadoop eco systemUse cases that can be addressed using Hadoop eco systemHadoop eco system tools/landscape

Evolution of Hadoop eco systemGFS to HDFSGoogle Map Reduce to Hadoop Map ReduceBig Table to HBase

Use cases that can be addressed using Hadoop eco systemETLReal time reportingBatch reportingOperational but not transactional

Hadoop eco system tools/landscapeOperational and real time data integrationHBaseETLMap reduce, Hive/Pig, Sqoop etcReportingHive (Batch)Impala/Presto (Real time)Analytics APIMap reduce Other frameworksMiscellaneous/complementary toolsZoo Keeper (co-ordination service for masters)Oozie (Workflow/Scheduler)Chef/Puppet (automation for administrators)Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)

OLTPClosedMain FramesXMLExternal appsSource(s)EDW(Current Architecture)

Data WarehouseData Integration(ETL/Real Time)


Visualization/ReportingReportingDecision Support

Use Case EDW(Current Architecture)Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundredsData IntegrationODS (Operational Data Store)Sources DisparateReal time Tools/custom (Goldengate, Shareplex etc)Batch Tools/customUses Compliance, data lineage, reports etcEnterprise DatawarehouseSources ODS or other sourcesETL Tools/custom (Informatica, Ab Initio, Talend)Reporting/VisualizationODS (Compliance related reporting)Enterprise DatawarehouseTools (Cognos, Business Objects, Microstrategy, Tableau etc)

EDW(Big Data eco system)

OLTPClosedMain FramesXMLExternal appsSource(s)

Visualization/ReportingReportingDecision Support


Hadoop Cluster(EDW/ODS)


Real Time/Batch (No ETL)Reporting Database

Hadoop eco system

Hadoop Core Components

Non Map ReduceHivePigFlumeSqoopOozieMahoutHadoop eco systemHadoop ComponentsDistributed File System (HDFS)Map ReduceImpalaPrestoHBaseSpark

Hadoop eco system

Distributed File System (HDFS)Map ReduceHadoop Core Components

HiveT and LBatch ReportingNon Map ReduceImpalaInteractive/adhoc ReportingSqoopE and LOozieWorkflowsHadoop eco systemHadoop ComponentsCustom Map ReduceE, T and LHBaseReal Time data integration or Reporting

Disadvantages of Map ReduceDisadvantages of Map Reduce based solutionsDesigned for batch, not meant for interactive and ad hoc reportingI/O bound and processing of micro batches can be an issueToo many tools/technologies (Map Reduce, Hive, Pig, Sqoop, Flume etc.) to build applicationsNot suitable for enterprise hardware where storage is typically network mounted

Apache SparkSpark can work with any file system including HDFSProcessing is done in memory hence I/O is minimizedSuitable for ad hoc or interactive querying or reportingStreaming jobs can be done much faster than map reduceApplications can be developed using Scala, Python, Java etcChoose one programming language and performData integration from RDBMS using JDBC (no need of sqoop)Stream data using spark streamingLeverage data frames and SQL embedded in programming languageAs processing is done in memory Spark works well with Enterprise Hardware with network file system

EDW(Big Data eco system - Spark)

OLTPClosedMain FramesXMLExternal appsSource(s)

Visualization/ReportingReportingDecision Support


Hadoop Cluster(EDW/ODS)


Real Time/Batch (No ETL)

Role of ApacheEach of these are separate projects incubated under ApacheHDFS and MapReduce/YARNHivePigSqoopHBaseEtc.

Installation (plain vanilla)In plain vanilla mode, depending up on the architecture each tool/technology needs to be manually downloaded, installed and configured.Typically people use Puppet or Chef to set up clusters using plain vanilla toolsAdvantagesYou can set up your cluster with latest versions from Apache directlyDisadvantagesInstallation is tedious and error proneNeed to integrate with monitoring tools

Hadoop DistributionsDifferent vendors pre-package apache suite of big data tools into their distribution to facilitateEasier installation/upgrade using wizardsBetter monitoringEasier maintenanceand many moreLeading distributions include, but not limited toClouderaHortonworksMapRAWS EMRIBM Big Insightsand many more

Hadoop Distributions

HDFS/YARN/MRHivePigApache FoundationSqoopImpalaTezFlumeSparkGangliaHBaseImpalaZookeeperClouderaHortonworksMapRAWS

Be a Big Data Expert with SpringPeopleAdminstrator

Apache Hadoop + Hadoop AdministrationDeveloper

Apache Hadoop + Apache Spark with ScalaData Scientist

Apache Hadoop+Analytics with R / Machine LearningBecome a Big Data Expert in 10 days.

World class training by Certified Subject Matter Experts

More Details

Big Data Bundled TrainingBecome a Big Data Expert in 8 days.

World class training by Certified Subject Matter Experts

More Details

Suggested Audience & Other DetailsSuggested Audience: Developers & ArchitectsDuration: 8 daysPrerequisites: -Familiarity with Linux/Unix & Hadoop and some Database experience

Become a Hadoop GuruBecome an overall Hadoop Expert in 5 days.

World class training by Certified Subject Matter Experts

More Details

Suggested Audience & Other DetailsSuggested Audience: Developers & ArchitectsDuration: 5 daysPrerequisites: -Familiarity with Linux/Unix and some Database experience

How To Become A Big Data Analyst?Join the bundled training program of Big Data

World class training by Certified Subject Matter Experts

More Details

Suggested Audience & Other DetailsSuggested Audience: Developers & ArchitectsDuration: 10 daysPrerequisites: -Hands-on experience in Java and some Database experience

Get Certified & #BeTheExpert

Our Certified Partners

For further info/assistance contact:tra

View more