an introduction of big data; big data for beginners; overview of big data; big data tutorial

Big DataIntroduction to Big Data, Hadoop and Spark

DEFINITIONSBY DEFINITION: Big data refers to large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic.Hadoop & Spark are programming & processing technologies for Big Data.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.

Current Scenario

Enterprise applications

Operational Decision Support

Traditionally enterprise applications can be broadly categorized into Operational and Decision support systems.

Lately new set of applications such as Customer Analytics is gaining momentum (eg: YouTube Channel for different categories of users)

Customer Analytics

Current Scenario – Architecture(Typical Enterprise Application)

Client(Browser)

Client(Browser)

Client(Browser)

App Server

App ServerDatabase

Current Scenario - Architecture

Recent trends Standardization and consolidation of hardware (servers,

storage, network) etc., to cut down the costs Storage is physically separated from servers and connected

with high speed fiber optics


Database

ServerDatabas

e Server

Database

Server

Network Switch Network Switch Storage Cluster

*Typical database architecture in an enterprise

Oracle ArchitectureStorage

Network Switch

Network Switch

(interconnect)

Database Servers


Databases Databases are clustered (Oracle – RAC)

High availability Fault tolerance Load balancing Scalable (not linear)

Common network storage File abstraction – file can be of any size Fault tolerance (using RAID)

Current Scenario - Architecture Almost all these applications follow similar n-tier architecture

Core applications (operational) EAI (Integrating Enterprise Applications) CRM ERP DW/BI Tools like Informatica, Cognos, Business Objects etc

However there are exceptions – legacy (Mainframes based) applications which uses closed architecture


Application Servers Database Servers Storage

Servers

*Birds eye view – after standardization and consolidation using cloud architecture

Current Scenario - Challenges Almost all operational systems are using relational databases (RDBMS like

Oracle). RDBMS are originally designed for Operational and transactional.

Not linearly scalable. Transactions Data integrity

Expensive Predefined Schema Data processing do not happen where data is stored (storage layer)

Some processing happens at database server level (SQL) Some processing happens at application server level (Java/.net) Some processing happens at client/browser level (Java Script)

Evolution of Databases• Relational Databases (Oracle, Informix, Sybase, MySQL etc)• NoSQL Databases (Cassandra, HBase, MongoDB etc)• In memory Databases (Gemfire, Coherence etc)• Search based Databases (Elastic Search, Solr etc)• Batch processing frameworks (Map Reduce, Spark etc)

* Modern applications need to be polyglot (different modules need different category of databases)

Big Data eco system – Advantages Distributed storage

Fault tolerance (RAID is replaced by replication) Distributed computing/processing

Data locality (code goes to data) Scalability (almost linear) Low cost hardware (commodity) Low licensing costs

Hadoop eco system Evolution of Hadoop eco system Use cases that can be addressed using Hadoop eco system Hadoop eco system tools/landscape

Evolution of Hadoop eco system

GFS to HDFS Google Map Reduce to Hadoop Map Reduce Big Table to HBase

Use cases that can be addressed using Hadoop eco system

ETL Real time reporting Batch reporting Operational but not transactional

Hadoop eco system tools/landscape Operational and real time data integration

HBase ETL

Map reduce, Hive/Pig, Sqoop etc Reporting

Hive (Batch) Impala/Presto (Real time)

Analytics API Map reduce Other frameworks

Miscellaneous/complementary tools Zoo Keeper (co-ordination service for masters) Oozie (Workflow/Scheduler) Chef/Puppet (automation for administrators) Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)

OLTP

ClosedMain

Frames

XMLExternal

apps

Source(s)

EDW(Current Architecture)

Data Warehouse

Data Integration(ETL/Real

Time)

ODSEDW/ODS

Visualization/Reporting

Reporting

Decision Support

Use Case – EDW(Current Architecture)

Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds

Data Integration ODS (Operational Data Store)

Sources – Disparate Real time – Tools/custom (Goldengate, Shareplex etc) Batch – Tools/custom Uses – Compliance, data lineage, reports etc

Enterprise Datawarehouse Sources – ODS or other sources ETL – Tools/custom (Informatica, Ab Initio, Talend)

Reporting/Visualization ODS (Compliance related reporting) Enterprise Datawarehouse Tools (Cognos, Business Objects, Microstrategy, Tableau etc)

EDW(Big Data eco system)

OLTP

ClosedMain

Frames

XMLExternal

apps

Source(s)


Reporting

Decision Support

Node

Node

Node

Hadoop Cluster(EDW/ODS)

ETL

Real Time/Batch (No ETL)

Reporting Database

Hadoop eco system

Hadoop Core Components

Non Map Reduce

Hive

Pig

Flume

Sqoop

Oozie

Mahout

Hadoop eco system

Hadoop Components

Distributed File System (HDFS)

Map Reduce

Impala

Presto

HBase

Spark

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

HiveT and L

Batch Reporting

Non Map Reduce

ImpalaInteractive/adhoc

Reporting

SqoopE and L

OozieWorkflows

Hadoop eco system

Hadoop Components

Custom Map Reduce

E, T and LHBase

Real Time data integration or

Reporting

Disadvantages of Map Reduce Disadvantages of Map Reduce based solutions

Designed for batch, not meant for interactive and ad hoc reporting I/O bound and processing of micro batches can be an issue Too many tools/technologies (Map Reduce, Hive, Pig, Sqoop, Flume etc.) to build

applications Not suitable for enterprise hardware where storage is typically network mounted

Apache Spark Spark can work with any file system including HDFS Processing is done in memory – hence I/O is minimized Suitable for ad hoc or interactive querying or reporting Streaming jobs can be done much faster than map reduce Applications can be developed using Scala, Python, Java etc Choose one programming language and perform

Data integration from RDBMS using JDBC (no need of sqoop) Stream data using spark streaming Leverage data frames and SQL embedded in programming language As processing is done in memory Spark works well with Enterprise Hardware with network

file system

EDW(Big Data eco system - Spark)

OLTP

ClosedMain

Frames

XMLExternal

apps

Source(s)


Reporting

Decision Support

Node

Node

Node

Hadoop Cluster(EDW/ODS)

ETL

Real Time/Batch (No ETL)

Role of Apache• Each of these are separate projects incubated under Apache

– HDFS and MapReduce/YARN– Hive– Pig– Sqoop– HBaseEtc.

Installation (plain vanilla) In plain vanilla mode, depending up on the architecture each

tool/technology needs to be manually downloaded, installed and configured.

Typically people use Puppet or Chef to set up clusters using plain vanilla tools

Advantages You can set up your cluster with latest versions from Apache directly

Disadvantages Installation is tedious and error prone Need to integrate with monitoring tools

Hadoop Distributions Different vendors pre-package apache suite of big data tools into their distribution to facilitate

Easier installation/upgrade using wizards Better monitoring Easier maintenance and many more

Leading distributions include, but not limited to Cloudera Hortonworks MapR AWS EMR IBM Big Insights and many more

Hadoop Distributions

HDFS/YARN/MRHivePig

Apache Foundation

Sqoop

ImpalaTez

Flume

Spark

Ganglia

HBaseImpala

Zookeeper

Cloudera

Hortonworks

MapR

AWS

Be a Big Data Expert with SpringPeople

Adminstrator

Apache Hadoop +

Hadoop Administration

Developer

Apache Hadoop +

Apache Spark with Scala

Data Scientist

Apache Hadoop+

Analytics with R / Machine Learning

Become a Big Data Expert in 10 days.

World class training by Certified Subject Matter Experts

More Details

Big Data Bundled Training

Become a Big Data Expert in 8 days.


More Details

http://www.springpeople.com/courses/online/big-data-expert-learning-path-combo-bundle

Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 8 days Prerequisites: -Familiarity with Linux/Unix & Hadoop and some Database experience

Become a Hadoop Guru

Become an overall Hadoop Expert in 5 days.


More Details

http://www.springpeople.com/courses/online/hadoop-guru-learning-path-combo-bundle

Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 5 days Prerequisites: -Familiarity with Linux/Unix and some Database experience

How To Become A Big Data Analyst?

Join the bundled training program of Big Data


More Details

http://www.springpeople.com/courses/online/big-data-analyst-learning-path-combo-bundle

Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 10 days Prerequisites: -Hands-on experience in Java and some Database experience

Get Certified & #BeTheExpert

Our Certified Partners

For further info/assistance contact:[email protected]

+91 80 6567 9700www.springpeople.com

mailto:[email protected]

an introduction of big data; big data for beginners; overview of big data; big data tutorial

Data & Analytics