introduction of big data, nosql & hadoop
TRANSCRIPT
INTRODUCTIONNOSQLHADOOP.BIGDATA.
BIG DATA
Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
BIGDATA
Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
VOLUMEHigh data capacity
(Terabytes or petabytes)
BIGDATA
BIG DATA CHARACTERISTICS
Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
VOLUME VELOCITYHigh data capacity
(Terabytes or petabytes)
BatchReal-timeStreams
BIGDATA
BIG DATA CHARACTERISTICS
Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
VOLUME VELOCITY VARIETYHigh data capacity
(Terabytes or petabytes)
BatchReal-timeStreams
Various kinds(Structured, unstructured,
semi-structured)
BIGDATA
BIG DATA CHARACTERISTICS
Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.
1
WHAT IS BIG DATA?
BIG DATA CHARACTERISTICSVOLUME VELOCITY VARIETY VERACITY
High data capacity
(Terabytes or petabytes)
BatchReal-timeStreams
Various kinds(Structured, unstructured,
semi-structured)
QualityConsistency
Reliability
BIGDATA
Type Characteristics Examples Technology
STRUCTURED d a t a
Entities with a pre-defined format/schema. RDBMS records. RDBMS, NoSQL
SEMI -STRUCTURED
d a t aData is lesser, maybe a schema. XML Files, JSON
filesNoSQL,
MapReduce
UNSTRUCTURED d a t a NO structure
Email content, images, videos,
PDF filesMapReduce
1BIGDATA
BIG DATA TYPES
1BIGDATA
BIG DATA CHALLENGES IN STORAGE&ANALYSIS 1. PROCESS SLOWLY, UNSCALABLE
SSD (800Mb/s, 2ms seek)
SATA (300Mb/s)
IDE drive (75MB/sec, 10ms seek)
1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
IDE drive (75MB/sec, 10ms seek)
Risky
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
IDE drive (75MB/sec, 10ms seek)
Scalability
Data recovery
Partial failure
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
4. BACKUP
IDE drive (75MB/sec, 10ms seek)
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
4. BACKUP
5. PARALLEL PROCESS
IDE drive (75MB/sec, 10ms seek)
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
1BIGDATA
1. PROCESS SLOWLY, UNSCALABLE
2. UNRELIABLE MACHINE
3. RELIABILITY
4. BACKUP
5. PARALLEL PROCESS
6. EXPENSIVE COST
IDE drive (75MB/sec, 10ms seek)
BIG DATA CHALLENGES IN STORAGE&ANALYSIS
HADOOP
2HADOOP
WHAT IS HADOOP ?A free, Java-based framework that allows the DISTRIBUTED PROCESSING of LARGE DATA SETS across CLUSTER OF COMPUTERS using SIMPLE PROGRAMING MODELS
2HADOOP
WHAT IS HADOOP ?
HADOOP ORIGIN
GOOGLE PUBLISH GFS & MAP REDUCE
PAPER
2 0 0 2 - 2 0 0 4
DOUGH CUTTING ADD GFS & MAP
REDUCE TO NUTCH
2 0 0 4
YAHOO! HIRE DOUGH, BUILD A TEAM TO DEVELOP HADOOP
2 0 0 7
NY TIME CONVERT 4 TB OF ARCHIVE (100
EC2 CLUSTER)
WEB SCALE DEVELOPMENT AT
YAHOO, FACEBOOK, TWITTER
A free, Java-based framework that allows the DISTRIBUTED PROCESSING of LARGE DATA SETS across CLUSTER OF COMPUTERS using SIMPLE PROGRAMING MODELS
2HADOOP
WHAT IS HADOOP ?
HADOOP ORIGIN
WEB SCALE DEVELOPMENT AT
YAHOO, FACEBOOK, TWITTER
YAHOO! DOES FASTEST SORT OF a TB in 62 sec
2 0 0 9
YAHOO! SORT A PB IN 16.25 HOURS (3658
NODES)APACHE HADOOP IS
NOW AN OPEN SOURCENY TIME CONVERT 4
TB OF ARCHIVE (100 EC2 CLUSTER)
A free, Java-based framework that allows the DISTRIBUTED PROCESSING of LARGE DATA SETS across CLUSTER OF COMPUTERS using SIMPLE PROGRAMING MODELS
2HADOOP
HADOOP ARCHITECTURE
Hadoop is designed and built on top two independent parts
HADOOP HDFSMAP REDUCE +
=
Storage file system Processing
2HADOOP
HADOOP ARCHITECTURE
+
Distributed across “NODES”HDFS – Hadoop distributed file system
2HADOOP
HADOOP ARCHITECTURE
+ Provide actual storage
NAME NODE DATA NODE
Master of the system
Store meta dataTransaction blog, list of files,
list of block, data nodes
Maintain and manage blocks
on data nodes
Responsible for serving read/write requests
Slaves; deployed on each machine.
Distributed across “NODES”HDFS – Hadoop distributed file system
2HADOOP
HADOOP ARCHITECTURE
+
MODELHDFS – Hadoop distributed file system
2HADOOP
HADOOP ARCHITECTURE
+
MAP REDUCECOMPONENTS
JOB TRACKER TASK TRACKER
Master & manage job & resource in the cluster
Slaves, deployed on each machines
Running the map & reduce tasks as job tracker requires
2HADOOP
HADOOP ARCHITECTURE
+
MAP REDUCEMODEL
2HADOOP
HADOOP ARCHITECTURE
+
ALGORITHMo Parallel algorithm
MAP REDUCE
2HADOOP
HADOOP ARCHITECTURE
+
ALGORITHMo Parallel algorithmo 3 basic steps
Map stepSplit data into key & value
MAP REDUCE
2HADOOP
HADOOP ARCHITECTURE
ALGORITHMo Parallel algorithmo 3 basic steps
Map step
Shuffle step
Split data into key & value
Sorted by key
MAP REDUCE
2HADOOP
HADOOP ARCHITECTURE
+
ALGORITHMo Parallel algorithmo 3 basic steps
Map step
Shuffle step
Reduce step
Split data into key & value
Gather
Sorted by key
MAP REDUCE
oLogical functions: MAPPER & REDUCER
2HADOOP
HADOOP ARCHITECTURE
FUNCTIONS
oHadoop handles distributing MAP & REDUCE tasks across the cluster
oMAP & REDUCE functions were written and submit .jars to Hadoop clusters.
oTypically batch oriented.
MAP REDUCE
2HADOOP
HADOOP ARCHITECTURE
+
ECOSYSTEMMODEL
2HADOOP
HADOOP FEATURES SUMMARY
+
STORE ANYTHING
Unstructured datasemi structured data
2HADOOP
HADOOP FEATURES SUMMARY
+
STORE ANYTHING
Unstructured data,semi structured data
STORAGE CAPACITY
Scale linearlyCost is not exponential
2HADOOP
HADOOP FEATURES SUMMARY
+
STORAGE CAPACITY
Scale linearlyCost is not exponential
DATA LOCALITY & PROCESS IN YOUR WAY
STORE ANYTHING
Unstructured data,semi structured data
2HADOOP
HADOOP FEATURES SUMMARY
+
STORE ANYTHING
Unstructured data,semi structured data
STORAGE CAPACITY
Scale linearlyCost is not exponential
DATA LOCALITY & PROCESS IN YOUR WAY
FAILURE & FAULT TOLERANCE
Detect failure & heal itself(data replicated, failed task is re-run, no need to maintain backup data)
2HADOOP
HADOOP FEATURES SUMMARY
+
STORE ANYTHING
Unstructured data,semi structured data
STORAGE CAPACITY
Scale linearlyCost is not exponential
DATA LOCALITY & PROCESS IN YOUR WAY
FAILURE & FAULT TOLERANCE
Detect failure & heal itself(data replicated, failed task is re-run, no need to maintain backup data)
COST EFFECTIVE
2HADOOP
HADOOP FEATURES SUMMARY
+
STORE ANYTHING
Unstructured data,semi structured data
STORAGE CAPACITY
Scale linearlyCost is not exponential
DATA LOCALITY & PROCESS IN YOUR WAY
FAILURE & FAULT TOLERANCE
Detect failure & heal itself(data replicated, failed task is re-run, no need to maintain backup data)
COST EFFECTIVE
PRIMARILY USED FOR BATCH PROCESSING, NOT REAL-TIME
2HADOOP
WHO IS USING HADOOP & FOR WHAT
+
SEARCH
LOG PROCESSING
RECOMMENDATION SYSTEMS
DATA WAREHOUSE
VIDEO & IMAGE ANALYSIS
2HADOOP
+
SEARCH
LOG PROCESSING
RECOMMENDATION SYSTEMS
DATA WAREHOUSE
VIDEO & IMAGE ANALYSIS
ANDMANY
MORE …
WHO IS USING HADOOP & FOR WHAT
NOSQL
3N O S Q L
WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE
3N O S Q L
WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES
KEY VALUE STOREDYNAMO,
AZURE, REDIS, MEMCACHED
3N O S Q L
WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES
KEY VALUE STOREDYNAMO,
AZURE, REDIS, MEMCACHED
B IG TABLE / COLUMN STORE
(GOOGLE )HBASE; CASSANDARSimilar to RBDMS but handles semi - structured
3N O S Q L
WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES
KEY VALUE STOREDYNAMO,
AZURE, REDIS, MEMCACHED
B IG TABLE / COLUMN STORE
(GOOGLE )HBASE; CASSANDARSimilar to RBDMS but handles semi - structured
GRAPH DB NEO4J
3N O S Q L
WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES
KEY VALUE STOREDYNAMO,
AZURE, REDIS, MEMCACHED
B IG TABLE / COLUMN STORE
(GOOGLE )HBASE; CASSANDARSimilar to RBDMS but handles semi - structured
GRAPH DB NEO4J
DOCUMENT STORE
MONGODB, REDIS, COUCHDBSimilar to key – value store but DB knows what is the
value
3N O S Q L
NOSQL
+ COLLECTION: is a group of RELATED DOCUMENTS
MONGO DB – DATA MODELING CONCEPT
In form of DOCUMENTS (JSON-liked key value).
Data in MongoDB has A FLEXIBLE SCHEMA.
3N O S Q L
NOSQL
+
No JOIN, instead, there are 2 types of DOCUMENT STRUCTUREReference Embedded
MONGO DB – DATA MODELING CONCEPT
3N O S Q L
NOSQL
+
MONGO DB – DATA MODELING CONCEPT * Always consider the usage of data (queries or update) when designing data
modelsMODEL RELATIONSHIP BETWEEN DOCUMENTS
MODEL TREE STRUCTURES
One - to - one
One - to - many
Parent referenceChild reference
Array of ancestors
Materialized paths Nested sets
3N O S Q L
NOSQL MONGO DB – CRUD OPERATIONS
COMPARING: SQL VS MONGO STATEMENTS
QUERY STATEMENT
CREATE / INSERT / UPDATE / DELETE
THE END