overview of big data zoo

12
Data Analysis as a Service Iou Fag(halv)dag, 2014 Gurvinder Singh, Uninett

Upload: gurvinder-singh

Post on 27-Jan-2015

104 views

Category:

Technology


2 download

DESCRIPTION

Explains different open source big data tools and where they fit

TRANSCRIPT

Page 1: Overview of Big data zoo

Data Analysis as a ServiceIou Fag(halv)dag, 2014

Gurvinder Singh, Uninett

Page 2: Overview of Big data zoo

Data is the King

Page 3: Overview of Big data zoo

Big-Data is ...... ?

Page 5: Overview of Big data zoo

What the hype is ..Cheap commodity hardware with amazing computing and storagecapacity

... but this time software has also catching up with hardware

Page 6: Overview of Big data zoo

Hype Ingredient list is ..Cheap commodity hardware

Good network capacity

Software based on principal of "Divide and Conquer"

..thus scale out horizontally

Page 7: Overview of Big data zoo

Storage

Page 8: Overview of Big data zoo

Unstructure StorageStore data reliably, cheaply and scalably

Hadoop Distributed File System (HDFS)

Divide data into smaller chunks

Hetrogenous storage medium support

Similar DFS e.g. Lustre, IBM GPFS, Ceph, MooseFS

Page 9: Overview of Big data zoo

Structured StorageStore structured data reliably, scalably and indexed

NoSQL databases to store structured data

HBase, Accumulo stores underlying data in HDFS

Many more in big data zoo: Cassandra, Voltdb, NuoDB...

BlinkDB offers tradeoff between accuracy & response time

Full text search offers by Elasticsearch, Solr

Page 10: Overview of Big data zoo

ProcessingMapreduce methodology to process data in the distributed fashion

Data locality with Hadoop Mapreduce and HDFS

Spark supports mapreduce and utilize system & cluster's RAM

Support machine learning algorithms

Support python,scala,java

Support R, framework for data scientists

Hive, Shark, Pig to process structure data in distributed way

Page 11: Overview of Big data zoo

Some performance numbers toguide..

L1 cache reference 0.5 nsL2 cache reference 7 nsRAM reference 100 ns (Queen)Flash IO card reference 75,000 ns (Princess)RTT within same datacenter 500,000 nsDisk reference 10,000,000 ns

Page 12: Overview of Big data zoo

THE ENDBy Gurvinder Singh