overview of big data zoo
DESCRIPTION
Explains different open source big data tools and where they fitTRANSCRIPT
Data Analysis as a ServiceIou Fag(halv)dag, 2014
Gurvinder Singh, Uninett
Data is the King
Big-Data is ...... ?
Big-Data is relative
What the hype is ..Cheap commodity hardware with amazing computing and storagecapacity
... but this time software has also catching up with hardware
Hype Ingredient list is ..Cheap commodity hardware
Good network capacity
Software based on principal of "Divide and Conquer"
..thus scale out horizontally
Storage
Unstructure StorageStore data reliably, cheaply and scalably
Hadoop Distributed File System (HDFS)
Divide data into smaller chunks
Hetrogenous storage medium support
Similar DFS e.g. Lustre, IBM GPFS, Ceph, MooseFS
Structured StorageStore structured data reliably, scalably and indexed
NoSQL databases to store structured data
HBase, Accumulo stores underlying data in HDFS
Many more in big data zoo: Cassandra, Voltdb, NuoDB...
BlinkDB offers tradeoff between accuracy & response time
Full text search offers by Elasticsearch, Solr
ProcessingMapreduce methodology to process data in the distributed fashion
Data locality with Hadoop Mapreduce and HDFS
Spark supports mapreduce and utilize system & cluster's RAM
Support machine learning algorithms
Support python,scala,java
Support R, framework for data scientists
Hive, Shark, Pig to process structure data in distributed way
Some performance numbers toguide..
L1 cache reference 0.5 nsL2 cache reference 7 nsRAM reference 100 ns (Queen)Flash IO card reference 75,000 ns (Princess)RTT within same datacenter 500,000 nsDisk reference 10,000,000 ns
THE ENDBy Gurvinder Singh