introduction to big data technologies & applications
TRANSCRIPT
Big Data Myths
● People talk about Big Data all the time: 3Vs○ Volume
○ Variety
○ Velocity
● Business Value in Data○ Customer Insights
○ Product Insights
Big Data Myths
VOLUME
● Data is BIG● Storage capability of hard drives
increased massively compared to Access speed
Big Data Myths
VARIETY
● Different kinds of data○ Structured
○ Semi-structured
○ Unstructured
● Structured● Semi-structured
○ Self-described Information (json,
xml, logs)
● Unstructured
Big Data Myths
VELOCITY
● Characteristics○ How fast data available for
processing?
○ How fast the processing is?
● Data accumulation with very high rates○ Click streams
○ Supermarket transactions
○ Social media interactions
Big Data Technologies
● Technologies○ Collecting
○ Storage
○ Computation
○ Stream Processing
○ Data Mining
● Scribe is a server for aggregating log data that's streamed in real time from clients.
● It is designed and developed by FaceBook.
● Not active any more
Scribe
Big Data Collecting
● Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system which allows producers send messages over the network to the Kafka cluster which in turn serves them up to consumers
Apache Kafka
Big Data Collecting
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware
Hadoop File System (HDFS)
Big Data Storage
● NoSQL: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.
● Types:○ Key-Value Store
○ Document Store
○ Column Store
○ Graph Database
○ Content Delivery Network
NoSQL Datastores
Big Data Storage
NoSQL: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.
NoSQL Datastores
Big Data Storage
A distributed, scalable, versioned, non-relational datastore on top of HDFS which models after Google's Bigtable.
HBase
Big Data Storage
Hadoop MapReduce
Big Data Computation
● Hadoop MapReduce is a software
framework for easily writing
applications which process vast
amounts of data (multi-terabyte
data-sets) in-parallel on large
clusters (thousands of nodes) of
commodity hardware in a
reliable, fault-tolerant manner.
Apache Spark
● Fast and general engine for large-scale data processing.
● Suitable for iterative algorithms
Big Data Computation
Apache Samza
● Apache Samza is a distributed stream processing framework.
● Uses Kafka to guarantee that messages are processed in the order they were written to a partition
● Whenever a machine in the cluster fails, Samza works with Hadoop YARN to transparently migrate your tasks to another machine.
Big Data StreamProcessing
Apache Mahout
Provide open-source implementations of distributed and scalable machine learning algorithms focused primarily in the areas:
● Collaborative Filtering● Classification● Clustering● Dimension Reduction
Big Data Mining
Big Data [email protected]
● Shop Dashboard
● Similar Product Recommendation
● Personalized product recommendation
● CPC Ads Display
Overview
~ 10,000 active shops
~ 40 Million pageviews/month
~ 8,000 Add to Cart/day
~ 1,000 VIP shops
References
1. https://github.com/facebookarchive/scribe/wiki/Scribe-Overview
2. http://hadoop.apache.org/ 3. http://nosql-database.org/ 4. http://samza.apache.org/