introduction to big data technologies & applications

Big Data Technologies & Applications

Nguyen D. CaoDecember 28, 2015

Agenda

● Big Data Myths

● Big Data Technologies

● Big Data Applications

@123Mua

Big Data Myths

● People talk about Big Data all the time: 3Vs○ Volume

○ Variety

○ Velocity

● Business Value in Data○ Customer Insights

○ Product Insights

Big Data Myths

VOLUME

● Data is BIG● Storage capability of hard drives

increased massively compared to Access speed

Big Data Myths

VARIETY

● Different kinds of data○ Structured

○ Semi-structured

○ Unstructured

● Structured● Semi-structured

○ Self-described Information (json,

xml, logs)

● Unstructured

Big Data Myths

VELOCITY

● Characteristics○ How fast data available for

processing?

○ How fast the processing is?

● Data accumulation with very high rates○ Click streams

○ Supermarket transactions

○ Social media interactions

Big Data Technologies

● Technologies○ Collecting

○ Storage

○ Computation

○ Stream Processing

○ Data Mining

● Scribe is a server for aggregating log data that's streamed in real time from clients.

● It is designed and developed by FaceBook.

● Not active any more

Scribe

Big Data Collecting

● Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system which allows producers send messages over the network to the Kafka cluster which in turn serves them up to consumers

Apache Kafka

Big Data Collecting

Apache Kafka (II)

Big Data Collecting

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware

Hadoop File System (HDFS)

Big Data Storage

● NoSQL: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

● Types:○ Key-Value Store

○ Document Store

○ Column Store

○ Graph Database

○ Content Delivery Network

NoSQL Datastores

Big Data Storage

NoSQL: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

NoSQL Datastores

Big Data Storage

A distributed, scalable, versioned, non-relational datastore on top of HDFS which models after Google's Bigtable.

HBase

Big Data Storage

Hadoop MapReduce

Big Data Computation

● Hadoop MapReduce is a software

framework for easily writing

applications which process vast

amounts of data (multi-terabyte

data-sets) in-parallel on large

clusters (thousands of nodes) of

commodity hardware in a

reliable, fault-tolerant manner.

Hadoop = HDFS + MapReduce


Apache Spark

● Fast and general engine for large-scale data processing.

● Suitable for iterative algorithms


Apache Samza

● Apache Samza is a distributed stream processing framework.

● Uses Kafka to guarantee that messages are processed in the order they were written to a partition

● Whenever a machine in the cluster fails, Samza works with Hadoop YARN to transparently migrate your tasks to another machine.

Big Data StreamProcessing

Apache Mahout

Provide open-source implementations of distributed and scalable machine learning algorithms focused primarily in the areas:

● Collaborative Filtering● Classification● Clustering● Dimension Reduction

Big Data Mining

Big Data [email protected]

● Shop Dashboard

● Similar Product Recommendation

● Personalized product recommendation

● CPC Ads Display

Overview

~ 10,000 active shops

~ 40 Million pageviews/month

~ 8,000 Add to Cart/day

~ 1,000 VIP shops

Product Performance

introduction to big data technologies & applications

Data & Analytics