big data analytics with hadoop volume 2

BIG DATA ANALYTICS WITH HADOOP

BY-

SWAMIL SINGH

VIPLAV MANDAL

GUIDED BY-DR S.SRIVATAVA

AGENDA

• Design of website clickstream data and example.• How to load data into sandbox.• Load data using flume and the process.• About flume • Flumes working process.• Process to refine data.• Map reduce• Hcatalog and hcatalog work• Hive• Hive work & their process• Queries

DESIGN OF WEBSITE CLICKSTREAM DATA

• Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log files.

• These website log files contain data elements such as a date and time stamp, the visitor’s IP address, the destination URLs of the pages visited, and a user ID that uniquely identifies the website visitor.

• One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data

EXAMPLE OF WEBSITE CLICK STREAM DATA

HOW TO LOAD DATA INTO SANDBOX

• The sandbox is a fully contained Data Platform environment

• The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing.

• You can access and analyze sandbox data with many Business Intelligence (BI) applications.

• By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.

ABOUT FLUME

• Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend.

• . The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents.

• Flume components interact in the following way

• A flow in Flume starts from the Client.• The Client transmits the Event to a Source operating within the Agent• The Source receiving this Event then delivers it to one or more Channels.

• One or more Sinks operating within the same Agent drains these Channels.

• Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.

LOAD DATA USING FLUME ,THE PROCESS

SEQUENCE DIAGRAM OF FLUME

• Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the HDFS. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive.

• In one specific example,

-Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run.

• The large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive..

THE PROCESS TO REFINE DATA

• Omniture logs* – website log files containing information such as URL, timestamp, IP address, geocoded IP address, and user ID (SWID).

• Users* – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.

• Products* – CMS

• data that maps product categories to website URLs

MAP REDUCE

ABOUT THE MAP REDUCE

• A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing.

• The Map function divides the input into ranges by the Input Format and creates a map task for each range in the input

• The output of each map task is partitioned into a group of key-value pairs for each reduce.

• The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve

HCATALOG

• Apache HCatalog is a table management layer that exposes Hive metadata to other Hadoop applications

• HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored

• HCatalog displays data from RCFile format, text files, or sequence files in a tabular view.

HOW HCATALOG WORKS

• HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written.

• By default, HCatalog supports RCFile, CSV, JSON, and Sequence File formats. To use a custom format, you must provide the Input Format, Output Format, and SerDe.

• HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL.

• HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.

HIVE

• Hive is a component of Data Platform. Hive provides a SQL-like interface to data stored in DP.

• Hive provides a database query interface to Apache Hadoop.

• Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse.

• Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying.

HOW TO WORK HIVE

• The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units.

• Databases are comprised of tables, which are made up of partitions.

• Data can be accessed via a simple query language and Hive supports overwriting or appending data

• Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT.

• In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

WORKING PROCESS OF HIVE

• Any queries…..

• THANK YOU

big data analytics with hadoop volume 2

Education