query store in sql server 2016 - … data ... tidy up and store boatloads of structure and...

D E E P D R I V E O N B I G D A T A & H A D O O P

Ahsan Kabir MVP on Data Platform

techforum user group meetup

PASS Local Chapter

Data is a set of values of qualitative or quantitative variables; any facts, numbers, or text

that can be processed by a computer. Data are generated by man and Machine

Operational or transactional data

Non-operational data

Meta data

Information

The patterns, associations, or relationships among all this data can provide information.

Knowledge

Information can be converted into knowledge about historical patterns and future trends.

DATA

Performance management

Identify trends

Monitoring

Decision Making

Cash flow trend

Fine-tune operations

Sales pipeline analysis

Future projections

business Forecasting

Convert data into information

DATA AND WE

Structured Data

Data stored, accessed and processed in the form of fixed format is termed as a

'structured' . This includes data contained in relational databases and spreadsheets.

Semi-structured Data

structured data that does not conform with the formal structure of data models associated

with relational databases or other forms of data tables: XML,JSON

Unstructured Data

Unstructured data is data that does not follow a specified format or pre-defined data

model .In a organization 70%–80% of all data are unstructured and is growing 10–50x

more than structured data”

DATA OF DIFFERENT COLOR

WHAT IS THE SOURCE OF 80%UNSTRUCTURED DATA

Emails

Social Network Data

Images

Videos ,Audio

Text Files

Social media data

Mobile data

website content

Radar or sonar data

Satellite images

Scientific data

RDBMS can store fixed data volumes.

RDBMS need more CPUs or memory to

scale up vertically.

RDBMS relational databases just can’t

categorize verity of data

RDBMS lacks in high velocity because it’s

designed for steady data retention rather

than rapid growth.

Variability refers to the data sourced from

varied places that requires testing of its

quality

Value refers to use all the data sourced to

be put in productive and value oriented

results

HOW TO MANAGE 80% UNSTRUCTURED DATA

WHAT IS BIG DATA

Big Data hug amount of data that can’t be stored and processed with traditional approach

within limited time frame..

Big data is a collection of data sets so large, complex ,rapidly growing that becomes difficult to

search, process, sharing, storage, transfer, visualization, querying, updating and information

privacy using on-hand database management tools or traditional data processing applications.

5-V of Big Data

COMPANY’S DATA NEEDS EXCEEDS ITS INFRASTRUCTURE

BUT WE WANT

Goals / Requirements:

Advance analytics

Abstract and facilitate the storage and processing of large data sets

Manage rapidly growing data sets

Structured and non-structured data

Simple programming models

High scalability and availability

Use commodity (cheap!) hardware with little redundancy

Fault-tolerance

Move computation rather than data

ANALYTICS

“BIG DATA” VENDORS

Apache Hadoop

Cloudera

Hortonworks (Micorosoft using Hortonworks hadoop distribution)

Amazon’s Elastic Map Reduce • EMC’s GreenPlum • IBM’s Infosphere

Oracle Big Data

Google BigQuery

SOLUTION THROUGH HADOOP

Hadoop can be used to collect, tidy up and store boatloads of structure and unstructured data.

Hadoop is an open source highly scalable compute and storage platform.

Hadoop work to interpret or parse the results of big data searches through specific proprietary algorithms

and methods.

Hadoop is an open-source software framework or tools under the Apache license that is maintained by a

global community of users.

It can be combined with a data warehouse and then linked to analytics

CORE COMPONENTS

Store Data

Hadoop is clustered system ,can store structured and unstructured

data.

HDFS (Hadoop distributed file system ): Large files ,lots of files and

you can store on multiple pc using multiple nodes.

Process Data

Capacity to process large volume of data using framework named

map reduce .It process data but need not moving data back and

forth to make faster processing .Processing is distributed to

different sites say pc, server end and after that take the answer

back.

Map Reduce

HDFS file system

Project

HADOOP CORE COMPONENT

HDFS

Name node store an index on which data store in which node.

It tells application where the data store so that application can fetch data directly .

Data Split into equal small parts > Distributed into Slaves PC’s

FAULT TOLERANCE MECHANISM

Enterprise version of Hadoop has a backup master with main master .

Indexed are backed up and copied to different computers.

Three copies of each files are copied in different nodes Job tracker detect and find

out other task tracker to assign.

MAP REDUCE

JobTracker splits up data into smaller tasks(“Map”) and sends it to the Task Tracker

process in each node

TaskTracker reports back to the JobTracker node and reports on job progress, sends data

(“Reduce”) or requests new jobs

DRILL DOWN HADOOP FRAMEWORK TOOLS

HDFS - distributed file system

Map Reduce – A distributed framework for executing work in parallel

YARN is a centralized platform for resource management that assigns CPU, memory

to applications running on Hadoop cluster. It also enables other application frameworks

to run on Hadoop apart from Map Reduce.

HADOOP ECHO SYSTEM

HIVE –Hive allows developers to explore and analyze data using Hive Query Language having

SQL like commands. This is mostly used for ad hoc querying the data stored in Hadoop cluster.

Hbase- is a column-oriented database management system that runs on top of HDFS. HBase

supports a structured data storage. It is scalable that can run into billions of rows.

PIG - High level platform for creating map reduce program using language called pig Latin.

Similar to SQL for RDBMS system.A scripting language to Manipulate .

SQOOP - enables data exchange between relational databases and Hadoop clusters.

Mahout - scalable machine learning algorithm and libraries on the Hadoop platform.

Oozie – Is a workflow scheduler to manage Hadoop jobs

Scoop –Transferring bulk data between Hadoop and structured data store

Spark: (one of the latest tools in Hadoop ecosystem) is an in-memory processing engine for

processing Hadoop data. It promises 100 times faster processing speed compared to

technologies in the market today.

HADOOP ECHO SYSTEM

HDFS - distributed file system

Map Reduce – A distributed framework for executing work in parallel

YARN is a centralized platform for resource management that assigns CPU, memory

to applications running on Hadoop cluster. It also enables other application frameworks

to run on Hadoop apart from Map Reduce.

HADOOP ECHO SYSTEM

Hadoop software library is a framework that allows for the distributed processing of large data sets

across clusters of computers using simple programming models. It is designed to scale up from

single servers to thousands of machines, each offering local computation and storage.

HADOOP ECHO SYSTEM

….is a distributed file system designed to run and deploy on commodity or low-cost hardware

NameNode(Master) that manages the cluster metadata and

DataNodes(Worker) that store the actual data.

Scalable,

distributed storage system

highly fault-tolerant

high throughput access

MapReduce moves compute processes to the data on HDFS.Processing tasks can occur on the

physical node where the data resides. This significantly reduces the network traffic and so there

improving overall latency/performance.

Utilities diagnose the health of the files system and can rebalance the data on different nodes

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

The core of Divide-and-conquer distributed processing Engine. Earlier versions carried

resource management but now those are moved to YARN.

Includes JobTracker(Master) and TaskTracker(worker) components to run batch version of

jobs.

1. MapReduce library in the user program splits

input files into 64-128 MB/piece

2. Many copies of the program on a cluster of machines.

Copies of the program is the master

3. Master assigned work. It picks idle workers and

assigns each one a map task or a reduce task.

4. Say M map tasks and R reduce tasks need to assign

5. MW worker is assigned a map task reads the contents

and buffered in memory buffered key/value pairs on

the local disk are passed and back to the master

6. After getting notification Master assign a reduce worker RW about locations. RPC

remote procedure calls to read the data from the local disks of the map workers MW.

MAP REDUCE

7. RW reduce worker has read data, it sorts it by the intermediate keys so that all occurrences of the

same key are grouped together.

8. The output of the Reduce function is appended to a final output file.

9. When all map tasks and reduce tasks have been completed, the master wakes up the user

program---the MapReduce call in the user program returns back to the user code.

10. The output of the map reduce execution is available in the output files

MAP REDUCE

MapReduce is that the default development cycle in Java is very long. Writing the mappers and

reducers, compiling and packaging the code, submitting the job(s), and retrieving the results is time

consuming.

Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is

converted into Java MapReduce program and then submitted to the Hadoop cluster.

Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop

cluster.

While HiveQL is a declarative language like SQL, PigLatin is a data flow language. The output of

one PigLatin construct can be sent as input to another PigLatin construct and so on.

PIG AND HIVE

Pig scriptable language “Latin” can process petabytes of data with just few statements. Originated

at Yahoo. Recommended use case(s) are ETL data pipelines, research on raw data and iterative

processing.

Key features

Available functions LOAD, FILTER, GROUP BY, FOREACH, MAX , ORDER, LIMIT, UNION ,

CROSS , SPLIT, CLUBE, ROLLUP.

Ability to create User Defined functions in Java and other languages that can be called with in

Pig scripts.

PIG

Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop.

Hive/HQL provides SQL like abstraction called HQL for data analysts with SQL background.

Originated at Facebook.

Hive provides a SQL/schema like structure to HDFS data by using a Metastore.

Metastore which visualizes “hdfs data” as sql tables is supported by MySQL. MySQL does not

store data but just the structure to data stored in HDFS.

Provides majority of ANSI SQL a like statements for processing including Explain, Analyze

statements.

Provides Partitioning, Indexing, Bucketing features to manage performance.

HIVE/ HQL

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop

and structured data stores such as relational databases.

Mostly Batch driven, simple use case will be an organization that runs a nightly sqoop import to

load the day's data from a production DB into a HDFS for Hive/HSQL analysis.

Provides Import, Export commands and runs the process in parallel with data coming in and out of

HDFS

Parallelizes data transfer for fast performance and optimal system utilization

Copies data quickly from external systems to Hadoop

Makes data analysis more efficient

Mitigates excessive loads to external systems.

SQOOP

Ozzie is a workflow scheduling engine specialized in running multi-stage jobs in Hadoop ecosystem.

It has ability to monitor, track, recover from errors, maintain dependency of jobs. Workflows are

expressed as XML and no development language are required.

•Types of jobs:

•Oozie workflow jobs - Jobs running on demand.

Oozie Coordinator jobs - Workflow Jobs running periodically on a regular basis.

Triggered by time and data availability.

•Oozie Bundle provides a way to package multiple coordinator and workflow jobs.

OOZIE

Storm provides ability to process of large amounts of “real time” data. Process twitter feeds, online

web log errors that needs immediate actions. Storm has 2 kinds of nodes - master (which runs

daemon Nimbus) and worker nodes(that runs supervisor) .

All clusters are managed by ZooKeeper and they can fail/restart. Also fast computation/data

transfers are done by “ZeroMQ” which is faster than TCP/IP .

STORM

An In-memory compute for Machine learning and Data science projects. Typical MapReduce

processing involves data transferring between “hard disks” but Spark does all data processing/

saving in memory(to great extent) thus providing 10 fold performance improvement over Storm

Spark provides RDD (resilient Distributed Data) mechanism for in-Memory data storage and

processing primitives which get applied on whole data

SPARK

One criticism of MapReduce is that the default development cycle in Java is very long.

Writing the mappers and reducers, compiling and packaging the code, submitting the job(s), and

retrieving the results is time consuming.

Also those who are not Java programmers cannot really make good use of Hadoop/HDFS.

PIG AND HIVE

HDFS is a general purpose file system and does not provide fast individual record lookups in files.

HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates)

for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your

data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.

HBASE

ZooKeeper is a centralized service for maintaining configuration information, naming, providing

distributed synchronization, and providing group services. All of these kinds of services are used in

some form or another by distributed applications.

ZOOKEEPER

A Scalable machine learning and data mining library .It is designed to be scalable and robust.

Provides Java Interface

A simple and extensible programming environment and framework for building scalable

algorithms

A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink

Samsara, a vector math experimentation environment with R-like syntax which works at scale

MAHOUT

HDInsight is an Apache Hadoop implementation

HDINSIGHT

THANKS

query store in sql server 2016 - … data ... tidy up and store boatloads of structure and...

Documents