query store in sql server 2016 - … data ... tidy up and store boatloads of structure and...
TRANSCRIPT
D E E P D R I V E O N B I G D A T A & H A D O O P
Ahsan Kabir MVP on Data Platform
techforum user group meetup
PASS Local Chapter
Data is a set of values of qualitative or quantitative variables; any facts, numbers, or text
that can be processed by a computer. Data are generated by man and Machine
Operational or transactional data
Non-operational data
Meta data
Information
The patterns, associations, or relationships among all this data can provide information.
Knowledge
Information can be converted into knowledge about historical patterns and future trends.
DATA
Performance management
Identify trends
Monitoring
Decision Making
Cash flow trend
Fine-tune operations
Sales pipeline analysis
Future projections
business Forecasting
Convert data into information
DATA AND WE
Structured Data
Data stored, accessed and processed in the form of fixed format is termed as a
'structured' . This includes data contained in relational databases and spreadsheets.
Semi-structured Data
structured data that does not conform with the formal structure of data models associated
with relational databases or other forms of data tables: XML,JSON
Unstructured Data
Unstructured data is data that does not follow a specified format or pre-defined data
model .In a organization 70%–80% of all data are unstructured and is growing 10–50x
more than structured data”
DATA OF DIFFERENT COLOR
WHAT IS THE SOURCE OF 80%UNSTRUCTURED DATA
Emails
Social Network Data
Images
Videos ,Audio
Text Files
Social media data
Mobile data
website content
Radar or sonar data
Satellite images
Scientific data
RDBMS can store fixed data volumes.
RDBMS need more CPUs or memory to
scale up vertically.
RDBMS relational databases just can’t
categorize verity of data
RDBMS lacks in high velocity because it’s
designed for steady data retention rather
than rapid growth.
Variability refers to the data sourced from
varied places that requires testing of its
quality
Value refers to use all the data sourced to
be put in productive and value oriented
results
HOW TO MANAGE 80% UNSTRUCTURED DATA
WHAT IS BIG DATA
Big Data hug amount of data that can’t be stored and processed with traditional approach
within limited time frame..
Big data is a collection of data sets so large, complex ,rapidly growing that becomes difficult to
search, process, sharing, storage, transfer, visualization, querying, updating and information
privacy using on-hand database management tools or traditional data processing applications.
5-V of Big Data
COMPANY’S DATA NEEDS EXCEEDS ITS INFRASTRUCTURE
BUT WE WANT
Goals / Requirements:
Advance analytics
Abstract and facilitate the storage and processing of large data sets
Manage rapidly growing data sets
Structured and non-structured data
Simple programming models
High scalability and availability
Use commodity (cheap!) hardware with little redundancy
Fault-tolerance
Move computation rather than data
ANALYTICS
“BIG DATA” VENDORS
Apache Hadoop
Cloudera
Hortonworks (Micorosoft using Hortonworks hadoop distribution)
Amazon’s Elastic Map Reduce • EMC’s GreenPlum • IBM’s Infosphere
Oracle Big Data
Google BigQuery
SOLUTION THROUGH HADOOP
Hadoop can be used to collect, tidy up and store boatloads of structure and unstructured data.
Hadoop is an open source highly scalable compute and storage platform.
Hadoop work to interpret or parse the results of big data searches through specific proprietary algorithms
and methods.
Hadoop is an open-source software framework or tools under the Apache license that is maintained by a
global community of users.
It can be combined with a data warehouse and then linked to analytics
CORE COMPONENTS
Store Data
Hadoop is clustered system ,can store structured and unstructured
data.
HDFS (Hadoop distributed file system ): Large files ,lots of files and
you can store on multiple pc using multiple nodes.
Process Data
Capacity to process large volume of data using framework named
map reduce .It process data but need not moving data back and
forth to make faster processing .Processing is distributed to
different sites say pc, server end and after that take the answer
back.
Map Reduce
HDFS file system
Project
HADOOP CORE COMPONENT
HDFS
Name node store an index on which data store in which node.
It tells application where the data store so that application can fetch data directly .
Data Split into equal small parts > Distributed into Slaves PC’s
FAULT TOLERANCE MECHANISM
Enterprise version of Hadoop has a backup master with main master .
Indexed are backed up and copied to different computers.
Three copies of each files are copied in different nodes Job tracker detect and find
out other task tracker to assign.
MAP REDUCE
JobTracker splits up data into smaller tasks(“Map”) and sends it to the Task Tracker
process in each node
TaskTracker reports back to the JobTracker node and reports on job progress, sends data
(“Reduce”) or requests new jobs
DRILL DOWN HADOOP FRAMEWORK TOOLS
HDFS - distributed file system
Map Reduce – A distributed framework for executing work in parallel
YARN is a centralized platform for resource management that assigns CPU, memory
to applications running on Hadoop cluster. It also enables other application frameworks
to run on Hadoop apart from Map Reduce.
HADOOP ECHO SYSTEM
HIVE –Hive allows developers to explore and analyze data using Hive Query Language having
SQL like commands. This is mostly used for ad hoc querying the data stored in Hadoop cluster.
Hbase- is a column-oriented database management system that runs on top of HDFS. HBase
supports a structured data storage. It is scalable that can run into billions of rows.
PIG - High level platform for creating map reduce program using language called pig Latin.
Similar to SQL for RDBMS system.A scripting language to Manipulate .
SQOOP - enables data exchange between relational databases and Hadoop clusters.
Mahout - scalable machine learning algorithm and libraries on the Hadoop platform.
Oozie – Is a workflow scheduler to manage Hadoop jobs
Scoop –Transferring bulk data between Hadoop and structured data store
Spark: (one of the latest tools in Hadoop ecosystem) is an in-memory processing engine for
processing Hadoop data. It promises 100 times faster processing speed compared to
technologies in the market today.
HADOOP ECHO SYSTEM
HDFS - distributed file system
Map Reduce – A distributed framework for executing work in parallel
YARN is a centralized platform for resource management that assigns CPU, memory
to applications running on Hadoop cluster. It also enables other application frameworks
to run on Hadoop apart from Map Reduce.
HADOOP ECHO SYSTEM
Hadoop software library is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local computation and storage.
HADOOP ECHO SYSTEM
….is a distributed file system designed to run and deploy on commodity or low-cost hardware
NameNode(Master) that manages the cluster metadata and
DataNodes(Worker) that store the actual data.
Scalable,
distributed storage system
highly fault-tolerant
high throughput access
MapReduce moves compute processes to the data on HDFS.Processing tasks can occur on the
physical node where the data resides. This significantly reduces the network traffic and so there
improving overall latency/performance.
Utilities diagnose the health of the files system and can rebalance the data on different nodes
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)
The core of Divide-and-conquer distributed processing Engine. Earlier versions carried
resource management but now those are moved to YARN.
Includes JobTracker(Master) and TaskTracker(worker) components to run batch version of
jobs.
1. MapReduce library in the user program splits
input files into 64-128 MB/piece
2. Many copies of the program on a cluster of machines.
Copies of the program is the master
3. Master assigned work. It picks idle workers and
assigns each one a map task or a reduce task.
4. Say M map tasks and R reduce tasks need to assign
5. MW worker is assigned a map task reads the contents
and buffered in memory buffered key/value pairs on
the local disk are passed and back to the master
6. After getting notification Master assign a reduce worker RW about locations. RPC
remote procedure calls to read the data from the local disks of the map workers MW.
MAP REDUCE
7. RW reduce worker has read data, it sorts it by the intermediate keys so that all occurrences of the
same key are grouped together.
8. The output of the Reduce function is appended to a final output file.
9. When all map tasks and reduce tasks have been completed, the master wakes up the user
program---the MapReduce call in the user program returns back to the user code.
10. The output of the map reduce execution is available in the output files
MAP REDUCE
MapReduce is that the default development cycle in Java is very long. Writing the mappers and
reducers, compiling and packaging the code, submitting the job(s), and retrieving the results is time
consuming.
Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is
converted into Java MapReduce program and then submitted to the Hadoop cluster.
Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop
cluster.
While HiveQL is a declarative language like SQL, PigLatin is a data flow language. The output of
one PigLatin construct can be sent as input to another PigLatin construct and so on.
PIG AND HIVE
Pig scriptable language “Latin” can process petabytes of data with just few statements. Originated
at Yahoo. Recommended use case(s) are ETL data pipelines, research on raw data and iterative
processing.
Key features
Available functions LOAD, FILTER, GROUP BY, FOREACH, MAX , ORDER, LIMIT, UNION ,
CROSS , SPLIT, CLUBE, ROLLUP.
Ability to create User Defined functions in Java and other languages that can be called with in
Pig scripts.
PIG
Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop.
Hive/HQL provides SQL like abstraction called HQL for data analysts with SQL background.
Originated at Facebook.
Hive provides a SQL/schema like structure to HDFS data by using a Metastore.
Metastore which visualizes “hdfs data” as sql tables is supported by MySQL. MySQL does not
store data but just the structure to data stored in HDFS.
Provides majority of ANSI SQL a like statements for processing including Explain, Analyze
statements.
Provides Partitioning, Indexing, Bucketing features to manage performance.
HIVE/ HQL
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop
and structured data stores such as relational databases.
Mostly Batch driven, simple use case will be an organization that runs a nightly sqoop import to
load the day's data from a production DB into a HDFS for Hive/HSQL analysis.
Provides Import, Export commands and runs the process in parallel with data coming in and out of
HDFS
Parallelizes data transfer for fast performance and optimal system utilization
Copies data quickly from external systems to Hadoop
Makes data analysis more efficient
Mitigates excessive loads to external systems.
SQOOP
Ozzie is a workflow scheduling engine specialized in running multi-stage jobs in Hadoop ecosystem.
It has ability to monitor, track, recover from errors, maintain dependency of jobs. Workflows are
expressed as XML and no development language are required.
•Types of jobs:
•Oozie workflow jobs - Jobs running on demand.
Oozie Coordinator jobs - Workflow Jobs running periodically on a regular basis.
Triggered by time and data availability.
•Oozie Bundle provides a way to package multiple coordinator and workflow jobs.
OOZIE
Storm provides ability to process of large amounts of “real time” data. Process twitter feeds, online
web log errors that needs immediate actions. Storm has 2 kinds of nodes - master (which runs
daemon Nimbus) and worker nodes(that runs supervisor) .
All clusters are managed by ZooKeeper and they can fail/restart. Also fast computation/data
transfers are done by “ZeroMQ” which is faster than TCP/IP .
STORM
An In-memory compute for Machine learning and Data science projects. Typical MapReduce
processing involves data transferring between “hard disks” but Spark does all data processing/
saving in memory(to great extent) thus providing 10 fold performance improvement over Storm
Spark provides RDD (resilient Distributed Data) mechanism for in-Memory data storage and
processing primitives which get applied on whole data
SPARK
One criticism of MapReduce is that the default development cycle in Java is very long.
Writing the mappers and reducers, compiling and packaging the code, submitting the job(s), and
retrieving the results is time consuming.
Also those who are not Java programmers cannot really make good use of Hadoop/HDFS.
PIG AND HIVE
HDFS is a general purpose file system and does not provide fast individual record lookups in files.
HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates)
for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your
data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
HBASE
ZooKeeper is a centralized service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services. All of these kinds of services are used in
some form or another by distributed applications.
ZOOKEEPER
A Scalable machine learning and data mining library .It is designed to be scalable and robust.
Provides Java Interface
A simple and extensible programming environment and framework for building scalable
algorithms
A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink
Samsara, a vector math experimentation environment with R-like syntax which works at scale
MAHOUT
HDInsight is an Apache Hadoop implementation
HDINSIGHT
THANKS