hadoopdb

26
HadoopDB Inneke Ponet

Upload: hieu

Post on 23-Feb-2016

108 views

Category:

Documents


0 download

DESCRIPTION

HadoopDB. Inneke Ponet. Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components. Introduction. More and more data needs to be stored and processed . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HadoopDB

HadoopDBInneke Ponet

Page 2: HadoopDB

Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components

Page 3: HadoopDB

More and more data needs to be stored and processed.

People want to do more and more complex calculations on their collected data.

Analytical databases on high-end machines are moving towards cheaper lower-end machines.

The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually.

Introduction

Page 4: HadoopDB

Parallel databases: good performance, good efficiency.

MapReduce-based systems: superior scalability, good fault tolerance, good flexibility to handle unstructered data.

Technologies for data analysis

Page 5: HadoopDB

Support for standard relational tables and SQL.

Implements techniques for a better performance: Indexing, compression, materialized views, result

caching, I/O sharing.

Data is partitioned (shared-nothing architecture) transparent to the end-user.

Parallel databases

Page 6: HadoopDB

The DBMS of the most analytical databases are deployed on a shared-noting architecture:

A collection of machines that are independent, are possible virtual, have their own local disk and local main memory, are connected by a high-speed network.

Scalability of machines.

Analysis tasks are easy to parallellize.

Shared-nothing architecture

Page 7: HadoopDB

A technology from Google: processes (un)structured data that is distributed on many

nodes in a shared-nothing cluster; works at enormous scale.

Map and Reduce: parallel without communicating; Map-repartition-Reduce cycles.

MapReduce

Page 8: HadoopDB

No detailed query execution plan in advance at runtime:

adjust to node failures and slow nodes (re)assigning tasks to faster nodes.

Checkpoints the output to local disk minimizing of the work in case of a failure.

MapReduce: advantages

Page 9: HadoopDB

Hybrid database: a combination of:

traditional DBMS, MapReduce-technology.

Developed by Yale University students: Azza Abouzeid and Kamil BajDa-Pawlikowski

It is free and open source.

HadoopDB

Page 10: HadoopDB

A. PerformanceB. Fault toleranceC. Heterogeneous environmentD. Flexible query interfaceE. Scalability

Desired properties

Page 11: HadoopDB

Primary characteristic to distinguish.

MapReduce: first modeling and loading data before processing slower performance than parallel databases.

Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware.

A. Performance

Parallel databases MapReduce

Page 12: HadoopDB

Succesfully commit transactions. Make progress on a workload.

Heterogeneity and scalibility more faultsBUT MapReduce good fault tolerance: reassigning tasks; sub-tasks minimize the effect of faults.

Parallel databases: assumption failures are rare more testing => slower performance.

B. Fault tolerance

Parallel databases MapReduce

Page 13: HadoopDB

Nodes don’t always run on identical hardware, an identical virtual machine. Different performance.

Parallel databases: not tested on more than 100 nodes.

C. Heterogeneous environment

Parallel databases MapReduce

Page 14: HadoopDB

Easy to make queries: SQL and non-SQL interface languages, Use of tools.

Robust mechanisme for writing UDFs.

Parallel databases: SQL, ODBC and UDFs. MapReduce-based systems: it is possible (Hive),

but not always (Hadoop).

D. Flexible query interface

Parallel databases / MapReduce

Page 15: HadoopDB

Traditional DBMS: only scalable to 100 nodes.

MapReduce-based systems: designed to scale to thousands of nodes in a shared-

nothing architecture.

E. Scalability

Reasons Assumption

Failures Failures are rare.

Hetrogeneity Homogeneous array of machines.

Not tested There are no applications with more than a few dozen nodes.

Parallel databases MapReduce

Page 16: HadoopDB

Parallel databases MapReduce

Performance Fault tolerance Heterogeneous environment

Flexible query interface /Scalability

Desired properties

Page 17: HadoopDB

Communication: Hadoop

Database: PostgreSQL

Translation: Hive

Layers of HadoopDB

Page 18: HadoopDB

Communication layer of HadoopDB.

Hadoop framework two layers: Hadoop Distributed File System (HDFS), MapReduce framework.

Cost: free/open source MapReduce.

Hadoop

Page 19: HadoopDB

Relational DBMS.

(Possible) database layer of HadoopDB.

Cost: free/open source.

PostgreSQL

Page 20: HadoopDB

Translation layer. Processing of a SQL query:

Query Abstract Syntax Tree. MetaStore: schema of the table(s). Logical query plan: DAG of relational operators. Optimized plan. Physical executable plan: MapReduce job(s). XML plan: DAG serialized. Hive Driver executes a Hadoop job.

Hive

Page 21: HadoopDB

Database Connector: Interface between independent database systems; Extends the InputFormat class (of Hadoop); Connect to any JDBC-compliant database.

Catalog: Meta-information about the databases:

connection parameters, metadata.

XML file in HDFS accessed by: Master node, Worker/Slave nodes.

HadoopDB components

Page 22: HadoopDB

Data loader: Global hasher:

Custom MapReduce job files in HDFS; Repartioning data upon loading.

Local hasher: Copies partition from HDFS to local file system; Partitions the file in smaller sized chunks.

HadoopDB Components (2)

Page 23: HadoopDB

SQL to MapReduce: Parallel database front-end to process SQL queries.

HiveQL ↓ TransformMapReduce jobs: Connect to tables stored in HDFS; Consists of DAGs of relational operators that operate as iterators.

Assumption no collection of tables: Operations on multiple tables Reduce function.

NOT in HadoopDB: a join operation can be pushed to the databse layer.

HadoopDB Components (3)

Page 24: HadoopDB

SQL/SMS planner: Modifies Hive:

Updates the MetaStore Two passes over the physical plan:

1. Determine the partition keys for the Reduce Sink Operators.2. Operators are:

converted in SQL querie(s); pushed into the database layer.

Only filter, select and aggregation operators.

HadoopDB Components (4)

Page 25: HadoopDB

HadoopDB Components (5)

Page 26: HadoopDB

Questions?