hadoopdb

HadoopDBInneke Ponet

Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components

More and more data needs to be stored and processed.

People want to do more and more complex calculations on their collected data.

Analytical databases on high-end machines are moving towards cheaper lower-end machines.

The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually.

Introduction

Parallel databases: good performance, good efficiency.

MapReduce-based systems: superior scalability, good fault tolerance, good flexibility to handle unstructered data.

Technologies for data analysis

Support for standard relational tables and SQL.

Implements techniques for a better performance: Indexing, compression, materialized views, result

caching, I/O sharing.

Data is partitioned (shared-nothing architecture) transparent to the end-user.

Parallel databases

The DBMS of the most analytical databases are deployed on a shared-noting architecture:

A collection of machines that are independent, are possible virtual, have their own local disk and local main memory, are connected by a high-speed network.

Scalability of machines.

Analysis tasks are easy to parallellize.

Shared-nothing architecture

A technology from Google: processes (un)structured data that is distributed on many

nodes in a shared-nothing cluster; works at enormous scale.

Map and Reduce: parallel without communicating; Map-repartition-Reduce cycles.

MapReduce

No detailed query execution plan in advance at runtime:

adjust to node failures and slow nodes (re)assigning tasks to faster nodes.

Checkpoints the output to local disk minimizing of the work in case of a failure.

MapReduce: advantages

Hybrid database: a combination of:

traditional DBMS, MapReduce-technology.

Developed by Yale University students: Azza Abouzeid and Kamil BajDa-Pawlikowski

It is free and open source.

HadoopDB

A. PerformanceB. Fault toleranceC. Heterogeneous environmentD. Flexible query interfaceE. Scalability

Desired properties

Primary characteristic to distinguish.

MapReduce: first modeling and loading data before processing slower performance than parallel databases.

Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware.

A. Performance

Parallel databases MapReduce

Succesfully commit transactions. Make progress on a workload.

Heterogeneity and scalibility more faultsBUT MapReduce good fault tolerance: reassigning tasks; sub-tasks minimize the effect of faults.

Parallel databases: assumption failures are rare more testing => slower performance.

B. Fault tolerance


Nodes don’t always run on identical hardware, an identical virtual machine. Different performance.

Parallel databases: not tested on more than 100 nodes.

C. Heterogeneous environment


Easy to make queries: SQL and non-SQL interface languages, Use of tools.

Robust mechanisme for writing UDFs.

Parallel databases: SQL, ODBC and UDFs. MapReduce-based systems: it is possible (Hive),

but not always (Hadoop).

D. Flexible query interface

Parallel databases / MapReduce

Traditional DBMS: only scalable to 100 nodes.

MapReduce-based systems: designed to scale to thousands of nodes in a shared-

nothing architecture.

E. Scalability

Reasons Assumption

Failures Failures are rare.

Hetrogeneity Homogeneous array of machines.

Not tested There are no applications with more than a few dozen nodes.



Performance Fault tolerance Heterogeneous environment

Flexible query interface /Scalability

Desired properties

Communication: Hadoop

Database: PostgreSQL

Translation: Hive

Layers of HadoopDB

Communication layer of HadoopDB.

Hadoop framework two layers: Hadoop Distributed File System (HDFS), MapReduce framework.

Cost: free/open source MapReduce.

Hadoop

Relational DBMS.

(Possible) database layer of HadoopDB.

Cost: free/open source.

PostgreSQL

Translation layer. Processing of a SQL query:

Query Abstract Syntax Tree. MetaStore: schema of the table(s). Logical query plan: DAG of relational operators. Optimized plan. Physical executable plan: MapReduce job(s). XML plan: DAG serialized. Hive Driver executes a Hadoop job.

Hive

Database Connector: Interface between independent database systems; Extends the InputFormat class (of Hadoop); Connect to any JDBC-compliant database.

Catalog: Meta-information about the databases:

connection parameters, metadata.

XML file in HDFS accessed by: Master node, Worker/Slave nodes.

HadoopDB components

Data loader: Global hasher:

Custom MapReduce job files in HDFS; Repartioning data upon loading.

Local hasher: Copies partition from HDFS to local file system; Partitions the file in smaller sized chunks.

HadoopDB Components (2)

SQL to MapReduce: Parallel database front-end to process SQL queries.

HiveQL ↓ TransformMapReduce jobs: Connect to tables stored in HDFS; Consists of DAGs of relational operators that operate as iterators.

Assumption no collection of tables: Operations on multiple tables Reduce function.

NOT in HadoopDB: a join operation can be pushed to the databse layer.


SQL/SMS planner: Modifies Hive:

Updates the MetaStore Two passes over the physical plan:

1. Determine the partition keys for the Reduce Sink Operators.2. Operators are:

converted in SQL querie(s); pushed into the database layer.

Only filter, select and aggregation operators.


Questions?

hadoopdb

Documents

data analysis4support

databasesunstructered

mapreducestructured

exploding data

collected data

data needs

processes unstructured

analytical dbms databases