hadoopdb
Post on 23-Feb-2016
108 Views
Preview:
DESCRIPTION
TRANSCRIPT
HadoopDBInneke Ponet
Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components
More and more data needs to be stored and processed.
People want to do more and more complex calculations on their collected data.
Analytical databases on high-end machines are moving towards cheaper lower-end machines.
The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually.
Introduction
Parallel databases: good performance, good efficiency.
MapReduce-based systems: superior scalability, good fault tolerance, good flexibility to handle unstructered data.
Technologies for data analysis
Support for standard relational tables and SQL.
Implements techniques for a better performance: Indexing, compression, materialized views, result
caching, I/O sharing.
Data is partitioned (shared-nothing architecture) transparent to the end-user.
Parallel databases
The DBMS of the most analytical databases are deployed on a shared-noting architecture:
A collection of machines that are independent, are possible virtual, have their own local disk and local main memory, are connected by a high-speed network.
Scalability of machines.
Analysis tasks are easy to parallellize.
Shared-nothing architecture
A technology from Google: processes (un)structured data that is distributed on many
nodes in a shared-nothing cluster; works at enormous scale.
Map and Reduce: parallel without communicating; Map-repartition-Reduce cycles.
MapReduce
No detailed query execution plan in advance at runtime:
adjust to node failures and slow nodes (re)assigning tasks to faster nodes.
Checkpoints the output to local disk minimizing of the work in case of a failure.
MapReduce: advantages
Hybrid database: a combination of:
traditional DBMS, MapReduce-technology.
Developed by Yale University students: Azza Abouzeid and Kamil BajDa-Pawlikowski
It is free and open source.
HadoopDB
A. PerformanceB. Fault toleranceC. Heterogeneous environmentD. Flexible query interfaceE. Scalability
Desired properties
Primary characteristic to distinguish.
MapReduce: first modeling and loading data before processing slower performance than parallel databases.
Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware.
A. Performance
Parallel databases MapReduce
Succesfully commit transactions. Make progress on a workload.
Heterogeneity and scalibility more faultsBUT MapReduce good fault tolerance: reassigning tasks; sub-tasks minimize the effect of faults.
Parallel databases: assumption failures are rare more testing => slower performance.
B. Fault tolerance
Parallel databases MapReduce
Nodes don’t always run on identical hardware, an identical virtual machine. Different performance.
Parallel databases: not tested on more than 100 nodes.
C. Heterogeneous environment
Parallel databases MapReduce
Easy to make queries: SQL and non-SQL interface languages, Use of tools.
Robust mechanisme for writing UDFs.
Parallel databases: SQL, ODBC and UDFs. MapReduce-based systems: it is possible (Hive),
but not always (Hadoop).
D. Flexible query interface
Parallel databases / MapReduce
Traditional DBMS: only scalable to 100 nodes.
MapReduce-based systems: designed to scale to thousands of nodes in a shared-
nothing architecture.
E. Scalability
Reasons Assumption
Failures Failures are rare.
Hetrogeneity Homogeneous array of machines.
Not tested There are no applications with more than a few dozen nodes.
Parallel databases MapReduce
Parallel databases MapReduce
Performance Fault tolerance Heterogeneous environment
Flexible query interface /Scalability
Desired properties
Communication: Hadoop
Database: PostgreSQL
Translation: Hive
Layers of HadoopDB
Communication layer of HadoopDB.
Hadoop framework two layers: Hadoop Distributed File System (HDFS), MapReduce framework.
Cost: free/open source MapReduce.
Hadoop
Relational DBMS.
(Possible) database layer of HadoopDB.
Cost: free/open source.
PostgreSQL
Translation layer. Processing of a SQL query:
Query Abstract Syntax Tree. MetaStore: schema of the table(s). Logical query plan: DAG of relational operators. Optimized plan. Physical executable plan: MapReduce job(s). XML plan: DAG serialized. Hive Driver executes a Hadoop job.
Hive
Database Connector: Interface between independent database systems; Extends the InputFormat class (of Hadoop); Connect to any JDBC-compliant database.
Catalog: Meta-information about the databases:
connection parameters, metadata.
XML file in HDFS accessed by: Master node, Worker/Slave nodes.
HadoopDB components
Data loader: Global hasher:
Custom MapReduce job files in HDFS; Repartioning data upon loading.
Local hasher: Copies partition from HDFS to local file system; Partitions the file in smaller sized chunks.
HadoopDB Components (2)
SQL to MapReduce: Parallel database front-end to process SQL queries.
HiveQL ↓ TransformMapReduce jobs: Connect to tables stored in HDFS; Consists of DAGs of relational operators that operate as iterators.
Assumption no collection of tables: Operations on multiple tables Reduce function.
NOT in HadoopDB: a join operation can be pushed to the databse layer.
HadoopDB Components (3)
SQL/SMS planner: Modifies Hive:
Updates the MetaStore Two passes over the physical plan:
1. Determine the partition keys for the Reduce Sink Operators.2. Operators are:
converted in SQL querie(s); pushed into the database layer.
Only filter, select and aggregation operators.
HadoopDB Components (4)
HadoopDB Components (5)
Questions?
top related