mapreduce debates and schema-free

http://www.coordguru.com

Woohyun Kim

The creator of open source “Coord”

(http://www.coordguru.com)

2010-03-03

MapReduce Debates and Schema-Free- Big Data, MapReduce, RDBMS+MapReduce, Non-Relational DB

The Advent of Big Data

Noah’s Ark Problem• Did Noah take dinosaurs on the Ark?

• The Ark was a very large ship designed especially for its important purpose

• It was so large and complex that it took Noah 120 years to build

• How to put such a big thing• Diet or DNA?

• Differentiate, Put, and Integrate

• Larger?• More?

• ‚Big Data‛ problem is just like that• Compression or Reduction

• gzip, Fingerprint, DNA, MD5, …

• Scale Up• Scale Out

Perspectives of Big Data

•SQL

•MapReduce

•Key-Value

•RESTFul

•OLAP

•Text/Data Mining

•Social/Semantic Analysis

•Visualization

•Reporting

•SQL

•MapReduce

•Pig

•Hive, CloudBase

•SAN

•HDFS

•Hbase, Voldemort, MongoDB, Cassandra

•HadoopDB

Store Process

RetrieveAnalyze

Struggling to STORE and ANALYZE “Big Data”

How to deal with “Big Data”

A User Credit Model

Case Study: User Credit Analysis

Confidence_negative

Penalty_cnt Admin_delete_cnt

0.5 0.5

Confidence_positive_content

Aha_best_cnt

Confidence_negative_user

Is_honor Dredt_level

0.3 0.4

Is_sponsor

Confidence_positive

0.5 0.5

confidence

-0.5 0.5

popularity_positive

best_answer_cnt

Total_kinup_point

0.7 0.3

Popularity_negative

Report_cnt

popularity

-0.2 0.8

quality

0.3 0.7∑

amount

Open100_write_cnt

Answer_cnt

0.3 0.6

Question_cnt

User Credit

0.5 0.5∑

Preprocessing Blog Data for Analyzing User Credit

Case Study: User Credit Analysis

pt_log1.csv

pt_attachfile1.csv

make_blog_post_info.cpp

pt_buddy.csv

pt_count.csv

pt_power_blog1.csv

pt_comment1.csv

cal_buddy_cnt.cpp

att_visit_count.cpp

att_is_powerblogger.cpp

att_commenting.cpp

att_pt_log.cpp

Post * Attachment

Buddy * Count

Buddy/Count * PowerBlogger

Buddy/Count/PowerBlogger * Comment

Post/Attachment *Buddy/Count/PowerBlogger/Comment Blog Post

Blogger

New Changes surrounding Data Storages

• Volume

• Data volumes have grown from tens of gigabytes in the 1990s to hundreds of

terabytes and often petabytes in recent years

• Scale Out

• Relational databases are hard to scale• Partitioning(for scalability)

• Replication(for availability)

• Speed

• The seek times of physical storage is not keeping pace with improvements in network

speeds

• Integration

• Today’s data processing tasks increasingly have to access and combine data from

many different non-relational sources, often over a network

‚Relations‛ get broken

‚New Relations‛

Hadoop Revolution

Row key Column key

Column

Family

Column

Family

Best Practice in Hadoop• Software Stack in Google/Hadoop • Cookbook for ‚Big Data‛

StructuredData

• Structured Data Storage for ‚Big Data‛

Hadoop is changing the Game

• Hadoop, DW, and BI

“Big Data” goes well with Hadoop

• Parallelize Relational Algebra Operations using MapReduce

Case Study: Parallel Join

• A Parallel Join Example using MapReduce

Case Study: Further Study in Parallel Join

Problems

• Need to sort

• Move the partitioned data across the network

• Due to shuffling, must send the whole data

• Skewed by popular keys

• All records for a particular key are sent to the same reducer

• Overhead by tagging

Alternatives• Map-side Join

• Mapper-only job to avoid sort and to reduce data movement across the

network

• Semi-Join

• Shrink data size through semi-join(by preprocessing)

Case Study: Improvements in Parallel Join

Map-Side Join• Replicate a relatively smaller input source to the cluster

• Put the replicated dataset into a local hash table

• Join – a relatively larger input source with each local hash table

• Mapper: do Mapper-side Join

Semi-Join• Extract – unique IDs referenced in a larger input source(A)

• Mapper: extract Movie IDs from Ratings records

• Reducer: accumulate all unique Movie IDs

• Filter – the other larger input source(B) with the referenced unique IDs

• Mapper: filter the referenced Movie IDs from full Movie dataset

• Join - a larger input source(A) with the filtered datasets

• Mapper: do Mapper-side Join• Ratings records & the filtered movie IDs dataset

MapReduce Debates

MapReduce is just A Major Step Backwards!!!Dewitt and StoneBraker in January 17, 2008

• A giant step backward in the programming paradigm for large-scale data intensive applications

• Schema are good• Type check in runtime, so no garbage

• Separation of the schema from the application is good• Schema is stored in catalogs, so can be queried(in SQL)

• High-level access languages are good• Present what you want rather than an algorithm for how to get it

• No schema??!• At least one data field by specifying the key as input• For Bigtable/Hbase, different tuples within the same table can

actually have different schemas• Even there is no support for logical schema changes such as

MapReduce is just A Major Step Backwards!!! (cont’d)Dewitt and StoneBraker in January 17, 2008

• A sub-optimal implementation, in that it uses brute force instead of indexing

• Indexing• All modern DBMSs use hash or B-tree indexes to accelerate access to data• In addition, there is a query optimizer to decide whether to use an index or

perform a brute-force sequential search• However, MapReduce has no indexes, so processes only in brute force fashion

• Automatic parallel execution• In the 1980s, DBMS research community explored it such as Gamma, Bubba,

Grace, even commercial Teradata

• Skew• The distribution of records with the same key causes is skewed in the map

phase, so it causes some reduce to take much longer than others

• Intermediate data pulling• In the reduce phase, two or more reduce attempt to read input files form the

same map node simultaneously

• Not novel at all – it represents a specific implementation of well known techniques developed nearly 25 years ago

• Partitioning for join• Application of Hash to Data Base Machine and its Architecture, 1983

• Joins in parallel on a shared-nothing• Multiprocessor Hash-based Join Algorithms, 1985• The Case for Shared-Nothing, 1986

• Aggregates in parallel• The Gamma Database Machine Project, 1990• Parallel Database System: The Future of High Performance Database Systems,

1992• Adaptive Parallel Aggregation Algorithms, 1995

• Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years

• PostgreSQL supported user-defined functions and user-defined aggregates in the mid 1980s

• Missing most of the features that are routinely included in current DBMS• MapReduce provides only a sliver of the functionality found in modern DBMSs

• Bulk loader – transform input data in files into a desired format and load it into a DBMS• Indexing – hash or B-Tree indexes• Updates – change the data in the data base• Transactions – support parallel update and recovery from failures during update• integrity constraints – help keep garbage out of the data base• referential integrity – again, help keep garbage out of the data base• Views – so the schema can change without having to rewrite the application program

• Incompatible with all of the tools DBMS users have come to depend on• MapReduce cannot use the tools available in a modern SQL DBMS, and has none of

its own• Report writers(Crystal reports)• Prepare reports for human visualization• business intelligence tools(Business Objects or Cognos)• Enable ad-hoc querying of large data warehouses• data mining tools(Oracle Data Mining or IBM DB2 Intelligent Miner)• Allow a user to discover structure in large data sets• replication tools(Golden Gate)• Allow a user to replicate data from on DBMS to another• database design tools(Embarcadero)• Assist the user in constructing a data base

What the !@# MapReduce?

RDB experts Jump the MR SharkGreg Jorgensen in January 17, 2008

• Arg1: MapReduce is a step backwards in database access• MapReduce is not a database, a data storage, or management system• MapReduce is an algorithmic technique for the distributed processing of large

amounts of data

• Arg2: MapReduce is a poor implementation• MapReduce is one way to generate indexes from a large volume of data, but it’s not

a data storage and retrieval system

• Arg3: MapReduce is not novel• Hashing, parallel processing, data partitioning, and user-defined functions are all old

hat in the RDBMS world, but so what?• The big innovation MapReduce enables is distributing data processing across a

network of cheap and possibly unreliable computers

• Arg4: MapReduce is missing features• Arg5: MapReduce is incompatible with the DBMS tools

• The ability to process a huge volume of data quickly such as web crawling and log analysis is more important than guaranteeing 100% data integrity and completeness

DBs are hammers; MR is a screwdriverMark C. Chu-Carroll

• RDBs don’t parallelize very well• How many RDBs do you know that can efficiently split a

task among 1,000 cheap computers?

• RDBs don’t handle non-tabular data well• RDBs are notorious for doing a poor job on recursive data

structures

• MapReduce isn’t intended to replace relational databases

• It’s intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines

Eugene Shekita

• Arg1: Data Models, Schemas, and Query Languages• Semi-structured data model and high level of parallel data flow query language is

built on top of MapReduce• Pig, Hive, Jaql, Cascading, Cloudbase

• Hadoop will eventually have a real data model, schema, catalogs, and query language

• Moreover, Pig, Jaql, and Cascading are some steps forward• Support semi-structured data• Support more high level-like parallel data flow languages than declarative query

languages• Greenplum and Aster Data support MapReduce, but look more limited than Pig, Jaql,

Cascading• The calls to MapReduce functions wrapped in SQL queries will make it difficult

to work with semi-structured data and program multi-step dataflows

• Arg3: Novelty• Teradata was doing parallel group-by 20 years ago• UDAs and UDFs appeared in PostgreSQL in the mid 80s• And yet, MapReduce is much more flexible, and fault-tolerant

• Support semi-structured data types, customizable partitioning

MR is a Step Backwards, but some Steps Forward

Lessons Learned from the Debates

Who Moved My Cheese?

Hybrids of MapReduce and RDBMS

Integrate MapReduce into RDBMS

HadoopDB Greenplum Aster Data

Sybase IQ

Oracle+Hadoop

Vertica+Hadoop

Netezza+MapReduce Teradata+MapReduce

HadoopDB Details

Connection parameters- database location- driver class- credentialsMetadata- dataset- replica locations- data partitioning

HadoopDB Architecture

An Interesting Friendship of RDBMS and MapReduce

RDBMS MapReduceData size Gigabytes PetabytesUpdates Read and write(Mutable) Write once, read many times(Immutable)Latency Low HighAccess Interactive(point query) and batch Batch(ad-hoc query in brute-force)

Structure Fixed schema Semi-structured schemaLanguage SQL Procedural (Java, C++, etc)Integrity High LowScaling Nonlinear Linear

RDBMS vs. MapReduce

Pig, Hive, CloudBase

SQL or Script

MapReduce

Greenplum, Aster Data, HadoopDB

MapReduce

Scalability, Fault tolerance, Flexibility

Performance, Efficiency

RDBMS + MapReduce

In-Database MapReduce vs. File-only MapReduce

In-Database MapReduce File-Only MapReduce

Target User Analyst, DBA, Data Miner Computer Science Engineer

Scale & Performance High High

Hardware Costs Low Low

Analytical Insights High High

Failover & Recovery High High

Use: Ad-Hoc Queries Easy (seamless) Harder (custom)

Use: UI, Client Tools BI Tool (GUI), SQL (CLI) Developer Tool (Java)

Use: Ecosystem High (JDBC, ODBC) Lower (custom)

Protect: Data Integrity High (ACID, schema) Lower (no transaction guarantees)

Protect: Security High (roles, privileges) Lower (custom)

Protect: Backup & DR High (database backup/DR) Lower (custom)

Performance: Mixed Workloads High (workload/QoS mgmt) Lower (limited concurrency)

Performance: Network Bottleneck No (optimized partitioning) Higher (network inefficient)

Operational Cost Low (1 DBA) Higher (several engineers)

• In-Database MapReduce

• Greenplum, Aster Data, HadoopDB

• File-only MapReduce

• Pig, Hive, Cloudbase

Why Non-Relational?

Challenges in Traditional RDBMS

• Volume

• Data volumes have grown from tens of gigabytes in the 1990s to hundreds of

terabytes and often petabytes in recent years

• Speed

• The seek times of physical storage is not keeping pace with improvements in network

speeds

‚New Relations‛

Challenges in Traditional RDBMS (cont’d)

• Scale Out• Is it possible to achieve a large number of simple read/write operations per second?

• Traditional RDBMSs have not provided good horizontal scaling for OLTP• Partitioning(for scalability)

• Replication(for availability)

• Data warehousing RDBMSs provide horizontal scaling of complex joins and queries• Most of them are read-only or read-mostly

• Integration• Today’s data processing tasks increasingly have to access and combine data from

many different non-relational sources, often over a network

‚Relations‛ get broken

The New Faces of Data

• Scale out

• CAP Theorem• CAP theorem simply states that any distributed data system can only achieve two of these

three at any given time

• Hence when building distributed systems, Just Pick 2/3

• Design Issues• ACID

• BASE

AtomicityConsistencyIsolationDurability Basically

AvailableSoft-stateEventual Consistency

The New Faces of Data (cont’d)

• Sparsity

• Some data have sparse attributes• document-term vector

• user-item matrix

• semantic or social relations

• Some data do not need ‘relational’ property, or complex join queries• log-structured data

• stacking or streamed data

• e.g. Facebook, Server Density(MySQL -> MongoDB)

• Immutable

• Do not need update and delete data, only insert it with versions• tracking history

• lock-free• atomicity is based on just a key

Schema-Free

Non-Relational Databases

Trends of Emergent Data Stores

On-going classification by Woohyun Kim

2500Bi

TrendGoogle(Jan.)

Emergent Data Stores in CAP Dimension

CAP Dimension

Key Features of Non-Relational Databases

• Common Features

• A call level interface (in contrast to a SQL binding)• HTTP/REST or easy to program APIs

• Fast indexes on large amounts of data• Lookups by one and more keys(key-value or document)

• Ability to horizontally scale throughput over many servers• Automatic sharding or client-side manual sharding

• Built-in replication(sync or async)

• Eventual Consistency

• Ability to dynamically define attributes or data schema• Key-Value, Column, or Document

• Support for MapReduce

Data Models of Non-Relational Databases

• Data Models• Tuple

• A set of attribute-value pairs

• Attribute names are defined in a schema

• Values must be scalar(like numbers and strings), not BLOBs

• The values are referenced by attribute name, not by ordinal position

• Document• A set of attribute-value pairs

• Attribute names are dynamically defined for each document at runtime• Unlike Tuple, there is no global schema for attributes

• Values may be complex values or nested values

• Multiple indexes are supported

• Extensible Record• A hybrid between Tuple and Document

• Families of attributes are defined in a schema

• New attributes can be defined (within an attribute family) on a per-record basis

• Object• A set of attribute-value pairs

• Values may be complex values or pointers to other objects

Classes of Non-Relational Databases

• Classification by Data Model

• Key-value Stores• Store values and an index to find them

• Provide replication, versioning, locking, transactions, sorting, and etc.

• Document Stores• Store indexed documents(with multiple indexes)

• Not support locking, synchronous replication, and ACID transactions

• Instead of ACID, support BASE for much higher performance and scalability

• Provide some simple query mechanisms

• Extensible Record Stores(=Column-oriented Stores)• Store extensible records that can be horizontally and vertically partitioned across nodes

• Both rows and columns are splitted over multiple nodes

• Rows are split across nodes by range partitioning

• Columns of a table are distributed over multiple nodes by using ‚column groups‛

• Relational Databases• Store, index, and query tuples

• Some new RDBMSs provide horizontal scaling

A Comparison of Non-Relational Databases

On-going classification by Woohyun Kim

ProjectLangu

ageReplicatio

nPartitioning Persistence

Consistency &Transaction Client Protocol

Data model

Community

Bigtable C++ Sync(GFS) Range Memtable/SSTable on GFSLock + limited ACID transactions

Custom API Column A Google, no

Hbase Java Sync(HDFS) Range Memtable/SSTable on HDFSLock + limited ACID transactions

Custom API, Thrift, Rest Column A Apache, yes

Hypertable C++ Sync(FS) Range CellCache/CellStore on any FSLock + limited ACID transactions

Thrift, other Column A Zvents, Baidu, yes

Cassandra Java Async Hash On-diskMVCC + limited ACID transactions

ThriftColumn & Key-Value

B Facebook, no

Coord C++Sync(on client-side)

Hash (on client-side)

Pluggable: in-memory, Lucene noCustom API(python, php,java, c++)

Key-Value or Document(json)

A NHN, yes

Dynamo ? Yes Yes ? Custom API Key-Value A Amazon, no

Voldemort Java Async Hash Pluggable: BerkleyDB, Mysql MVCC Java APIKey-Value(blob/text)

A Linkedin, no

Redis C SyncHash (on client-side)

In-memory with background snapshots

lock Custom API(Collection) Key-Value C some

Tokyo Tyrant C AsyncManual sharding

In-memory or on-disk(hash , b-tree, fixed-size/variable-length record tables)

lock + limitedACID transactions

Key-Value C

Scalaris Erlang Sync Range Only in-memorylock + limited ACID transactions

Erlang, Java, HTTPKey-Value(blob)

B OnScale, no

Kai Erlang ? Yes On-disk Dets file MemcachedKey-Value(blob)

Dynomite Erlang Yes Yes Pluggable: couch, dets Custom ascii, ThriftKey-Value(blob)

D+ Powerset, no

MemcacheDB C Yes No BerkleyDB MemcachedKey-Value(blob)

B some

Riak Erlang Async HashPluggable: in-memory, ets, dets, osmos tables (no indices on 2nd

key fields)MVCC Rest(json-based)

Key-Value & Document

SimpleDB ? AsyncNo automated sharding

S3 no Custom API Document B Amazon, no

ThruDB C++ Yes NoPluggable: BerkleyDB, Custom, Mysql, S3

Thrift Document C+ Third rail, unsure

CouchDB Erlang AsyncNo automated sharding

On-disk with append-only B-tree

MVCCHTTP, json, Custom API(map/reduce views)

Document(json)

A Apache, yes

MongoDB C++ Async Sharding new On-disk with B-tree Filed-levelHTTP, bson, Custom API(Cursor)

Document(bson)

A 10gen, yes

Neo4J On-disk linked lists Custom API(Graph) Graph

Document-oriented vs. RDBMSCouchDB MongoDB MySQL

Terminology Document, Field, Database Document, Key, CollectionData Model Document-Oriented (JSON) Document-Oriented (BSON) Relational

Data Types Text, numeric, boolean, and liststring, int, double, boolean, date, bytearray, object, array, others

Large Objects (Files) Yes (attachments) Yes (GridFS) no???

Replication Master-master (with developer supplied conflict resolution)

Master-slave Master-slave

Object(row) Storage One large repository Collection based Table based

Query Method Map/reduce of javascript functions to lazily build an index per query

Dynamic; object-based query languageDynamic; SQL

Secondary Indexes Yes Yes Yes

Atomicity Single document Single document Yes – advanced

Interface REST Native drivers Native drivers

Server-side batch data manipulation

Yes, via javascript(thru. map/reduce views)

Yes, via javascript Yes (SQL)

Written in Erlang C++ C Concurrency Control MVCC Update in Place Update in Place

Thank you.

Appendix: What is Coord?

Architectural Comparison• dust: a distributed file system based on DHT

• coord spaces: a resource sharable store system based on SBA

• coord mapreduce: a simplified large-scale data processing framework

• warp: a scalable remote/parallel execution system

• graph: a large-scale distributed graph search system

Appendix: Coord Internals A space-based architecture built on distributed hash tables

SBA(Space-based Architecture) processes communicate with others thru. only spaces

DHT(Distributed Hash Tables) data identified by hash functions are placed on numerically near nodes

A computing platform to project a single address space on distributed memories As if users worked in a single computing environment

node 1 node 2 node 3 node n

writetakeread

mapreduce debates and schema-free

Documents

mapreduce. mapreduce outline mapreduce architecture...

mapreduce - bowdoin

mapreduce var

mapreduce debates and schema-free

schema less table & dynamic schema

data management in large-scale distributed systems -...

google’s mapreduce

realworldbigdataarchitecture@* splunk,...

deep dive – amazon elastic mapreduce...avro for schema and...

mapreduce - cse.hcmut.edu.vn

python mapreduce programming with pydoop · mapreduce and...

mapreduce algorithms

hdfs & mapreduce

1. introduction to mapreduce -...

mapreduce tutorial

introduction to mapreduce | mapreduce architecture |...

pipelined-mapreduce an improved mapreduce

google’s mapreduce programming model —...

mapreduce-mpi library users...

ee324 distributed systems fall 2015 mapreduce. overview 2 ...