Download - The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 1© 2014 MapR Technologies

The Future of Hadoop: Data AgilityTomer ShiranVP Product Management, MapR TechnologiesCo-Founder and PMC Member, Apache Drill

June 22, 2014

© 2014 MapR Technologies 2

Data is doubling in size every two years


44 ZETTABYTES

4.4 ZETTABYTES

2011 2013

1.8 ZETTABYTES

IDC estimates that in 2020, there will be 44 zettabytes

of data in the world

2020

Source: IDC Digital Universe


UNSTRUCTURED DATA

STRUCTURED DATA

1980 2000 20101990 2020

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored


Unstructured Data is Ubiquitous

Social Media

Messages

Audio

Sensors

Mobile Data

Email

Clickstream


Hadoop Adoption is ExplodingJOB TRENDS FROM INDEED.COM

Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13


The MapR Distribution for Hadoop

Best Product Exponential Growth

3X bookings Q1 ‘13 – Q1 ‘14

80% of accounts expand 3X

90% software licenses

< 1% lifetime churn

> $1B in incremental revenuegenerated by 1 customer

500+ CustomersBig Data

Riding the Wave with

HadoopThe Big Data

Platform of Choice


360° Customer View

5PBCUSTOMER DATA

© 2014 MapR Technologies 9PEOPLE

1.2BPEOPLE

Largest Biometric Database in the World


The Future of Hadoop: Data Agility


Distance to Data

Business(analysts, developers)

“Plumbing” developmentMapReduce


Modeling and transformations

Hive and other SQL-on-Hadoop

Existing approaches require a middleman (IT)

Data

Data


Real-World Data Modeling and Transformations


“We just can’t continue to manage data the “old way” by throwing more DBA’s at the problem and waiting for data to be accessible.” – Fortune 100 CIO

“Our data and business needs are constantly changing. Traditional data management processes simply don’t work in this new world.” – Large Web 2.0 Hadoop user

“If source data is not easy to access, self-service BI won’t happen” - TWDI


Distance to Data


“Plumbing” developmentMapReduce

Hive and other SQL-on-Hadoop

Business(analysts, developers)Data Agility

Existing approaches require a middleman (IT)

Data

Data

Data


Modeling and transformations


Why Improve Distance to Data?

• Enable rapid data exploration and application development

• IT should provide a valuable service without “getting in the way”

• Can’t add DBAs to keep up with the exponential data growth

• Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users

2Reduce the burden on ITImprove time to value


• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications

APACHE DRILL

40+ contributors150+ years of experience buildingdatabases and distributed systems


Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Not needed

Self-service

Traditional BIw/ RDBMS

Self-Service BIw/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics


(1) Self-Describing Data is Ubiquitous

Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)

Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}


(2) Drill’s Data Model is Flexible

HBase

JSONBSON

CSVTSV

ParquetAvro

Schema-lessFixed schema

Flat

Complex

Flexibility

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table


(3) Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY


Quick TourSelf-Service Data Exploration with Apache Drill


• d


Zero to Results in 2 Minutes (3 Commands)$ tar xzf apache-drill.tar.gz

$ apache-drill/bin/sqlline -u jdbc:drill:zk=local

0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` GROUP BY columns[1] ORDER BY incidents DESC;+------------+------------+| incidents | category |+------------+------------+| 8372 | LARCENY/THEFT || 4247 | OTHER OFFENSES || 3765 | NON-CRIMINAL || 2502 | ASSAULT |...35 rows selected (0.847 seconds)

Install

Launch shell (embedded mode)

Query

Results


A storage engine instance- DFS- HBase- Hive Metastore/HCatalog

A workspace- Sub-directory- Hive database

A table- pathnames- HBase table- Hive table

Data Source is in the Query

SELECT timestamp, messageFROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2


Query Directory Trees# Query file: How many errors per level in Jan 2014?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`GROUP BY errorLevel;

# Query directory sub-tree: How many errors per level?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`GROUP BY errorLevel;

# Query some partitions: How many errors per level by month from 2012?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`WHERE dirs[1] >= 2012GROUP BY errorLevel, dirs[2];


Works with HBase and Embedded Blobs# Query an HBase table directly (no schemas)

SELECT cf1.month, cf1.year FROM hbase.table1;

# Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users

SELECT profile.name, count(profile.children)FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users)


Combine Data Sources on the Fly# Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message.

SELECT DISTINCT users.name, users.emails.workFROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` usersWHERE logs.uid = users.id AND logs.errorLevel > 5;

# Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user

SELECT users.name, count(*) as tweetCountFROM hive.social.tweets tweets, hbase.users usersWHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')GROUP BY tweets.userId;


Summary• Enable rapid data exploration and application development while

reducing the burden on IT

• Apache Drill beta coming soon– Email [email protected]

• Get involved– Download and play: http://incubator.apache.org/drill/– Ask questions: [email protected]– Contribute: http://github.com/apache/incubator-drill/

mailto:[email protected]

http://incubator.apache.org/drill/

http://incubator.apache.org/drill/

mailto:[email protected]

http://github.com/apache/incubator-drill/

http://github.com/apache/incubator-drill/


Thank You@mapr maprtech

[email protected]

Tomer Shiran, VP Product Management

MapRTechnologies

maprtech

mapr-technologies

Download - The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

Top Related