© 2014 MapR Technologies 1© 2014 MapR Technologies
The Future of Hadoop: Data AgilityTomer ShiranVP Product Management, MapR TechnologiesCo-Founder and PMC Member, Apache Drill
June 22, 2014
© 2014 MapR Technologies 2
Data is doubling in size every two years
© 2014 MapR Technologies 3
44 ZETTABYTES
4.4 ZETTABYTES
2011 2013
1.8 ZETTABYTES
IDC estimates that in 2020, there will be 44 zettabytes
of data in the world
2020
Source: IDC Digital Universe
© 2014 MapR Technologies 4
UNSTRUCTURED DATA
STRUCTURED DATA
1980 2000 20101990 2020
Unstructured data will account for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data S
tored
© 2014 MapR Technologies 5
Unstructured Data is Ubiquitous
Social Media
Messages
Audio
Sensors
Mobile Data
Clickstream
© 2014 MapR Technologies 6
Hadoop Adoption is ExplodingJOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
© 2014 MapR Technologies 7
The MapR Distribution for Hadoop
Best Product Exponential Growth
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
< 1% lifetime churn
> $1B in incremental revenuegenerated by 1 customer
500+ CustomersBig Data
Riding the Wave with
HadoopThe Big Data
Platform of Choice
© 2014 MapR Technologies 8
360° Customer View
5PBCUSTOMER DATA
© 2014 MapR Technologies 9PEOPLE
1.2BPEOPLE
Largest Biometric Database in the World
© 2014 MapR Technologies 10© 2014 MapR Technologies
The Future of Hadoop: Data Agility
© 2014 MapR Technologies 11
Distance to Data
Business(analysts, developers)
“Plumbing” developmentMapReduce
Business(analysts, developers)
Modeling and transformations
Hive and other SQL-on-Hadoop
Existing approaches require a middleman (IT)
Data
Data
© 2014 MapR Technologies 12
Real-World Data Modeling and Transformations
© 2014 MapR Technologies 13
“We just can’t continue to manage data the “old way” by throwing more DBA’s at the problem and waiting for data to be accessible.” – Fortune 100 CIO
“Our data and business needs are constantly changing. Traditional data management processes simply don’t work in this new world.” – Large Web 2.0 Hadoop user
“If source data is not easy to access, self-service BI won’t happen” - TWDI
© 2014 MapR Technologies 14
Distance to Data
Business(analysts, developers)
“Plumbing” developmentMapReduce
Hive and other SQL-on-Hadoop
Business(analysts, developers)Data Agility
Existing approaches require a middleman (IT)
Data
Data
Data
Business(analysts, developers)
Modeling and transformations
© 2014 MapR Technologies 15
Why Improve Distance to Data?
• Enable rapid data exploration and application development
• IT should provide a valuable service without “getting in the way”
• Can’t add DBAs to keep up with the exponential data growth
• Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users
2Reduce the burden on ITImprove time to value
© 2014 MapR Technologies 16
• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors150+ years of experience buildingdatabases and distributed systems
© 2014 MapR Technologies 17
Evolution Towards Self-Service Data Exploration
Data Modeling and Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BIw/ RDBMS
Self-Service BIw/ RDBMS SQL-on-Hadoop
Self-Service Data Exploration
Zero-day analytics
© 2014 MapR Technologies 18
(1) Self-Describing Data is Ubiquitous
Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)
Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
© 2014 MapR Technologies 19
(2) Drill’s Data Model is Flexible
HBase
JSONBSON
CSVTSV
ParquetAvro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name Gender AgeMichael M 6Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2014 MapR Technologies 20
(3) Drill Supports Schema Discovery On-The-Fly
• Fixed schema• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or schema-less
• Leverage schema in centralized repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
© 2014 MapR Technologies 21© 2014 MapR Technologies
Quick TourSelf-Service Data Exploration with Apache Drill
© 2014 MapR Technologies 22
• d
© 2014 MapR Technologies 23
Zero to Results in 2 Minutes (3 Commands)$ tar xzf apache-drill.tar.gz
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local
0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` GROUP BY columns[1] ORDER BY incidents DESC;+------------+------------+| incidents | category |+------------+------------+| 8372 | LARCENY/THEFT || 4247 | OTHER OFFENSES || 3765 | NON-CRIMINAL || 2502 | ASSAULT |...35 rows selected (0.847 seconds)
Install
Launch shell (embedded mode)
Query
Results
© 2014 MapR Technologies 24
A storage engine instance- DFS- HBase- Hive Metastore/HCatalog
A workspace- Sub-directory- Hive database
A table- pathnames- HBase table- Hive table
Data Source is in the Query
SELECT timestamp, messageFROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2
© 2014 MapR Technologies 25
Query Directory Trees# Query file: How many errors per level in Jan 2014?
SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`GROUP BY errorLevel;
# Query directory sub-tree: How many errors per level?
SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`GROUP BY errorLevel;
# Query some partitions: How many errors per level by month from 2012?
SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`WHERE dirs[1] >= 2012GROUP BY errorLevel, dirs[2];
© 2014 MapR Technologies 26
Works with HBase and Embedded Blobs# Query an HBase table directly (no schemas)
SELECT cf1.month, cf1.year FROM hbase.table1;
# Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users
SELECT profile.name, count(profile.children)FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users)
© 2014 MapR Technologies 27
Combine Data Sources on the Fly# Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message.
SELECT DISTINCT users.name, users.emails.workFROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` usersWHERE logs.uid = users.id AND logs.errorLevel > 5;
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user
SELECT users.name, count(*) as tweetCountFROM hive.social.tweets tweets, hbase.users usersWHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')GROUP BY tweets.userId;
© 2014 MapR Technologies 28
Summary• Enable rapid data exploration and application development while
reducing the burden on IT
• Apache Drill beta coming soon– Email [email protected]
• Get involved– Download and play: http://incubator.apache.org/drill/– Ask questions: [email protected]– Contribute: http://github.com/apache/incubator-drill/
© 2014 MapR Technologies 29
Thank You@mapr maprtech
Tomer Shiran, VP Product Management
MapRTechnologies
maprtech
mapr-technologies