1 1
Headline Goes Here Speaker Name or Subhead Goes Here
Building Data Applica;ons with Hadoop
Tom White @tom_e_white London Java Community #ljcjug 29 August 2013
What is Hadoop?
2
HDFS and MapReduce
3
A Hadoop Stack
4
Glossary
• Apache Avro – cross-‐language data serializa;on library
• Apache HCatalog – metadata storage system (part of Hive)
• Apache Flume – streaming log capture and delivery system • Apache Oozie – workflow scheduler system
• Apache Crunch – Java API for wri;ng data pipelines
• Parquet – column-‐oriented storage format for nested data
• Impala – interac;ve SQL on Hadoop
5
Hadoop Pain Points*
6 * Not exhaustive
Choosing a File Format
No compression gzip snappy lzo bzip2
Delimited text
JSON
Sequence File
Avro File
RCFile
Parquet
…
7
? ? ?
Defining a Data Model
8
Schema on read vs.
Schema on write
9
What is the user ID field called?
• uid
• userId
• userid • user_id
• user_Id
• “Scaling Big Data Mining Infrastructure: The Twider Experience” by Lin and Ryaboy
10
11
Defining a File Layout
• Which is best?
• /data/clickstream/20120101
• /data/clickstream/date=20120101 • /data/clickstream/2012/01/01
• /data/clickstream/year=2012/month=01/day=01
12
A Padern: Hadoop is Flexible
13
… but also low-‐level and complex
14
Some Best Prac;ces
• Use Avro Schemas for the data model
• Use Avro Data Files for row-‐oriented data
• Use Parquet for column-‐oriented data • Use a Hive/HCatalog compa;ble file layout:
• /data/<dataset>/par;;on-‐1=<x>/par;;on-‐2=<y>
• Use a library like Crunch or Cascading for batch analysis
• Use Impala for interac;ve ad hoc analysis
15
The Cloudera Development Kit Codifies Best Prac;ce as APIs, Tools, Docs and Examples
16
CDK
• A client-‐side library for wri;ng Hadoop Data Applica;ons
• First release was in April
• 0.6.0 released earlier this month • Open source, Apache 2 license
• Modular
• Data module (HDFS, Flume, Crunch, HCatalog)
• Morphlines transforma;on module
• Maven plugin
17
A typical system (zoom 100:1)
18
A typical system (zoom 10:1)
19
A typical system (zoom 5:1)
20
Example
21
1. The Events Schema
{!
"name": "StandardEvent",!
"namespace": "com.cloudera.cdk.data.event",!
"type": "record",!
"fields": [!
{ "name": "event_initiator", "type": "string" },!
{ "name": "event_name", "type": "string" },!
{ "name": "user_id", "type": "long" },!
{ "name": "session_id", "type": "string" },!
{ "name": "ip", "type": "string" },!
{ "name": "timestamp", "type": "long" }!
]!
}!22
2. Define the Events Dataset
$ mvn cdk:create-dataset \!
-Dcdk.rootDirectory=hdfs://namenode/data \!
-Dcdk.datasetName=events \!
-Dcdk.avroSchemaFile=standard_event.avsc!
23
(2. Or in Code)
DatasetRepository repo = new FileSystemDatasetRepository.Builder()!
.rootDirectory(new URI("hdfs://namenode/data")).get();!
Schema schema = new Schema.Parser().parse(!
Resources.getResource("standard_event.avsc").openStream());!
DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schema(schema).get();!
Dataset events = repo.create("events", descriptor);!
24
3. Write Events
public class LoggingServlet extends HttpServlet {!
private Logger logger = Logger.getLogger(LoggingServlet.class); // A!
@Override protected void doGet(HttpServletRequest request, HttpServletResponse!
response) throws ServletException, IOException {!
StandardEvent event = StandardEvent.newBuilder() // B!
.setEventInitiator("server_user")!
.setEventName("web:message")!
.setUserId(Long.parseLong(userId))!
.setSessionId(request.getSession(true).getId())!
.setIp(request.getRemoteAddr())!
.setTimestamp(System.currentTimeMillis())!
.build();!
logger.info(event); // C!
}!
}!25
The resul;ng file layout
/data!
/events!
/.metadata!
/schema.avsc!
/descriptor.properties!
/year=2013!
/month=08!
/day=27!
/hour=15!
/FlumeData.1375378320979!
/FlumeData.1375378320980!
26
4. Generate Derived Sessions
<build>!
<plugins>!
<plugin>!
<groupId>com.cloudera.cdk</groupId>!
<artifactId>cdk-maven-plugin</artifactId>!
<configuration>!
<toolClass>com.cloudera.cdk.examples.demo.CreateSessions</toolClass>!
</configuration>!
</plugin>!
</plugins>!
</build>!
$ mvn cdk:create-dataset -Dcdk.datasetName=sessions ...!
$ mvn cdk:run-tool!
27
5. Run ad hoc queries
$ impala-shell –q ‘SELECT AVG(duration) FROM sessions’!
Query: select AVG(duration)!
FROM sessions!
Query finished, fetching results ...!
+---------------+!
| avg(duration) |!
+---------------+!
| 5475.5 |!
+---------------+!
Returned 1 row(s) in 0.24s!
28
Extensions
• Use Oozie to generate sessions par;;ons every hour
• Tools to evolve the data model compa;bly
• Read datasets using • Impala JDBC
• CDK dataset API
• Other Hadoop tools (Pig, Hive)
• hdps://github.com/cloudera/cdk-‐examples/tree/master/demo
29
Ques;ons?
hdps://github.com/cloudera/cdk
30
31 31