analyzing twitter data with hadoop
DESCRIPTION
Cloudera's OATRANSCRIPT
1
Analyzing Twitter Data with HadoopOpen Analytics Summit, March 2013Joey Echeverria | Principal Solutions [email protected] | @fwiffo
©2012 Cloudera, Inc.
2
About Joey
• Principal Solutions Architect• 18 months• 4+ years• Local
3
Analyzing Twitter Data with Hadoop
BUILDING A BIG DATA SOLUTION
©2012 Cloudera, Inc.
4
Big Data
• Big• Larger volume than you’ve handled before
• No litmus test• High value, under utilized
• Data• Structured• Unstructured• Semi-structured
• Hadoop• Distributed file system• Distributed, batch computation
©2012 Cloudera, Inc.
5
Data Management Systems
©2012 Cloudera, Inc.
Data Source Data StorageData
Ingestion
Data Processing
6
Relational Data Management Systems
©2012 Cloudera, Inc.
Data Source RDBMSETL
Reporting
7
A Canonical Hadoop Architecture
©2012 Cloudera, Inc.
Data Source HDFSFlume
Hive (Impala)
8
Analyzing Twitter Data with Hadoop
AN EXAMPLE USE CASE
©2012 Cloudera, Inc.
9
Analyzing Twitter
• Social media popular with marketing teams• Twitter is an effective tool for promotion• Who is influential?
• Tweets• Followers• Retweets
• Similar to e-mail forwarding
• Which twitter user gets the most retweets?• Who is influential in our industry?
©2012 Cloudera, Inc.
10
Analyzing Twitter Data with Hadoop
HOW DO WE ANSWER THESE QUESTIONS?
©2012 Cloudera, Inc.
11
Techniques
• SQL• Filtering• Aggregation• Sorting
• Complex data• Deeply nested• Variable schema
12
Architecture
©2012 Cloudera, Inc.
HDFSFlume Hive
CustomFlumeSource
Sink toHDFS
JSON SerDeParses Data
Oozie
AddPartitions
Hourly
13
Analyzing Twitter Data with Hadoop
TWITTER SOURCE
©2012 Cloudera, Inc.
14
Flume
• Streaming data flow• Sources
• Push or pull• Sinks• Event based
©2012 Cloudera, Inc.
Pulling Data From Twitter
• Custom source, using twitter4j• Sources process data as discrete events
Loading Data Into HDFS
• HDFS Sink comes stock with Flume• Easily separate files by creation time
• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
17
Flume Source
©2012 Cloudera, Inc.
public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() { ... }}
18
Twitter API
• Callback mechanism for catching new tweets
©2012 Cloudera, Inc.
/** The actual Twitter stream. It's set up to collect raw JSON data */private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance();...// The StatusListener is a twitter4j API that can be added to a stream,// and will call a method every time a message is sent to the stream.StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... }}...// Set up the stream's listener (defined above), and set any necessary// security information.twitterStream.addListener(listener);twitterStream.setOAuthConsumer(consumerKey, consumerSecret);AccessToken token = new AccessToken(accessToken, accessTokenSecret);twitterStream.setOAuthAccessToken(token);
19
JSON Data
• JSON data is processed as an event and written to HDFS
©2012 Cloudera, Inc.
public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet
headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event);}
20
Analyzing Twitter Data with Hadoop
FLUME DEMO
©2012 Cloudera, Inc.
21
Analyzing Twitter Data with Hadoop
HIVE
©2012 Cloudera, Inc.
22
What is Hive?
• Created at Facebook• HiveQL
• SQL like interface• Hive interpreter
converts HiveQL to MapReduce code
• Returns results to the client
©2012 Cloudera, Inc.
23
Hive Details
• Schema on read• Scalar types (int, float, double, boolean, string)• Complex types (struct, map, array)• Metastore contains table definitions
• Stored in a relational database• Similar to catalog tables in other DBs
24
Complex Data
©2012 Cloudera, Inc.
SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_countFROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweet_count) AS retweets FROM tweets GROUP BY
retweeted_status.user.screen_name, retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;
25
Analyzing Twitter Data with Hadoop
JSON INTERLUDE
©2012 Cloudera, Inc.
26
What is JSON?
• Complex, semi-structured data• Based on JavaScript’s data syntax• Rich, nested data types:
• number• string• Array• object• true, false• null
©2012 Cloudera, Inc.
27
What is JSON?
©2012 Cloudera, Inc.
{ "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } }}
28
Hive Serializers and Deserializers
• Instructs Hive on how to interpret data• JSONSerDe
©2012 Cloudera, Inc.
29
Analyzing Twitter Data with Hadoop
HIVE DEMO
©2012 Cloudera, Inc.
30
Analyzing Twitter Data with Hadoop
IT’S A TRAP
©2012 Cloudera, Inc.
31
Not a Database
©2012 Cloudera, Inc.
RDBMS Hive
LanguageGenerally >= SQL-92
Subset of SQL-92 plus Hive specific extensions
Update Capabilities INSERT, UPDATE, DELETE
INSERT OVERWRITE no UPDATE, DELETE
Transactions Yes No
Latency Sub-second Minutes
Indexes Yes Yes
Data size Terabytes Petabytes
32
Analyzing Twitter Data with Hadoop
IMPALA ASIDE
©2012 Cloudera, Inc.
33
Cloudera ImpalaReal-Time Query for Data Stored in Hadoop.
FAMILIAR Supports Hive SQL
FAST 4-30X faster than Hive over MapReduce
Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED
100% OPEN SOURCE Flexible, cost-effective, no lock-in
EASY TO USE Deploy & operate withCloudera Enterprise RTQ
FLEXIBLE Supports multiple storage engines & file formats
©2012 Cloudera, Inc.
34
Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop
SPEED TO INSIGHT
COST SAVINGS
FULL FIDELITY ANALYSIS
DISCOVERABILITY
• Real-time queries run directly on source data• No ETL delays• No jumping between data silos
• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas
• All data available for interactive queries• No loss of fidelity from fixed data schemas
• Single metadata store from origination through analysis• No need to hunt through multiple data silos
©2012 Cloudera, Inc.
Cloudera Impala Details
35 ©2012 Cloudera, Inc.
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP Distributed
Local Direct Reads
State Store
HDFS NNHive Metastore YARN
Common Hive SQL and interface
Unified metadata and scheduler
Low-latency scheduler and cache(low-impact failures)
36
Analyzing Twitter Data with Hadoop
OOZIE AUTOMATION
©2012 Cloudera, Inc.
Oozie: Everything in its Right Place
Oozie for Partition Management
• Once an hour, add a partition• Takes advantage of advanced Hive functionality
39
Analyzing Twitter Data with Hadoop
OOZIE DEMO
©2012 Cloudera, Inc.
40
Analyzing Twitter Data with Hadoop
PUTTING IT ALL TOGETHER
©2012 Cloudera, Inc.
41
Complete Architecture
©2012 Cloudera, Inc.
HDFSFlume Hive
CustomFlumeSource
Sink toHDFS
JSON SerDeParses Data
Oozie
AddPartitions
Hourly
42
Analyzing Twitter Data with Hadoop
MORE DEMOS
©2012 Cloudera, Inc.
What next?
• Download Hadoop!• CDH available at www.cloudera.com• Cloudera provides pre-loaded VMs
• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Demo+VM
• Clone the source repo• https://github.com/cloudera/cdh-twitter-example
My personal preference
• Cloudera Manager• https://ccp.cloudera.com/display/SUPPORT/Downloads
• Free up to 50 nodes
Shout Out
• Jon Natkins• @nattybnatkins• Blog posts
• http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
• http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
• http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
47 ©2012 Cloudera, Inc.