Download - Analyzing Big Data at Twitter Presentation 1
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
1/75
Web 2.0 Expo, 2010
Kevin Weil @kevinweil
Analyzing Big Data at
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
2/75
Three Challenges Collecting Data
Large-Scale Storage and Analysis
Rapid Learning over Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
3/75
My Background Studied Mathematics and Physics at Harvard,
Physics at Stanford
Tropos Networks (city-wide wireless): GBs of data
Cooliris (web media): Hadoop for analytics, TBs of data
Twitter: Hadoop, Pig, machine learning, visualization,
social graph analysis, PBs of data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
4/75
Three Challenges Collecting Data
Large-Scale Storage and Analysis
Rapid Learning over Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
5/75
Data, Data Everywhere You guys generate a lot of data
Anybody want to guess?
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
6/75
Data, Data Everywhere You guys generate a lot of data
Anybody want to guess?
12 TB/day (4+ PB/yr)
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
7/75
Data, Data Everywhere You guys generate a lot of data
Anybody want to guess?
12 TB/day (4+ PB/yr) 20,000 CDs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
8/75
Data, Data Everywhere You guys generate a lot of data
Anybody want to guess?
12 TB/day (4+ PB/yr) 20,000 CDs 10 million floppy disks
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
9/75
Data, Data Everywhere You guys generate a lot of data
Anybody want to guess?
12 TB/day (4+ PB/yr) 20,000 CDs 10 million floppy disks 450 GB while I give this talk
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
10/75
Syslog? Started with syslog-ng
As our volume grew, it didnt scale
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
11/75
Syslog? Started with syslog-ng
As our volume grew, it didnt scale
Resources
overwhelmed
Lost data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
12/75
Scribe Surprise! FB had same problem, built and open-
sourced Scribe
Log collection framework over Thrift
You scribe log lines, with categories
It does the rest
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
13/75
Scribe Runs locally; reliable in
network outage
FE FE FE
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
14/75
Scribe Runs locally; reliable in
network outage
Nodes only know
downstream writer;
hierarchical, scalable
FE FE FE
Agg Agg
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
15/75
Scribe Runs locally; reliable in
network outage
Nodes only know
downstream writer;
hierarchical, scalable
Pluggable outputs
FE FE FE
Agg Agg
HDFSFile
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
16/75
Scribe at Twitter Solved our problem, opened new vistas
Currently 40 different categories logged from
javascript, Ruby, Scala, Java, etc
We improved logging, monitoring, behavior during
failure conditions, writes to HDFS, etc
Continuing to work with FB to make it better
http://github.com/traviscrawford/scribe
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
17/75
Three Challenges Collecting Data
Large-Scale Storage and Analysis
Rapid Learning over Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
18/75
How Do You Store 12 TB/day?
Single machine?
Whats hard drive write speed?
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
19/75
How Do You Store 12 TB/day?
Single machine?
Whats hard drive write speed?
~80 MB/s
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
20/75
How Do You Store 12 TB/day?
Single machine?
Whats hard drive write speed?
~80 MB/s
42 hours to write 12 TB
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
21/75
How Do You Store 12 TB/day?
Single machine?
Whats hard drive write speed?
~80 MB/s
42 hours to write 12 TB
Uh oh.
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
22/75
Where Do I Put 12TB/day?
Need a cluster of machines
... which adds new layers
of complexity
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
23/75
Hadoop Distributed file system
Automatic replication Fault tolerance Transparently read/write across multiple machines
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
24/75
Hadoop Distributed file system
Automatic replication Fault tolerance Transparently read/write across multiple machines MapReduce-based parallel computation
Key-value based computation interface allowsfor wide applicability
Fault tolerance, again
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
25/75
Hadoop Open source: top-level Apache project
Scalable: Y! has a 4000 node cluster
Powerful: sorted 1TB random integers in 62 seconds
Easy packaging/install: free Cloudera RPMs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
26/75
MapReduce Workflow
Challenge: how many tweets
per user, given tweets table?
Input: key=row, value=tweet info
Map: output key=user_id, value=1
Shuffle: sort by user_id
Reduce: for each user_id, sum
Output: user_id, tweet count
With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Shuffle/Sort
Reduce
Reduce
Reduce
Outputs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
27/75
MapReduce Workflow
Challenge: how many tweets
per user, given tweets table?
Input: key=row, value=tweet info
Map: output key=user_id, value=1
Shuffle: sort by user_id
Reduce: for each user_id, sum
Output: user_id, tweet count
With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Shuffle/Sort
Reduce
Reduce
Reduce
Outputs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
28/75
MapReduce Workflow
Challenge: how many tweets
per user, given tweets table?
Input: key=row, value=tweet info
Map: output key=user_id, value=1
Shuffle: sort by user_id
Reduce: for each user_id, sum
Output: user_id, tweet count
With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Shuffle/Sort
Reduce
Reduce
Reduce
Outputs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
29/75
MapReduce Workflow
Challenge: how many tweets
per user, given tweets table?
Input: key=row, value=tweet info
Map: output key=user_id, value=1
Shuffle: sort by user_id
Reduce: for each user_id, sum
Output: user_id, tweet count
With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Shuffle/Sort
Reduce
Reduce
Reduce
Outputs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
30/75
MapReduce Workflow
Challenge: how many tweets
per user, given tweets table?
Input: key=row, value=tweet info
Map: output key=user_id, value=1
Shuffle: sort by user_id
Reduce: for each user_id, sum
Output: user_id, tweet count
With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Shuffle/Sort
Reduce
Reduce
Reduce
Outputs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
31/75
MapReduce Workflow
Challenge: how many tweets
per user, given tweets table?
Input: key=row, value=tweet info
Map: output key=user_id, value=1
Shuffle: sort by user_id
Reduce: for each user_id, sum
Output: user_id, tweet count
With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Shuffle/Sort
Reduce
Reduce
Reduce
Outputs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
32/75
MapReduce Workflow
Challenge: how many tweets
per user, given tweets table?
Input: key=row, value=tweet info
Map: output key=user_id, value=1
Shuffle: sort by user_id
Reduce: for each user_id, sum
Output: user_id, tweet count
With 2x machines, runs 2x faster
Inputs
Map
Map
Map
Map
Map
Map
Map
Shuffle/Sort
Reduce
Reduce
Reduce
Outputs
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
33/75
Two Analysis Challenges Compute mutual followings in Twitters interest graph
grep, awk? No way. If data is in MySQL... self join on an n- billion row table? n,000,000,000 x n,000,000,000 = ?
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
34/75
Two Analysis Challenges Compute mutual followings in Twitters interest graph
grep, awk? No way. If data is in MySQL... self join on an n- billion row table? n,000,000,000 x n,000,000,000 = ? I dont know either.
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
35/75
Two Analysis Challenges Large-scale grouping and counting
select count(*) from users? maybe. select count(*) from tweets? uh... Imagine joining these two. And grouping. And sorting.
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
36/75
Back to Hadoop Didnt we have a cluster of machines?
Hadoop makes it easy to distribute the calculation
Purpose-built for parallel calculation
Just a slight mindset adjustment
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
37/75
Back to Hadoop Didnt we have a cluster of machines?
Hadoop makes it easy to distribute the calculation
Purpose-built for parallel calculation Just a slight mindset adjustment
But a fun one!
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
38/75
Analysis at Scale Now were rolling
Count all tweets: 20+ billion, 5 minutes
Parallel network calls to FlockDB to compute
interest graph aggregates
Run PageRank across users and interest graph
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
39/75
But... Analysis typically in Java
Single-input, two-stage data flow is rigid
Projections, filters: custom code Joins lengthy, error-prone
n-stage jobs hard to manage
Data exploration requires compilation
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
40/75
Three Challenges Collecting Data
Large-Scale Storage and Analysis
Rapid Learning over Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
41/75
Pig High level language
Transformations on
sets of records Process data one
step at a time
Easier than SQL?
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
42/75
Why Pig?Because I bet you can read the following script
Change this to your big-idea call-outs...
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
43/75
A Real Pig Script
Just for fun... the same calculation in Java next
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
44/75
No, Seriously.
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
45/75
Pig Makes it Easy 5% of the code
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
46/75
Pig Makes it Easy 5% of the code
5% of the dev time
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
47/75
Pig Makes it Easy 5% of the code
5% of the dev time
Within 20% of the running time
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
48/75
Pig Makes it Easy 5% of the code
5% of the dev time
Within 20% of the running time Readable, reusable
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
49/75
Pig Makes it Easy 5% of the code
5% of the dev time
Within 20% of the running time Readable, reusable
As Pig improves, your calculations run faster
O
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
50/75
One Thing Ive Learned Its easy to answer questions
Its hard to ask the right questions
O Thi I L d
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
51/75
One Thing Ive Learned Its easy to answer questions
Its hard to ask the right questions
Value the system that promotes innovationand iteration
O Thi I L d
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
52/75
One Thing Ive Learned Its easy to answer questions
Its hard to ask the right questions
Value the system that promotes innovationand iteration
More minds contributing = more value from your data
C ti Bi D t
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
53/75
Counting Big Data How many requests per day?
C ti Bi D t
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
54/75
Counting Big Data How many requests per day?
Average latency? 95% latency?
C ti Bi D t
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
55/75
Counting Big Data How many requests per day?
Average latency? 95% latency?
Response code distribution per hour?
C ti Bi D t
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
56/75
Counting Big Data How many requests per day?
Average latency? 95% latency?
Response code distribution per hour?
Twitter searches per day?
C ti Bi D t
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
57/75
Counting Big Data How many requests per day?
Average latency? 95% latency?
Response code distribution per hour?
Twitter searches per day?
Unique users searching, unique queries?
Co nting Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
58/75
Counting Big Data How many requests per day?
Average latency? 95% latency?
Response code distribution per hour?
Twitter searches per day?
Unique users searching, unique queries?
Links tweeted per day? By domain?
Counting Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
59/75
Counting Big Data How many requests per day?
Average latency? 95% latency?
Response code distribution per hour?
Twitter searches per day?
Unique users searching, unique queries?
Links tweeted per day? By domain?
Geographic distribution of all of the above
Correlating Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
60/75
Correlating Big Data
Usage difference for mobile users?
Correlating Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
61/75
Correlating Big Data
Usage difference for mobile users?
... for users on desktop clients?
Correlating Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
62/75
Correlating Big Data
Usage difference for mobile users?
... for users on desktop clients?
... for users of #newtwitter?
Correlating Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
63/75
Correlating Big Data
Usage difference for mobile users?
... for users on desktop clients?
... for users of #newtwitter?
Cohort analyses
Correlating Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
64/75
Correlating Big Data
Usage difference for mobile users?
... for users on desktop clients?
... for users of #newtwitter?
Cohort analyses
What features get users hooked?
Correlating Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
65/75
Correlating Big Data
Usage difference for mobile users?
... for users on desktop clients?
... for users of #newtwitter?
Cohort analyses
What features get users hooked?
What features power Twitter users use often?
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
66/75
Research on Big Data
What can we tell from a users tweets?
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
67/75
Research on Big Data
What can we tell from a users tweets?
... from the tweets of their followers?
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
68/75
Research on Big Data
What can we tell from a users tweets?
... from the tweets of their followers?
... from the tweets of those they follow?
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
69/75
Research on Big Data
What can we tell from a users tweets?
... from the tweets of their followers?
... from the tweets of those they follow?
What influences retweets? Depth of the retweet tree?
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
70/75
Research on Big Data
What can we tell from a users tweets?
... from the tweets of their followers?
... from the tweets of those they follow?
What influences retweets? Depth of the retweet tree?
Duplicate detection (spam)
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
71/75
Research on Big Data
What can we tell from a users tweets?
... from the tweets of their followers?
... from the tweets of those they follow?
What influences retweets? Depth of the retweet tree?
Duplicate detection (spam)
Language detection (search)
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
72/75
Research on Big Data
What can we tell from a users tweets?
... from the tweets of their followers?
... from the tweets of those they follow?
What influences retweets? Depth of the retweet tree?
Duplicate detection (spam)
Language detection (search)
Machine learning
Research on Big Data
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
73/75
Research on Big Data
What can we tell from a users tweets?
... from the tweets of their followers?
... from the tweets of those they follow?
What influences retweets? Depth of the retweet tree?
Duplicate detection (spam)
Language detection (search)
Machine learning
Natural language processing
Diving Deeper
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
74/75
Diving Deeper HBase and building products from Hadoop
LZO Compression
Protocol Buffers and Hadoop
Our analytics-related open source: hadoop-lzo,
elephant-bird
Moving analytics to realtime
http://github.com/kevinweil/hadoop-lzo
http://github.com/kevinweil/elephant-bird
-
8/3/2019 Analyzing Big Data at Twitter Presentation 1
75/75
Questions?
Follow me at
twitter.com/kevinweil
Change this to your big-idea call-outs...