linkedin case study
TRANSCRIPT
4/10/2013
Case Study
LinkedIn and its System, Network and Analytics – Data Storage
Sai Srinivas K (B09016), Sai Sagar J (B09014), Rajeshwari R (B09026) and Ashish K Gupta (B09008)
Distributed Database Systems, Spring 2013, IIT Mandi
Instructor: Dr. Arti Kashyap
D D B , S p r i n g 2 0 1 3 | 2
IIT Mandi
Abstract
This paper is a case study on LinkedIn, a social networking
website for people in professional occupations, Its Data
Storage Systems and few of the System, Network and
Analytics aspects of the site. The SNA team at LinkedIn
has a web site that hosts the open source projects built by
the group. Notable among these projects is Project
Voldemort, a distributed key-value structured storage
system with low-latency similar in purpose to
Amazon.com's Dynamo and Google's BigTable. Let us see
the till date research and backend managing systems of the
web-site which reports more than 200 million acquired
users in more than 200 countries and territories.
I. Introduction
LinkedIn Corporation is a social networking website for
professional networking among people in various
occupations. The company was founded by Reid Hoffman
and founding team members from PayPal and
Socialnet.com (Allen Blue, Lee Hower, Eric Ly, David
Eves, Ian McNish, Chris Saccheri, Jean-Luc Vaillant,
Konstantin Guericke, Stephen Beitzel and Yan Pujante)
launched it on May 5, 2003 in Santa Monica, California [1].
LinkedIn's CEO is Jeff Weiner, previously a Yahoo! Inc.
Executive, and Founder Reid Hoffman, previously CEO of
LinkedIn, is now the Chairman of the Board.
1.1 Features
This site helps in professional social networking for the
users by maintaining a list of connections which would
have the individual contact details of everyone ‘connected’
to them. Whether a site User or not, one can invite anyone
to become a connection. However, if the invitee selects "I
don't know" or "Spam", this counts as a report against the
inviter and he gets too many of such responses, the account
may be restricted or closed. This list of connections can
then be used in a number of ways:
A network of contacts is built up of, their direct
connections, their second-degree connections
(connections of each of their connections) and also
third degree connections (connections of the
second-degree connections). This is similar to the
concept of “Mutual Friends” in Facebook where
one can gain an introduction to someone, he/she
finds interesting.
Users can upload their resumes or build/design
them design in their profiles in order to share their
respective work and community experiences.
It can be used to find jobs, people and business
opportunities recommended by someone in one's
contact network.
Employers can list jobs and search for potential
candidates.
Job seekers can review the profile of hiring
managers and discover which of their existing
contacts can introduce them.
Users can post their own photos and view photos of
others to aid in identification.
Users can now follow different companies and can
get notification about the new joining and offers
available.
Users can save or bookmark jobs that they would
like to apply for.
The "gated-access approach" (where contact with any
professional requires either an existing relationship or the
intervention of a contact of theirs) is intended to build trust
among the service's users and is one of the Special Features
of LinkedIn. The feature “LinkedIn Answers” similar to
“Yahoo! Answers” allows users to ask questions for the
community to answer. This feature is free, and the main
difference from the latter is that questions are potentially
more business-oriented, and the identity of the people
asking and answering questions is known. “LinkedIn cites”
a new 'focus on development of new and more engaging
ways to share and discuss professional topics across
LinkedIn' is a recent development which may sack the
outdating feature, “LinkedIn Answers”.
Other LinkedIn features include LinkedIn Polls as a form of
researching (for the users), LinkedIn DirectAds as a form of
sponsored advertising etc. LinkedIn allows users to endorse
each other’s skills. This feature also allows users to
efficiently provide commentary on other user’s profiles
thus reinforcing the network build-up. However there is no
way of flagging anything other than positive content.
1.1.1 Applications
The Applications Platform allows other online services to
be embedded within a member's profile page like ‘Amazon’
Reading List that allows LinkedIn members to display
books they are reading, a connection to Tripit (travel
itinerary), a WordPress and TypePad application, which
allows members to display their latest blog postings within
their LinkedIn profile and etc. Later on LinkedIn allowed
businesses to list products and services on company profile
pages; it also permitted LinkedIn members to "recommend"
products and services and write reviews.
1.1.2 Groups
LinkedIn also supports the formation of interest groups
(which are equivalently famous in many social networking
sites and blogs), the majority related to employment
although a very wide range of topics are covered mainly
around professional and career issues and the current focus
is on the groups for both academic and corporate alumni.
Groups support a limited form of discussion area,
moderated by the group owners and managers. Since
groups offer the ability to reach a wide audience without so
easily falling foul of anti-spam solutions, there is a constant
stream of spam postings, and there now exist a range of
firms who offer a spamming service for this very purpose.
Groups also keep their members informed through emails
with updates to the group, including most talked about
D D B , S p r i n g 2 0 1 3 | 3
IIT Mandi
discussions within your professional circles. Groups may be
private, accessible to members only or may be open to
Internet users in general to read, though they must join in
order to post messages.
1.1.3 Job listings
LinkedIn allows users to research companies with which
they may be interested in working. When typing the name
of a given company in the search box, statistics about the
company are provided. These may include the location of
the company's headquarters and offices, or a list of present
and former employees, the percentage of the most common
titles/positions held within the company etc. LinkedIn
launched a new feature allowing companies to include an
"Apply with LinkedIn" button on job listing pages which
was really a serious and useful development. The new plug-
in will allow potential employees to apply for positions
using their LinkedIn profiles as resumes. All applications
will also be saved under a "Saved Jobs" tab.
II. SNA LinkedIn
The Search, Network, and Analytics of LinkedIn host the
open source projects in the data blogs built by the group.
Notable among these projects is Project Voldemort, a
distributed key-value structured storage Database system
with low-latency similar in purpose to Amazon’s ‘Dynamo’
and Google's BigTable. The data team at LinkedIn works
on LinkedIn's information retrieval systems, the social
graph system, data driven features, and supporting data
infrastructure.
2.1 Project Voldemort
Voldemort is a distributed key-value storage system. It has
the following properties:
Data is automatically replicated over multiple
servers (Data Replication)
Data is automatically partitioned so each server
contains only a subset of the total data (Data
Partitioning)
Server failures are handled transparently oblivious
to the users (Transparent failures)
Pluggable serialization is supported to allow rich
keys and values including lists and tuples with
named fields, as well as to integrate with common
serialization frameworks like Protocol Buffers,
Thrift, Avro and Java Serialization
Data items are versioned to maximize data integrity
in failure scenarios without compromising
availability of the system (Versioning)
Each node is independent of other nodes with no
central point of failure or coordination (Node
Independence)
Good single node performance: you can expect 10-
20k operations per second depending on the
machines, the network, the disk system, and the
data replication factor
Support for pluggable data placement strategies to
support things like distribution across data centers
that are geographically far apart (Data Placement).
2.1.1 Comparison with the Relational Database
Voldemort is not a relational database; it does not attempt
to satisfy arbitrary relations while satisfying ACID
properties. Nor is it an object database that attempts to
transparently map object reference graphs. Nor does it
introduce a new abstraction such as document-orientation.
It is basically just a big, distributed, persistent, fault-
tolerant hash table.
For applications that can use an O/R mapper like active-
record or hibernate this will provide horizontal scalability
and much higher availability but at great loss of
convenience. For large applications under internet-type
scalability pressure, a system may likely consists of a
number of functionally partitioned services or APIs, which
may manage storage resources across multiple data centres
using storage systems which may themselves be
horizontally partitioned.
Voldemort offers a number of advantages:
Voldemort combines in memory caching with the
storage system so that a separate caching tier is not
required (instead the storage system itself is just
fast)
Unlike MySQL replication, both reads and writes
scale horizontally
Data portioning is transparent, and allows for
cluster expansion without rebalancing all data
Data replication and placement is decided by a
simple API to be able to accommodate a wide
range of application specific strategies
The storage layer is completely mockable so
development and unit testing can be done against a
throw-away in-memory storage system without
needing a real cluster (or even a real storage
system) for simple testing.
For applications in this space, arbitrary in-database joins
are already impossible since all the data is not available in
any single database. A typical pattern is to introduce a
caching layer which will require hash table semantics
anyway. It is even used for certain high-scalability storage
problems where simple functional partitioning is not
sufficient. It is still a new system under development which
may have rough edges and probably plenty of uncaught
bugs.
D D B , S p r i n g 2 0 1 3 | 4
IIT Mandi
2.1.2 Design
Key-Value Storage
Project Voldemort created by LinkedIn is just simple key-
value data storage, for their primary importance is enabling
high performance and availability to the users. Both keys
and values can be complex compound objects including
lists or maps, but nonetheless the only supported queries are
effectively the following:
Value = store.get(key); Store.put(key, value);
Store.delete(key)
This may not be good enough for all storage problems, for
there maybe a variety of trade-offs like no complex query
filters, all joins must be done in code, no foreign key
constraints, no triggers etc.
2.1.3 System Architecture
The below representation [2] is the Logical view in which
each layer implements a simple storage interface like put,
get, and delete. Each of these layers is responsible for
performing one function such as tcp/ip network
communication, serialization, version reconciliation, inter-
node routing, etc. For example the routing layer is
responsible for taking an operation; say a PUT, and
delegating it to all the N storage replicas in parallel, while
handling any failures. [3]
We have flexibility, on even where the intelligent routing of
data to partitions is done, for that matter anywhere in those
layers. One could add in a compression layer that
compresses byte values at any level below the serialization
level. This could be done on the client sides or on the server
side to enable hardware load-balanced http clients.
The below representation [4] is the Physical architecture
having Frontend, Back end and Voldemort Clusters
connected through Load balancers (hardware) which is a
round-robin software load balancer, and "Partition-aware
routing" which is the storage systems internal routing. All
the Possible tier-architectures are denoted in the Diagram.
It is highly efficient if one could see it, from the latency
perspective because the obvious fewer hops and also from
the throughput perspective since there are fewer potential
bottlenecks, but has few bottlenecks too for it requires the
routing intelligence to move up the stack.
Apart from them, the flexibility aspect makes high
performance configurations possible. Disk access is the
single biggest performance hit in storage, the second is
network hops. Disk access can be avoided by partitioning
the data set and caching as much as possible. Network hops
require architectural flexibility to eliminate. In the diagram
shown one can implement 3-hop, 2-hop, or 1-hop remote
services using different configurations. This enables very
high performance to be achieved when it is possible to
route service calls directly to the appropriate server.
2.1.3.1 Data partitioning and replication [5]
Data needs to be partitioned across a cluster of servers so
that no single server needs to hold the complete data set.
Even when the data can fit on a single disk, disk access for
small values may be slowed down by seek time so
partitioning would invariably improve cache efficiency by
splitting the data into smaller chunks. The servers in the
cluster are not interchangeable, and requests need to be
routed to a server that holds requested data, not just any
available server at random.
Similarly Servers which regularly fail or become
overloaded are brought down for maintenance. If there are
S servers and each server is assumed to fail independently
with probability p in a given day, then the probability of
losing at least one server in a day will be 1 - (1 - p)s.
Therefore we cannot store data on only one server or the
probability of data loss will be inversely proportional to
cluster size.
The simplest possible way to accomplish this would be to
cut the data into S partitions (one per server) and store
copies of a given key K on R servers. One way to associate
the R servers with key K would be to take a = K mod S and
D D B , S p r i n g 2 0 1 3 | 5
IIT Mandi
store the value on servers a, a+1, ..., a+r. So for any
probability p you can pick an appropriate replication factor
R to achieve an acceptably low probability of data loss.
This system has the nice property that anyone can calculate
the location of a value just by knowing its key, which
allows us to do look-ups in a peer-to-peer fashion without
contact a central metadata server that has a mapping of all
keys to servers. The downside (Failures) to the above
approach occurs when a server is added, or removed from
the cluster. In this case d may change and all data will shift
between servers. Even if d does not change, load will not
evenly distribute from a single removed/failed server to the
rest of the cluster.
Consistent hashing is a technique that avoids these
problems, and we use it to compute the location of each key
on the cluster. Using this technique Voldemort has the
property that when a server fails load will distribute equally
over all remaining servers in the cluster. Likewise when a
new server is added to a cluster of S servers, only 1/(S+1)
values must be moved to the new machine.
To visualize the consistent hashing method we can see the
possible integer hash values as a ring beginning with 0 and
circling around to 2^31-1. This ring is divided into Q
equally-sized partitions with Q >> S and each of the S
servers is assigned Q/S of these. A key is mapped onto the
ring using an arbitrary hash function, and then we compute
a list of R servers responsible for this key by taking the first
R unique nodes when moving over the partitions in a
clockwise direction. The diagram [6] below pictures a hash
ring for servers A, B, C, D. The arrows indicate keys
mapped onto the hash ring and the resulting list of servers
that will store the value for that key if R=3.
These features like load balancing, Semantic partitioning is
implemented by Kafka, Sensei DB etc
2.1.4 Data Format & Queries
In Voldemort data is divided into “store” unlike in a
relational database where it is broken into 2D tables. The
word table is not used for the data need not necessarily be
tabular (a value can contain lists and mappings which are
not considered in a strict relational mapping). Each key is
unique to a store, and each key can have at most one value.
2.1.4.1 Queries
Voldemort supports hash table semantics, so a single value
can be modified at a time and retrieval is by primary key.
This makes distribution across machines particularly easy
since everything can be split by the primary key.
It can support lists as values if not one-many relations
because anyways both accomplish the same, so it is
possible to store a reasonable number of values associated
with a single key. In most cases this denormalization is a
huge performance improvement since there is only a single
set of disk seeks; but for very large one-to-many
relationships (say where a key maps to tens of millions of
values) which must be kept on the server and streamed
lazily via a cursor this approach is not practical. This rare
case must be broken up into sub-queries or otherwise
handled at the application level.
The simplicity of the queries can be an advantage, since
each has very predictable performance, it is easy to break
down the performance of a service into the number of
storage operations it performs and quickly estimate the
load. In contrast SQL queries are often opaque, and
execution plans can be data dependent, so it can be very
difficult to estimate whether a given query will perform
well with realistic data under load (especially for a new
feature which has neither data nor load).
Also, having a three operation interface makes it possible to
transparently mock out the entire storage layer and unit test
using a mock-storage implementation that is little more
than a HashMap. This makes unit testing outside of a
particular container or environment much more practical.
2.1.5 Consistency & Versioning
When taking multiple simultaneous writes distributed
across multiple servers and perhaps multiple data centres,
consistency of data becomes a difficult problem. The
traditional solution to this problem is distributed
transactions but these are both slow (due to many round
trips) and fragile as they require all servers to be available
to process a transaction. In particular any algorithm which
must talk to more than 50% of the servers to ensure
consistency becomes quite problematic if the application is
running in multiple data centres and hence the latency for
cross-data-centre operations will be extremely high.
An alternate solution is to tolerate the possibility of
inconsistency, and resolve inconsistencies at read time.
Applications usually do a read-modify-update sequence
when modifying data. For example if a user adds an email
address to their account we might load the user object, add
the email, and then write the new values back to the db.
Transactions in databases are a solution to this problem, but
are not a real option when the transaction must span
D D B , S p r i n g 2 0 1 3 | 6
IIT Mandi
multiple page loads (which may or may not complete, and
which can complete on any particular time frame)
The value for a given key is consistent if, in the absence of
updates, all reads of that key return the same value. In the
read-only world data is created in a consistent way and not
changed. When we add both writes, and replication, we
encounter problems: now we need to update multiple values
on multiple machines and leave things in a consistent state.
In the presence of server failures this is very hard, in the
presence of network partitions it is provably impossible (a
partition is when, e.g., A and B can reach each other and C
and D can reach each other, but A and B can't reach C and
D).
There are several methods for reaching consistency with
different guarantees and performance tradeoffs like two-
Phase Commit, Paxos-style consensus and Read-repair.
The first two approaches prevent permanent inconsistency.
The third approach involves writing all inconsistent
versions, and then at read-time detecting the conflict, and
resolving the problems and hence used by the SNA team.
This involves little co-ordination and is completely failure
tolerant, but may require additional application logic to
resolve conflicts. This has the best availability guarantees,
and the highest efficiency (only W writes network
roundtrips are required for N replicas where W can be
configured to be less than N). 2PC typically requires 2N
blocking roundtrips. Paxos variations vary quite a bit but
are comparable to 2PC.
Another approach to reach consistency is by using Hinted
Handoff. In this method during writes if we find that the
destination nodes are down (Failure Handling) we store a
"hint" of the updated value on one of the ‘alive’ nodes.
Then when these down nodes come back up the "hints" are
pushed to them thereby making the data consistent.
2.1.6 Routing Parameters
Any persistent system needs to answer the question "where
is my stuff?". This is a very easy question if we have a
centralized database, since the answer is always
"somewhere on the database server". In a partitioned key
system there are multiple machines that may have the data.
When we do a read we need to read from at least 1 server to
get the answer, when we do a write we need to (eventually)
write to all N of the replicas.
There are thus three parameters that matter:
N - The number of replicas
R - The number of machines to read from
W - The number writes to block for
Note that if R + W > N then we are guaranteed to "read our
writes". If W = 0, then writes are non-blocking and there is
no guarantee of success whatever. Puts and deletes are
neither immediately consistent nor isolated. The semantics
are this: if a put/delete operation succeeds without
exception then it is guaranteed that at least W nodes carried
out the operation; however if the write fails (say because
too few nodes succeed in carrying out the operation) then
the state is unspecified. If at least one put/delete succeeds
then the value will eventually be the new value, however if
none succeeded then the value is lost. If the client wants to
ensure the state after a failed write operation they must
issue another write.
2.1.7 Performance [7]
Getting real applications deployed requires having simple,
well understood, predictable performance. Understanding
and tuning performance of a cluster of machines is a
important criteria too. Note that there are a number of
tuneable parameters: the cache size on a node, the number
of nodes you read and write to on each operation, the
amount of data on a server, etc.
Estimating network latency and data/cache ratios
Disk is far and away the slowest and lowest throughput
operation. Disk seeks are 5-10ms and a lookup could
involve multiple disk seeks. When the hot data is primarily
in memory you are benchmarking the software, when it is
primarily on disk you are benchmarking your disk system.
The calculation we do when planning a feature is to take
the estimated total data size, divide by the number of nodes
and multiply be the replication factor. This is the amount of
data per node. Then compare this to the cache size per
node. This is the fraction of the total data that can be served
from memory. This fraction can be compared to some
estimate of the hotness of the data. For example if the
requests are completely random, then a high proportion
should be in memory. If instead the requests represent data
about particular members and only some fraction of
members are logged in at once, and one member session
indicates many requests, then you may survive with a much
lower fraction.
Network is the second biggest bottleneck after disk. The
maximum throughput one java client can get for roundtrips
through a socket to a service that does absolutely nothing
seems to be about 30-40k req/sec over localhost. Adding
work on the client or server side or adding network latency
can only decrease this.
“Some results of LinkedIn performances
The throughput we see from a single multithreaded client
talking to a single server where the "hot" data set is in
memory under artificially heavy load in a performance lab:
Reads: 19,384 req/sec
Writes: 16,559 req/sec
Note that this is to a single node cluster so the replication
factor is 1. Obviously doubling the replication factor will
halve the client req/sec since it is doing 2x the operations.
So these numbers represent the maximum throughput from
D D B , S p r i n g 2 0 1 3 | 7
IIT Mandi
one client, by increasing the replication factor, decreasing
the cache size, or increasing the data size on the node, we
can make the performance arbitrarily slow. Note that in this
test, the server is actually fairly lightly loaded since it has
only one client so this does not measure the maximum
throughput of a server, just the maximum throughput from
a single client” [8]
.
2.2 Support for batch computed data – Read only stores
One of the most data-intensive storage needs is storing
batch computed data about members and content in our
system. These jobs often deal with the relationships
between entities (e.g. related users, or related news articles)
and so for N entities can produce up to N2 relationships. An
example at LinkedIn is member networks, which are in the
12TB range if stored explicitly for all members. Batch
processing of data is generally much more efficient than
random access, which means one can easily produce more
batch computed data than can be easily accessed by the live
system - Hadoop greatly expands this ability. Therefore a
Voldemort persistence-backend that supports very efficient
read-only access that helps take a lot of the pain our of
building, deploying, and managing large, read-only batch
computed data sets was created.
Much of the pain of dealing with batch computing comes
from the "push" process that transfers data from a data
warehouse or hadoop instance to the live system. In a
traditional db this will often mean rebuilding the index on
the live system with the new data. Doing millions of sql
insert or update statements is generally not at all efficient,
and typically in a SQL db the data will be deployed as a
new table and then swapped to replace the current data
when the new table is completely built. This is better than
doing millions of individual updates, but this still means the
live system is now building a many GB index for the new
data set (or performa) while simultaneously serving live
traffic.
This alone can take hours or days, and may destroy the
performance on live queries. Some people have fixed this
by swapping out at the database level (e.g. having an online
and offline db, and then swapping), but this requires effort
and means only half your hardware is being utilized.
Voldemort fixes this process by making it possible to
prebuild the index itself offline (on Hadoop or wherever),
and simply push it out to the live servers and transparently
swap.
A driver program initiates the fetch and swap procedure in
parallel across a whole Voldemort cluster. In their tests it is
reported that this process can reach the I/O limit of either
the Hadoop cluster or the Voldemort cluster. This also
helps in associating the ‘Hot’ data with its
corresponding keys.
Benchmarking anything that involves disk access is
notoriously difficult because of sensitivity to three factors:
1. The ratio of data to memory
2. The performance of the disk subsystem, and
3. The entropy of the request stream
The ratio of data to memory and the entropy of the request
stream determine how many cache misses will be sustained,
so these are critical. A random request stream is more or
less un-cacheable, but fortunately almost no real request
streams are random. They tend to have strong temporal
locality which is what page cache eviction algorithms
exploit. So we can assume a large ratio of memory to disk,
and test against a simulated request stream to get
performance information. Any build process will consist of
three stages: (1) partitioning the data into separate sets for
each destination nodes, (2) gathering all data for a given
node, and (3) building the lookup structure for that node.
2.2.1 Build Time [8]
The tested time is the complete build time including
mapping the data out to the appropriate node-chunk,
shuffling the data to the nodes that will do the build, and
finally creating the ‘store’ files. In general, the time was
roughly evenly split between map, shuffle and reduce
phases. The number of map and reduce tasks are a very
important parameter, as experiments on a smaller data set
show that varying the number of tasks could change the
build time by more than 25%, but due to time constraints
LinkedIn used defaults Hadoop produced, for Testing. Here
are the times taken:
100GB: 28mins (400 mappers, 90 reducers)
512GB: 2hrs, 16mins (2313 mappers, 350 reducers)
1TB: 5hrs, 39mins (4608 mappers, 700 reducers)
This neglects the additional benefits of Hadoop for
handling failures, dealing with slower nodes, etc.
D D B , S p r i n g 2 0 1 3 | 8
IIT Mandi
In addition, this process is scalable: it can be run on a
number of machines equal to the number of chunks (700 in
our 1TB case) not the number of destination nodes (only
10). Data transfer between the clusters happens at a steady
rate bound by the disk or network. In LinkedIn’s Amazon
instances this is around 40MB/second.
2.2.2 Online Performance [8]
Lookup time for a single Voldemort node compares well to
a single MySQL instance as well. Consider a local test
against the 100GB per-node data from the 1 TB test. Let it
run on an Amazon Extra Large instance with 15GB of
RAM and the 4 ephemeral disks in a RAID 10
configuration. To run the tests 1 million requests from a
real request stream recorded on the production system
against each of storage systems, be simulated. Then the
following performance for 1 million requests against a
single node is resulted:
MySQL Voldemort
Reqs per sec. 727 1291
Median req. Time 0.23 ms 0.05 ms
Avg. req. Time 13.7 ms 7.7 ms
99th percentile req. time 127.2 ms 100.7 ms
These numbers are both for local requests with no network
involved as the only intention is to benchmark the storage
layer of these systems.
2.3 White Elephant: The Hadoop Tool
LinkedIn’s solution of a Hadoop Tool to manage and
configure the Network analytics is “White Elephant”. At
LinkedIn it is used for product development (e.g.,
predictive analytics applications like ‘People You May
Know’ and ‘Endorsements’), descriptive statistics for
powering our internal dashboards, ad-hoc analysis by data
scientists, and ETL. White Elephant parses Hadoop logs to
provide visual drill downs and rollups of task statistics for
your Hadoop cluster, including total task time, slots used,
CPU time, and failed job counts.
White Elephant fills several needs:
Scheduling: when you have a handful of periodic
jobs, it’s easy to reason about when they should
run, but that quickly doesn’t scale. The ability to
schedule jobs at periods of low utilization helps
maximize cluster efficiency.
Capacity planning: to plan for future hardware
needs, operations need to understand the resource
usage growth of jobs.
Billing: Hadoop clusters have finite capacity, so in
a multi-tenant environment it’s important to know
the resources used by a product feature against its
business value.
2.3.1 Architecture [10]
Here's a diagram outlining the White Elephant architecture:
There are three Hadoop Grids, A, B, and C, for which
White Elephant will compute statistics as follows:
1. Upload Task: a task that periodically runs on the
Job Tracker for each grid and incrementally copies
new log files into a Hadoop grid for analysis.
2. Compute: a sequence of MapReduce jobs
coordinated by a Job Executor parses the uploaded
logs and computes aggregate statistics.
3. Viewer: a viewer app incrementally loads the
aggregate statistics, caches them locally, and
exposes a web interface which can be used to slice
and dice statistics for your Hadoop clusters
2.4 Sensei DB
Sensei DB is a distributed searchable database that handles
complex semi-structured queries. It can be used to power
consumer search systems with rich structured data. It is an
Open-source, distributed, real-time, semi-structured
database which powers “LinkedIn homepage” and
“LinkedIn Signal”.
Some Features of this database include:
Full-text search
Fast real-time updates
Structured and faceted search
BQL: SQL-like query language
Fast key-value lookup
High performance under concurrent heavy update
and query volumes
D D B , S p r i n g 2 0 1 3 | 9
IIT Mandi
Hadoop integration
It helps in faceted search on the rich structured data
required by LinkedIn to incorporate into the user profiles.
The fundamental paradigm was to provide individuals with
an easy and natural way to slice and dice through search
results or simply content so a faceted search paradigm
would be ideal not only for retrieval but also for Navigation
and Discovery. At LinkedIn since a member profile does
have these rich structural dimensions, along with rich text
data, it seemed that it would be only a matter of time to
create such an interface.
A click on a facet value would be similar to a
filtering of search results through that value. For
example in the search for “John” and later selecting
the “San Francisco” should get you only people in
San Francisco called John, i.e. “John” +
facet_value(“San Francisco”) = “John AND
location:(San Francisco)”. While navigating
through results this never leads to a Dead end.
What was implemented is essentially a query engine for the
following type of query:
SELECT f1,f2…fn FROM members
WHERE c1 AND c2 AND c3..
MATCH (fulltext query, e.g. “java engineer”)
GROUP BY fx,fy,fz…
ORDER BY fa,fb…
LIMIT offset,count
Deferring this query to a traditional RDBMS on 10s – 100s
millions of rows with sub-second query latency SLA is not
feasible. Thus a distributed system like Sensei that handles
the above query at internet scale, is necessary. Below is a
faceted search snapshot. [11]
2.5 Avatara: OLAP for Web-scale Analytics Products
The last important part of SNA, LinkedIn described in this
paper is Avatara which is an OLAP for web analytics
products. LinkedIn has many analytical insight products
such as "Who's Viewed My Profile?" and "Who's Viewed
This Job?"At their core, these are multidimensional queries.
For example, "Who's Viewed My Profile?" takes someone's
profile views and breaks them down by industry,
geography, company, school, etc to show the richness of
people who viewed their profiles and who viewed the Job [12]:
D D B , S p r i n g 2 0 1 3 | 10
IIT Mandi
Online analytical processing (OLAP) has been the
traditional approach to solve these multi-dimensional
analytical problems. However, LinkedIn had to build a
solution that can answer these queries in milliseconds
across 175+ million members and so built Avatara. Avatara
is LinkedIn's scalable, low latency, and highly-available
OLAP system for "Sharded" multi-dimensional queries in
the time constraints of a request/response loop.
An interesting insight for LinkedIn's use cases is that
queries span relatively few – usually tens to at most a
hundred – dimensions, so this data can be “Sharded” across
a primary dimension. For "Who's Viewed My Profile?", we
can shard the cube by the member herself, as the product
does not allow analyzing profile views of anyone other than
the member currently logged in. Here's a brief overview of
how it works. As shown in the figure below, Avatara
consists of two components:
1. An offline engine that computes cubes in batch
2. An online engine that serves queries in real time
The offline engine computes cubes with high throughput by
leveraging Hadoop for batch processing. It then writes
cubes to Voldemort DDBS. The online engine queries the
Voldemort store when a member loads a page. Every piece
in this architecture runs on commodity hardware and can be
easily scaled horizontally.
The above diagram also shows the integration scenario of
Hadoop with the LinkedIn’s key-store DDB Voldemort.
2.5.1 Offline Engine
The offline batch engine processes data through a pipeline
that has three phases:
1. Pre-processing
2. Projections and joins
3. Cubification
Each phase runs one or more Hadoop jobs and produces
output that is the input for the subsequent phase. We utilize
Hadoop for its built-in high throughput, fault tolerance and
horizontal scalability. The pipeline pre-processes raw data
as needed, projects out dimensions of interest, performs
user-defined joins, and at the end transforms the data to
cubes. The result of the batch engine is a set of Sharded
small cubes, represented by key-value pairs, where each
key is a shard (for example, by member_id for "Who's
Viewed My Profile?"), and the value is the cube for the
shard.
2.5.2 Online Engine
All cubes are bulk loaded into Voldemort. The online query
engine retrieves and processes data from Voldemort,
returning results back to the client. It provides SQL-like
operators, such as select, where, group by, plus some math
operations. The wide-spread adoption of SQL makes it easy
for application developers to interact with Avatara. With
Avatara, 80% of queries can be satisfied within 10 ms, and
95% of queries can be answered within 25 ms for "Who's
Viewed My Profile?" on a high traffic day.
2.6 Conclusions
When the scale of data began to overload the LinkedIn
servers, their solution wasn’t to add more nodes but to cut
out some of the matching heuristics that required too much
compute power. Instead of writing algorithms to make
“People You Know” more accurate, their team worked on
getting LinkedIn’s Hadoop infrastructure in place and built
a distributed database called Voldemort. They then built
Azkaban, an open source scheduler for batch processes
such as Hadoop jobs, and Kafka, another open source tool
referred to as “the big data equivalent of a message broker”.
At a high level, Kafka is responsible for managing the
company’s real-time data and getting those hundreds of
feeds to the apps that subscribe to them with minimal
latency. A 2012 study comparing systems for storing APM
monitoring data reported that Voldemort, Cassandra, and
HBase offered linear scalability in most cases, with
Voldemort having the lowest latency and Cassandra having
the highest throughput.
Why hasn’t LinkedIn shifted from a NoSQL Database like
Voldemort?
“The fundamental problem is endemic to the relational
database mindset, which places the burden of computation
on reads rather than writes. This is completely wrong for
large-scale web applications, where response time is
critical. It’s made much worse by the serial nature of most
applications. Each component of the page blocks on reads
from the data store, as well as the completion of the
operations that come before it. Non-relational data stores
reverse this model completely, because they don’t have the
complex read operations of SQL” as mentioned by
LinkedIn SNA Team in the ‘Interview with Ryan King’
Acknowledgements
The authors of this paper would like to acknowledge the
Data Team of the LinkedIn which has open-sourced their
data store DBSs like Voldemort and SNA tools like Sensei
DB, Avatara, Azkaban etc., hence providing various means
for researching.
D D B , S p r i n g 2 0 1 3 | 11
IIT Mandi
References
[1] http://en.wikipedia.org/wiki/LinkedIn
[2] Dynamo: Amazon's Highly Available Key-Value Store
[3] http://data.linkedin.com/ - The data team which
manages the SNA of LinkedIn
[4]http://www.project_voldemort.com/voldemort/design.ht
ml
[5]http://en.wikipedia.org/wiki/Voldemort_%28distributed_
data_store%29
[6] Time, Clocks, and the Ordering of Events in a
Distributed System—for the versioning details
[7] Eventual Consistency Revisited A discussion on Werner
Vogels' blog on the developers interaction with the storage
system and what the tradeoffs mean in practical terms.
[8] Brewer's conjecture and the feasibility of consistent,
available, partition-tolerant web services— Consistency,
Availability and Partition-tolerance
[9] Berkeley DB performance— A somewhat biased
overview of bdb performance.
[10] Google's Bigtable— for comparison, a very different
approach.
[11] One Size Fit's All: An Idea Whose Time Has Come
and Gone— Very interesting paper by the creator of Ingres,
Postgres and Vertica
[12] One Size Fits All? - Part 2, Benchmarking Results—
Benchmarks mentioned in the paper
[13] Consistency in Amazon's Dynamo— blog posts on
Dynamo
[14] Paxos Made Simple , Two-phase commit— Wikipedia
description.
[15] The Life of a Typeahead Query
The various technical aspects and challenges of real-time
typeahead search in the context of social network.
[16] Efficient type-ahead search on relational data: a
TASTIER approach
A relational approach for typeahead searching by means of
specialized index structures and algorithms for joining
related tuples in the database.
[16]http://gigaom.com/2013/03/03/how-and-why-linkedin-
is-becoming-an-engineering-powerhouse/
“LinkedIn, A powerhouse” Interviews with the Developing
Team
[17] http://www.cloudera.com/hadoop-training-basic the
principles behind Map Reduce and Hadoop.
[18]
https://groups.google.com/forum/?fromgroups#!forum/proj
ect-voldemort