social databases - a brief overview
TRANSCRIPT
CONTENTS
•Background
•Study Cases:
•Twitter: Real time search-Earlybird
•Facebook: Storage.
•LinkedIn: Storage (Voldemort)
•Conclusion
•References
BACKGROUND OSN● Huge amount of data, diverse and changing over time. Likes,
sharing, comments, logins, page-views, search queries.
● New approaches to manipulate it.
● Distributed Databases, NoSQL.
● How to retrieve the data (Search relevance,
Recommendations, Security against abusive behavior,
Newsfeed features)
● Goal: massive scaling of demand: Unstructured, Semi-
structured.
Facebook Twitter LinkedIn
2.7M likes & comments/day
500M tweets/day
300+ M.Users(2 new/s). 200 group conversation/min
STORING AND QUERYING AT TWITTER
● Storage:o MySQL used as key-value store.
o FlockDB to Twitter Social Graph.
● Desired queries:o TrendingTopics
o Breaking news
o Sentiment
TAO
o Architecture and Data Model:
Objects: (id) → (otype, (key ? value)∗)
Associations: (id1, atype, id2) → (time, (key ? value)∗)
o MySQL to the Storage Layer.
o Main challenges:
Efficience scale.
Very fast response time.
High Read Availability.
Professional Social
Network
Data Driven Features:
● Recomendation System
(people you may know)
● People Search (Jobs
search - candidates)
● Who view your profile?
● Events you may be
STORAGE - VOLDEMORT
Highly Available Distrib. KV
Store
10 Voldemort Clusters
(+100 nodes) - 9 of BDB
Layered Design
All layers – single interface:
-Put/Delete/Get
-Flexible
-Every layer->decorates
next one
STORAGE - VOLDEMORT
Voldemort provides:
•High available
•Low latency
•Distributed
Like a Distrib. Hash Table
(DHT).
Storage Data engine on
nodes:
•Compact index
•Data files
SUMMARY
Problem Solved Main Advantages
EarlyBirdReal time search Fast indexing,
concurrence Management
TAOStoring Facebook Social Graph
Very fast response time.High read availability.
Voldemort
Simple Data Partitioning to meet scalability needs
High Scalable, Seamless replication
CONCLUSION
• The selection of the database systems depends on the
needs of the applications and the primary type of
information of the social network.
• Many OSN have developed their own solutions to cope with
the ever growing nature of big data and its challenges.
• Summarizing, the main features that the data solutions
should have are:
• Storage huge amount of data.
• Fast read and low latency.
• Processing of big data (meaningful results)
• Streaming and indexing are critical.
• Other challenges: Privacy and Expansion.
EXTREMELY DIFFICULT QUESTIONS
1.Why did LinkedIn needed to build their own
solution Voldemort?
2.How does TAO resolve the challenges it was built
for?
3.How the real time search service works at
twitter?
REFERENCES● Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., … Zhang, J.
(2012). Data Infrastructure at LinkedIn. In Data Engineering (ICDE), 2012 IEEE 28th
International Conference on (pp. 1370–1381).
● N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo,
S. Kulkarni, and H. Li, “Tao: Facebook’s distributed data store for the social graph,” in
USENIX ATC, 2013.
● N. Ruflin, H. Burkhart, and S. Rizzotti, “Social-data storage-systems,” Databases Soc.
Networks - DBSocial ’11, pp. 7–12, 2011.
● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H.
Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the
2010 ACM SIGMOD International Conference on Management of data. ACM,
Indianapolis, Indiana, USA, pp. 1013–1020, 2010.
● D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a Needle in Haystack:
Facebook’s Photo Storage,” in OSDI, 2010, vol. 2010, pp. 47–60.
● M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-Time
Search at Twitter,” Proceedings of the 2012 IEEE 28th International Conference on Data
Engineering. IEEE Computer Society, pp. 1360–1369, 2012.
● J. Lin and D. Ryaboy, “Scaling big data mining infrastructure: the twitter experience,”
REFERENCES (II)● D. Borthakur, J. Gray, J. Sen Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K.
Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache hadoop goes
realtime at Facebook,” Proceedings of the 2011 ACM SIGMOD International Conference on
Management of data. ACM, Athens, Greece, pp. 1071–1080, 2011.
● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data
warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013–
1020, 2010.
● C. Chen, F. Li, B. C. Ooi, and S. Wu, “TI: an efficient indexing mechanism for real-time search
on tweets,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of
data. ACM, Athens, Greece, pp. 649–660, 2011.
● G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin, “Fast data in the era of big data: Twitter’s real-
time related query suggestion architecture,” Proceedings of the 2013 ACM SIGMOD
International Conference on Management of Data. ACM, New York, New York, USA, pp. 1147–
1158, 2013.
● S. Cohen and B. Kimelfeld, “A Social Network Database that Learns How to Answer Queries ∗,”
2013.
● Eben Hewitt, Cassandra: The Definitive Guide, illustrated ed. , O'Reilly Media, Inc., 2010.
LINKS
● https://www.usenix.org/conference/atc13/technical-
sessions/presentation/bronson
● http://www-
conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105
_DhrubaBorthakur.pdf
● http://www.slideshare.net/linkedin/jay-kreps-on-project-
voldemort-scaling-simple-storage-at-linkedin
● http://data.linkedin.com/
● http://www.infoq.com/presentations/Project-Voldemort-at-
Gilt-Groupe