social databases - a brief overview

SOCIAL DATA AND DBS

IVAN SANCHEZ

JULIO SALINAS

MARLENE ROBLES

CONTENTS

•Background

•Study Cases:

•Twitter: Real time search-Earlybird

•Facebook: Storage.

•LinkedIn: Storage (Voldemort)

•Conclusion

•References

BACKGROUND OSN● Huge amount of data, diverse and changing over time. Likes,

sharing, comments, logins, page-views, search queries.

● New approaches to manipulate it.

● Distributed Databases, NoSQL.

● How to retrieve the data (Search relevance,

Recommendations, Security against abusive behavior,

Newsfeed features)

● Goal: massive scaling of demand: Unstructured, Semi-

structured.

Facebook Twitter LinkedIn

2.7M likes & comments/day

500M tweets/day

300+ M.Users(2 new/s). 200 group conversation/min

STORING AND QUERYING AT TWITTER

● Storage:o MySQL used as key-value store.

o FlockDB to Twitter Social Graph.

● Desired queries:o TrendingTopics

o Breaking news

o Sentiment

REAL TIME SEARCH AT TWITTER: EARLYBIRD

TAO AND THE FACEBOOK SOCIAL GRAPH

TAO

o Architecture and Data Model:

Objects: (id) → (otype, (key ? value)∗)

Associations: (id1, atype, id2) → (time, (key ? value)∗)

o MySQL to the Storage Layer.

o Main challenges:

Efficience scale.

Very fast response time.

High Read Availability.

Professional Social

Network

Data Driven Features:

● Recomendation System

(people you may know)

● People Search (Jobs

search - candidates)

● Who view your profile?

● Events you may be

STORAGE - VOLDEMORT

Highly Available Distrib. KV

Store

10 Voldemort Clusters

(+100 nodes) - 9 of BDB

Layered Design

All layers – single interface:

-Put/Delete/Get

-Flexible

-Every layer->decorates

next one

STORAGE - VOLDEMORT

Voldemort provides:

•High available

•Low latency

•Distributed

Like a Distrib. Hash Table

(DHT).

Storage Data engine on

nodes:

•Compact index

•Data files

DISTRIBUTED HASHING ALGORITHM

This slide is from Roshan Sumbaly & Jay Kreps! (thanks Rosh & Jay)

SUMMARY

Problem Solved Main Advantages

EarlyBirdReal time search Fast indexing,

concurrence Management

TAOStoring Facebook Social Graph

Very fast response time.High read availability.

Voldemort

Simple Data Partitioning to meet scalability needs

High Scalable, Seamless replication

CONCLUSION

• The selection of the database systems depends on the

needs of the applications and the primary type of

information of the social network.

• Many OSN have developed their own solutions to cope with

the ever growing nature of big data and its challenges.

• Summarizing, the main features that the data solutions

should have are:

• Storage huge amount of data.

• Fast read and low latency.

• Processing of big data (meaningful results)

• Streaming and indexing are critical.

• Other challenges: Privacy and Expansion.

EXTREMELY DIFFICULT QUESTIONS

1.Why did LinkedIn needed to build their own

solution Voldemort?

2.How does TAO resolve the challenges it was built

for?

3.How the real time search service works at

twitter?

REFERENCES● Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., … Zhang, J.

(2012). Data Infrastructure at LinkedIn. In Data Engineering (ICDE), 2012 IEEE 28th

International Conference on (pp. 1370–1381).

● N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo,

S. Kulkarni, and H. Li, “Tao: Facebook’s distributed data store for the social graph,” in

USENIX ATC, 2013.

● N. Ruflin, H. Burkhart, and S. Rizzotti, “Social-data storage-systems,” Databases Soc.

Networks - DBSocial ’11, pp. 7–12, 2011.

● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H.

Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the

2010 ACM SIGMOD International Conference on Management of data. ACM,

Indianapolis, Indiana, USA, pp. 1013–1020, 2010.

● D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a Needle in Haystack:

Facebook’s Photo Storage,” in OSDI, 2010, vol. 2010, pp. 47–60.

● M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-Time

Search at Twitter,” Proceedings of the 2012 IEEE 28th International Conference on Data

Engineering. IEEE Computer Society, pp. 1360–1369, 2012.

● J. Lin and D. Ryaboy, “Scaling big data mining infrastructure: the twitter experience,”

REFERENCES (II)● D. Borthakur, J. Gray, J. Sen Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K.

Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache hadoop goes

realtime at Facebook,” Proceedings of the 2011 ACM SIGMOD International Conference on

Management of data. ACM, Athens, Greece, pp. 1071–1080, 2011.

● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data

warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD

International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013–

1020, 2010.

● C. Chen, F. Li, B. C. Ooi, and S. Wu, “TI: an efficient indexing mechanism for real-time search

on tweets,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of

data. ACM, Athens, Greece, pp. 649–660, 2011.

● G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin, “Fast data in the era of big data: Twitter’s real-

time related query suggestion architecture,” Proceedings of the 2013 ACM SIGMOD

International Conference on Management of Data. ACM, New York, New York, USA, pp. 1147–

1158, 2013.

● S. Cohen and B. Kimelfeld, “A Social Network Database that Learns How to Answer Queries ∗,”

2013.

● Eben Hewitt, Cassandra: The Definitive Guide, illustrated ed. , O'Reilly Media, Inc., 2010.

LINKS

● https://www.usenix.org/conference/atc13/technical-

sessions/presentation/bronson

● http://www-

conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105

_DhrubaBorthakur.pdf

● http://www.slideshare.net/linkedin/jay-kreps-on-project-

voldemort-scaling-simple-storage-at-linkedin

● http://data.linkedin.com/

● http://www.infoq.com/presentations/Project-Voldemort-at-

Gilt-Groupe

THE END

social databases - a brief overview

Software

data search relevance

dbs social data

event data

search queries

variety social data

semistructured data

twitter queries

unauthorized data scraping