social networks at scale

Post on 08-Jan-2017

159 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Social Networks @ Scale

Eoin Hurrell, PhD Data Lead, Cohort

@eoinhurrell

Cohort as a use-case

Social Networks

👤

👤

👤

👤

👤

👤

👤

👤

Social Network Analysis

📚 ☕📜

• Provides many tools to solve problems • Consider your problem before you consider your tools! • SNA has a long history in sociology

What do you want to know?

Let's talk Graphs

Let's talk Graphs

source: http://www.nltk.org/book_1ed/ch04.html Fig 4.16

Social Networks as Big Data

Options for Getting Networks

• Start a new social network from scratch

• Ahead-of-time scrape a bunch of data from target social networks.

Options for Examining Networks• networkx

• Graph database like Neo4j:

• pandas, dask, standard PyData tools are not focused on networks or cause issues with production service issues

Cohort as a use-case

## Cohort as a use-case - We want to understand friend of a friend relationships, and the knowledge of people in them, so any existing data is

important to us. Python is excellent because sklearn, networkx and other data science libraries exist. It also allows for Spark and Kafka usage as we scale.

We need to get existing data from social networks and be able to process large amounts of data intelligently

Over half a billion relationships, 72+ million people

Streaming architecture

👤` ` `

👤👤 👤👤

Single Source of Truth

www.kappa-architecture.com

🤖

=

Realised Views

Streaming architecture

👤` ` `

👤👤 👤👤

www.kappa-architecture.com

🤖🤖🤖🤖

In production

Batch calculation

👤

👤

👤

👤

👤

👤

👤

👤

👤

Community detection

Batch calculation

👤

👤

👤

👤

👤

👤

👤

👤

👤

Popularity models (e.g. PageRank)

Handling Batch calculation

One Trillion Edges: Graph Processing at Facebook-ScaleVLDB '15, A Ching et al.

How to handle messages like Twitter

SELECT * FROM posts WHERE user_id IN :friend_list ORDER BY timestamp DESC LIMIT 100;

This does not scale 💀

How to handle messages like Twitter

Redis

👤:1

✉✉✉✉✉✉

✉✉✉

✉✉✉✉✉✉✉✉

✉✉✉✉

✉✉✉✉✉✉✉

📨

📨

📨

📨

posts a new

Single Source of Truth

📨

How to handle messages like Twitter

SELECT * FROM posts WHERE id IN :timeline_ids

This scales 😻

Conclusion

• Networks are dense but useful data • Scalable data science depends on usage, not just

traditional form • Python is useful and powerful at every level of this

stack

Thank You!

🔬

Cohort helps you find what you need through the people you know and trustcohort.is

Eoin Hurrell, PhD Data Lead, Cohort

@eoinhurrell

top related