social networks at scale
TRANSCRIPT
Social Networks @ Scale
Eoin Hurrell, PhD Data Lead, Cohort
@eoinhurrell
Cohort as a use-case
Social Networks
👤
👤
👤
👤
👤
👤
👤
👤
Social Network Analysis
📚 ☕📜
• Provides many tools to solve problems • Consider your problem before you consider your tools! • SNA has a long history in sociology
What do you want to know?
Let's talk Graphs
Let's talk Graphs
source: http://www.nltk.org/book_1ed/ch04.html Fig 4.16
Social Networks as Big Data
Options for Getting Networks
• Start a new social network from scratch
• Ahead-of-time scrape a bunch of data from target social networks.
Options for Examining Networks• networkx
• Graph database like Neo4j:
• pandas, dask, standard PyData tools are not focused on networks or cause issues with production service issues
Cohort as a use-case
## Cohort as a use-case - We want to understand friend of a friend relationships, and the knowledge of people in them, so any existing data is
important to us. Python is excellent because sklearn, networkx and other data science libraries exist. It also allows for Spark and Kafka usage as we scale.
We need to get existing data from social networks and be able to process large amounts of data intelligently
Over half a billion relationships, 72+ million people
Streaming architecture
👤` ` `
👤👤 👤👤
Single Source of Truth
www.kappa-architecture.com
🤖
=
Realised Views
Streaming architecture
👤` ` `
👤👤 👤👤
www.kappa-architecture.com
🤖🤖🤖🤖
In production
Batch calculation
👤
👤
👤
👤
👤
👤
👤
👤
👤
Community detection
Batch calculation
👤
👤
👤
👤
👤
👤
👤
👤
👤
Popularity models (e.g. PageRank)
Handling Batch calculation
One Trillion Edges: Graph Processing at Facebook-ScaleVLDB '15, A Ching et al.
How to handle messages like Twitter
SELECT * FROM posts WHERE user_id IN :friend_list ORDER BY timestamp DESC LIMIT 100;
This does not scale 💀
How to handle messages like Twitter
Redis
👤:1
✉✉✉✉✉✉
✉✉✉
✉✉✉✉✉✉✉✉
✉✉✉✉
✉✉✉✉✉✉✉
📨
📨
📨
📨
❤
❤
❤
posts a new
Single Source of Truth
📨
How to handle messages like Twitter
SELECT * FROM posts WHERE id IN :timeline_ids
This scales 😻
Conclusion
• Networks are dense but useful data • Scalable data science depends on usage, not just
traditional form • Python is useful and powerful at every level of this
stack
Thank You!
🔬
Cohort helps you find what you need through the people you know and trustcohort.is
Eoin Hurrell, PhD Data Lead, Cohort
@eoinhurrell