stig: social graphs & discovery at scale
TRANSCRIPT
Social graphs and discovery. At scale.Jason Lucas, Scalability Architect, Tagged
Stig is...• A very large-scale non-SQL database...
• But it speaks and can emulate SQL
• A graph-oriented data store...
• But can look like a key value store, relational tables, file system
• A foundation for building general web applications...
• But it particularly excels at social apps
• A distributed system with a shared-nothing architecture...
• But it gives developers an easy-to-manage path to data
• A solution to complex problem of CAP-limited systems...
Part 1: Stig Project GoalsPart 2: Stig ConceptsPart 3: Lunch Workshop
Part 1: Stig Project Goals• Facilitate the developer.
• Be like a good waiter
• Easy should be easy, complex should be possible
• Untangle some existing messes
• Scale like crazy. (Without driving ops crazy.)
• Go big
• Go fast
• Go smooth
• Exceed expectations.
• Enable previously unthinkable features
Facilitate the developer
• Decrease the burden.
• Provide a single path to data
• Create a uniform representation available to multiple application languages
• Reduce the need for “defensive programming”
• Enforce consistency.
• Re-introduce atomic transactions
• Control assumptions with assertions
• Promote correctness.
• Provide a more robust data representation
• Support unit testing
Facilitate the developer
• Offer power in simplicity.
• Offer a robust expression language
• Describe effects rather than details of distribution
• Above all else:
• “I want to feel like I'm doing a good job.”
Scale like crazy...
• Use a distributed architecture.
• Shard data over multiple machines
• Use commodity hardware
• Scale as linearly as possible
• Use replicas to speed average access
• Move queries to data.
• Decompose queries by separating areas of concern
• Farm sub-queries to shards which hold the relevant data
• Use comprehensions instead of realizations when possible
Scale like crazy...
• Build for the web.
• Provide durable sessions
• Allow clients to disconnect and reconnect at will
• Continue running in the background
• Increase concurrency.
• Break large objects down into smaller ones
• Escrow deltas around fields which are partitioned or contentious
• Use assertions instead of locks to permit interleaving of operations
Without driving Ops crazy
• Be highly available.
• Replicate storage across multiple machines
• Shift responsibilities between machines transparently
• Bring machines back into service transparently
• Tolerate partitioning.
• Fall back transparently to lower levels of service
• Reconcile database automatically when partitions rejoin
Without driving Ops crazy
• Simplify maintenance.
• Tolerate unreliable hardware
• Make software upgrades easy to manage
• Be flexible with regard to physical topology
• Make system status, performance, and capacity easy to measure and comprehend
• Degrade gracefully under load
• To the greatest degree possible, make the system maintain itself
Exceed expectations• Enable previously unthinkable features.
• Don’t include histories in your schemas; the database keeps histories
• Design apps with real-time, multi-user communications; database sessions are “chatty”
• Feel free to compute Erdős Numbers or routes to Kevin Bacon
• Test for the existence of interesting data states in constant time, not log time
• Execute queries in time proportionate to the size of the answer, not the size of the database
Exceed expectations
• Decrease development cycle time.
• Build working apps on your desktop; the database can be simulated
• Evolve your schema at will; the database doesn’t make a distinction between data and metadata
• Use any language you like; the database looks the same from all clients
Part 2: Stig Concepts• Representing Graphs
• Deconstructing Commits
• Making Time Flow
• Finding Meaning
• Querying
Representing Graphs...Without Stig
• Graphs in Tables.
• Walks spread outward in waves
• Self-joins proliferate
• Graphs Key-Value Stores.
• Generally node-centric
• Edges are denormalized conjugate sets
• Non-transactional multi-set is deadly
• Graphs in XML Stores.
• Floating chunk syndrome
• Worst of both worlds
• Graphs in Doc & Graph Stores.
• Typeless, interned at nodes
Representing Graphs...With Stig
Locations, Nodes & Edges
/user/[email protected] /user/[email protected]
person
mafiaplayer
personpets
player
mafiaplayer
ownspetsplayer
Deconstructing Commits...Without Stig
• Two States.
• Uncommitted: only me
• Committed: everybody else
• One sandbox per connection
• Variable Isolation.
• High isolation limits concurrency
• Low isolation hard to cope with
• Two Guarantees.
• Written to disk
• Ephemeral
• Some NoSQL Options.
• No transactional integrity
• Post-hoc reconciliation
Deconstructing Commits...With Stig
• Private.
• Only me, but I get as many as I want; maybe ephemeral
• Shared.
• Restricted scope, rapid communication; maybe ephemeral
• Global.
• A singleton, same as commit
• Guarantees
• Self-consistent
• Replicated in data center
• Written to disks
• Replicated to other data centers
Deconstructing Commits...With StigPoints of View in Diplomacy
(Global)
Diplomacy Game(Shared)
Alice/Bob Alliance(Shared)
Alice(Private)
Carol(Private)
Bob(Private)
Making Time Flow...Without Stig
• Time Flows Naturally.
• System clock is OK
• Execution Time ≈ Query Time.
• A query made after an update will see the results of the update because time flow is linear
• The order of events is definite
• Locks Enforce Consistency.
• Updates block each other
• MVCC in Lieu of Locks.
• Reads are writes
• Collisions are rollbacks
Making Time Flow...With Stig
• Time is Uncertain.
• Distributed machines cannot rely on their system clocks
• Declared Dependencies.
• Each query declares its predecessors, so causality is a graph
• The order of events is unknowable, but any topological sort of the graph is OK
• Assertions Enforce Consistency.
• MVCC with Paxos facilitates time travel
• Query: seek a time in the past at which assertions are true
• Update: seek a time in the future at which assertions are still true
Confirm Order
Making Time Flow...With Stig
Checkout Time
Enter Credit Card
DisplayShopping
CartRequest Gift Wrap
Update Qty. of Item
Specify Shipping
Finding Meaning...Without Stig
• Tables & Views.
• Tables store the base data
• Views collect data from tables and other views
• Views often present performance bottlenecks
• Analysis Belongs to Data Definition.
• Adding or changing a view or index is a schema change
• Programmers must work with DBAs, limiting individual initiative
• Changes have the potential to degrade the data service as a whole
Finding Meaning...With Stig
• Asserted & Inferred Edges.
• Asserted edges store the base data
• Inferred edges collect data from asserted and inferred edges
• Inference is distributed, on-going, and subject to time-travel
• Analysis Belongs to Program Definition.
• Inference rules aren’t “special”
• Programmers can invent as they like
• Scope of risk is limited
AliceAlice
Finding Meaning...With Stig
Bob
has friendship
x
has friendship
Inferring Friends & Stalkers
Bob
is friend of
<a, ‘is friend of’ b>if <a, ‘has friendship’, x>and <b, ‘has friendship’, x>and a is not b;
<a, ‘is stalking’ b>if <a, ‘is friend of’, b>and a.age >= 18and b.age < 18;
Querying...Without Stig
• SQL
• Easy-to-use, commonly known, and mostly harmless
• Suffers from poor composability and is useless as a general-purpose programming language
• Map-Reduce, Erlang, etc.
• Not so easy-to-use, not so commonly known, and capable of shooting you in the foot
• Often requires knowledge of underlying distributed architecture and are still not front-runners as general-purpose programming languages
Querying...With Stig
• Robust and General-Purpose Language.
• Purely functional, lazily evaluated, and strictly, robustly typed
• Pattern-oriented notation for describing walks across graph
• Composability Rules.
• Comprehensions of sequences form the foundation
• Transformations of sequences (map, reduce, filter, zip, etc.) are the building blocks
• Distributed Evaluation Rocks.
• Queries are broken down and sent to the servers where they need to be
• Evaluation occurs in parallel
Querying...With Stig
• Compiled & Stored.
• Queries compile down to machine code and get stored in the graph itself
• Stored programs are subject to on-going analysis
• Programs can call each other
• Library-Driven.
• Language fundamentals support construction of libraries
• We can emulate other languages, such as LINQ and Python
• Clients.
• Currently Java, Perl, PHP, Python, and C/C++
• We can also serve HTTP directly
Querying...With Stig
o /* function definition */mutual_friends x y = solve f: [ <x, ‘is friend of’, f>; <y, ‘is friend of’, f> ];
o /* function application */mutual_friends person@/users/alice person@/users/bob;
o /* results */[ { f = person@/users/carol }, { f = person@/users/dave } ];
Mutual Friends
Wrapup
Is your project...?• Graph-shaped?
• Representing graphs as graphs (instead of as tables or key pairs) simplifies your life
• Stig graphs are fat, meaning they're really any number of simultaneous, intersecting graphs, so go nuts
• Transactional?
• Reliably atomic state transitions also simplify your life
• Asynchronous transaction management makes it more tolerable
• Real-time?
• Control the influence of updates with shared points-of-view
• Never be blocked waiting for the database to respond
Is your project...?• Really huge?
• The store scales very close to linearly, so more data just means more machines
• The size of the cluster doesn't generally doesn't affect the performance of individual operations
• Deeply analytic?
• Use inferences to describe relations and conditions you're interested in
• Build up arbitrarily complex libraries of inference to extract meaning from data
Open-sourcing this year!• About our Code.
• Written in C++0x and Haskell, with Python for tools
• Entirely unit-test driven and designed for easy adoption
• Why Open Source?
• We want to give back
• We benefit first and most
• Competitive advantage would be temporary anyway
• Knowing it’s open keeps us on our toes
• There’s more to do than we can do ourselves
• We attract the kind of people we want to work with
Our doors are open• About Tagged
• #3 in social networking and growing (100+ Million members)
• Located in downtown SF, 10 Ten Places to work by San Francisco Business Journal
• Profitable since 2008. We answer only to ourselves and our users
Our doors are open
• About the Stig Team.
• Five full-time engineers with backgrounds in compilers, databases, distributed systems, and AI
• Interns year-round with opportunities to publish
• And yes, we're hiring!
Got ideas?• Contact us!
• Sign up for Stig news at: www.stigdb.org
• Follow the Tagged Dev Blog at: blog.tagged.com
• Jason LucasArchitect of Scalable [email protected]
Part 3: Lunch Workshop• But wait, there’s more!
• Join us as we get our hands messy with food and take a deep dive into the Stig query language and the Stig API!
• Lunch 'N Learn 01:15 PM - 02:15 PM