building a distributed data-platform - a perspective on current trends in computing

51
Data, dev-ops, and cloud services Building a distributed data-platform Charles Care Engineering Team Kasabi / Talis

Upload: charles-care

Post on 19-Jan-2015

1.034 views

Category:

Technology


0 download

DESCRIPTION

Data, dev-ops, and cloud services: Building a distributed data-platformA lecture given to Computer Science Students at the University of Warwick, February 2012.

TRANSCRIPT

Page 1: Building a distributed data-platform  - A perspective on current trends in computing

Data, dev-ops, and cloud services

Building a distributed data-platform

Charles Care

Engineering TeamKasabi / Talis

Page 2: Building a distributed data-platform  - A perspective on current trends in computing

Talk overview

● About me...● What Kasabi is,

● what we are trying to do● how we are working to achieve that● a quick walk-though

● Discussion of the Kasabi platform team● Our technology / architecture● Our engineering culture● Lessons learnt

Page 3: Building a distributed data-platform  - A perspective on current trends in computing

Views are mine...

…and not necessarily those of my (current/past) employers

Page 4: Building a distributed data-platform  - A perspective on current trends in computing

About me...

Page 5: Building a distributed data-platform  - A perspective on current trends in computing

About me...

● 2001-2004 – BSc Computer Science (Warwick) ● 2004-2008 – PhD Computer Science (Warwick) ● 2007-2011 – BT Plc

● Technical risk analyst – BT Global MPLS Network● Software Engineer – Infrastructure for Financial Markets● Senior Software Engineer – Central software standards

and tools

● 2011-Present – Talis/Kasabi ● Software Engineer – Semantic web platform

Page 6: Building a distributed data-platform  - A perspective on current trends in computing

About Kasabi

Page 7: Building a distributed data-platform  - A perspective on current trends in computing

About Kasabi

● Data market place● Bringing together data...

● owners● consumers

● Lowering the barrier for data-driven apps to enter the market

● Enabling new opportunities for aggregating and mixing data

Page 8: Building a distributed data-platform  - A perspective on current trends in computing

Data licensing today

Data Owners Data Consumers

Bespoke, expensive, contracts

Page 9: Building a distributed data-platform  - A perspective on current trends in computing

Kasabi as a data platform

Data Owners

Third-party services

Application Developers

Data enthusiastsData engineers

API developers

Page 10: Building a distributed data-platform  - A perspective on current trends in computing

About Kasabi

● Publish datasets using standard APIs● Access data using standard APIs

● Query a dataset using SPARQL● Search a dataset using a simple full-text search

● Define, contribute, and share your own APIs

Page 11: Building a distributed data-platform  - A perspective on current trends in computing

Data marketplace

http://www.kasabi.com/

Page 12: Building a distributed data-platform  - A perspective on current trends in computing

A dataset

Page 13: Building a distributed data-platform  - A perspective on current trends in computing

Access data using standard APIs

Page 14: Building a distributed data-platform  - A perspective on current trends in computing

Contribute custom APIs

Page 15: Building a distributed data-platform  - A perspective on current trends in computing

Example – contributed APIs

Page 16: Building a distributed data-platform  - A perspective on current trends in computing

Current organisation

● Product development● Data engineering● Customer operations● Platform development

Page 17: Building a distributed data-platform  - A perspective on current trends in computing

Current organisation

● Product development● Data engineering● Customer operations● Platform development

Page 18: Building a distributed data-platform  - A perspective on current trends in computing

Platform architecture

Page 19: Building a distributed data-platform  - A perspective on current trends in computing

Data Platform

Load balancing and routing

Update services Search services Query services

Datasets

● Need to store and update datasets● Access data via various services● Must scale with load and increasing data● Must be tolerant to failure● Extensible

● Should be easy to add new services over time

Page 20: Building a distributed data-platform  - A perspective on current trends in computing

To distribute...

...or not to distribute

Page 21: Building a distributed data-platform  - A perspective on current trends in computing

Dynamic Gossip Network

Distributed PlatformRouting layer

Updateservice Search

service

Sequence Service Storage Service Monitoring Services

Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?

Page 22: Building a distributed data-platform  - A perspective on current trends in computing

Dynamic Gossip Network

Distributed Platform – updatesRouting layer

Updateservice Search

service

Sequence Service Storage Service

Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?

Monitoring Services

- Updates are sequenced- Data stored in distributed storage

Page 23: Building a distributed data-platform  - A perspective on current trends in computing

Dynamic Gossip Network

Distributed Platform – updatesRouting layer

Updateservice Search

service

Sequence Service Storage Service

Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?

Monitoring Services

- Updates are gossiped around network- Here a SPARQL node realises that it should apply the update

Page 24: Building a distributed data-platform  - A perspective on current trends in computing

Dynamic Gossip Network

Distributed Platform – queryRouting layer

Updateservice Search

service

Sequence Service Storage Service

Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?

Monitoring Services

SPARQL queries will now reflect the update that was submitted

Page 25: Building a distributed data-platform  - A perspective on current trends in computing

Monolithic vs distributed

● Monolithic● Easy to synchronise events and data

● Consistent views and queries

● Less inter-process communication / less network overhead

● Easier to optimise for high throughput

● Single code-base

● Fewer processes to monitor

● Distributed● Service-oriented - separate concerns run in isolated processes (and can be scaled

independently)

● Development is component-based

– Changes are more focussed / helps avoids scope-creep

● Deployment can be localised to avoid downtime

● Failure is more likely – so you need to plan for it

● Easier to integrate out-of-the box software – e.g. using standard Apache Solr

Page 26: Building a distributed data-platform  - A perspective on current trends in computing

Distributed data platform

● Separate services for each API

● Communication via Gossip messages

● Have to manage eventual consistency

● Highly scalable

● Easy to add new services

● Use standard protocols and open-source components● HTTP libraries / REST / ZeroMQ / Apache Thrift● RDF and SPARQL using Apache Jena● Search using Apache Solr● Avoid modification and forks

● Deploy into Amazon EC2 (also using: S3, EMR, and ELB)

Page 27: Building a distributed data-platform  - A perspective on current trends in computing

Benefits of using cloud services

Page 28: Building a distributed data-platform  - A perspective on current trends in computing

Consider a start-up in 2002

● Have an idea...

● Get funding (development, op-ex, cap-ex)

● Aquire servers● Set-up your servers

– mail, web, source code repo, build systems

– development, staging, live

● Some 'cloud' services

– …, SourceForge, shared servers, etc

● Build, and go, to market● Probably embedding open-source

components

● Delivery based on full-stack, monolithic, architectures

Page 29: Building a distributed data-platform  - A perspective on current trends in computing

Consider a start-up in 2012

● Have an idea...

● Get funding (development capital, op-ex)● you will probably not get cap-ex

● Use cloud services... rent rather than buy● SaaS – Software as a Service

– Why would you run your own (chat/email etc)

– Host your code in GitHub/BitBucket etc

● PaaS – Platform as a Service

– Do you need to control the full stack?

– Could you leverage platforms like: Heroku, Joyant, AppEngine etc

– Amazon RDS

● IaaS – Infrastructure as a Service

– Cloud services to provide 'bare metal'

● Build and go to market quickly

● scale elastically over time

Page 30: Building a distributed data-platform  - A perspective on current trends in computing

But what about the enterprise?

● Benefits of cloud services are already transforming the enterprise● Private clouds

● Virtual appliances

● Cloud bursting

● Independent scaling

● Separation of concerns

● SOA architecture

● And in future...● Appetite for IaaS is growing

● PaaS and SaaS will follow.

● Perimeter security will be replaced by localised security boundaries

Page 31: Building a distributed data-platform  - A perspective on current trends in computing

So how do we build this stuff...?

Page 32: Building a distributed data-platform  - A perspective on current trends in computing

How it all happens

● Constantly iterating through...● Requirements● Development (Test-driven)● Testing/Review● Deployment● Operation

● We're an Agile, dev-ops team...

so all the above is a shared responsibility

Page 33: Building a distributed data-platform  - A perspective on current trends in computing

Being a dev-ops team...

● Removing barriers between development and operations

● Shared responsibilities rather than distrust

● Everyone has root access

● Developers are responsible for operating systems they build

● Everyone is free to make changes

...and responsible to manage the roll-out of those changes

● Ops/Deployment/Monitoring are automated

● Everyone should have full-stack awareness

● Read more...● http://dev2ops.org/blog/2010/2/22/what-is-devops.html

● http://www.jedi.be/blog/

● http://en.wikipedia.org/wiki/Devops

● http://www.slideshare.net/jallspaw/ 10-deploys-per-day-dev-and-ops-cooperation-at-flickr

Page 34: Building a distributed data-platform  - A perspective on current trends in computing

Life-cycle of a change

Page 35: Building a distributed data-platform  - A perspective on current trends in computing

Requirements and Planning

● Identification of requirement ● Planning

● Break down big changes into smaller tasks– Can the change be deployed in small steps?– Can the change be dark-deployed?

● Understand the wider impact● Find middle ground between generic and specific

● Team is self-organising● People pull work from the prioritised, planned stories

Page 36: Building a distributed data-platform  - A perspective on current trends in computing

Branch based development

● One branch per change, squash before merge

Page 37: Building a distributed data-platform  - A perspective on current trends in computing

Writing the code

● Work on a branch ● don't know if/when you'll merge

● Test-driven● Unit tests first

● Do acceptance tests need to change?

● What technology? Which tool-sets?

● Smoke testing● How do you know it works?

● What's different in production?

● What are the risks of failure?

● Feature flags?

Tests run: 110, Failures: 0, Errors: 0, Skipped: 2

[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESSFUL[INFO] ------------------------------------------------------------------------[INFO] Total time: 39 seconds[INFO] Finished at: Sat Feb 18 15:20:36 GMT 2012[INFO] Final Memory: 33M/240M[INFO] ------------------------------------------------------------------------

Page 38: Building a distributed data-platform  - A perspective on current trends in computing

Writing the code

● Avoid unnecessary scope-creep● “I'll just fix this...”

● “It would be much cleaner if I re-factored this...”

● “It would be neat if I also added this...”

● …however, these observations can be written as new stories

● …and sometimes it's good to fix things before they cause pain

● …if extra changes are really necessary, can they be implemented separately?

● …team should be empowered to fix technical debt

● ...managing scope-creep is a shared responsibility

● Be prepared to abandon a change if it's taking too long, maybe it needs more planning?

● Should you be pairing?

● Should you demo your work?

Page 39: Building a distributed data-platform  - A perspective on current trends in computing

Code review

● Code review possible with tools for distributed teams (e.g. Gerrit or ReviewBoard)

● If you're not following a strict pairing policy, code-review is vital

● Useful to make others aware of changes

● Gerrit● Build agent automatically builds your change and

runs tests – verify +/- 1

● Invite others to review your code, they can give it a score between -2 and +2.

● Can only deploy code once at least one person has given a +2

● Work-flow is customisable

● Self-organising... anyone can review

$> git commit$> git review

Page 40: Building a distributed data-platform  - A perspective on current trends in computing

Code review (2)

Page 41: Building a distributed data-platform  - A perspective on current trends in computing

Code review (3)

Page 42: Building a distributed data-platform  - A perspective on current trends in computing

Merge / Deployment

● Merge & Deployment● One-click deployment

● Developer should press the button

● Code is merged into the master/release branch

● Build server automatically checks out the code and builds, tags, and uploads the release to an artefact repository

● Package is automatically deployed on all servers

– Extra orchestration for external-facing services to avoid “thundering-herd” problems

Page 43: Building a distributed data-platform  - A perspective on current trends in computing

Managing infrastructure

● Puppet or Chef

● Build packages (e.g. DEB or RPM)

● Centralise configuration management

● Utilising cloud compute infrastructure● Amazon EC2

● Amazon S3

● Elastic load balancers

● Elastic Map-Reduce

● Application monitoring● Metrics

● Log analysis

● Internal monitoring

● External checks

Page 44: Building a distributed data-platform  - A perspective on current trends in computing

Lessons learnt

(again, my views!)

Page 45: Building a distributed data-platform  - A perspective on current trends in computing

Technical lessons learnt

● Use distributed SOA-based services to reduce tight-coupling

● Monitor everything...● Leverage cloud offerings

● wrap them with well-defined interfaces to avoid lock-in

● Design systems to scale● Use open and unmodified components where possible

● Standard components fronting external APIs● E.g. Jena, Solr, Haproxy, Apache

Page 46: Building a distributed data-platform  - A perspective on current trends in computing

Practices that have helped us

● Dev-ops culture● Pragmatic approach to agile development

● Task allocation should be 'pull', rather than 'push'● Teams should be self-organising● Pairing when working on new problems

● Test-Driven-Development (TDD)● Continuous integration● Peer-review of code● Continuous deployment

Page 47: Building a distributed data-platform  - A perspective on current trends in computing

…so, in summary...

Page 48: Building a distributed data-platform  - A perspective on current trends in computing

Conclusion

● Isolate your design into components● Empower your team to release small changes

frequently● Leverage hosted/cloud offerings

Page 49: Building a distributed data-platform  - A perspective on current trends in computing

Thanks for listening!

Page 50: Building a distributed data-platform  - A perspective on current trends in computing

Credits

● Thanks for the invite to speak● Thanks to Kasabi / Talis Systems Ltd

● Sign up at http://www.kasabi.com

Graphics from http://www.iconarchive.com/, http://www.oxygen-icons.org and http://www.icons-land.com

Page 51: Building a distributed data-platform  - A perspective on current trends in computing

Questions?