gluecon 2013 - netflix cloud native tutorial details (part 2)

64
Details: Building Using The NetflixOSS Architecture May 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Upload: adrian-cockcroft

Post on 06-May-2015

5.655 views

Category:

Technology


4 download

DESCRIPTION

A collection of information taken from previous presentations that was used as drill down for supporting discussion of specific topics during the tutorial.

TRANSCRIPT

Page 1: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Details:Building Using The NetflixOSS

Architecture

May 2013Adrian Cockcroft

@adrianco #netflixcloud @NetflixOSShttp://www.linkedin.com/in/adriancockcroft

Page 2: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Architectures for High Availability

Cassandra Storage and Replication

NetflixOSS Components

Page 3: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Component Micro-ServicesTest With Chaos Monkey, Latency Monkey

Page 4: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Three Balanced Availability ZonesTest with Chaos Gorilla

Cassandra and Evcache Replicas

Zone A

Cassandra and Evcache Replicas

Zone B

Cassandra and Evcache Replicas

Zone C

Load Balancers

Chaos Gorilla

Page 5: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Triple Replicated PersistenceCassandra maintenance affects individual replicas

Cassandra and Evcache Replicas

Zone A

Cassandra and Evcache Replicas

Zone B

Cassandra and Evcache Replicas

Zone C

Load Balancers

Page 6: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Isolated Regions

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Page 7: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Failure Modes and EffectsFailure Mode Probability Current Mitigation Plan

Application Failure High Automatic degraded response

AWS Region Failure Low Active-Active Using Denominator

AWS Zone Failure Medium Continue to run on 2 out of 3 zones

Datacenter Failure Medium Migrate more functions to cloud

Data store failure Low Restore from S3 backups

S3 failure Low Restore from remote archive

Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures

didn’t make sense. Working on Active-Active in 2013.

Page 8: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Application Resilience

Run what you wroteRapid detectionRapid Response

Page 9: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Run What You Wrote

• Make developers responsible for failures– Then they learn and write code that doesn’t fail

• Use Incident Reviews to find gaps to fix– Make sure its not about finding “who to blame”

• Keep timeouts short, fail fast– Don’t let cascading timeouts stack up

• Dynamic configuration options - Archaius– http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html

Page 10: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Resilient Design – Hystrix, RxJavahttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Page 11: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Chaos Monkeyhttp://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

• Computers (Datacenter or AWS) randomly die– Fact of life, but too infrequent to test resiliency

• Test to make sure systems are resilient– Kill individual instances without customer impact

• Latency Monkey (coming soon)– Inject extra latency and error return codes

Page 12: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Monkeys

Edda – Configuration Historyhttp://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html

Edda

AWS Instances, ASGs, etc.

Eureka Services

metadata

AppDynamics Request flow

Page 13: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Edda Query ExamplesFind any instances that have ever had a specific public IP address$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"["i-0123456789","i-012345678a","i-012345678b”]

Show the most recent change to a security group$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504@@ -1,33 +1,33 @@ {… "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32",+ "10.10.1.3/32",- "10.10.1.4/32"… }

Page 14: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Platform Outage Taxonomy

Classify and name the different types of things that can go wrong

Page 15: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

YOLO

Page 16: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Zone Failure Modes

• Power Outage– Instances lost, ephemeral state lost– Clean break and recovery, fail fast, “no route to host”

• Network Outage– Instances isolated, state inconsistent– More complex symptoms, recovery issues, transients

• Dependent Service Outage– Cascading failures, misbehaving instances, human errors– Confusing symptoms, recovery issues, byzantine effects

Page 17: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Zone Power Failure

• June 29, 2012 AWS US-East - The Big Storm– http://aws.amazon.com/message/67457/– http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

• Highlights– One of 10+ US-East datacenters failed generator startup– UPS depleted -> 10min power outage for 7% of instances

• Result– Netflix lost power to most of a zone, evacuated the zone– Small/brief user impact due to errors and retries

Page 18: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Zone Failure Modes

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Zone Power Outage

Zone Network Outage

Zone Dependent Service Outage

Page 19: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Regional Failure Modes

• Network Failure Takes Region Offline– DNS configuration errors– Bugs and configuration errors in routers – Network capacity overload

• Control Plane Overload Affecting Entire Region– Consequence of other outages– Lose control of remaining zones infrastructure– Cascading service failure, hard to diagnose

Page 20: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Regional Control Plane Overload• April 2011 – “The big EBS Outage”

– http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

– Human error during network upgrade triggered cascading failure– Zone level failure, with brief regional control plane overload

• Netflix Infrastructure Impact– Instances in one zone hung and could not launch replacements– Overload prevented other zones from launching instances– Some MySQL slaves offline for a few days

• Netflix Customer Visible Impact– Higher latencies for a short time– Higher error rates for a short time– Outage was at a low traffic level time, so no capacity issues

Page 21: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Regional Failure Modes

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Regional Network Outage

Control Plane Overload

Page 22: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Dependent Services Failure

• June 29, 2012 AWS US-East - The Big Storm– Power failure recovery overloaded EBS storage service– Backlog of instance startups using EBS root volumes

• ELB (Load Balancer) Impacted– ELB instances couldn’t scale because EBS was backlogged– ELB control plane also became backlogged

• Mitigation Plans Mentioned– Multiple control plane request queues to isolate backlog– Rapid DNS based traffic shifting between zones

Page 23: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Application Routing FailureJune 29, 2012 AWS US-East - The Big Storm

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Zone Power OutageApplications not using Zone-aware routing kept trying to talk to dead instances and timing out

Eureka service directory failed to mark down dead instances due to a configuration error

Effect: higher latency and errorsMitigation: Fixed config, and made zone aware routing the default

Page 24: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Dec 24th 2012

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Partial Regional ELB Outage

• ELB (Load Balancer) Impacted– ELB control plane database state accidentally corrupted– Hours to detect, hours to restore from backups

• Mitigation Plans Mentioned– Tighter process for access to control plane– Better zone isolation

Page 25: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Global Failure Modes

• Software Bugs– Externally triggered (e.g. leap year/leap second)– Memory leaks and other delayed action failures

• Global configuration errors– Usually human error– Both infrastructure and application level

• Cascading capacity overload– Customers migrating away from a failure– Lack of cross region service isolation

Page 26: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Global Software Bug Outages• AWS S3 Global Outage in 2008

– Gossip protocol propagated errors worldwide– No data loss, but service offline for up to 9hrs– Extra error detection fixes, no big issues since

• Microsoft Azure Leap Day Outage in 2012– Bug failed to generate certificates ending 2/29/13– Failure to launch new instances for up to 13hrs– One line code fix.

• Netflix Configuration Error in 2012– Global property updated to broken value– Streaming stopped worldwide for ~1hr until we changed back– Fix planned to keep history of properties for quick rollback

Page 27: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Global Failure Modes

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Cascading Capacity Overload

Software Bugs and Global Configuration Errors

Capacity Demand Migrates

“Oops…”

Page 28: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

DenominatorPortable DNS management

API Models (varied and mostly broken)

DNS Vendor Plug-in

Common Model

Use CasesEdda, Multi-

Region Failover

Denominator

AWS Route53

IAM Key Auth REST

DynECT

User/pwd REST

UltraDNS

User/pwd SOAP

Etc…

Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)

Page 29: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Highly Available Storage

A highly scalable, available and durable deployment pattern

Page 30: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Micro-Service PatternOne keyspace, replaces a single table or materialized view

Single function Cassandra Cluster Managed by PriamBetween 6 and 72 nodes

Stateless Data Access REST ServiceAstyanax Cassandra Client

OptionalDatacenterUpdate Flow

Many Different Single-Function REST Clients

Appdynamics Service Flow Visualization

Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones

Page 31: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Stateless Micro-Service Architecture

Linux Base AMI (CentOS or Ubuntu)

Optional Apache

frontend, memcached, non-java apps

MonitoringLog rotation

to S3 AppDynamics machineagent

Epic/Atlas

Java (JDK 6 or 7)AppDynamics

appagentmonitoring

GC and thread dump logging

TomcatApplication war file, base servlet, platform, client interface jars, Astyanax

Healthcheck, status servlets, JMX interface,

Servo autoscale

Page 32: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

AstyanaxAvailable at http://github.com/netflix

• Features– Complete abstraction of connection pool from RPC protocol– Fluent Style API– Operation retry with backoff– Token aware

• Recipes– Distributed row lock (without zookeeper)– Multi-DC row lock– Uniqueness constraint– Multi-row uniqueness constraint– Chunked and multi-threaded large file storage

Page 33: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Initializing Astyanax// Configuration either set in code or nfastyanax.propertiesplatform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERYnetflix.environment=testdefault.astyanax.readConsistency=CL_QUORUMdefault.astyanax.writeConsistency=CL_QUORUMMyCluster.MyKeyspace.astyanax.servers=127.0.0.1

// Must initialize platform for discovery to workNFLibraryManager.initLibrary(PlatformManager.class, props, false, true);NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);

// Open a keyspace instanceKeyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");

Page 34: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Astyanax Query ExamplePaginate through all columns in a rowColumnList<String> columns;int pageize = 10;try { RowQuery<String, String> query = keyspace .prepareQuery(CF_STANDARD1) .getKey("A") .setIsPaginating() .withColumnRange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getResult()).isEmpty()) { for (Column<String> c : columns) { } }} catch (ConnectionException e) {}

Page 35: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token Aware

Token Aware Clients

Cassandra•Disks•Zone A

Cassandra•Disks•Zone B

Cassandra•Disks•Zone C

Cassandra•Disks•Zone A

Cassandra•Disks•Zone B

Cassandra•Disks•Zone C

1. Client Writes to local coordinator

2. Coodinator writes to other zones

3. Nodes return ack4. Data written to

internal commit log disks (no more than 10 seconds later)

If a node goes offline, hinted handoff completes the write when the node comes back up.

Requests can choose to wait for one node, a quorum, or all nodes to ack the write

SSTable disk writes and compactions occur asynchronously

14

4

42

3

33

2

Page 36: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local Quorum

US Clients

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

1. Client writes to local replicas2. Local write acks returned to

Client which continues when 2 of 3 local nodes are committed

3. Local coordinator writes to remote coordinator.

4. When data arrives, remote coordinator node acks and copies to other remote zones

5. Remote nodes ack to local coordinator

6. Data flushed to internal commit log disks (no more than 10 seconds later)

If a node or region goes offline, hinted handoff completes the write when the node comes back up.Nightly global compare and repair jobs ensure everything stays consistent.

EU Clients

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

6

5

5

6 64

44

16

6

62

2

23

100+ms latency

Page 37: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Cassandra Instance Architecture

Linux Base AMI (CentOS or Ubuntu)

Tomcat and Priam on JDKHealthcheck,

Status

MonitoringAppDynamics machineagent

Epic/Atlas

Java (JDK 7)AppDynamics

appagentmonitoring

GC and thread dump logging

Cassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB disk

holding Commit log and SSTables

Page 38: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Priam – Cassandra AutomationAvailable at http://github.com/netflix

• Netflix Platform Tomcat Code• Zero touch auto-configuration• State management for Cassandra JVM• Token allocation and assignment• Broken node auto-replacement• Full and incremental backup to S3• Restore sequencing from S3• Grow/Shrink Cassandra “ring”

Page 39: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

ETL for Cassandra

• Data is de-normalized over many clusters!• Too many to restore from backups for ETL• Solution – read backup files using Hadoop• Aegisthus

– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html

– High throughput raw SSTable processing– Re-normalizes many clusters to a consistent view– Extract, Transform, then Load into Teradata

Page 40: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Cloud Architecture Patterns

Where do we start?

Page 41: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Datacenter to Cloud Transition Goals

• Faster– Lower latency than the equivalent datacenter web pages and API calls– Measured as mean and 99th percentile– For both first hit (e.g. home page) and in-session hits for the same user

• Scalable– Avoid needing any more datacenter capacity as subscriber count increases– No central vertically scaled databases– Leverage AWS elastic capacity effectively

• Available– Substantially higher robustness and availability than datacenter services– Leverage multiple AWS availability zones– No scheduled down time, no central database schema to change

• Productive– Optimize agility of a large development team with automation and tools– Leave behind complex tangled datacenter code base (~8 year old architecture)– Enforce clean layered interfaces and re-usable components

Page 42: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Datacenter Anti-Patterns

What do we currently do in the datacenter that prevents us from

meeting our goals?

Page 43: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Rewrite from Scratch

Not everything is cloud specificPay down technical debt

Robust patterns

Page 44: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Netflix Datacenter vs. Cloud Arch

Central SQL Database Distributed Key/Value NoSQL

Sticky In-Memory Session Shared Memcached Session

Chatty Protocols Latency Tolerant Protocols

Tangled Service Interfaces Layered Service Interfaces

Instrumented Code Instrumented Service Patterns

Fat Complex Objects Lightweight Serializable Objects

Components as Jar Files Components as Services

Anti-Architecture

Page 45: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Tangled Service Interfaces

• Datacenter implementation is exposed– Oracle SQL queries mixed into business logic

• Tangled code– Deep dependencies, false sharing

• Data providers with sideways dependencies– Everything depends on everything else

Anti-pattern affects productivity, availability

Page 46: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Untangled Service Interfaces

Two layers:• SAL - Service Access Library– Basic serialization and error handling– REST or POJO’s defined by data provider

• ESL - Extended Service Library– Caching, conveniences, can combine several SALs– Exposes faceted type system (described later)– Interface defined by data consumer in many cases

Page 47: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Service Interaction PatternSample Swimlane Diagram

Page 48: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

NetflixOSS Details

• Platform entities and services

• AWS Accounts and access management

• Upcoming and recent NetflixOSS components

• In-depth on NetflixOSS components

Page 49: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Basic Platform Entities

• AWS Based Entities– Instances and Machine Images, Elastic IP Addresses– Security Groups, Load Balancers, Autoscale Groups– Availability Zones and Geographic Regions

• NetflixOS Specific Entities– Applications (registered services)– Clusters (versioned Autoscale Groups for an App)– Properties (dynamic hierarchical configuration)

Page 50: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Core Platform Services

• AWS Based Services– S3 storage, to 5TB files, parallel multipart writes– SQS – Simple Queue Service. Messaging layer.

• Netflix Based Services– EVCache – memcached based ephemeral cache– Cassandra – distributed persistent data store

Page 51: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Security Architecture

• Instance Level Security baked into base AMI– Login: ssh only allowed via portal (not between instances)– Each app type runs as its own userid app{test|prod}

• AWS Security, Identity and Access Management– Each app has its own security group (firewall ports)– Fine grain user roles and resource ACLs

• Key Management– AWS Keys dynamically provisioned, easy updates– High grade app specific key management support

Page 52: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

AWS Accounts

Page 53: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Accounts Isolate Concerns• paastest – for development and testing

– Fully functional deployment of all services– Developer tagged “stacks” for separation

• paasprod – for production– Autoscale groups only, isolated instances are terminated– Alert routing, backups enabled by default

• paasaudit – for sensitive services– To support SOX, PCI, etc.– Extra access controls, auditing

• paasarchive – for disaster recovery– Long term archive of backups– Different region, perhaps different vendor

Page 54: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Reservations and Billing

• Consolidated Billing– Combine all accounts into one bill– Pooled capacity for bigger volume discountshttp://docs.amazonwebservices.com/AWSConsolidatedBilling/1.0/AWSConsolidatedBillingGuide.html

• Reservations– Save up to 71% on your baseline load– Priority when you request reserved capacity– Unused reservations are shared across accounts

Page 55: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Cloud Access Gateway

• Datacenter or office based– A separate VM for each AWS account– Two per account for high availability– Mount NFS shared home directories for developers– Instances trust the gateway via a security group

• Manage how developers login to cloud– Access control via ldap group membership– Audit logs of every login to the cloud– Similar to awsfabrictasks ssh wrapperhttp://readthedocs.org/docs/awsfabrictasks/en/latest/

Page 56: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Cloud Access Control

www-prod

• Userid wwwprod

Dal-prod

• Userid dalprod

Cass-prod

• Userid cassprod

Cloud Access ssh Gateway

Security groups don’t allowssh between instances

developers

Page 57: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

AWS Usage (coming soon)for test, carefully omitting any $ numbers…

Page 58: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Dashboards with Pytheas (Explorers)http://techblog.netflix.com/2013/05/announcing-pytheas.html

• Cassandra Explorer– Browse clusters, keyspaces, column families

• Base Server Explorer– Browse service endpoints configuration, perf

• Anything else you want to build…

Page 59: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Cassandra Explorer

Page 60: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Cassandra Explorer

Page 61: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Cassandra Clusters

Page 62: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Bubble Chart

Page 63: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Slideshare NetflixOSS Details• Lightning Talks Feb S1E1

– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks

• Asgard In Depth Feb S1E1– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house

• Lightning Talks March S1E2– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-r

oadmap

• Security Architecture– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned

• Cost Aware Cloud Architectures – with Jinesh Varia of AWS– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jine

sh-varia-aws-and-adrian-cockroft-netflix

Page 64: Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

Amazon Cloud Terminology Reference See http://aws.amazon.com/ This is not a full list of Amazon Web Service features

• AWS – Amazon Web Services (common name for Amazon cloud)• AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus application code)• EC2 – Elastic Compute Cloud

– Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configurations.– Instance – a running computer system. Ephemeral, when it is de-allocated nothing is kept.– Reserved Instances – pre-paid to reduce cost for long term usage– Availability Zone – datacenter with own power and cooling hosting cloud instances– Region – group of Avail Zones – US-East, US-West, EU-Eire, Asia-Singapore, Asia-Japan, SA-Brazil, US-Gov

• ASG – Auto Scaling Group (instances booting from the same AMI)• S3 – Simple Storage Service (http access)• EBS – Elastic Block Storage (network disk filesystem can be mounted on an instance)• RDS – Relational Database Service (managed MySQL master and slaves)• DynamoDB/SDB – Simple Data Base (hosted http based NoSQL datastore, DynamoDB replaces SDB)• SQS – Simple Queue Service (http based message queue)• SNS – Simple Notification Service (http and email based topics and messages)• EMR – Elastic Map Reduce (automatically managed Hadoop cluster)• ELB – Elastic Load Balancer• EIP – Elastic IP (stable IP address mapping assigned to instance or ELB)• VPC – Virtual Private Cloud (single tenant, more flexible network and security constructs)• DirectConnect – secure pipe from AWS VPC to external datacenter• IAM – Identity and Access Management (fine grain role based security keys)