Download - Application architecture operational aspects
1
Application Architecture Operational Aspects
Disclaimer
This presentation is a journey into the Application Architecture operational aspects.
This presentation is quite old but the concepts remains.
Most of the content was excerpted from external resources which I lost the references.
Course Given at La Rochelle University in France in 2009
http://eventtoons.com/home
2
3
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
4
Introduction
What is the Operational Aspect of an architecture?
Concerns centered around Run time Environment
Achieve Service Level Requirements
Deployment Units, their connections, locations, nodes
Also called Quality of Service, ilities, etc.
5
The Operational Aspects are part of development lifecycle
6
Operational Aspects Concerns
Availability •Scheduled service hours
•Outage costs
•Speed of service
recovery
•Disaster recovery
Process and Data
integrity
Standards
Cost
Security
• Access to system /
data
• Threats
• Controls
Systems Management • Event and Log
Management
• Configuration
Management
• Security Management
• Performance
Management
• Scheduling
• Backup and Recovery
Timescales
Skills (User and IT)
There are more disciplines / areas of concern than listed here!
Data
Currency
Performance • Response
Time
• Throughput
• Capacity
Scalability
7
Modeling Operational Aspects
What Operational Model contains?
Domain analysis (actors and use cases)
Plans for achieving Functional and non-functional requirements
The systems management strategy and constraints
Requirements IT Architecture Design
Architecture
Overview
Diagram
Current IT
Environment
Interaction Diagram
Detailed Design
8
Operational Model Terminology
RAS
Reliability
Availability
Serviceability
RAS terms center entirely around uptime
Originally used by mainframe vendors
RASP
Reliability
Availability
Scalability
Performance
RASP terms describe both uptime and scalable performance
9
Operational Model Takeover
The operational aspect of architecture
Documents the placement of the solution's components
Outlines the systems management aspects of the solution
Focuses on runtime systems design
Developed in concert with the Component Model
Documented via several views, developed in parallel and iteratively through the phases of an engagement
10
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
11
Reliability
Defined as the probability of a failure within a given time period (or, how frequently a failure occurs)
Failure rate is often notated as lambda
Typically measured as
MTBF: Mean Time Between Failure
FI T Rate: Failures In Time (failures per 1,000,000,000 hours)
12
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
Availability
13
Availability means the system is open for business
Business for a retail store means Customers are browsing and buying
Most retail stores have planned downtime
for holidays, inventory or just close during
off-peak hours like late night or early morning
14
Dimensions of Availability
Functionality
Performance
Data Accuracy
Is the data provided by the system accurate and complete?
Does the system do what it is supposed to do?
Does the system function within the acceptable performance criteria?
15
Availability
Availability can be expressed numerically as the percentage of the time that a service is available for use.
Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time
Influencing Factors
MTBF: Mean Time Between Failure
MTTR: Mean Time To Recovery
16
Availability
Defined as the percentage of time that an application is processing requests
Measured in terms of uptime
typically nines (99.999% is five nines)
8.75 hours 99.9
52 minutes 99.99
5 minutes 99.999
30 seconds 99.9999
By Year not available Availability (%)
3.65 days 99
17
Service Level Agreements
Define what you mean by available
The system is available when
The home page displays within 2 seconds when you navigate to the URL
You can add items to the shopping cart in 1 second or less
You can purchase items in your shopping cart using a credit card in 15 seconds or less
Your definition should be testable with automated tools or third party vendors
18
Availability Requires People
People are the biggest cause of downtime
Organization - ensure skills are available or on call when required
Procedures - Operators need correctly documented, tested and maintained procedures
19
Reliability vs. Availability
Reliability generally has the greater impact on end-user perception
Frequent failures are irritating
Good reliability, bad availability
Infrequent, but potentially major, downtimes
e.g. electrical power grid
Availability generally has the greater impact on operations
Extended downtime can cripple a business
Bad reliability, good availability
Frequent, but minor, failures
e.g. mobile communications
If the application is fault tolerant, and failover is instantaneous, then
Reliability and Availability can generally be treated as a single objective
The application is reliable as long as there is a server to failover to
The application is available as long as there is at least one server up
20
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
High Availability
HA refers to application or service availability targets of 99.9 percent or higher availability.
In contrast, the Service Availability Forum defines HA applications or services as applications or services with an availability objective of ‘‘five nines,’’ i.e., 99.999 percent.
For an application that can be accessed at any time, the former definition (99.9 percent) implies unscheduled downtime of 8.76 hours (525.6 minutes) and availability of 8751.24 hours per year (given 8760 hours in a non leap year).
This is equivalent to a few unscheduled outages in a year and a restoration of service in minutes.
In order to provide this level of availability, an application or service requires a set of HA technologies, IT processes, and services supporting HA (the focus of this paper), as well as an IT organization that supports HA.
21
High Availability
The first critical aspect of HA management is the understanding and documenting of customer requirements for availability.
Understanding the business requirements clearly can help minimize overinvestment in areas that do not add needed value
Reaching this understanding can be a joint effort of the availability management, service level management and service financial management, requirements engineering, and architecture teams.
22
High Availability Four Key Goals (KGI)
Based on experience and industry trends, there are four key goals associated with HA:
1. Maximizing or extending application or service uptime, i.e., mean time between service failures (MTBSF);
2. Eliminating or minimizing the impact of service related incidents by detecting and resolving component incidents before they impact application or service availability
3. Minimizing unplanned or unscheduled downtime of applications or services, i.e., mean time to recover service (MTTRS);
4. Eliminating or minimizing planned or scheduled downtime (i.e., downtime for changes, releases, and maintenance work).
These goals (KGIs, a term introduced in the COBIT 3)
are consistent with ITIL V3 service design documentation.
Are related to continuous operations (CO) and continuous availability (HA þ CO).
23
High Availability Four KGI and IT Processes
24
High Availability Stages of the services lifecycle
25
High Availability Service Strategy
Service strategy helps in defining availability requirements and rationalizing expenditures for improving service availability by detailing the relationship between service, IT, functional, and business strategy.
As an element in service strategy, service portfolios can be grouped into service tiers, with each tier having its own set of service-level objectives (SLOs) and service-level requirements (SLRs).
The service targets for each SLO may vary by service tier.
Service tiers can in turn include availability tiers, with key differences in their availability objectives.
Key SLOs associated with service availability can help with gathering and documenting service availability requirements.
26
High Availability Service Design
Service design involves determining and documenting service requirements and designing services to meet or exceed a set of functional and nonfunctional requirements.
Availability management is an IT process that is part of the service design stage of the service lifecycle.
Service design is directly responsible for using availability architecture patterns, technologies, and standards, in both processes both in the application design and technology infrastructure design processes.
The ITIL version 3 service design concept is a critical change from ITIL version 2.
In V2, availability management was part of service delivery.
By moving it to service design, ITIL version 3 makes it clear that waiting until service delivery to plan service levels, availability levels, capacity levels, continuity plans, security plans, and financial plans will not result in an efficient design.
27
High Availability Service Transition
Service transition moves the service package into operational mode and involves the development of the base configuration information and knowledge management related to the service.
This includes documentation of the service architecture, service related operational procedures, and other service specific documentation.
Involves the testing, evaluation, and validation of the service in a pre-production environment including testing, evaluation, and validation of HA technologies and capabilities.
Change, release, and transition planning must also be performed, including operational readiness and final production deployment.
Processes employed in service transition include asset configuration and knowledge management, change management, transition planning and support, release and deployment management, and service testing, validation, and evaluation processes.
28
High Availability Service Operation
Service operations in production involve operational activities such as:
Event and Incident management, which is critical for MTBSF, MTBCF, and MTTRS.
Post-deployment configuration audits, including HA configuration audits.
Post-deployment operational audits, such as change audits and maintenance audits.
Advanced change management capabilities and change models for HA services and applications.
Advanced release management capabilities and release models for HA services and applications.
Post-deployment operational work also involves day-to-day maintenance activities, both reactive and proactive, operational, infrastructural, and minor code-related changes, and major releases.
29
High Availability Service Improvement
Service improvement includes the development and implementation of service, application, and infrastructure availability improvement plans.
Availability improvement plans can be based
On thorough availability architecture analysis (i.e., identifying gaps between current availability capabilities and target availability architecture) or
On the ad hoc development and implementation of service, application, infrastructure, and operational architectural improvements as they relate to and impact service availability
30
High Availability Process and Tools
31
High Availability Architecture Patterns
32
33
High Availability Measuring HA
Question to be answerer
What percentage of the time is the application usable?
For HA, its measured as the number of nines
e.g. 99.999% is five nines
Calculated using simple probability
34
High Availability Measuring HA Example
Hardware
Cluster of 8 2-CPU servers, 99% Uptime each
Question
What is the availability of each configuration if a total of 8 CPUs are
required to service user requests within SLA requirements?
Solution
I f a total of 8 CPUs are required, this implies four servers are
sufficient to service application requests.
For the application to fail, five of the eight servers must fail
simultaneously: 0.01^5 = 1 e-10
Predicted application availability is 99.99999999% (annual
downtime of ~ 3ms)
35
High Availability Measuring HA – In reality …
Application availability is more complicated than it appears at first glance.
Human error is one of the biggest contributor to application downtime
Servers are rarely truly independent
A server failure may increase load on the remaining servers, triggering a cascade effect
Errors in shared components (network switches, clustering, power systems) can impact multiple servers
High Availability Measuring HA – Sequential Dependency
Components connected is a chain, relying on the previous component for availability
The total availability is always lower than the availability of the weakest link
36
Server 1 Server 2 Server 3
Availability (A)= AS1*AS2*AS3
High Availability Measuring HA – Sequ. Dep. Example
37
Availability = Database * Network * Web Server * Desktop
Availability = 98% * 98% * 97.5% * 96% = 89.89%
Total Infrastructure Availability = 89.89%
98%
Database Server
98%
Network
97.5%
Web Server
96%
Desktop
High Availability Measuring HA – Redundant Dep. Ex.
38
Database Availability= 1 – ((1 – 0.98) * (1 – 0.98)) = 0.9996
Database Availability = 99.96%
Availability = Database * Network * Server * Workstation
Availability = 0.9996 * 0.98 * 0.975 * 0.96 = 0.9169
Total Infrastructure Availability = 91.69%
Total availability is higher than the availability of the individual links
98%
Network
97.5%
Web Server
96%
Desktop
98%
Database Servers
98%
High Availability Measuring HA – Reality
If an application has three tiers with 99% uptime each, what’s the up-time for the application?
Measure probability as being up
Probability = .99 * .99 * .99
97% uptime = not even two nines!
The application availability is not even as good as the weakest link when adding a tier
even if it is more reliable than the other tiers, adding a tier will always reduce application availability
39
High Availability Synthesis
Use redundancy with failover to increase availability by eliminating Single Points Of Failure (SPOFs)
Decouple tiers and as much as possible make each tier self-sufficient ( or at least fail gracefully)
Keep Common Sense
Not all applications need HA it s expensive!
There is still room for human error and unavoidable downtime (e.g. certain upgrades)
40
41
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
42
Eliminating SPOFs Introduction
SPOF = Single Point Of Failure
Whenever a single server can die and take down the application (or part of an application), that server is a SPOF
Eliminating SPOFs increases application availability
When a working system can take over for a failed system, that is called failover
A system that can fail over is not a SPOF
43
Component Redundancy
Eliminates single point of failure
Active / Active configuration
Example: Web Farm
Active / Passive
Example: Cluster of SQL Servers
Use High Availability Patterns
Y1
Y3
Y2 ZX
Y1
Y3
Y2 Load
Balancer
Y1
Y3
Y2 Load
Balancer
44
Eliminating SPOFs N-Tiers Architecture
Local or global load
balancer are used
today.
Can be either
hardware or
software based
Generally
software based.
Load balancing
can keep track
of the session
Mainly software
solution. Local
and distributed
cache
management
Mainly
software
solution. Local
and distributed
cache
management
45
Eliminating SPOFs HA Load Balancer
Local HA Load Balancers
Typically a master/slave configuration
Both Load Balancers receive all the traffic
The Load Balancers communicate directly over a dedicated cable
When the slave detects failure of the master, it assumes all responsibility for the current connections
May even be able to failover stateful connections, including HTTPS
Global Load Balancers
Used to direct traffic to a particular data center
Use an Authoritative Name Server e.g. to resolve www to particular data center
For disaster recovery, the www resolves to the primary data center unless it is down, in which case it resolves to the backup
For regional load-balancing, the www is resolved to the geographically closest data center
Modern Global Load Balancers do both Global and Local balancing
46
Eliminating SPOFs HA Load Balancer
Known appliance
BigIP
Alteon
Other software Software
Continuent (ex EMIC) a/cluster (European Connect)
Apache has multiple load balancing and failover plug-ins
BEA is providing natively an HTTP load balancing
IIS also
47
Eliminating SPOFs HA Database
HA Databases do not normally require additional programming in the application tier
Often implemented in the JDBC driver level or below
Failover may cause current pending transactions to roll back, but with a real HA database, no previously committed transactions are lost
The most reliable HA Database configuration is master/slave
The slave server is always ready for the master to die
One-way replication may even work across datacenters
48
Eliminating SPOFs HA Database
Hardware
SAN/NAS BAY with RAIDx
Software
Continuent m/cluster for mySQL (heterogeneous, SQL Server,
Sybase and Oracle cluster in Q3 2006)
Shared Nothing Architecture - load balance read / broadcast write
Oracle RAC
RAC is a kind of distributed cache on top of Oracle.
Require the same OS, same Oracle version
No load balancing, no failover within transactions
Only supports retry on new connections or retry of reads.
49
Eliminating SPOFs HA Application Tiers
Application tiers can be stateless
Stateless tiers (e.g. web servers) are HA using simple redundancy
Only problem is that statelessness in one tier usually just passes the
buck to the next tier, which is almost always more expensive
Application tiers are almost always stateful
Only two things can be lost: State and in-flight requests
To achieve HA, the Application tier must either manage its state
resiliently (e.g. in a clustered coherent cache) or back it up to a
central store
Idempotent actions can be replayed by the web tier when a server
fails
Eliminating SPOFs HA Application Tiers
Application cache can be implemented
in application using JEE standard jcache
using open source tool (OScache, ehcache)
Using commercial tool (like Tangosol, Coherence)
Application cache can be implemented in application using .net
ASP.NET application cache is a smart in-memory repository for data
Caching Application Block.
50
51
Eliminating SPOFs Web Tier to App Tier
Load balancers slow way down if the load balancing is sticky (keep track of pair session/server)
Best approach is for the load balancer to round-robin or randomize its load-balancing across all available web servers
but there’s a good reason:
Web servers (e.g. Apache, IIS, JES) can handle lots of concurrent connections, serve static content, and route requests to app servers
The web server plug-in for routing to the app server can do the sticky load balancing, guaranteeing that HTTP Sessions stick !
All application servers are offering clustering.
52
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transactions
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
Transactions Introduction
A transaction is a sequence of operations that change the state of an object or collection of objects in a well defined way.
Transactions are useful because they satisfy constraints about what the state of an object must be before, after or during a transaction.
For example, a particular type of transaction may satisfy a constraint that an attribute of an object must be greater after the transaction than it was before the transaction.
Sometimes, the constraints are unrelated to the objects that the transactions operate on.
For example, a transaction may be required to take place in less than a certain amount of time.
53
Transactions Must comply to ACID Properties
Atomicity: All-or-nothing process.
Atomicity guarantees that all operations within a transaction happen within a single unit of work
Consistency: System in consistent state.
Consistency guarantees that all transactional resources within a transaction are left in a consistent state either after the transaction succeeds and is committed or after it fails and all resources are rolled back to their previous state
Isolation: Not affected by other.
Isolation ensures that even though multiple transactions may be running in parallel they appear to be running in a serial manner
Durability: Once committed, effects persist.
Durability ensures that once a transaction has been marked as committed all information relating to the transaction has been committed to durable storage
54
Transactions XA Protocol
Open Group's X/Open Distributed Transaction Processing (DTP) model
Defines how an application program uses a transaction manager to coordinate a distributed transaction across multiple resource managers
Any resource manager that adheres to the XA specification can participate in a transaction coordinated by an XA-compliant transaction manager, thereby enabling different vendors' transactional products to work together.
All XA-compliant transactions are distributed transactions
XA supports both single-phase and two-phase commit
The transaction manager is responsible for making the final decision either to commit or rollback any distributed transaction.
For the transaction to commit successfully all of the individual resources must commit successfully; if any of them are unsuccessful, the transaction must roll back in all of the resources.
55
Transactions Tools
JTA (Java Transaction API) required for XA style transactions. An XA transaction involves coordination among the various resource managers, which is the responsibility of the transaction manager.
JTA specifies standard Java interfaces between the transaction manager and the application server and the resource managers.
56
Transactions XA and SOA
Today some technologists are even positioning the enterprise service bus (ESB) as a standard mechanism for integrating systems with heterogeneous data interfaces.
While ESB and Web services can clearly be used to move data between disparate data sources, I would not recommend using a set of Web services to implement a distributed transaction if the transaction requirements could be achieved by using XA, even if the enabling Web services technology supported WS-Transaction (WS-TX).
The one advantage Web services have over XA is that XA is not a remote protocol.
57
Transactions WS-TX for SOA
WS-Transaction is the name of the OASIS group that is currently working on the transaction management specification. WS-TX is the name of the committee, and they are working on three specs:
WS-Coordination (WS-C) - a basic coordination mechanism on which protocols are layered;
WS-AtomicTransaction (WS-AT) - a classic two-phase commit protocol similar to XA;
WS-BusinessActivity (WS-BA) - a compensation based protocol designed for longer running interactions, such as BPEL scripts.
In practice, WS-AT should be used in conjunction with XA to implement a true distributed transaction.
WS-TX essentially extends transaction coordinators, such as OTS/JTS and Microsoft DTC to handle transactional Web services interoperability requirements.
58
Transactions Birth of XTP
Extreme Transaction-Processing Platform
Traditional online transaction processing (OLTP) architectures and products are wearing thin when it comes to supporting the growing transactional workloads generated by modern service oriented and event-driven architectures (SOAs and EDAs)
Users are looking for alternatives based on low-cost commodity hardware and modern software.
59
Transactions XTP Platform
An XTPP will be characterized by the following features:
A cohesive programming model supporting the development paradigms offered by the containers
Event-processing and service containers to enable development and execution of rich applications supporting even the most-complex requirements
Flow management container to enable application development and execution through composition of loosely coupled components (services or event handlers)
A batch container to support batch and high-performance computing (HPC)-style applications
A common distributed transaction manager, leveraged by the application containers for supporting transaction integrity in highly distributed architectures
60
Transactions XTP Platform
A high-performance computing fabric, a communication and data-sharing infrastructure combining enterprise service bus and distributed caching mechanisms to support fast event propagation, service request dispatching and transactional data sharing between XTP application components and external applications.
Tera-architecture support to manage transparent and dynamic application and system components deployment and execution over large clusters of Linux, Unix or Windows servers but also pervasive computing processors.
Development tool, security, administration and management capabilities
61
Transactions XTP Platform
62
Transactions XTP Platform Vendors
IBM WebSphere XD
an add-on product for several WebSphere (and non-IBM alike) products, providing distributed caching, stream processing, a batch framework, virtualization and other XTP features; it has announced support for OSGi in the WebSphere family and is one of the strongest SCA supporters.
Oracle declared on many public occasions that XTP was an area of strategic investment
in March 2007, it acquired Tangosol
Oracle also announced that it will introduce SCA and OSGi support in the next release of Oracle Fusion Middleware.
Oracle bought BEA and Sun
Microsoft's Windows Workflow Foundation
Layers a flow management, event-driven programming model atop the "classic," client/server-oriented .NET environment.
63
Transactions XTP Platform Vendors
Tibco ActiveMatrix
Hybrid combination of container technology (POJO, Java EE and .NET), policy management and core enterprise service bus technologies.
Red Hat
acquired Mobicents, an open-source, JSLEE 1.0-compliant event-driven application platform
Mobicents runs on the mickokernel foundation of the Java EE-based, JBoss Application Server.
E2E Technologies provides E2E Bridge
Hybrid combination of ESB, flow management, service and event containers providing a UML-based programming model
64
Transactions XTP Platform Vendors
GigaSpaces
Extreme Application Platform (XAP) — a platform middleware product combining Java, Spring, JavaSpaces, OSGi and a Java EE subset (JDBC, JMS and JCA) meant to address analytical and transactional applications.
Several grid-based application platform vendors (Appistry, Majitek and Paremus) have announced support for Spring and are extending their platforms for event-driven programming.
Event-driven application platform vendors (Kabira, jNetX, OpenCloud and WareLite) are beginning to move out of the traditional telecommunications and financial service sectors into other verticals (such as retail and defense).
65
66
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
CAP theorem from Amazon
What goals might you want from a shared-data system?
Strong Consistency: all clients see the same view, even in presence of updates
High Availability: all clients can find some replica of the data, even in the presence of failures
Partition-tolerance: the system properties hold even when the system is partitioned
The theorem states that you can always have only two of the three CAP properties at the same time.
67
CAP theorem from Amazon
The first property, Consistency
Has to do with ACID systems
Big shops like Amazon and Google, as they handle an incredibly huge number of transactions and data, always need some kind of system partitioning.
For Amazon the third and second CAP properties (Availability and Partitioning) are fixed, they need to sacrifice Consistency.
It means they prefer to compensate or reconcile inconsistencies instead of sacrificing high availability, because their primary need is to scale well to allow for a smooth user experience.
68
CAP theorem Lessons learned from Amazon
Most legacy application servers and relational database systems are built with consistency as their primary target, while big shops really need high availability.
That's why firms like Google or Amazon have developed their own applicative infrastructure.
That's why, a two-phase commit protocol is never an appropriate choice in case of big scalability needs.
To scale-up what you really need are asynchronous, stateless services, together with a good reconciliation and compensation mechanism in case of errors.
Second, your data model has a dramatic impact on performances, that's why Amazon has implemented a simple put/get API instead of running complex database queries, and why Google performances are due to the MapReduce algorithm: simplicity rules.
69
70
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
71
Scalability
Defined in terms of the impact on throughput as additional hardware resources are added
Adding CPUs/ RAM to a server : Scaling Up
Adding servers to a cluster : Scaling Out
Measured with Scaling Factor
The ratio of new capacity to old capacity as resources are increased
If doubling CPUs results in 1.9 x throughput, then Scalability Factor is 1.9, and the adjusted SF/CPU is 0.95
The ideal SF/ CPU is 1.0, a.k.a. linear scalability
72
Scaling in vs. Scaling out
IS professionals typically add capacity to computer systems by scaling up.
When response time starts to degrade because of additional workload or higher database capacities, the straightforward answer to the immediate performance problem is adding bigger, faster hardware.
Extrapolating from Moore's Law, which states that hardware performance will double every 18 months, you might conclude that scaling up is an adequate solution to handle growth for the foreseeable future.
However, you'll soon realize that Murphy's Law precludes Moore's Law.
73
Scaling in vs. Scaling out
Although the current 8-way SMP systems equipped with high-speed Storage Area Network (SAN) storage arrays provide tremendous scalability, they also bring to light several other scalability problems.
First, when a system reaches a certain point, further scaling up becomes prohibitively expensive.
Second, even with Moore's Law in full effect, you can't scale beyond a certain point—at least until vendors release the next generation of hardware.
Even beyond the hardware problems, you'll probably encounter software hurdles when you're trying to scale up.
Software systems such as databases have internal mechanisms that handle locking and other multi-user database issues.
These software structures have limited efficiency, and these limits typically become the real governing impediments to continued upward scalability.
74
Scaling in vs. Scaling out
Thus, you don't see SMP performance graphs continuing to demonstrate linear upward scalability as you add more processor power.
At some point, the curve always begins to flatten. At the upper reaches of that curve, you'll find that you need very expensive hardware upgrades to get very small performance improvements.
That’s where scaling out came into the game.
75
Scaling in vs. Scaling out
Scaling out can provide an effective answer to the problems of the scale-up scenario, by using the shared-nothing architecture.
Essentially, shared-nothing architecture means that each system operates independently.
Each system in the cluster maintains separate CPU, memory, and disk storage that other systems can't directly access.
To address capacity issues by scaling out, you add more hardware-not bigger hardware.
When you scale out, the absolute size and speed of a single system doesn't limit total capacity.
Shared-nothing architecture also skirts the software bottleneck by providing multiple multi-user concurrency mechanisms.
Because the workload is divided among the servers, total software capacity increases.
76
Scaling in vs. Scaling out
Although scaling out provides great answers to the inherent limitations in scale-up architecture, this method is no stranger to Murphy's Law, either.
At this point in the technology lifecycle, scaling out requires increased management overhead that is potentially as great as the performance gains it offers.
Even so, scaling out might be a viable solution to database implementations that have reached the limits of SMP scalability.
77
Scalability for Database Tier
Applications that go to the database for each request likely will have scalability problems
The Database tier is difficult and expensive to scale; it is difficult to scale a database server to more than a single host, and it becomes exponentially more expensive to add CPUs
Database servers scale sub- linearly at best with additonal CPUs, and there is a CPU limit
78
Super-Linear Scalability
I t is possible to exceed an SF of 1 .0
With two disks, reduced head content ion can increase the throughput of sequential I/O by even 100x.
Similar effects can occur with CPU caches and context switches.
Large cluster - aggregated data caches can offer super- linear scale by significantly increasing the hit rate, reducing the average data access cost
Can be also explained as a super-linear slow down as resources are reduced ( i.e. the converse)
79
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
80
Performance
Defined as how fast operations complete
Typically measured as time (wall clock) elapsed between request and response
Elapsed time also known as latency
Web apps often measured on the server side as time to last byte (TTLB)
81
Scalability vs. Performance
Users are affected by poor performance
Poor performance is usually a result of poor scalability
Operating costs and capacity limitations are caused by poor scalability
Designing for scalability often has a negative impact on single-user performance
Building in the ability to scale out has overhead
But single-user performance doesn't often matter !
Once the maximum sustainable request rate is exceeded, performance will degrade
End user apps will degrade in a linear fashion as the request queue backs up
Automated applications will degrade exponentially
82
Scalable Performance
Scalable Performance is NOT focused on making an application faster ; rather, it is focused on insuring that the application Performance does not degrade beyond defined boundaries as the application gains additional users, how resources must grow to ensure that, and how one can be certain that additional resources ill solve the problem
Scalable Performance refers to overall response times for an application (SLA) that are within defined tolerances for normal use, remain within those tolerances up to the expected peak user load, and for which a clear understanding exists as to the resources that would be required to support additional load without exceeding those tolerances
83
Engineering For Performance
Build performance and scalability thinking in the development lifecycle
Define your objectives
Measure against your objectives
When You measure what you are speaking about, and express it
in numbers, you know something about it; but when You cannot
express it in numbers, your knowledge is of a meager and
unsatisfactory kind; it may be the beginning of knowledge, but you
have scarcely in your thoughts advanced to the state of science. - Lord Kelvin (William Thomson)
84
Performance Modeling
A structured and repeatable approach to modeling the performance of your software
Similar to “Threat Modeling” in security
Begins during the early phases of your application design
Continues throughout the application lifecycle
Consists of
A document that captures your performance requirements
A process to incrementally define and capture the information that helps the teams working on your solution to focus on using, capturing, and sharing the correct information.
85
Performance modeling Process
Critical Scenarios
Have specific performance expectations or requirements.
Significant Scenarios
Do not have specific performance objectives
May impact other critical scenarios.
Look for scenarios which
Run in parallel to a performance critical scenario
Frequently executed
Account for a high percentage of system use
Consume significant system resources
1. Identify Key Scenarios
2. Identify Workloads
3. Identify Performance Objectives
4. Identify Processing Steps
5. Allocate Budget
6. Evaluate
7. Validate
Itera
te
86
Performance modeling Process
Workload is usually derived from marketing data
Total users
Concurrently active users
Data volumes
Transaction volumes and transaction mix
Identify how this workload applies to an individual scenario
Support 100 concurrent users browsing
Support 10 concurrent users placing orders.
1. Identify Key Scenarios
2. Identify Workloads
3. Identify Performance Objectives
4. Identify Processing Steps
5. Allocate Budget
6. Evaluate
7. Validate
Itera
te
87
Performance modeling Process
Performance and scalability goals should be defined as non-functional or operational requirements
Requirements should be based on previously identified workload
Consider the following:
Service level agreements
Response times
Projected growth
Lifetime of your application
1. Identify Key Scenarios
2. Identify Workloads
3. Identify Performance Objectives
4. Identify Processing Steps
5. Allocate Budget
6. Evaluate
7. Validate
Itera
te
88
Define Your Objectives
Performance and scalability goals should be defined as non-functional or operational requirements
Requirements should be based on expected use of the system
Compare to previous versions or similar systems
Metric Definition Measured By Impacts
Throughput How Many? Requests per second
Number of servers
Response Time How Fast? Client latency Customer Satisfaction
Resource Util. How Much? % of resource Hardware/ Network
Workload How many concurrent requests?
Concurrent requests for the system
Scalability, Concurrency
89
Define Your Objectives
Objectives must be SMART
S – Specific
M – Measurable
A – Achievable
R – Results Oriented
T – Time Specific
"application must run fast"
“Page should load quickly"
"3 second response time on home page with 100
concurrent users and < 70% CPU"
"25 journal updates posted per second with 500 concurrent
users and < 70% CPU"
r
a
"If You cannot
measure it, You
cannot improve
it.“
-Lord Kelvin
90
Build an objective
Scenario Response Time
Throughput Workload
Resource Utilization
Browse Home page
Client latency 3 seconds
50 requests per second
100 concurrent
users
< 60% CPU Utilization
Search Catalog
Client latency 5 seconds
10 requests per second
100 concurrent
users
< 60% CPU Utilization
91
Performance modeling Process
Identify the steps that must take place to complete a scenario
Use cases, sequence diagrams, flowcharts etc. all provide useful input
Helps you to know where to instrument your code later
Start at a high level, don’t go to low
1. Identify Key Scenarios
2. Identify Workloads
3. Identify Performance Objectives
4. Identify Processing Steps
5. Allocate Budget
6. Evaluate
7. Validate
Itera
te
92
Performance modeling Process
Use your performance baseline to measure how much time each processing step is taking
If you are not meeting your target budget the time among the processing steps
1. Identify Key Scenarios
2. Identify Workloads
3. Identify Performance Objectives
4. Identify Processing Steps
5. Allocate Budget
6. Evaluate
7. Validate
Itera
te
93
Performance modeling Process
Run automated test scenarios and evaluate the performance against objectives
As much as possible, tests must be repeatable throughout application lifecycle
1. Identify Key Scenarios
2. Identify Workloads
3. Identify Performance Objectives
4. Identify Processing Steps
5. Allocate Budget
6. Evaluate
7. Validate
Itera
te
94
Performance modeling Process
Check your results against performance objectives
Leave yourself a margin early in the project to avoid early performance optimization
As you progress toward completion allow less margin
1. Identify Key Scenarios
2. Identify Workloads
3. Identify Performance Objectives
4. Identify Processing Steps
5. Allocate Budget
6. Evaluate
7. Validate
Itera
te
95
Content
1. Introduction to Operational Aspects
2. Reliability
3. Availability and SLA
4. High Availability
5. Eliminating SPOF
6. Transaction
7. CAP Theorem
8. Scalability
9. Performance
10. Clustering
96
Clustering
Clustering enables multiple servers or server processes to work together
Clustering can be used to horizontally scale a tier, i.e. scale by adding servers
Clustering usually costs much less than buying a bigger server (vertical scaling)
Clustering also typically provide failover and other reliability benefits
97
Clustering Concepts
The less communication required, the better
Always better to be stateless in a tier if it does not cause a bottleneck in the next tier
Server farms: a stateless clustering model
The less coordination required, the better
Independence: Don’t go to the committee
Concurrency Control: Reduces scalability, so use only as necessary
98
Clustering Benefits
If the application has been built correctly, it supports a predictable scaling model
Clustering allows relatively inexpensive CPU and memory resources to be added to a production application in order to handle more concurrent users and/or more data
Provides redundancy
Simple ( n+ 1 ) model
99
Scalability of Clustering
The Potential for Negative Scale
Single server model allows unrestricted caching
Clustering may require the disabling of caching
Two servers often slower than one!
Data Integrity Challenges in a Cluster
How to maintain the data in sync among servers
How to keep in sync with the data tier
How to failover and fail-back servers without impact
http://www.twitter.com/welkaim
Travel 2.0
http://www.netvibes.com/travl20
http://fr.linkedin.com/in/williamelkaim
Blog
http://www.reimagine.fr/
Contact: [email protected]
100