hinjewadi, pune - cmg india (computer measurement … pune table of contents foreword: cmg...
TRANSCRIPT
Table of Contents
Foreword: CMG India's 1st Annual Conference …………….. i
Architecture & Design for Performance
Optimal Design Principles for Better Performance of Next
Generation Systems, Maheshgopinath Mariappan et al
……………..
1
Architecture & Design for Performance for a Large European
Bank, R Harikumar, Nityan Gulati
……………..
7
Designing for Performance Management in Mission Critical
Software Systems, Raghu Ramakrishnan et al
……………..
19
Low Latency Multicore Systems
Incremental Risk Calculation: A Case Study of Performance
Optimization on Multi Core, Amit Kalele et al
……………..
31
Performance Benchmarking of Open Source Messaging
Products, Yogesh Bhate et al
……………..
41
Advances in Performance Testing and Profiling
Automatically Determining Load Test Duration Using
Confidence Intervals, Rajesh Mansharamani et al
……………..
58
Measuring Wait and Service Times in Java Using Byte Code
Instrumentation, Amol Khanapurkar, Chetan Phalak
……………..
69
Cloud Performance Testing Key Considerations,
Abhijeet Padwal
……………..
78
Reliability
Building Reliability in to IT Systems,
K. Velivela
……………..
90
i
Foreword: CMG India's 1st Annual Conference
Rajesh Mansharamani
President, CMG India
When we founded CMG India in Sep 2013, I expected this community of IT system
performance engineers and capacity planners to grow to 200 members over time. An
year and a quarter since then I am happy to see my initial estimates go wrong. Not only
do we have more than 1500 CMG India members today, we also have more than 200
attending our 1st Annual Conference this December!
CMG Inc is very popular worldwide thanks to its annual conference, which attracts the
best from the industry to present papers in performance engineering and capacity
planning. Having this precedent in front of us, we wanted to set the bar high for CMG
India's 1st Annual Conference. Given that majority of the IT system professionals in
India have never submitted a paper for a conference publication, we were delighted to
see 29 high quality submissions in response to our call for papers. The conference
technical programme committee, drawn from the best across industry and academia,
accepted 10 of these submissions for publication and presentation. We hope to see these
numbers grow over time, thus giving opportunities for more and more professionals
across India to step forth and present their contributions.
Fortunately, the paper submissions were in diverse areas spanning architecture and
design for performance, advances in performance testing and profiling, reliability, and
cutting edge work in low latency systems. When complemented with our keynote
addresses in big data, capacity management, database query processing, and real life
stock exchange evolution, we truly have a great technical programme lined up for our
audience. Thanks to all our keynote speakers (Adam Grummitt, N. Murali, Anand
Deshpande, and Jayant Haritsa) for their readiness to speak at this inaugural event.
Given that majority of our audience is in billable client projects, we decided to restrict to
the conference to a Friday and Saturday, and hence have tutorials and vendor talks in
parallel. Tutorials too went through a call for contributions process and we were
delighted to see fierce competition in this area as well. Finally, we could shortlist only
four tutorials and we added another two invited tutorials from academia and industry
stalwarts. At the same time we lined up one session on startups and five vendor talks
from our hosts and sponsors.
Our 1st conference would not have been possible without the eagerness shown by
Persistent Systems and Infosys, Pune, to host the sessions in their campuses in
Hinjewadi, which is today the heart of the IT sector in Pune. We would also not be able
to make our conference affordable to one and one, without contributions from our
sponsors: Tata Consultancy Services, Dynatrace, VMware, Intel and HP. Given that CMG
India exists as a community and not a company, we were extremely glad when
Computer Society of India stepped in as the event supporter to handle all financial
transactions on our behalf. CMG India is extremely thankful to the hosts, sponsors, and
supporter, not just because of their deeds but also because of the terrific attitude they
have demonstrated in making this conference a success.
ii
None of the CMG India board members has hosted or organized a conference of this
nature before. While 16 regional events were organized by CMG India since its inception,
there wasn't any need of an organizing committee for these events, given that each
event lasted just two to three hours and was free to participants. As the annual
conference dates started approaching we realized the enormity of the task at hand in
managing a relatively mega event of this nature. For that reason I am extremely grateful
to Abhay Pendse, head of this conference's organising committee, and all the volunteers
who have worked with him in the planning and implementation.
Given that all of the organising committee members are working professionals with little
spare time in their work, it was heartening to see all of them spend late evening hours in
ensuring that the conference planning and implementation is as meticulous as possible.
It's been a joy working with such people and I would like to thank them again and again
for stepping forward and carrying forth their responsibilities till the very end. I am
equally impressed with the technical programme committee (TPC) wherein nearly all of
the 25 members reviewed papers and tutorials well ahead of their deadlines. All TPC
members are expert professionals very busy in their own work. Hats off to such
commitment to the field of performance engineering and capacity management.
We have hit a full house in our 1st Annual Conference, and we look forward to tasting
the same success in the years to come. I sincerely hope this community in India
continues to grow and shows the same spirit of contribution as we move forward in time.
Technical Programme Committee
Rajesh Mansharamani, Freelancer (Chair)
Amol Khanapurkar, TCS
Babu Mahadevan, Cisco
Bala Prasad, TCS
Benny Mathew, TCS
Balakrishnan Gopal, Freelancer
Kishor Gujarathi, TCS
Manoj Nambiar, TCS
Mayank Mishra, IIT-B
Milind Hanchinmani, Intel
Mustafa Batterywala, Impetus
Prabu D, NetApp
Prajakta Bhatt, Infosys
Prashant Ramakumar, TCS
Rashmi Singh, Persistent
Rekha Singhal, TCS
Sandeep Joshi, Freelancer
Santosh Kangane, Persistent
Subhasri Duttagupta, TCS
Sundarraj Kaushik, TCS
Trilok Khairnar, Persistent
Umesh Bellur, IIT-B
Varsha Apte, IIT-B
Vijay Jain, Accenture
Vinod Kumar, Microsoft
Organising Committee
Abhay Pendse, Persistent (Chair)
Dundappa Kamate, Persistent
Prajakta Bhatt, Infosys
Rachna Bafna, Persistent
Rajesh Mansharamani, Freelancer
Sandeep Joshi, Freelancer
Satonsh Kangane, Persistent
Shekhar Sahasrabuddhe, CSI
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings. 1
Optimal Design Principles for better Performance of Next generation Systems
Maheshgopinath Mariappan, Balachandar Gurusamy, Indranil Dharap,
Energy, Communications and Services,
Infosys Limited,
India.
{Maheshgopinath_M,Balachandar_gurusamy,Indranil_Dharap}@infosys.com
Abstract
Design plays a vital role in the software engineering methodology. Proper design ensures that the software will serve its intended functionality. Design of a system should cover both functional and Nonfunctional requirements. Designing the nonfunctional requirements are very difficult in the early stages of SDLC due to less clarity of actual requirements and primary focus is given to Functional requirements. Design related errors are really difficult to address and it might cost millions to fix it at a later stage. This paper describes the various real life performance issues and design aspects to be taken care for better Performance.
1. INTRODUCTION
There is a tremendous growth in the field of social
networking and internet based applications over the last few years. Across the globe there is an exponential growth in the number of people using these applications. Companies are deploying lot of strategies to increase their applications availability, reliability and make it less error prone. Any drop in these parameters will have a significant impact on their revenue and user base. But developing a 100 % reliable and error free application is not possible. Some type of application errors are easy to fix and recover where as some of them are not.
Design related issues are very critical and they have a huge impact on the functionality of the application. It takes lot of time to redesign and rebuild the application. So enough attention has to be given in the early stages to ensure that all the aspects of application is covered during design. Designing for the next generation system is even more complex as it introduces new complexities like most of the softwares used are open source, lot of stake holders involved and dynamically changing requirements.
Each section of this paper from Section 3 – Section 16 explains about the different design aspects which should be considered for better quality design.
2. RELATED WORK
Our paper covers the efficient logging ideas to achieve better application performance. There are different best practices and algorithms that are available for logging in the market. Some of the latest logging algorithms are explained by Kshemkalyani, A.D (1). Our design suggestions are generic in nature and applicable to any of the commonly used programming languages. List of the common languages used for software development is provided by IEEE (2).Different patterns of IO operations are explained by Microsoft (4). Significance of caching size is explained in this paper. Different caching techniques are explained by L.Han and team (5).
2
3. Logging:
Traditionally logging was considered as a way to store all the info related to the application request and response. This info was used by the operations and dev teams during debugging of the application. Over a period of time there is a major shift in this trend. Now a day’s business mainly relies on this data to generate business metrics and reports. They also use the log data to identify the Usage pattern and Customer churn. Advancement of research in the areas of Cloud computing and big data as well as availability of efficient tools like Splunk made this analysis possible. So there is a push and urge from different stake holders of the products like sales, business and care to log as much info as possible related to the user request. The following are the common problems faced because of poor logging design
1. Slow response time
2. Application performance degradation
Case study:
A real time web application was hanging after running in the production environment for 5 hrs. After analysis we were able to identify the root cause of the issue. That application was logging the entire request, response with headers and all the Meta into log files. After filtering out the unnecessary logging, the system was able to run without any issues. Extra measures needs to be taken incase if system is asynchronous in nature (Node.js) and single threaded. We should not lock the master process in logging. If the main thread is get locked than all the requests into the system file up until the main thread is released and available.
Aspects that needs to be considered during the design are
1. Proper logging level
2. Log only the critical details of the session like, session id, user id, type of operation performed etc., instead of logging all the details
3. Storing logs to a file system or local system instead of trying to write across the network
4. Set proper rollback size and policy for logging
5. Enable auto archival for logging
6. If possible make logging as async process
Relationship between improvement of throughput and logging level is explained in this picture
Figure1: Logsize vs Throughput,Response Time
4. Programming language:
In the software industry lot of programming language options available to develop a system. Selecting a proper language for application development makes the major difference in the system performance. Some developers tend to write all the applications using the same language. But consideration should be provided in selecting the proper programming language
Case study:
An e-commerce application was designed using J2EE framework. This application was a web application and it is stateless. It was not able to serve more user requests using a single server. The same application was redesigned using Node.js and now the system is able to handle more requests using the same application server. The development time also reduced a lot.
But the same Node.js which was employed for the back end application had issues as follows. During trial run everything seemed to be normal, but the application was not scaling up easily in the production region without the node clustering, Node.js is a single threaded framework. One more major issue was observed related to error handling. If there is any error in the input, it triggered failure of the single thread and crashed the application completely. So the following has to be considered.
1. If the application is going to be multithreaded then frameworks like AKKA, PLAY and J2EE should be used
2. If the application needs quick response time use languages like Scala than JAVA
3. If the application is a web application with no state information stored, then select programming languages like Node.js
3
Figure2:Responsetime vs Programming launguage
3.3 Reducing IO calls:
IO related calls like DB and File operations are generally costly. These should be limited as much as possible.
Case study:
A mobile back end application’s goal was to collect the user preference, debug logs from mobile clients and store it in the DB. During the initial testing everything seemed to be fine. But during the production launch the response time was increased drastically. We did an analysis and observed that the client was calling multiple times DB to store and retrieve the user preference and logs. So caching solution was added in between client and DB layer to provide most frequently requested data
1. Employ a caching mechanism
2. Group the DB write calls
3. Using the Async driver if available
5. Selection of DB
Selection of appropriate DB is also a critical factor. Now a days there is a trend among technical community to go for No SQL DB for each and every application. But that is not the right approach. The following are common problems if the DB selection is not correct.
1. Query becomes very complex if the data stored is not in proper format.
2. Query takes more time to execute and compute results than the expected interval
3. Overall slow response to the end user
Case study:
An application was designed to store various trouble tickets and their status. Development team decided to develop this entire application with new softwares like NoSQL databases. But during the POC it was observed that RDBMS are better than NoSQL for relational, transaction based system.
The best design practices for the selection of Database are
1. Use Relational DB’s such as SQL, if there is relation among the data getting stored and system needs frequent reads
2. NoSQL DB’s are useful for storing voluminous data without relation and with more writes and less reads
6. Replication strategies:
Applications are deployed across different data centers and clusters. In this case sometimes data replication is required across the servers to serve the users without any functional glitch. There are two strategies followed for data replication
1. Synchronous data replication
2. Asynchronous data replication.
Synchronous data replication should be used only when the data needs to be updated for each transaction. It generally comes with the tradeoff –providing slow response time to the end users.
Asynchronous replication provides good response time but not suitable for frequent updates.
Figure 3: Replication type vs Response time
7. Avoid lot of hops:
It is not a surprise that the number of hops directly affects the performance of the system, especially the ones related to the legacy systems.
0
50
100
150
Node J2EE
Res
po
nse
tim
e
program language impact
I/O itensive application CPU intensive application
4
Case study:
When a provisioning engine was deployed in Production it took more than an hour to provision a single user account. Lot of analysis was done to identify the root cause of the problem. The system was sourcing from more than 50 legacy services to fetch the info and creating the user entry during provisioning. Each system took some milliseconds to process the request. During further analyzes it was observed that all these legacy systems were built on top of the original source of truth and each one of them adding some extra small functions which is not required for the provisioning engine.
After a series of discussions with all our stake holders we retired the unwanted systems and rewrote the original source of truth in such a way that it provides all the required data directly. The response time then improved a lot from an hour to less than 10 sec. So always remember to avoid the unnecessary hops in larger systems.
Figure 4 : Number of hops vs Response time
8. Caching:
Caching is mainly used to cache the user data temporarily for a certain period of time so that it can reduce the I/O intensive calls like DB read and write. The following thumb rules need to be followed for caching.
1. Store only the static data in Cache
2. Never store dynamic data in the cache
3. Store only less volume of data
4. Put proper Cache eviction policies
5. If possible use in-memory cache compared to secondary cache
The following are the common issues faced across the applications if the caching is not proper
1. Lot of DB calls
2. Out of memory error due to increase in the cache growth
Figure 5 : Data type vs Cache effectiveness
9. Retry mechanism:
Generally retry mechanism is employed in database calls and third party calls to cover the failures due to not able to reach the target application. This also helps to enhance the overall user experience.
Case study:
A real time communication project was deployed with a no SQL DB in the back end. During the week end the NoSQL DB went down due to system issues and all the clients were retrying indefinitely, eventually bringing down the entire infrastructure. After thorough analysis it was identified that the number of retries was not defined in the client side and it was continuously retrying to reach the DB. Number of retries was configured in the application and it helps the application to work without issues. So the optimal number of retries should be decided during the initial stages of system design. According to the CAP theorem (6) we cannot achieve consistency, availability and partition of tolerance at the same time. There is always tradeoff between these parameters
10. Garbage collection:
Case study:
One of the queuing application had an issue of dropping customer requests. There were no issues reported in the system logs. Everything seems to be normal. During the root-cause analysis it was identified that the Garbage collection was not configured properly. The garbage collection was attempted at very frequent intervals which resulted in system hanging during that period and loss of transactions.
So design should capture the required garbage collection parameters.
5
Figure 6 : Full GC vs Dropped connections
11. Session Management:
Case study:
An ecommerce site was not able to scale beyond certain range of customer transactions. After analysis it was identified that there were lot of customer sessions maintained in system memory. Each session had lot of data within it. So the whole process was consuming lot of memory and system was not able to scale very much. So care should be provided for the proper session management
1. Remove the session from memory after certain time limit
2. Remove the session immediately if client is disconnected or not able to reach
3. Limit the number of items getting stored as part of the session
4. Move the data from session to some sort of temporary cache or journal so that the memory is free to be allocated most of the times
12. Async Transactions
During the design determine the transactions that can be performed in async manner. Performing transactions in async manner helps to reuse the expensive resources effectively. The following items should be done in async way.
1. Data base read or write
2. IO intensive operations
3. Calls to third party systems
4. Enable NIO connections in the application servers
13. Choice of Data Structures
Data structures are internally used by Computers to store and retrieve the data in an organized manner. The commonly used data types are List, Map, Set and Queue. Each type of data structures comes with their inherent special features like Hash table which is synchronized by design. So if we use Hash table for
multithreaded applications then it will degrade the performance of the applications. Instead of that we can use ConcurrentHashMap to achieve better results. Similarly String Builder is preferred over String Buffer as it is not synchronized. If you want to retrieve the items using a key then the ideal choice will be using a Map for the implementation. To retrieve the items in the input order, any List implementation can be used. When you want to store only unique set of data items then any Set implementation can be used. Queues can be used to implement any worker thread model of application. Not selecting the appropriate data structure will also result in inefficient use of heap memory. By default each data structure implementation has its own default storage capacity. The default capacity size of Hash table, Hash Map, String buffer is 16 whereas the Array List default is 10. Most of these data structures expand their capacity by power of 2 when the initial storage is exhausted. So selecting a Hash Map instead of Array List will consume more heap memory. So utmost care must be provided about the selection of the proper data structure during the design phase.
14. Client Side Computing vs Server Side computing
There is always a disagreement between back end program developers and web designers about Client side or Server side computation model is better. Most of the Client side computations are generally related to user interface. The client side implementation is done using java script, AJAX and flash which utilizes the client system resources to complete the requests. Server side computing is implemented using PHP, ASP.net JSP, python and ruby on rails. There are advantages and disadvantages on each of these approaches. Using server side computing one can provide quick business rule computation, efficient caching implementation and better overall data security whereas client side computing helps to develop more interactive web applications which use less network bandwidth and achieves quicker initial load time. So server side computations is preferred for validating the user choices, to develop structured web applications and persist user data. Client side computations are useful for dynamic loading of content, animations and data storage in local temporary storage.
15. Parallelism
Most of the modern day computers by default contain multiple cores. During design phase we need to identify what are all the tasks that can be executed in parallel to completely utilize this hardware feature. Parallelism is a concept of executing tasks in parallel to achieve high throughput, performance and easy
0
20
40
1 2 3 4
Dropped connection vs Full GC's
Full GC cycles(fixed time interval)
Dropped connections
6
scalability. To achieve parallelism in the application we need to identify the set of tasks that can be executed independently without waiting for others. Large work/transaction should be broken into small work units/tasks. Dependency between these work units and communication overhead between these units should be identified during the design phase. Then work units can be assigned to the central command unit for execution. Finally the results should be combined and sent to the user. One good example of this design is MapReduce programming technique. According to design rule hierarchies paper (7) Software modules located within the same layer of the hierarchy, suggest independent, hence parallelizable, tasks. Dependencies between layers or within a module suggest the need for coordination during concurrent work. So use as much parallelism as possible during the design to achieve better performance.
16. Choice of Design patterns
Design patterns provides a solution approach to the commonly recurring problems in a particular context. Concepts of design pattern started with the initial set of patterns described by the gang of four in their design patterns book. Currently around 200 design patterns available to resolve different software problems. So the cumbersome task is identifying the suitable design pattern for the application. One of the good approach is to design pattern Intent ontology proposed by Kampffmeyer H., Zschaler S in their paper (7).
They have also developed a tool to identify the suitable design pattern for the problem. Once a particular pattern is identified it has to be checked to ensure that it doesn’t belong to Anti patterns.
17. CONCLUSION:
As the next gen systems becomes more complex and pose a challenge on its own, it is imperative that while designing these systems, we follow the points discussed in this paper. Based on our experience over the years, a system proves to be efficient and cost-effective only when more weightage is given and adequate time is spent in designing the system. Brief design topics are
1. Proper logging configuration
2. Appropriate selection of software language
3. Reduce number IO operations
4. Appropriate selection of databases
5. Suitable replication strategy
6. Retire/merge unwanted legacy systems
7. Implement proper caching mechanism
8. Proper retry interval at client side
9. Proper garbage collection configuration
10. Keep less data in session memory
11. Priority to Async transaction
12. Proper choice of data structure.
13. Client Side computing vs Server Side computing
14. Parallelism
15. Choice of Design patterns
Each one of the above mentioned design principle will surely help in achieving better performance and good customer experience.
18. REFERENCES:
1. A. Kshemkalyani "A Symmetric O(n log n) Message Distributed Algorithm for Large-Scale Systems", Proc. IEEE Int\'l Cluster Computing Conf., 2009
2. http://spectrum.ieee.org/at-work/tech-careers/the-top-10-programming-languages
3. N. Ali, P. Carns, K. Iskra, D. Kimpe, S. Lang, R. Latham, R. Ross, L. Ward, and P. Sadayappan. Scalable I/O forwarding framework for high-performance computing systems. In IEEE International Conference on Cluster Computing (Cluster 2009), New Orleans, LA, September 2009.
4. http://msdn.microsoft.com/en-us/library/windows/desktop/aa365683(v=vs.85).aspx
5. L. Han, M. Punceva, B. Nath, S. Muthukrishnan, and L. Iftode, "SocialCDN: Caching techniques for distributed social networks, " in Proceedings of the 12th IEEE International Conference on Peer-to-Peer Computing (P2P), 2012.
6. http://en.wikipedia.org/wiki/CAP_theorem.
7. Sunny Wong, Yuanfang Cai, Giuseppe Valetto, Georgi Simeonov, and Kanwarpreet Sethi”Design Rule Hierarchies and Parallelism in Software Development Task”
8. Kampffmeyer H., Zschaler S., “Finding the Pattern You Need: The Design Pattern Intent Ontology”, in MoDELS, Springer, 2007, volume 4735, pages 211-225
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings. 7
ARCHITECTURE AND DESIGN FOR PERFORMANCE OF A LARGE EUROPEAN BANK PAYMENT SYSTEM
Nityan Gulati Principal Consultant Tata Consultancy Services Ltd
[email protected] Gurgaon
R. Hari Kumar
Senior Consultant Tata Consultancy Services
[email protected] Bangalore
A Large software system is typically characterized by a large volume of transactions to be processed, considerable infrastructure and high number of concurrent users. Additionally it usually involves integration with a large number of up-stream and down-stream interfacing systems with varying processing requirements and constraints. The above parameters on its own may not pose a challenge when they are static in nature, but it gets tricky if the inputs keep changing and continuously evolving. In such conditions, how do we keep the system performance and resilience under control? This paper tries to explain the key design aspects that will need to be considered across various architectural layers to ensure a smooth post production performance.
1. INTRODUCTION
In a typical implementation, often due attention is not paid to the system performance during the initial stages of design and development. Performance testing happens at a later stage; sometimes just a few weeks before the application goes live As a result, only a very limited performance tuning options are available at this stage. We can do a bit of SQL tuning and some tweaking in the system configuration. Due to a lack of systematic and timely approach to address the performance issues, mostly these steps result in little gains.
While a large system involves several design aspects, we shall discuss some key application design areas and guidelines that need to be borne in mind during the design stage for robust and performing implementations.
Specifically, the paper illustrates key aspects to be considered in tuning web page response time, aspects in tuning straight through processing (STP) throughput, factors for improving the batch throughput and a number of other parameters to be considered for tuning.
The document is based on the experiences from tuning the architecture, design and code of several product based implementations of financial applications.
The examples and statistics quoted are derived from the actual experience from managing the design and architecture of payment platform for a large European Bank.
The system has gone live successfully and is in production for around two years now.
The paper is organized as follows:
Section 2 provides the context of the payment system, SLA requirements and a brief overview on the architecture of the system, Section 3 provides the key design considerations and parameters that are discussed in the paper, Section 5 to 7 discusses in detail the tuning activity that was done on the selected parameters, Section 8
8
summarizes the key performance benefits realized after the various parameters were tuned and finally section 9 enumerates the key lessons learnt from the project followed by references
2. SYSTEM CONTEXT AND ARCHITECTURE
The following picture depicts a high level view of the architecture of a corporate banking application. The application supports payment transactions in terms of deposits, transfers, collections, mandates and a host of other typical banking transactions.
Figure 1 Application architecture of the payment processing system
The following are the key, metrics on the volume of the transactions that the system is expected to process
Two million transactions per day to be processed with a peak load of 300k transactions per hour, with around 2000 branch users. The End of day (EOD) / End of month (EOM) process has to complete within 2 hours. The system has to generate around one million statements post EOD over Million accounts. The system has around twenty upstream interfaces sending down the payments as files and messages. Over all about 30+ downstream systems which include regulatory, reporting, security and business interfaces.
The following picture provides the technical architecture of the banking application
9
Figure 2 Technical architecture of the payment processing system
This is a typical n-tier architecture that has the following layers
Web Tier: It is browser based and uses https as the protocol with extensive use of AJAX to render the content. Presentation Layer: This provides the presentation logic, mainly appearance and localization of data. Controller Layer: This is realized using the standard Model View Controller (MVC) design pattern.
Application Layer: This consists of Business Services and Shared services. A business service covers the functional area of a specific application, whereas a shared service provides infrastructure functionality such as exception handling, logging, audit trail, authorization, and access control.
Database Tier: This layer encompasses the data access objects that encapsulate data manipulation logic. Batch Processing: The batch framework supports a multi-threaded scalable framework. It provides restart and recovery features. It also allows capability to schedule the jobs using an internal scheduler or any 3rd party schedulers such as Control-M.
Integrator: This layer provides the integration capabilities to provide interfaces with external systems and gateways. Integrator amalgamates simplicity with a rich layer of protocol adapters including the popular SWIFT adapters, which is a powerful transformation and a rule based routing engine to provide the standard features of an Enterprise Application Integration layer.
10
3. DESIGN CONSIDERATIONS
This section covers the key challenges that we faced in the various architectural layers of the system and the thought process adopted for resolution.
The following are the challenges that are discussed in this paper
3.1 Tuning web page parameters Performance of search screens considering the variety of search options and parameters available to the user and huge transaction volumes that are added to the system on a daily basis. Search capability is fundamental to the daily usage of the business users to carry out their daily tasks. The SLA mandated by the customer is to have a screen response of 2 seconds or less
3.2 Tuning STP throughput parameter (Straight through processing) Design for maintaining the STP throughput considering the following parameters
Number (Count) of payment files received from the upstream systems at any given point in time and non-uniformity in the size of the files received. The files could be single transaction file or bulk set of transaction files.
The SLA is to process a load of 300,000 transactions received as messages, files (single and bulk) of varying sizes under 60 minutes.
3.3 Tuning batch parameters The performance of batch programs largely depends on effective management of database contentions and optimal commit sizes. Considering the fact that we could receive more than 40% of the transactions from a single account, it could result in hot spots. The batch process can be quite intensive pumping up a large number of transactions. We had to tune the system where the commit rates were in excess of 10,000 per second.
The SLA is to complete the End of day batch profile (EOD) and the End of the month batch profile less than 2 hours. The peak volume of a business day was about two million transactions. Additionally, the following parameter tunings are considered
Database parameters tuning.
Oracle specific considerations.
4. TUNING WEB PAGE PARAMETERS
Keeping response time SLA for search page as less than 2 seconds posed a challenge, considering the variety of search options / combinations available to the user and huge amount of transactions that are added to the system on a daily basis. The flexibility provided to the user allowed for a large number of combinations of search parameters and the generic SQL having null value function and “OR” conditions lead to full table scan.
Web based searches is one of the key operations that the business users exercise frequently. Hence it is imperative that the operation is designed as efficiently as possible. The following techniques are recommended for efficient search operations.
Identifying the popular search criteria – Considering a huge permutation of search conditions that can be possible, it is challenging to have indexes available on every possible combination. Moreover, there can be fields such as names and places where indexes may not be useful since we might have several thousand records that qualify for a given name or a place. Hence it is essential that we understand the most frequently used search parameters through detailed discussions with the business or operational users of the current systems. We included dedicated queries for these popular combinations serving the day to day requirements of most of the operational requirements. . For the remaining combinations, we included a generic query with a short date range as default. E.g. in a payment system, Order Date will have a default date range of 30 days. This is the period most of the searches would fit into.
If it is a legacy system which is being upgraded, logging using software probes can be used to tap the parameter values passed for scientific analysis. A query on Oracle internal view DBA_HIST_SQLBIND can also be used to capture the parameter (host variables for the SQL involved) usage by the end users.
11
Gracefully stopping the long running SQLs - A long running online query in the database is not under the control of the Application Server. It can block the Application Server thread for considerable amount of time thus impacting concurrency. In Oracle RDBMS such a query, can be interrupted gracefully if we create a dedicated Oracle user id for usage in the Application Server data source. This user id is assigned an Oracle Profile which has a limit set for CPU_PER_CALL (units are 1/100 of CPU seconds). Whenever an SQL running under that user id exceeds the specified, an Oracle Error ORA 02393 is returned and the query is terminated. Application code will trap it and a meaningful message is sent to the end user
Tuning the blank Search - This is a special case of open ended Search where the end user does not pass any parameters at all. This means all data is part of the result set and as the data is invariably sorted, it means sorting millions of rows and then presenting the first few rows. The only way to address this is to guide the Query Plan (at times via HINTS) so that an index with column order matching the sort order is picked up to avoid the costly sort operation. Also, we decided to suppress this Open Search feature wherever it is not required.
Additionally we added a data filter which will restrict the size of the result set to a smaller set. This makes sense as a blank search would otherwise bring up multiple pages of data that is not really useful to the user.
4.1 Quick Tuning Steps that lead to further benefits
Most of the times we do not have frequent change in the static content. In such cases we can use the option of
long term caching in the browser. However browser still generates 304 requests [PERF001] for verification of the
validity in static content causing response delays. This problem is pronounced in high latency locations. The
following entry in httpd configuration file of web server prevents the 304 requests.
Table 1: Configuration required in HTTPD to prevent 304 calls
However when a software upgrade contains a modified static content, the browser would still continue to use
the old content. This was managed by appending the context parameter of the application with a release
number whereby the browser will pull the content once gain during the first access. To avoid the URL
changing from a user perspective, the base URL was made to forward the user request to the upgraded URL
This is by far the best approach as a quick win to deploy the application in high latency WAN environment.
Significant performance gains in response times were realized due to reduction in the number of network calls
that the page rendering process performed for retrieving the static content. A gain of 15-20% was realized on
high latency networks where the latencies were between 200 to 300msBrowser Upgrade
12
It has to be noted that IE 8 / IE9 give superior performance as compared to IE 7 on account of efficient
rendering APIs in these versions. The performance of the web pages improved 10 to 15% without requiring
any code changes.
5. STP THROUGPUT PARAMETER TUNING
Design for maintaining the STP throughput considers the following parameters:
Number of files received at any point of time and no uniformity in the size of the files received by the system. Random combinations of small, medium and large sized files are received from the upstream interfaces. A file considered small varies from ~1 to 100 KB based on the record length of the transaction. A medium file varies from 100 to 1000KB. A file is considered large when it is in excess of 1 MB. We were expected to receive files of size up to 60MB.
The following is the design adopted for achieving scalability and load balancing on the number of files and their sizes
Figure 3 Design view of the file processing component for scalability
The files are received by the File interface through a push or pull mechanism based on the upstream source.
Based on the number of files received, the file processer threads are spawned dynamically scaling up the processing capability.
The file content is parsed and the contents are grouped, based on the type of transaction. E.g. Single payments and bulk payments.
Each of the batches is handled by batch adapters which divides the load amongst a pool of threads that can be varied based on the load
13
6. TUNING BATCH PARAMETERS
Batch jobs are a standard way of processing the transactions at the end of the day for doing interest calculations, account management and other risk management jobs. The volume of the transactions at the end of the day can be quite high. In our case, the peak volume of transactions expected was about two million transactions.
The application software should be able to scale up and fully exploit the available resources viz. CPU, Memory etc. in a manner such that there is minimal contention amongst parallel paths of execution.
Multithreading support is provided by Java. The Batch frameworks should exploit the same. However; the various hotspots which cause concurrency issues degrade the gains from multi-threading.
The guidelines below were used to mitigate the contentions
Sequence Caching – Batch Processing needs a large number transaction IDs to be generated. Using oracle generated sequences enable them to be easily generated and assigned to the transactions. Caching them upfront reduces the overhead of ID generation and leads to improvement of performance of the batch processing. Oracle Sequences typically used in primary keys should be cached as much as possible (default is 20) if there is no business constraint. The NOORDER clause should be preferred. This technique improved the performance of the SQL performance of our application thereby improving the batch throughout considerably.
Allocation of Transactions to Threads - Allocation of transactions to the threads avoids situation, where transactions processed over multiple threads in parallel, can enter into contention (Oracle row lock contention wait). An example is the case - where transactions of the same account are getting processed in parallel across Threads and doing ‘Select for Update’ on the Account row. The contention can be reduced if the model is changed and the threads pick up the transactions based on the modulus of the Account Id. Modulus is a mathematical function that finds the remainder of division of one number by another. This technique helped us to route the transaction evenly across the threads thereby reducing the hot spots
Usage of Right Data Structures - Usage of better container collection classes promotes better concurrency. We adopted concurrent hash map which reduced the cases where the transactions were entering into dead lock situations.
Parallelism in batch profile - This can bring in much needed time reduction in the overall EOD cycle. We engaged the functional SMEs to redesign the EOD profile carefully so that non conflicting Batch programs can be run in parallel.
Deadlock Prevention- Deadlocks need to be avoided at by planning and proper design. For this, the updates should be at the end, just before the commit as far as possible. The Tables should be updated in a consistent order in all programs running in parallel e.g. Table A, followed by Table B followed by Table C. Details of deadlocks, the SQLs and rows involved can be seen in the Alert Log and Trace files generated by Oracle. The above technique helped us to reduce the deadlocks when files were received with several transactions on a small set of accounts.
7. TUNING DATABASE PARAMETERS
Managing the REDO logging - Oracle redo logs can become a huge bottle neck when several threads are writing large amount of redo data in parallel. The redo volumes were reduced by avoiding repeated updates in the same commit cycle and by placing the redo logs on faster storage and separating the log members / groups on different disks [PERF003]. More the threads means more redo generation load.
By using prepared statements having bind variables - This promotes statement caching, reduces memory footprint in the shared pool and reduces the parsing overhead.
14
Bulking of Inserts/Updates – Inserts/Updates were clubbed using the add Batch and execute Batch JDBC methods. This is highly useful in an IO bound application. It saves on network round trips and especially useful where latency between the application and database servers is high.
Disabling Auto Commit mode – As a default, the JDBC connection commits at every Insert/Update/Delete statement. This can not only lead to too many commits resulting in the infamous ‘log file sync wait’ [PERF002] but
can also lead to integrity problems as it breaks the atomic unit of work principle.
Ensuring closure of Prepared statement and Connection – This will conserve the JDBC resources and prevent database connection leakage. This is done in the ‘finally’ block in Java so that even in exceptions we close the resources before exiting.
Connection Pool libraries - This saves on JDBC connection count and promotes better management and control. Industry standard Application Servers use this as a standard practice. The same may now be used in the batch frameworks. In this context right configuration of pool is very important, especially the minimum/maximum pool sizes else the application threads will wait for the connections. “BoneCP” is a third party connection pool management library which was used in our application. It does a good job of managing the pool effectively.
Controlling the JDBC Fetch Size – The result set rows are retrieved from the database server in lots as per the JDBC fetch size. Based on this size the JDBC driver allocates memory structures in the Heap to accommodate the arriving data. Fetch size determines the number of chunks of row-sets that oracle return for a given query. The default value is 10. This will mean, oracle will return in 10 row-sets if the result set size is 100 rows. Too less a value will mean more network traffic between client and the database. Too high a value will impact the heap memory available on the client. The fetch size selected should be optimal to balance the network traffic and the available heap available on the client side. A large fetch size can lead to out of memory errors. Cases where the code tries to fetch data with same size of the rows to be shown on screens can result in out of memory issues. We selected a fetch size of 100, based on the maximum number of records that a user would normally want to view on a given search criteria
Database Clustering - Implementing Real application cluster can help scale the database layer largely. It is one of the most important layers at which an issue is normally cascaded across all the layers. RAC is a special case and unless application is designed for RAC, it is going to degrade 30% or more on account of RAC related waits related to transfer of Oracle data blocks over the high speed interconnect, across the SGAs of the different RAC nodes. While index hash partitioning, sequence caching etc. can give some relief, real improvement comes from right application work load partitioning in sync with database Table partitioning.
For example - in a two Entity scenario, JVMs processing Entity 1 connect to RAC node A as a preferred instance (using Oracle Services) and JVMs processing Entity 2 connect to RAC node B as a preferred instance. The required Tables are partitioned on Entity Id and the Partitions are mapped onto separate Table spaces to isolate the blocking scenarios. All this helps mitigate the RAC related damages and promotes improved performance, scalability and availability. It implies that at design stage enough thought has been given for RAC enablement.
Implementation of RAC in the program was postponed as there was no infrastructure support available in the client environment
Oracle specific considerations [PERF003]
Physical Design: The physical design of database storage has an important impact on overall performance of
the database
The following are the factors that were considered for optimizing the database design:
The physical arrangement of table spaces on storage volumes (hard disks) along with the mapping of database objects (tables, indexes, etc.) on to the table spaces.
The number of redo log groups, their member count, and log file sizes and placement. (E.g. log groups and their members to be placed on separate and very fast disks.)
15
The storage options of database objects within a table space (e.g. In Oracle, options to be set in the storage clause e.g. PCT free, extents should be exercised).
Definition of indexes - which tables columns should be indexed, and the type of index (e.g. B Tree, bit mapped index (recommended for status type columns etc.).
Other performance tuning options, depending on RDBMS features (e.g. partitioned tables, materialized views should be exploited)
Table Partitioning - If volumes are going to be high, table partitioning has to be thought in advance rather than later when we have performance issues. For databases above 1 TB, partitioning must be an essential consideration. It reduces the hot spots such as index last block, where we have sequentially increasing keys for the index. We considered both Table and Index Partitioning to get the best gains.
Partitioning helped us greatly in large scale removal of data during archival and purging process. For example
removal of Jan 2012 data from the database Table can be done very fast by dropping the monthly Partition as
compared to conventional delete of millions of rows. Additionally, it helped us to improve the maintenance of the
database e.g. Partitions were backed up independently; Partitions were marked ‘read only’ to save backup time.
8. KEY BENEFITS REALIZED
8.1 Web performance The design criteria adopted for web page performance helped us to manage the SLA expectation of the users. While the logged in users were expected to be around 2000 users, the concurrent usage was 200 users.
Figure 4 Average Response Time of the top 10 frequent use cases
X-axis >> Average response range, Y axis >> Actual graph value
16
The search page performance was manageable even when a portion of the users did fire blank searches. Queries were adjusted to include a shortened data range to manage the data volume.
The introduction of configuration in the web server to stop 304 calls originating from browser boosted the performance of the application in regions when the network latency was upwards of 200ms.
The graphical picture above shows the response times of key pages that the user community uses very frequently.
8.2 STP performance The design adopted for processing the files of varying sizes increased the throughput significantly. The dynamic spawning of file processors and splitting the huge files into manageable batches improved the scalability of the system dramatically.
The following matrix provides the split up of various files and messages totaling to 300,000 transactions per hour that the system was able to process in 60 minutes
Table 2: STP volumes to be supported on each of the key interfaces
8.3 Batch performance This was one of the key components in the payment processing system. Parallel processing of various batch components that are functionally independent helped us to reduce the time window of the batches significantly. The following picture depicts the various groups of batches grouped together and executed parallel.
17
Figure 4 Average Response Time on the top 10 use cases
Oracle database‘s Redo log volume design (right sizing) helped a lot improving the IO of the system and thereby processing performance. Appropriate commit size configuration coupled with a very efficient third party connection pool library helped us to increase the batch throughput significantly.
With these improvements, the batch component was able to complete the EOD profile on a transaction volume of two million transactions which was the peak volume test that was mandated by the customer.
9. KEY LESSONS LEARNT FROM THE PROJECT
Following are the key lessons learnt from this project
Performance is a key component to be planned and worked upon from the requirement stage till the performance testing stage. It cannot be done as an exercise when the issues are found. Performance of large systems hinges upon the ability to minimize IO by maximizing efficiency. Exploit caching features thoroughly as they contribute immensely to the scalability of the system.
Quick results on web page performance can be realized using static content caching and by reducing 304 calls especially on a WAN environment. It has to be noted that IE 8 / IE9 give better results as compared to IE 7 on account of efficient rendering APIs in these versions. The performance of the pages is expected to increase by 10 to 15% without requiring any code changes.
Keeping the number of SQLs fired to the minimum per unit of output achieved is one of the key design objectives to be borne in mind. Once this is done, rest of the tuning can be achieved easily with appropriate indexing.
Keeping the primary database size to the minimum through an archival policy is very essential to contain the ever growing transaction tables. Without a good archival policy, performance of the system is bound to deteriorate with the passage of time. Implementation of RAC requires a careful thought on the application design without which we could experience significant performance degradation.
18
REFERENCES
[PERF001] Best practices for speeding up your web site. Yahoo developer network.
[PERF002] Oracle Tuning: The Definitive Reference: By Donald Burleson.
[PERF003] Oracle® Database Performance Tuning Guide.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
19
Designing for Performance Management of Mission Critical Software Systems in Production
Raghu Ramakrishnan TCS
A61-A, Sector 63 Noida, Uttar Pradesh, India
201301 91-9810607820
[email protected] [email protected]
Arvinder Kaur USICT, GGSIPU
Sector 16C Dwarka, Delhi, India
110078 91-9810434395
Gopal Sharma TCS
A61-A, Sector 63 Noida, Uttar Pradesh, India
201301 91-9958444833
[email protected] [email protected]
Traditionally, the performance management of software systems in production has been a reactive exercise, often carried out after a performance bottleneck has surfaced or a severe disruption of service has occurred. In many such scenarios, the reason behind such behavior is never correctly identified primarily due to the absence of accurate information. The absence of historical records related to system performance also limits development of models and baselines for proactive identification of trends which indicate degradation. This paper seeks to change the way the performance management is carried out. It identifies five best practices that have been classified as requirements to be included as part of software system design and construction to improve the overall quality of performance management of these systems in production. These practices were successfully implemented in a mission critical software system at design time resulting in effective and efficient performance management of this system from the time it was operationalized.
1 Introduction The business and technology landscape of today is characterized by the increasing presence of mission critical web applications. These web applications have progressed from being simple static content providing applications to application allowing all kinds of business transactions. The responsiveness of websites under concurrent load of a large number of users is an important performance indicator for the end users and the underlying business. The growing focus on high performance and resilience has necessitated including comprehensive performance management as an integral part of the software systems. However this is an area which has received limited focus and relies on a fix-it-later approach from a project execution perspective. The key to the successful performance management of critical systems in production is timely availability and accuracy of data which may then be analyzed for proactive identification of performance incidents. The inclusion of performance management requirements is essential in the design and construction phase of a software system. Our experience in handling of performance related incidents in critical web application over the last few years has shown that the focus on inclusion of such techniques starts when performance incidents get reported after the start of operations of the software system in production. This may be too late since there may be little room left for any significant design change at that point of time or may require a major rework in the application. This approach
20
is risk prone and expensive as making changes in implementation phase of software development lifecycle is difficult, incurs rework effort and a 100 times increase in cost [YOGE2009]. A number of studies have shown that responsive websites impact productivity, profits and brand image of an organization in a positive manner. In addition, slow websites result in loss of brand value due to negative publicity and decreased productivity. A survey done by Aberdeen Group on 160 organizations reported an impact of up to 16% in customer satisfaction and 7% in conversions i.e. loss of a potential customer due to a one second delay in response time. The survey also reported that the best in class organizations improved their average application response time by 273 percent [SIMI2008]. Satoshi Iwata et al. demonstrates the usage of statistical process control charts based on median statistic for detecting performance anomalies related to processing time in RuBiS which is a web based prototype of an auction site [SATO2010]. This performance anomaly detection technique requires the timely availability of measured value using appropriate instrumentation techniques (e.g. response time from the web application). This paper tries to bring about a paradigm shift from the prevalent reactive and silo-based approach in the domain of performance monitoring of mission critical software systems to an analytics based engineering approach by including certain proven requirements as part of the design and design process. The objective is to be able to know about a performance issue before a complaint is received from the end users. The silo-based approach analyzes several dimensions such as web applications, web servers, application servers, database servers, storage, servers and network components in isolation. The reactive approach involves including logs in a makeshift manner only when a performance incident occurs. This in turn results in the required information not being available at the right time for effective detection, root cause analysis and resolution of the problem. This paper suggests including practices like instrumentation, system performance archive, controlled monitoring, simulation and an integrated operational console as part of the design and design process of software systems. These requirements are not aimed at improving or optimizing the performance of the software system but are for enhancing the effectiveness and efficiency of performance monitoring in production. These requirements were successfully included as part of the design and design process in a mission critical e-government domain web application built using J2EE technology. This has helped the production support team to recognize early warning signs which may lead to a possible performance incident and take corrective actions quickly. The rest of this paper is organized as follows. Section 2 describes the application in which the proposed best practices were implemented. Section 3 describes the best practices in detail. Section 4 presents the results and findings of our work. Section 5 provides the summary, conclusion, limitations of our work and suggestions for future work.
2 Background These requirements were successfully included as part of the design and design process in a mission critical e-government domain web application built using J2EE technology servicing more than 40000 customers every day. The web application in this e-governance program is used by both external and department users for carrying out business transactions. The external users access this application on the Internet and the department users access it on the Intranet. The technical architecture has five tiers for the external users and three tiers for the department users. The presentation components are JSPs and the business logic components are POJOs developed using the Spring framework. External Users: The information flow from external users passes through five tiers. The tiers 1, 2A and 3 host the web application server. The tiers 2B and 4 host the database server. Tier 1 provides the presentation services. Tier 2A and 2B carry out the request routing from the tier 1 to tier 3 after carrying out the necessary authentication and authorization checks. The business logic is deployed in tier 3. Tier 4 holds all transactional and master data of the application. Department Users: The information flow from department users passes through three tiers. The tier 5 and 6 host the web application server. Tier 5 provides the presentation tier. The business logic id deployed in Tier 6. Tier 4 holds all transactional and master data of the application.
21
Figure 2 shows the logical architecture of the e-government domain web application.
Figure 2: Logical architecture of the e-governance web application
3 Building Performance Management in Software Systems The existing approach of performance management of software systems in production involves a reactive and silo based approach. The silo based approach involves measurements at the IT infrastructure component level i.e. server, storage, web servers, application servers, database servers, network components and application server garbage collection health. The reactive approach involves adding log entries whenever a performance incident occurs. This section describes in detail five mandatory requirements that a software system needs to incorporate as part of the design and design process to ensure effective, proactive and holistic performance management after the system is in production. These requirements were successfully implemented as part of the design and design of a mission critical e-government domain web application with excellent results. This web application provides a number of critical business transactions to end users and requires meeting stringent performance and available service level agreement requirements.
3.1 Instrumentation
The instrumentation principle of Software Performance Engineering states the need to instrument software systems when they are designed and constructed [CONN2002]. Instrumentation is the inclusion of log entries in components for generating data to be used for analysis. These log entries do not change the application behavior or state. Correct and sufficient instrumentation help in quick isolation of performance hotspots and determination of the components contributing to these hotspots. The logs from various tiers and sources form an important input to performance management of software systems in production. These logs can be from application or the infrastructure tier. The logs from the application tier include web server and custom logs of the web application. The logs from the infrastructure tier include processor utilization, disk utilization, application server garbage collector etc. The technique of implementing instrumentation includes use of filters, interceptors and base classes. Figure 3 shows the usage of a base class for implementing instrumentation. The software system requirements in the area of performance and scalability traditionally do not mention an instrumentation requirement. The experience of the authors in managing large scale software systems showed instrumentation being introduced as a reactive practice towards the end of the software development lifecycle for identification of performance incidents reported from the end users or performance tests. This reactive approach results in rework and schedule slippage due to the code changes need for instrumentation and regression testing
22
required following these changes. This paper recommends inclusion of this practice as a key requirement in the software requirements specification rather than being limited to being a best practice.
PRACTICE: Include sufficient instrumentation in all tiers for quick isolation of performance
problems and identification of the component(s) contributing to performance problems.
public class TestBaseAction extends ActionSupport implements PrincipalAware, ServletRequestAware, SessionAware {
public final String execute() throws Exception {
String res; Date begin = new Date( ); res = execute2( ); Date end = new Date( );
logger.info(….end.getTime( ) – begin.getTime( ) ….);
}
} Figure 3: A base class is a place to add instrumentation log entries.
Figure 4 shows entries from a standard web server log. Each entry includes a timestamp, the request information, execution time, response size and status. The software requirements specification can explicitly state that the web server log need to be enabled for recording specific attributes. These entries can be aggregated for a time interval (e.g. two minutes) to arrive at statistics like count and mean response time or used for steady state analysis of the software system. In a stable system, the rate of arrival of requests is equal rate at which the requests leave the system.
Figure 4: Using Web Server Logs as an Instrumentation Tool
Figure 5 shows entries from a custom web application log. Each entry includes an entry and exit timestamp, web container thread identifier, the request information, execution time, status and correlation identifier. The software requirements specification can explicitly state that the application web server log need to be enabled for recording specific attributes. The custom logs provide application specific information which may not correctly reflect in the web server log (e.g. a web server log may report an http status 200 but the business transaction may have
XXX.00.777.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/news/ticker.jsp HTTP/1.1" 200 1078 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:13 +0530] "POST /OnlineApp/secure/AddressAction HTTP/1.1" 200
13495 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/images/bt_red.gif HTTP/1.1" 200 157
0
XXX.00.777.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/images/bullet_gray.gif HTTP/1.1" 200
45 0
XXX.00.777.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/status/trackingHTTP/1.1" 200 10055 0
ZZZ.77.99.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/css/doctextsizer.css HTTP/1.1" 200 73 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/news/ticker.jsp HTTP/1.1" 200 1078 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "GET /OnlineApp/CaptchaRxs?x=1483d7f9s71990
HTTP/1.1" 200 4508 0
ZZZ.77.99.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/secure/ServiceNeeded HTTP/1.1" 200
7346 0
ZZZ.77.99.1XX - - [14/Aug/2014:22:06:14 +0530] "POST /OnlineApp/user/loginValidate HTTP/1.1" 302 - 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "GET /OnlineApp/user/uservalidationHTTP/1.1" 200
8717 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "GET
/OnlineApp/user/loginAction?request_locale=en&[email protected] HTTP/1.1" 302 - 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "POST/OnlineApp/secure/logmeAction HTTP/1.1"200
7657 0
23
encountered a logical error). These entries can be aggregated for a time interval (e.g. two minutes) to arrive at statistics like count and mean response time or used for steady state analysis of the software system.
Figure 5: Using Custom Logs as an Instrumentation Tool
3.2 System Performance Archive
This practice involves keeping a record of the history of the software system by storing the values of various metrics related to the performance of the system. The metric provide a strong mechanism for reviewing past performance and identifying emerging trends. Brewster Kalhe founded the Internet archive for keep a record of the history of Internet [BREW1996]. The Http archive is a similar permanent record of information related to web page performance like page size, total requests per page etc. [HTPA2011]. The system performance archive must capture a minimum of three important attributes namely the measured value, the metric and the applicable domain for each measurement (e.g. the metric is the response time, applicable domain is the application home page and the measured value is 4.2 seconds). The measured metric can be explicit or implicit. An example of an explicit metric can be derived from the requirement that “the software system shall be designed to process 99% of the online home page requests to complete within 5 seconds”. This archive is used as input to in-house and third party analytical tools to carry out statistical analysis (e.g. mean, median, standard deviation, percentile) and modeling (e.g. capacity planning) . This paper recommends inclusion of this practice as a key requirement in the software requirements specification. Critical software systems need to provision the required infrastructure for creating this archive in terms of compute and storage, in-house or third party analytical tools for this critical functionality at the time of design, capacity planning and construction. This compute and storage provisioning can be easily done on in a cloud environment.
PRACTICE: Design and construct a system performance archive for critical software systems to
keep a record of performance related information.
3.3 Controlled Monitoring
This practice involves executing synthetic read only business transactions using a real browser, connection speed and latency. These synthetic business transactions can be executed from one or more regions. There are a number of incidents in which an end user reports experiencing slowness but the server health appears normal. The practice of controlled monitoring helps quickly determine if the incident is specific to the user reporting the problem. The software system requirements in the area of performance and scalability traditionally do not mention executing synthetic read only business transactions using real browser, at real connection speed and latency. This requirement can be implemented using frameworks like private instance of WebPageTest (https://sites.google.com/a/webpagetest.org/docs/private-instances). Critical software systems need to provision the required infrastructure for carrying out this monitoring and upload the results in the System Performance Archive.
2014-08-14 21:01:13,836 | WebContainer : 760 | -|DCBANKONL|class .secure.action.uploadform|-|Mon Aug 14 21:01:00
GMT+05:30 2014|13176|-|-|20140818210100
000001ABBBA26d1ds668|
2014-08-14 21:01:41,507 | WebContainer : 755 | -|[email protected]|class online.secure.action.viewFormAction|-|Mon
Aug 14 21:01:34 GMT+05:30 2014|7030|-|-|2
01408182101340s0ad0af00164515616|
2014-08-14 21:01:52,798 | WebContainer : 730 | -|[email protected]|class
online.secure.action.payment.PaymentVerificationAction|-|Mon Aug 14 21:01:46 GMT+05
:30 2014|5805|-|-|20140818210146000001AAAA65590404|
2014-08-14 21:02:34,466 | WebContainer : 699 | -|CCC0990|class online.secure.action.CreditCardPaymentAction|-|Mon
Aug 14 21:02:26 GMT+05:30 2014|7733|-|-|20140818
210226000002AAAA23518695|
2014-08-14 21:02:34,498 | WebContainer : 655 | -|[email protected]|class
online.secure.action.ApplicationSubmitAction|-|Mon Aug 14 21:02:19 GMT+
05:30 2014|15050|-|-|2014081820aa000d02AAaA5a118s7|
24
PRACTICE: Use synthetic read only business transactions using a real browser, connection speed
and latency to measure performance.
3.4 Simulation Environment
This practice is based on the belief that most events happening in a system shall be reproducible under similar conditions. The causal analysis of certain incidents may remain inconclusive during initial analysis. Recreation of the symptoms leading to that incident, under similar conditions may lead to the deeper insight and help in finding the actual root cause. Since such simulation may not be feasible in the actual production environment in majority of the cases, a similar Simulation Test environment needs to be used. The prevalent practice in the Industry appears to be to treat performance test as single or one time activity prior to implementation resulting in provisioning of a simulation environment only for limited duration. As a result, reproduction of a complex problem that occurred in production becomes extremely difficult.
PRACTICE: Provide a simulation environment to reproduce the performance incident in production
like conditions to ensure completeness and correctness of the causal analysis of that incident.
3.5 Integrated Operations Console
In order to manage performance of a production system effectively, it is essential that production support teams have the ability to visualize anomalies and resolve exceptions without delay. The prevalent silo based approach involves measurements at the IT infrastructure component level i.e. server, storage, web servers, application servers, database servers, network components and application server garbage collection health. The concept of an Integrated Operations Console can be very effective in such scenarios. This console not only monitors the system performance, but also records exception conditions and provides ability to take actions to resolve these conditions. The typical actions may range from killing a process or query to restarting of a service. This console shall also need to provide a component level checklist which can be executed automatically prior to start of operations every day. Table 1 shows an extract of a database checklist.
This console empowers the teams to take quicker action once an exception is observed. Besides, allowing actions, the console may automatically gather the relevant data such as heap dumps, database snapshot to aid further investigation.
Host accessible? Instance available? Able to connect to the database? …. …. …. ….
Table 1: Using Custom Logs as an Instrumentation Tool
PRACTICE: Provide an Integrated Operations Console for monitoring the system performance
parameters and mechanism to resolve anomalies for which resolution processes are known.
25
4 Results & Findings The above five practices were successfully implemented as part of the design and design of a mission critical e-government domain web application.
4.1 Implementation 1
The instrumentation was implemented as an integral part of the design and design activity of the e-governance application. Figure 6 shows the instrumentation implemented in that application. The tiers 2, 2A, 3, 5 and 6 implemented instrumentation in the form of web server logs and custom logs. The tiers 2B and 4 implemented instrumentation in the form of database snapshots. The relevant logs from all tiers all collected at a shared location. The information from these logs are processed and used as input to the System Performance Archive and Integration Operations Console.
Figure 6: Instrumentation implemented in web server, application server and database
The first example shows how simple instrumentation helped finding the cause of a performance incident in where end users experienced high response time. Figure 7 and Figure 8 show request and response time graphs of tier 1 and 3 respectively calculated using the web server logs and depicted on the Integrated Operations Console. The request is the count of all the requests serviced in a given time interval and the response time is the mean execution time of all the requests serviced in that time interval. The visual inspection of Figure 7 clearly shows a high mean response time in tier 1. The spike in the mean response time is not visible in tier 3 for the same duration, but there is drop in the number of request serviced by this tier. This helps us to conclude that the origin of the performance incident may be tier 1 or 2A.
26
Figure 7: The request and response time graph of tier 1 from the Integrated Operations Console. The request is the count of all requests serviced in a given time interval and the response time is the mean execution time of all requests serviced in that time
interval.
Figure 8: The request and response time graph of tier 3 from the Integrated Operations Console. The request is the count of all requests serviced in a given time interval and the response time is the mean execution time of all requests serviced in that time
interval.
The second example shows how the same instrumentation helps in determining whether the system is in a steady state. Figure 9 and Figure 10 shows that the system is in a stable state or equilibrium as the number of arrivals is equal to the number of exits. The arrivals and exits graphs are depicted on the Integrated Operations Console.
Figure 9: The count of arrivals in a given time interval from the Integrated Operations Console
27
Figure 10: The count of exists in a given time interval from the Integrated Operations Console
4.2 Implementation 2
The provisioning of a simulation environment even after implementation helped resolve a serious long stop the world garbage collection pauses in the application server on tier 1 and 3. The simulation and confirmation of resolution of the issue required multiple cycles of test execution. The identification of the reason for these pauses as a class unload problem required adding debug logs in a Java runtime class. Executing multiple test cycles to reproduce the problem is not feasible in the development or production environment. Figure 11 shows the garbage collector log show more than 91000 classes getting unloaded.
Figure 11: The garbage collection log from the Simulation Environment
28
4.3 Implementation 3
The System Performance Archives provides an insight into the historical as well as emerging performance related trends for the software system. These trends are crucial to assess the capability of the software system to render services while meeting the required performance objectives. This information is also used in capacity planning exercises. Figure 12 shows the implementation of System Performance Archives in the e-government domain application.
Figure 12: Implementation of the System Performance Archive
Certain trends may be cyclic and may appear only after a particular period. Other trends may be more permanent in nature and tend to grow or decline. The implementation used descriptive statistics like count, mean, median, minimum, maximum, percentile and standard deviation. Figure 13 shows the mean response time trend of a business transaction (BT1) for a period of a month. The range of this response time is between 800 to 900 milliseconds. As can be clearly seen, the response time remained constant for the complete month. Figure 14 shows the mean response time trend of a second business transaction (BT2) for the same month. It can be seen that there is a change in the trend from 7
th (1919 milliseconds to 2004 milliseconds) and on 29
th (2176
milliseconds to 2813 milliseconds). The changes in the trends were further investigated to find out the reason for the same. The change in the trend on 7
th was due to additional business logic being added to the business
transaction as part of the deployment on 6th. The increase in the mean response time on 29
th was due to a
network issue resulting in transactions getting executed slowly.
Figure 13: Daily mean response time trend of a business transaction (BT1) for a month
29
Figure 14: Daily mean response time trend of a business transaction (BT2) for a month
5 Threats to Validity The practices described in this paper are based on the authors experience in working and managing mission critical web application. These practices may need to be augmented with additional practices which the software performance engineering community may share.
6 Conclusions In this paper we have introduced a few mandatory requirements required to be included as part of the design and construction of a mission critical software system. This will bring about a paradigm shift in the prevalent practice of production support teams being equipped adequately to quickly detect a performance incident, gather enough information during an incident or reproduce the incident for a more accurate closure. These practices when included as part of design provided significant benefits in production support of a mission critical e-government domain system in areas of timely detection of a performance incident allowing corrective action, visualizing emerging trends and providing a more correct closure to incidents.
7 Future Work The future work in this area includes creating baselines and statistical models for metrics like response time and throughput. There is also a need to devise proactive anomaly detection models using techniques like steady state analysis.
30
References
[YOGE2009] K. K. Aggarwal and Yogesh Singh, “Software Engineering”, New Age International Publishers, p470, 2009.
[SIMI2008] B. Simic, "The Performance of Web Applications: Customers Are Won or Lost in One Second", Technical Report - Aberdeen Group, Accessed on 31 Jan 2014 at http://www.aberdeen.com/aberdeen-library/5136/RA-performance-web-application.aspx
[SATO2010] S. Iwata and K. Kono, Narrowing Down Possible Causes of Performance Anomaly in Web Applications, European Dependable Computing Conference, p185-190, 2010.
[CONN2002] Connie U. Smith, Performance Solutions – A practical guide to creating responsive, scalable software, Addison Wesley, p243, 2002.
[BREW1996] Brewster Kahle, Internet Archives, Accessed on 17 Aug 2014 at http://en.wikipedia.org/wiki/Brewster_Kahle
[HTPA2011] http archive, Accessed on 31 Jan 2014 at http://httparchive.org/
[WPGT] WebPageTest, Accessed on 31 Oct 2014 at https://sites.google.com/a/webpagetest.org/docs/private-instances
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty
free right to publish this paper in CMG India Annual Conference Proceedings. 31
Incremental Risk Charge Calculation: A case study of
performance optimization on many/multi core
platforms
Amit Kalele*, Manoj Nambiar# and Mahesh Barve
*#
Center of Excellence for Optimization and Parallelization
Tata Consultancy Services Limited, Pune India
Incremental Risk Charge calculation is a crucial part of credit risk estimation. This data intensive
calculation requires huge compute resources. A large grid of workstations was deployed at a large
European bank to carry out these computations. In this paper we show that with availability of many
core coprocessors like GPU and MIC and parallel computing paradigms, speed up of order of
magnitude can be achieved for the same workload with just a single server. This proof of concept
demonstrates that with the help of performance analysis and tuning, coprocessors can be made to
deliver high performance with low energy consumption, making them a “must-have” for financial
institutions.
1. Introduction Incremental Risk Charge (IRC) is a regulatory charge for default and migration risk for trading
book position. Inclusion of IRC is made mandatory under the new Basel III reforms in banking
regulations for minimum trading book capital. The calculation of IRC is a compute intensive
task, especially the methods involving Monte-Carlo simulations. A large European bank
approached us to analyze and resolve performance bottlenecks in IRC calculations. The timing
reported on a grid of 50 workstations at their datacenter was approximately 45 min. Risk
estimation and Monte Carlo techniques is well studied topic and details can be found in [1], [2],
[3] and [4]. In this paper we focus on the performance optimization of the IRC calculations on
modern day many/multi core platforms.
The modern day CPUs and GPUs (Graphic Processing Units) are extremely powerful machines.
Equipped with many compute cores, they are capable of performing multiple tasks in parallel.
Exploiting their parallel processing capabilities along with several other optimization techniques,
could result in many fold improvement in performance. In this paper we present our approach for
performance optimization of IRC calculations. We show that multifold gains, in terms of
reduction in compute time, hardware footprint and energy required, can be achieved. We report
that 13.5x and 5.2x speed up is achieved on Nvidia’s K40 GPUs and Intel KNC coprocessor
respectively.
In this paper we present performance optimization of IRC calculations on Nvidia’s K40 [11] and
Intel’s Xeon Phi or KNC coprocessors. We also present benchmarks on Intel’s latest available
platforms namely Sandy Bridge and Ivy Bridge processors. The paper is organized as follows. In
the next section (2), we briefly describe incremental risk charge in relations with credit risk. In
section (3) & (4), we introduce a method for the IRC calculation along with the experimental
setup and procedure. The performance optimization of IRC calculation on Nvidia’s K40 and
32
Intel’s KNC coprocessors are presented in section (6) & (7). We present our final experimental
results and achievements in section (8).
2. Credit Risk & Incremental Risk Charge Basel-III, a comprehensive set of reforms in banking prudential regulation, provides clear
guidelines on strengthening the capital requirements through:
Re-definition of capital ratios and the capital tiers.
Inclusion of additional parameters into the Credit and Market Risk framework like IRC,
CVA (Credit Valuation Adjustment) etc.
Stress testing, wrong way risk and liquidity risk
The regulatory reforms and the ongoing change in the derivatives market landscape and the
changing behavior of the clients are moving risk function from a traditional back office to a real
time function. This redefinition of capital adequacy and requirements for efficient internal risk
management has increased the amount of model calculation. This is required within the Credit
and Market risk world and thus there is need for large scale computing. Incremental Risk Charge
is one such problem we focused in this paper.
IRC calculation is crucial for any financial institutions in estimating credit risk. The IRC
calculation involves various attributes like Loss Given Default (LGD), credit rating, ultimate
issuer, product type etc. Standard algorithms as well as proprietary algorithms are used to
calculate IRC and methods involving the Monte Carlo simulations are extremely compute
intensive. In the next section we present one such algorithm and discuss computational
bottlenecks.
3. Fast Fourier Transforms in IRC IRC is a regulatory charge for default and migration risk for trading book position. One of the
approaches based on Monte Carlo simulations for IRC calculation is described in below figure:
The data involved in default loss
distribution is huge. In our case, the FFT
computation for this data was offloaded to
a grid of 50 workstations.
A typical IRC calculation for a single
scenario involves computations of FFT for
160,000 arrays. Each array consisting of
32768 random numbers arising out of random credit movement paths. This translates to
approximately 37GB of data to be processed for FFT computation. In all we have to process 133
such scenarios, which makes it a huge data and compute intensive problem. To summarize the
overall complexity of the problem:
1 scenario of IRC calculation: 37GB of data
Figure 15 IRC Calculation Flow
33
Total scenarios: 133
Total data to be processed: (133 * 37) = 4.9 TB
To simulate the above computations, we carried out following procedure:
For each IRC scenario
o Create 160,000 arrays, each of 32768 elements
o Each arrays is filled with random numbers between (0 ~ 1)
o Transfer the data in batches from host (server) to co-processors (Nvidia’s K40
GPU and Intel Xeon Phi or MIC) over PCIe bus
o Compute FFT and copy back results
4. Experiment Environment Following hardware setup and software libraries were used to carry out the above defined
procedure. Performance analysis was done using Nvidia’s “nvvp” visual profiler tool.
The GPU benchmarks reported in this paper were carried out on the following system. This was
enabled by Boston-Supermicro HPC labs, UK.
Host: Intel Xeon E5 2670V2, 2 socket (10 core x 2), 64GB of RAM
GPU: K40 x 4 (in x16 slot), 12GB RAM
Freely available cuFFT library from Nvidia is used for FFT calculations [7], [8], [12].
Access to the Intel Xeon system was enabled by Intel India. All the experiments were carried out
on following setup.
Intel X5647 (Westmere), 4 core, 2.93 GHz, 24GB RAM
Host: Intel Xeon E5 2670, 2 socket (8 core x2), 2.7 GHz processor, 64 GB of RAM
Coprocessor: KNC 1.238 GHz speed, 16GB of RAM, 61 cores
Intel MKL library is used for FFT calculations on host as well as coprocessor
In the following sections we discuss performance tuning of IRC calculations on various
platforms.
5. IRC Calculations on Intel Westmere We implemented the procedure explained in the previous section on Intel Westmere platform
with Intel’s MKL math library.
The MKL is a collection of several libraries spanning linear algebra to FFT computations. The
MKL provides various APIs for creating plans and performing different FFTs. Following APIs
were used to perform 1D FFT in our exercise.
DftiCreateDescriptor();DftiSetValue();
DftiCommitDescriptor();
DftiComputeForward(); DftiFreeDescriptor();
34
Since 4 cores were available for computation, a multi-process application was developed using
Message Passing Interface (MPI) [5], [6]. The overall computations were equally divided among
all the 4 cores. A code snippet of the main compute loop is given below in Figure 2.
Since FFT computation for each array is
independent no communication was
required among ranks over MPI.
It took 194 minutes to complete 133 IRC
scenarios. It would require ~40 Westmere
servers to complete these calculations in
under 5 min mark. This adds too much
cost in terms of hardware and power
requirement. We hope to achieve a better
performance with coprocessors and
reduce hardware and power requirements.
We consider Nvidia’s K40 and Intel’s
KNC coprocessor in the following
sections.
6. IRC on Nvidia K40 GPU The K40 GPU is Nvidia’s latest Tesla
series coprocessor. It has 2880 compute light weight GPU cores and rated at around 1TF of peak
performance. Such platforms are extremely suitable for data parallel workloads. The cuFFT
library was used for FFT computations. In this section, we describe the performance
optimization in a step-by-step manner starting with a baseline implementation. Each step
includes the measures taken in earlier steps.
o Baseline Implementation
Using the above mentioned procedure, a baseline implementation of FFT calculations was
carried out. This involved creation of the appropriate arrays, calling the cuFFT functions for
creating a 1D plan cufftPlan1d and cufftExecR2C for computing transform and finally
copying the data back to host using cudaMemcpy.
It took ~67min to compute 133 scenario. We observed that the majority of the time (~61 min) is
spent in data transfer between the host and the device. This data transfer happens over PCIe bus.
Profiling the application using nvvp revealed that data transfer over PCIe bus was happening at
only 2 – 3 GBps. The figure (3) below is a snapshot
of the nvvp output.
The data is always transferred between pinned
memory on host and device memory. Since normal
allocation (using malloc()) is always a page-able
memory, there is an extra step, which happens
int num_arrays, nprocs, myrank;
int mystart, myend, range;
range = num_arrays/nprocs;
mystart = myrank * range;
myend = mystart + range;
for (i = mystart; i < myend;
i++)
{
load_data(buffer);
DftiCreateDescriptor();
DftiSetValue();
DftiCommitDescriptor();
DftiComputeForward(buffer);
DftiFreeDescriptor();
}
(Figure-2)
Figure 3. Data throughput with pageable memory
35
internally, of allocating pinned memory and copying data between pinned memory and page-able
memory.
o Performance Optimization
The major performance issue observed was data transfer speed. We carried out couple of
optimizations to resolve this issue. We discuss them below.
Usage of Pinned Memory: The data for 1 IRC scenario is approximately 37GBs. The data
transfer rate achieved was poor since page-able memory was used. CUDA provides
separate APIs to allocate pinned memory
(cudaHostAlloc and
cudaMallocHost). With pinned memory
usage, we achieved a through put of 5 – 6
GBps. A speed up of ~2.5x was achieved.
The data transfer time of 133 IRC scenarios
reduced to around 25 min and overall time was ~31 min.
Multi Stream Computation: In the current scheme of things, the data transfers and
computations were happening sequentially in a single stream as shown in figure (5). By
enabling multi stream computation, we could achieve two way overlap. o Computations with data transfer: GPUs have different engines for computations
(i.e. launching kernels) and data transfer (i.e. cudaMemcpy). The computations
were arranged in such a way that computations for one set and data transfer for
next set happened simultaneously see figure (6). o Data transfer overlap: GPUs are capable of transferring data from host to device
(H2D) and from device to host (D2H) simultaneously. With 4 streams, we could
Figure 4 Data transfers with pinned memory
Figure 5 Computation in 1 stream
Figure 6 Computation in 4 stream with overlaps
36
achieve complete overlap between H2D and D2H transfer. With overlaps, further
speedup of approximately 2.67x was achieved. The time for 133 IRC scenarios
were reduced to ~11 min. A single server can host multiple coprocessor cards. So within a box we could still enhance the
performance by using multiple GPUs. This however had a limitation of data transfer bandwidth.
Our experimental setup had 2 GPUs in x16 PCIe slot. The above optimized implementation was
extended to use two GPUs. The final execution time obtained was 5.6 min. The below plot
highlights step-by-step performance improvement.
A marginal dip in the performance is observed which is attributed to the sharing of bandwidth for
data transfer between host and multiple devices. The overall scale up achieved was close to 2x
with 2 devices.
7. IRC on Intel KNC Like NVidia’s K40 GPU, we also carried out the above exercise on Intel KNC coprocessor. The
KNC was Intel’s first coprocessor with 61 cores and it also supports 512 bit registers for vector
processing. These two feature together provides tremendous computing possibilities similar to
NVidia GPUs. Intel also offers a highly optimized math library (MKL). The MKL is a collection
of several libraries spanning linear algebra to FFT computations. However unlike cuFFT, MKL
is not freely distributed. The MKL provides various APIs for creating plans and performing
different FFTs. Following APIs were used to perform 1D FFT in our exercise.
DftiCreateDescriptor();DftiSetValue();
DftiCommitDescriptor();
DftiComputeForward(); DftiFreeDescriptor();
Unlike GPUs, which only works in offload mode, the KNC coprocessor could be used for
computation in native mode, symmetric mode and offload mode. In an offload mode, the main
application runs on the host. Only compute intensive sections of the application are offloaded to
the coprocessors. In native mode, full application run on the coprocessor and in symmetric mode
both host and coprocessor run the part of application. In this exercise all the reading mentioned
on KNC were taken in native mode. Only the final reading of the optimized code was taken in
symmetric mode.
67
31
11 5.6
0
20
40
60
80
IRC Problem
Exec
uti
on
tim
e in
min
Baseline Pinned Memory Pinned memory + Multi Streams 2 GPUs
Figure 7 Step-by-step performance improvement on K40
37
The biggest advantage of using KNC coprocessor, in the native mode, is no code level changes
were required. The implementation done for Westmere platform was only recompiled for KNC
platform. The overall computations were equally divided among all the 60 cores.
Each rank or core computed FFT for arrays in its range. This baseline code took 120 min for 133
IRC scenarios. Though the compute time was reduced as compared to Westmere platform (from
194 min to 120 min), the advantage is not as expected. We discuss the changes made to enhance
the performance.
o Performance Optimizations
Since we were operating in native mode, no data transfer between host and coprocessor was
involved. The MKL library used for FFT computation is highly efficient one. To identify
performance issues, we referred to Intel’s guide to best practices [9], [10] on KNC and MKL.
We exploited some of these techniques, which resulted in improved performance. We present
these below:
Thread binding: Multi-core coprocessors can achieve the best performance when threads
do not to migrate from core to core while execution. This can be achieved by setting an
affinity mask to bind the threads to the coprocessor cores. We observed around 5 – 7 %
improvement by setting the proper affinity. The affinity can be set by KMP_AFFINITY
environment variable with the command:
export KMP_AFFINITY=scatter,granularity=fine
Memory Alignment for input/output data: To improve performance with data access,
Intel recommends that the memory address for input and output data is aligned to 64 byte.
This can be done by using MKL function mkl_malloc() to allocate input output
memory buffer. This provided further boost of 7 – 9 % in the performance.
Re-using DFTI structures: Intel recommends reuse of the MKL descriptor functions if
FFT configuration remains constant. This reduces the overhead to initialize various DFTI
structure. The MKL functions DftiCreateDescriptor and
DftiCommitDescriptor allocates the necessary internal memory, and perform the
initialization to facilitate the FFT computation. It may also involve the computation on
exploring different factorizations of the input length and searching the highly efficient
computation method. For the problem under consideration array sizes, type of data, type
of FFT remains unchanged for the full application. Hence these descriptors can be
initialized only once and then reused for all the data. Initializing these descriptors only
once outside the main compute loop gave the desired ~3.6x performance gain.
With all the above changes in place, we observed significant improvement in the performance of
IRC calculations. Timing for 133 IRC scenario was reduced to approximately 32 min from 120
min.
Similar to GPUs, a single server can host multiple KNC coprocessor. Since such a setup was not
available, we expect that it would take around 16 – 17 min for IRC calculations on 2 KNCs.
38
8. Final Results In the earlier sections we discussed performance optimization of IRC calculations on Nvidia’s
K40 and Intel’s KNC coprocessor. Both the platforms are capable compute resources having
their own pros and cons. In this section we summarize overall achievements and other benefits
enabled by this optimization exercise.
o Execution time with Hybrid computing
Several fold performance improvement was achieved on both coprocessors. All the workload
was taken by coprocessors. However the host machine could be utilized to share the partial
workload. In case of Intel KNC, it required only recompilation of the code to facilitate this.
However for K40, we had to rework the code to accommodate these changes. This was achieved
by combining MPI and CUDA C.
35 – 40 % further speed up was achieved on both KNC and K40 by enabling the
workload sharing.
Out of 160,000 arrays per IRC scenarios, 60,000 were processed on each of K40s and
40,000 on host. In case of KNC the split up was 30,000 on each of KNC and 100,000
arrays on the host.
The figure (8) summarizes the best results achieved with hybrid computing on coprocessors
along with other Intel platforms.
Figure 8 IRC Performance comparison across all platforms
Clearly K40 GPUs performs better than KNC coprocessor. However KNC offers ease of
programming. Any x86 applications will only require a recompilation to work on KNC. On the
other hand porting application to K40 requires lot of programming efforts in terms of CUDA C.
o Energy Consumption
The energy required to carry out the computations directly affects the cost of the computation. In
our experiment, the K40 performed best. Assuming this as benchmark, we rationalize the
hardware requirement of other platforms to achieve the same performance and intern calculate
45
97
8.95 8.15 3.38
8.645
0
20
40
60
80
100
120
IRC Problem
Exec
uti
on
tim
e in
min
Original: 50 work Stations 2 Westmere 2 Sandy Bridge 2 Ivy Bridge 2 K40 Hybrid 2 KNC Hybrid
39
the energy required to carry out the computation. The energy consumptions are computed
considering the rated wattage for each Intel and Nvidia platform.
The figure (9) show that drastic reduction in compute time and cost for computation is achieved
by optimizing IRC calculations on both the platforms. But the gains are not only limited to these
factors. This exercise also enabled huge reduction in hardware footprint, data center floor space
and ease to maintain compact system.
Figure 9 Energy requirement for best performance for all platforms
9. Conclusion This paper highlights the importance of optimizing an application for a given platform. The
baseline results suggests that simply using the new hardware with libraries would almost always
results in a suboptimal performance. The modern day many core GPUs or Coprocessors have
tremendous computing capabilities. However new and legacy applications could achieve huge
gains only by doing optimization by detailed analysis and measurement with proper profiling
tools. In this paper we highlighted the above fact with an example of IRC calculations. Though
the chosen application was from financial risk computation, compute intensive applications from
various domains can benefit by performance optimization with many core parallel computing.
The highlights of the work are as follows:
With the optimizations, we achieved approximately 13.5x and 5.2x speedup on K40 and
KNC respectively for IRC calculations and ~150x reduction in energy consumption.
Hybrid computing utilizes both, the host and the coprocessor and intern gives best
performance.
4.74
5.79
0.053 0.044 0.124 0.032 0
1
2
3
4
5
6
7
Energy required in KWh
Ener
gy r
equ
ired
in K
Wh
Original: 50 Workstations Westmere Sandy Bridge
Ivy Bridge KNCs with Hybrid K40s with Hybrid
40
These high performance setups (coprocessor HW + optimized applications) would allow
banks or financial institution to simulate many more risk scenarios in real time and enable
better investment decisions.
We conclude this paper on a note that, with proper performance optimizations, the many/multi
core parallel computing with coprocessors enable multi-dimensional gains in terms of reduction
in compute time, cost of computations and hardware foot print.
Acknowledgement
The authors would like to thank Jack Watts from Boston Limited (www.boston.co.uk) and
Vineet Tyagi from Supermicro (www.supermicro.com) for enabling access to their HPC labs for
K40 benchmarks. We are also thankful to Mukesh Gangadhar from Intel India for enabling
access to Intel KNC coprocessor.
References
[1] T. Wood, “Applications of GPUs in Computational Finance”, M.Sc Thesis, Faculty of
Science, Universiteit van Amsterdam, 2010.
[2] P. Jorion, “The new benchmark for managing financial risk”, 3rd ed., New York, McGraw-
Hill, 2007.
[3] J. C. Hull, “Risk Management and Financial Institutions Prentice Hall”, Upper Saddle River,
NJ, 2006.
[4] P. Glasserman, “Monte Carlo Methods in Financial Engineering”, Appl. of Math. 53,
Springer, 2003.
[5] Web Tutotial on Message Passing Interface, www.computing.llnl.gov/tutorials/mpi
[6] Peter Pacheco, Parallel Programming with MPI, www.cs.usfca.edu/peter/ppmpi/
[7] Kenneth Moreland, Edward Angel,The FFT on GPU,
http://www.sandia.gov/~kmorel/documents/fftgpu/fftgpu.pdf
[8] Xiang Cui, Yifeng Chen, Hong Mei, “Improving Performance of Matrix Multiplication and
FFT on GPU” 15th International Conference on Parallel and Distributed Systems,2009.
[9] Intel guide for Xeon Phi: https://software.intel.com/sites/default/files/article/335818/intel-
xeon-phi-coprocessor-quick-start-developers-guide.pdf
[10] Tuning FFT on Xeon Phi: https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-
functions-performance-on-intel-xeon-phi-coprocessors
[11] Nvidia K40 GPU: http://www.nvidia.com/object/tesla-servers.html
[12] Nvidia cuFFT library: https://developer.nvidia.com/cuFFT
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
41
Performance Benchmarking for Open Source Message Productions
Yogesh Bhate Abhay Pendse Deepti Nagarkar Performance Engg. Group Performance Engg. Group Performance Engg. Group
Persistent Systems Persistent Systems Persistent Systems [email protected] [email protected] [email protected]
Abstract- This paper shares the experiences and findings that were collected during a 6-month
performance benchmarking activity carried out on multiple open source messaging products. The primary
aim of the activity was to identify the best performant messaging product from around 5 shortlisted
messaging products available in the open source community. There were specific requirements provided
against which the benchmarking activity was carried out. This paper covers the objective, the plan and
the execution methodology followed for this. The paper also shares the detailed numbers that were
captured during the tests.
1. Introduction
A large scale telescope system is being built by a consortium of 4-5 countries. The telescope system
consists of the actual manufacturing, installation and the operation of a 30 Meter telescope and its related
software sub systems. All software subsystems that control the telescope or use the output provided by
the telescope need to communicate with each other through a backbone set of services providing multiple
common functionalities like logging, security, monitoring and messaging. Messaging or Event Service as
it is called is one the primary service which is part of the backbone infrastructure in the telescope software
system. Each software subcomponent talks to one another use a set of events and those events need to
be propogated to the correct target in real-time.
The event service backbone had stringent performance requirements which are listed in subsequent
sections. The event service was planned to be a thin API layer over a well-known open source messaging
product. This allowed the software planners to keep an option open of changing the middleware in the
lifecycle of the event service. It was required that the software lifecycle would be a minimum of 30 years
from the date of commissioning of the telescope systems. Benchmarking open source messaging
platforms for use in the Event Service development was the primary goal of this project
42
2. Benchmarking Details 2.1. Functional Requirements for Benchmarking
The customer provided some very specific requirements which were to be considered during the
benchmarking activity. Below is a summary of those requirements:
No Persistence: The messages or event sent via the event service are not expected to be persisted nor are they expected to be durable.
Unreliable Delivery: Message delivery may not be reliable. This means that it is okay if the messaging system has some message loss.
No Batching: No batching should be used to send messages or event. As soon as an event gets generated it has to be sent on the wire to the listeners/subscribers.
Distributed Setup: The products should work in a distributed fashion i.e. the publisher, subscriber, and broker should all be on different.
Java API: Java API should be designed and developed for the benchmarking tests.
2.2. Benchmarking Plan
To ensure that all stakeholders understand the exact process and expectations of the project a benchmarking plan was created before the work was started. The purpose of the benchmarking plan was to explain the process of benchmarking in detail. Some important areas that the benchmarking plan covered were:
The environment that was planned to be used for testing
The methodology that was to be used
The software tools, libraries that would be used.
The workload models that would be simulated This benchmarking plan was circulated and reviewed by everyone from the customer technical team and this was used as the basis for all the activities of this benchmarking project. During multiple rounds of reviews the benchmarking plan went under numerous changes to ensure that we only look at stuff that was needed by the customer. This paper would not go into the details of the benchmarking plan but below (ref. Table 1) is the summary of the workload models which were mutually agreed and considered important
43
Table 1
2.3. Environment Setup
The benchmarking was carried out on physical high end servers. The configuration of the servers and other details were part of the benchmarking plan:
Hardware
Three physical servers
Each server with 2 Intel Xeon processor chips. Each chip with 6 cores.
32GB of RAM on each server.
1G and 10G connectivity between these servers connected via a NetGear switch.
Each server with one 1G NIC and one 10G NIC.
Software
64 Bit Java 1.6
64 Bit Cent OS
MySQL for storing counters The following two topologies (Ref. Figure 1) were used for the test
44
Figure 1
2.4. Benchmarking Suite
A custom benchmarking suite was used for this particular benchmarking activity which allowed us to execute multiple iterations of tests with different workload configurations, to capture counters and generate appropriate charts for the tests. The following diagram (Ref. Figure 2) gives a quick design view of the benchmarking suite
Figure 2
Other tools Apart from the custom bechmarking suite some open source utilities were also used. Below is the summary of such utilities
Standard Linux utilities - pidstat, vmstat, top etc to capture cpu , memory, disk activities
nicstat – a 3rd
party utility to monitor network usage on the NIC card.
jstat – A standard JDK utility to capture java heap usage.
JFreeChart – JFreeChart is used to plot graphs from the data collected by the test. This is used as part of the reporting module in the benchmark suite.
MySQL – MySQL will be used to store the captured metrics and the reporting component will generate reports based on data stored in MySQL db.
45
Ant – for building the source code.
2.5. Tests
Since there were multiple tests that needed to be done it was required that we categorize the tests into high level types to clearly understand the purpose of each test. The following categories were hence defined and every tests marked under one of these categories:
Test Category Description
Throughput Tests Tests executed under this category captured the throughput of the messaging platform. Tests under these categories were executed in different combinations to observe how the throughput changes
Latency Tests Tests executed under this category will capture the latency of the messaging platform. These tests will determine how latency gets affected by different parameters and load. The tests will also determine the variance in latency (jitter)
Message Loss Tests Tests executed under this category will capture the message loss if any for a messaging platform. During execution these tests could be combined with the throughput tests
Reliability Tests Tests executed under this category will discover if the messaging platform degrades if it’s up and running for a longer duration. Such tests will make the messaging platform send and receive messages for a longer duration of time(e.g. overnight) and identify if there is any adverse impact on latency, throughput or overall functioning of the platform
Table 2
2.6. Products to be benchmarked
This project was preceded by an earlier phase where almost all available open source
messaging products were subjected to multiple levels of filter criteria. This phase called the
Trade Study phase selected 5 messaging products which were considered suitable for the
requirements of the customer. In this phase these 5 products were benchmarked. The
products are
Table 3
46
2.7. Reporting
It was decided that the following important quantitative parameters would be reported after
the benchmarking tests. For all 5 products each of these parameters would be compared and
the product which has the best values for the majority of the parameters would be chosen.
Publisher Throughput – Max Number of messages sent per second
Subscriber Throughput – Max Number of messages received per second
Latency – The time taken for the message from point A to point B.
Jitter – Variation in Latency
Message Loss – Loss of messages
Important Note: All the tests were to be done on 1G and 10G network. It was decided that for
comparison purposes the numbers observed for a 10G network would be used since the
production network bandwidth was planned to be 10G.
47
3. Observations 3.1. Aggregate Publisher Throughput
This parameter gives the maximum number of messages that can be published by the publishers per second both in an isolated fashion and as an aggregate group. These throughput numbers are captured on 10G networks when only a single subscriber listens on a topic. In majority of the cases the system has been scaled to use multiple publishers, multiple topics.
Figure 3
3.2. Isolated Publisher Throughput
The picture below shows the throughput possible when a single publisher publishes messages as fast as possible as a function of message size without system failure. HornetQ was able to publish 111,566 600 byte messages. It is expected that the throughput in msgs/sec will decrease with message size. In a perfect system, the decrease would be linear. As shown, this is mostly true but begins to fail for larger message sizes.
48
Figure 4
During the throughput tests we have observed that HornetQ showed the best possible
throughput and was able to utilize the whole bandwidth of the network. All the other products
hit a plateau on the publisher processing side and could not use the network to the full extent.
3.3. Subscriber Throughput
This provides a view of the number of messages the subscribers were able to consume per
second as a group.
49
Figure 5
The above charts show the aggregate subscriber throughput. In this case one subscriber
listens on one topic and we increase the number of subscribers and topic. This potentially
shows the scalability of the platform from a consumer angle. HornetQ subscriber throughput
is more than two times of the other products. Comparing the publisher and subscriber
throughput graphs we should have seen almost the same number of messages consumed by
the subscribers. But due to latency and other factors the subscribers always lag by some
amount. However as we can see the lag is minimal in the case of HornetQ.
3.4. Impact of multiple subscribers on throughput
Some tests were carried out to judge the impact of multiple subscribers listening on the same
topic. In the customer defined scenarios they did not expect their system to have anything
more than 5-10 subscribers listening on an individual topic. Hence these tests were carried
out for a limited number of subscribers
50
Figure 6
The throughput drops whenever more subscribers join in to listen on a topic. The primary
reason for this drop has to do with the acknowledgements that the platform has to
manage for every message loop. In this case too, HornetQ shows the best possible
results for multiple subscriber scenarios.
51
3.5. Publisher Throughput v/s Subscriber Throughput Ratio
Our observations of the publisher and subscriber throughput for both 1G and 10G show how
well the platform allows the subscriber to “keep up” with the publisher. The chart below
shows the ratio of this comparison.
Figure 7
The best products will show a flat curve and the closer the ratio is to 1, the better the product. Again, HornetQ is clearly the best product, but surprisingly, Redis is the second best product with 80% of its messages arriving within the measurement period. The worse product is Redhat MRG with only 40%-60% arriving within the measurement window.
52
3.6. Scalability Range
Significant numbers of tests were designed to find out the upper limit of the platform. This gave good insights on the way the platform was designed and developed. However the customer was interested in finding out one more non-traditional parameter which was termed as Scalability Range. This test was designed so that each publisher will publish messages at a predefined rate (throttled Publishers) of 1000Hz i.e 1000 messages/sec and with such a configuration we had to determine what is the max number of publishers the platform can support. In the customer production scenario, the telescope instruments had a upper transmit range of 1KHz however the number of instruments were not fixed. So this test was deemed important.
Figure 8
HornetQ and RedHat MRG showed the best scalability in this test and we were able to stretch the system to almost 350 publishers each publishing at 1000Hz without a message loss (all messages were received by the subscriber)
3.7. Latency and Jitter
The time a message takes to reach from publisher to subscriber is an important measurement of the product performance. During latency tests clock synchronization problem was encountered and our attempts to use NTP or PTP daemons did not yeild the result we expected. Hence we used the approach of calculating the Round Trip Time (RTT) and halving that to come up with one way latency. This method, however, does introduce a bit of uncertainty to the measurements. But this was considered the best approach at that time since the customer was more interested in ensuring that the latency numbers hower around the microseconds range and not the milli-second range. We reported Latency as average numbers, percentile numbers and standard deviation numbers. However it was considered best to compare the latency in Percentile terms than the average values since averages can get skewed with outliers.
53
Figure 9
The average latency numbers show that OpenSplice DDS and Redis have the lowest average latency. HornetQ and Red Hat MRG have the highest average latency. The percentile charts show a different view of the numbers. We have used percentile charts in the previous reports and we believe that gives a very good perspective of how the latency varies across messages. Jitter can be accurately visualized by looking at the percentile charts or values. Average Latencies can be misinterpreted because even some high values can skew the whole data set. Percentile shows how much percent of the total messages fall below a particular latency.
54
So percentile shows the distribution of latencies in the whole message group. In this report we have picked up the 50, 80, 90 percentile values for all products. HornetQ has the lowest latency in the 50, 80 and 90 percentile range and this is the primary reason its subscriber throughput is almost equal to the publisher throughput. Other products have high latencies in the 50, 80 and 90 percentile range even if their average latencies are lower than HornetQ and hence they have low subscriber throughput.
In a perfect world we would have both the average and percentile to be minimum and that really then can be classified as the best product from a latency standpoint. However in the real world we rarely see such cases. The platform has done some kind of give and take with respect to throughput, processing speed and latency. HornetQ does provide a good percentile numbers and comparing with its other parameters it still remains the top product in the benchmarking tests.
3.8. Resource Utilization
During all the benchmarking tests we constantly captured how the platform utilized the system resources. We thought it important to report the system utilization at peak throughput to give a glimpse of how well the system resources are utilized.
Figure 10
Redis is a single threaded server and hence it never utilizes more than one CPU from the system. HornetQ and RedHat MRG heavily use the CPU’s on the server and that is how they are able to scale to a very high throughput number.
55
Figure 11
4. Conclusion
Close to around 10-12 man months were spent in this benchmarking activity and this paper attempts to provide a glimpse of what it looked like after all the work was done and data was compared. The customer was provided with very detailed charts and reports after every product was benchmarked and that helped us to tune the overall process by getting early feedback from the customer. We have done thousands of iterations to ensure that we had utilized the features as per the documentation and then only we have finally captured the data. Not all the efforts and work done can be documented in this short paper.
HornetQ came out to be the best performing product based on the customer requirements.
Redis showed real promise on performance standpoint.
RTI and OpenSplice tests were extremely discouraging. Though we acknowledged to the customer that RTI and OpenSplice is a functionality rich product with thousands of tuning possibilities which couldn’t be attempted in the time frame provided to us. But we used the most commonly documented settings and used those for testing.
RedHat MRG was the next best product after HornetQ in throughput terms.
While recommending HornetQ to be used for the event service implementation we also provided a verbose tabular comparison (see below) of each product to the customer.
57
References
[RTI DDS] http://www.rti.com/products/dds/
[HornetQ] hornetq.jboss.org/
[Redhat MRG] https://www.redhat.com/promo/mrg
[Open Splice DDS] www.prismtech.com/opensplice
[Redis] www.redis.io
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings
AUTOMATICALLY DETERMINING LOAD TEST DURATION USING CONFIDENCE INTERVALS
Rajesh Mansharamani Subhasri Duttagupta Anuja Nehete Freelance Consultant Innovation Labs Perf. Engg. Performance Engg. Group
Tata Consultancy Services Persistent Systems [email protected] [email protected] [email protected]
Load testing has become the de facto standard to evaluate performance of applications in the IT industry, thanks to the growing popularity of automated load testing tools. These tools report performance metrics such as average response time and throughput, which are sensitive to the test duration specified by the tester. Too short a duration can lead to inaccurate estimates of performance and too long a duration leads to reduced number of cycles of load testing. Currently, no scientific methodology is followed by load testers to specify run duration. In this paper, we present a simple methodology, using confidence intervals, such that a load test can automatically determine when to converge. We demonstrate the methodology using five lab applications and three real world applications.
1. Introduction
Performance testing (PT) has grown in popularity in the IT industry thanks to a number of commercial and free load testing tools available in the market. These tools let the load tester script application transactions to create virtual users, which mimic the behaviour of real users. At load test execution time, the tester can specify the number of virtual users, the think time (time spent at terminal), and the test duration.
Test duration is specified in these tools either as an absolute time interval or in terms of number of user iterations that need to be tested. In the absence of statistical knowledge, the common practice in the IT industry is to specify an ad hoc duration which may range from a few seconds to a few hours. The ad hoc duration is usually arrived at in consultation with one's test manager or blindly adopted from 'best practices' followed by the PT team.
Regardless of the duration specified, at the end of the load test the tester gets numeric estimates of performance metrics such as average response time and throughput. The numeric value is accepted as true because it has come from a well-known tool. Unfortunately, the sensitivity of test duration on test output is not considered in a regular load test. By regular load test we mean a test that is used to determine the application response time under a given load, and not the stability of the application under load for a long duration (such as to test for memory leaks).
If the test duration is too small the estimate of performance may be erroneous. If the test duration is too long it will lead to fewer cycles for load testing. This paper proposes a simple methodology based on confidence intervals for automatically determining load test duration while a load test is in progress.
Confidence intervals are widely used to determine convergence of discrete event simulations [PAWL1990] [ROBI2007] [EICK2007]. Using confidence intervals one can specify with what probability (say 99%) the estimate of average response time lies in an interval around the true average. The wider the interval less is the confidence in the estimate. As the run duration increases one expects the interval to become tighter and converge to a specified limit (for example, 10% of the true mean). We have not come across any study that
58
59
specifies how to use confidence intervals to determine load test duration. There is a mention that confidence intervals should be used in load tests in [MANS2010] but no methodology has been given there.
The rest of this paper is organised as follows. Section 2 provides the state of the art in specifying load test durations. Section 3 provides an introduction to confidence intervals for the reader who is not well versed with this topic. Section 4 provides a simple methodology to determine run duration and its application to laboratory (lab) and real world applications. We show in Section 5 that this methodology also works for page level response time averages as opposed to overall average response time for an application. Section 6 extends this methodology to deal with outliers in response time data. Finally, Section 7 provides a summary of the work and ideas for future work.
2. State of the Art in Determining Load Test Duration
We have seen four types of methodologies to determine load test duration in the IT industry, some of which are given in [MANS2010] and some discussion of the warm up phase (see Section 2.2) is provided in [WARMUP]. These are not formal methodologies in the published literature, but over the years majority of IT performance testing teams have adopted them. We now elaborate on each methodology.
2.1 Ad hoc The most popular methodology is to simply use test duration without questioning why. More often than not the current load testing team simply uses what was adopted by the previous load testing team and that becomes a standard within an IT organisation. We have commonly seen test durations ranging from 30 seconds to 20 minutes.
2.2 Ad hoc duration for steady state In this methodology the transient state data is manually discarded. The initial part of any load test will have the system (under test) in a transient state, due to several reasons such as ramp up of user load, and warm up of application and database server caches. Figure 1 shows the average response time and throughput as a function of time, of a lab application that was load tested with 300 virtual users. As can be seen in the figure if the test duration happens to be in the transient state, then the estimates of average response time and throughput will be highly inaccurate compared to the converged estimates seen at the later part of the graphs.
Figure 16 Average Response Time & Throughput vs. Test Duration
Experienced load testers usually run a pilot test for a long duration and then visually examine the output to determine the duration of transient state. They then discard transient state data and use only the steady state data to compute performance metrics. The duration of time used in the steady state is ad hoc, and often ranges from 5 minutes to 20 minutes. While this methodology results in more accurate results than the one in Section 2.1, it is not clear how long to run a test in steady state. Moreover, the transient state duration will vary with change in application and in workload, requiring a pilot to be run for every change.
2.3 Ad hoc Transient Duration, Ad hoc Steady State Duration
As discussed above, it is laborious to run a pilot and visually determine start of steady state for every type of load test. As a result, some performance testing teams adopt an ad hoc approach to transient and steady state durations. We have seen instances wherein the PT team simply assumes that the first 20 minutes of run duration should be discarded and the next 20 minutes data should be retained for analysis.
(ms)
(seconds) (seconds)
60
2.4 Long Duration The last approach that we have seen in several organisations is to keep the regular load test duration in hours (to obtain an accurate estimate for performance and not to test for availability, which is a separate type of test). This way the effects of transient state will not have any major contribution to overall results, since it is assumed that transient state lasts for a few minutes. We have seen several instances of 2 to 3 hour test duration in multiple organisations. While there is no doubt on the accuracy of the output, this approach severely limits the number of performance test cycles.
3. Quick Introduction to Confidence Intervals
We have added this section to give a quick introduction to confidence intervals to the non-statistical load tester. To understand confidence intervals it helps to first understand the Central Limit Theorem.
The Central Limit Theorem [WALP2002] in statistics states that:
Given N independent random variables X1, …, XN each with mean and standard deviation , then the average of these variables X = (X1 + X2 + … + XN)/N approaches a normal distribution with mean
and standard deviation /sqrt(N).
Successive response time samples may not necessarily be independent and hence it is common to see the method of batch means widely employed in discrete event simulation [FISH2001]. Instead of using successive response time samples, we use batches of samples and take the average value per batch as the random variable of interest.
Thus, if we consider response time batch averages in steady state then we can assume that the average response time (across batch samples) will converge to a Normal distribution. For a Normal distribution with
mean and standard deviation it is well known that 99% of the values lie in the interval ( 2.576) [WALP2002]. Therefore, if we have n batch average samples in steady state during a load test then we can
say with 99% confidence that our estimate of average response time of n samples is within 2.576*/sqrt(n) of
the true mean, where is the standard deviation of response time. As number of samples, n, increases the interval gets tighter and we can specify a convergence criteria, as will be shown in Section 4.
An important point to note is that we do not know the true mean and standard deviation of response times to start with, and hence we need to use the estimated mean and standard deviation computed from n samples of response time. To account for this correction, statisticians assume a student t-distribution [WALP2002]. This will clearly be a function of the number of samples n (more specifically degrees of freedom n-1) and the level of confidence required (say 99% or 95%). Tables are widely available for this purpose, such as the one
provided in [TDIST]. For a large number of samples (say n=200), the confidence intervals estimated out of a student t-distribution converge to that of a normal distribution [WALP2002].
4. Proposed Methodology for Automatically Determining Load Test Duration
4.1 Proposed Algorithm We propose a simple methodology where we analyse response time samples in steady state until we are confident that the average response time converges. Upon convergence we stop the load test and output all the metrics required from the test. While there is no technical definition of when exactly steady state starts, we know that initially throughput will vary a lot and then gradually converge (see Figure 1). Let Xk denote the throughput at k minutes since start of the test (equal to total number of samples divided by k minutes). We
assume that steady state has started after k minutes if Xk is within 10% of Xk-1, where k > 1.
Once we are in steady state we start collecting samples until we reach our desired level of confidence. We
propose using a 99% confidence interval that is within 15% of the estimated average response time1. In other
words if after n batch samples in steady state the estimated average response time is An, and the estimated standard deviation across batch samples is Sn then we assume that the average response time estimate converges if the following relationship holds true:
1 There is nothing sacrosanct about 15%; it is just that we empirically found the convergence to be reasonably
good with this interval size.
61
An + t99,n-1 Sn /sqrt(n) 1.15 An where t99,n-1 is the critical value of the t-distribution for
=0.01 (two tailed) and n-1 degrees of freedom. For example, for n=50, t99,n-1 = 2.68
Suppose we have an application wherein the average response time takes a very large amount of time to converge, then we need to specify a maximum duration of test to account for this case. We also need to specify a minimum duration in steady state to account for (minor) perturbations due to daemon processes running in the background, activities such as garbage collection, or known events that may occur at fixed intervals (such as period specific operations/queries).
Taking the above in to account we propose the following Algorithm 1 for automatically determining load test duration while a load test is in execution.
Table 1: Algorithm 1 to Determine Load Test Duration
1. Start test for Maximum Duration 2. From the first sample onwards, compute performance metrics of interest as well as throughput (number of jobs completed/total time). Let Xk denote throughput after k minutes of the run, where k > 1.
If (Xk 1.1 Xk-1) and (Xk 0.9 Xk-1) then Steady state is reached. Reset computation of all performance metrics. Else if Maximum Duration of test is reached output all performance metrics computed 3. From steady state, restart all computations of performance metrics. Assume a batch size of 100 and compute average response time per batch as one sample. Compute the running average and standard deviation across batches as follows: Set n = 0, Rbsum = 0, and Rbsumsq = 0 at start of steady state For completion of every 100 samples (batch size) after steady state do Let Rb = average response time of batch n = n + 1 Rbsum = Rbsum + Rb Rbsumsq = Rbsumsq + Rb*Rb AvgRb = Rbsum / n StdRb = sqrt(Rbsumsq/n - AvgRb * AvgRb)
If (t99,n-1 *StdRb/sqrt(n) 0.15 AvgRb) and (MinDuration is over in steady state) then stop test and output performance metrics Else If Max Duration is reached then output performance metrics Endif End for
We have assumed a batch size of 100. This was chosen empirically after asserting that the autocorrelation of
batch means [AUTOC] was less than 0.1 for the first few values of lag. Typically correlation drops with increase in lag.
Note that we compute running variance by taking the difference between average of the squared response time and square of the average response time. This is very efficient for a running computation, as opposed to the traditional method which is O(n).
We need to validate whether the use of 99% confidence intervals that are within 15% of estimated average response time, is indeed practical for convergence of load tests or not. And if load tests do converge then we need to assess what is the error percentage versus a true mean, assuming true mean is the value we get if we let the test run for ‘long enough’.
We also need to specify a value for MinDuration of test after steady state. Technically one might want to specify both a minimum number of samples as well as a minimum duration, whichever is higher. In reality, it is easier for the average load tester to simply specify duration in minutes, given that most of the load tests produce throughputs which are in tens of pages per second or higher thus yielding sufficient samples.
The next section 4.2 validates Algorithm 1 on a set of five lab applications and then section 4.3 does the same on three real life applications.
62
4.2 Validation against Lab Applications
Five lab applications were used for validating Algorithm 1. All five were web applications, which were load tested using an open source tool with 300 concurrent users with 2 second think time. All tests were run for a total duration of 22 minutes. We asked the team running the tests to send us response time logs in the format <elapsed time of run, page identifier, response time> where the log contains one entry for each application
web page that has completed.
The five lap applications were:
a. Dell DVD Store (DellDVD) [DELLDVD] which is an open source e-commerce benchmark application. 7 pages were tested in our load test.
b. JPetStore [JPET] which is an open source e-commerce J2EE benchmark. 11 pages were tested in our load test.
c. RUBiS [RUBIS] which is an auction site benchmark. 7 pages were tested in our load test. d. eQuiz which is a proprietary online quizzing application. 40 pages were tested in our load test. e. NextGenTelco (NxGT) which is a proprietary reporting application. 13 pages were tested in our load test.
We present in Table 2 the application of Algorithm 1 to determine convergence in the load tests for these five lab applications, all of which had a maximum test duration 22 minutes. We graphically verified that in all cases that the throughput and average response times had converged well before 22 minutes. In all cases we used a minimum duration of 5 minutes after steady state. We can see from Table 1 that all the applications reached steady state within 2 to 3 minutes, and after 5 minutes of steady state the 99% confidence intervals are well within 15% of the mean. When we compare the estimated average response time versus the true mean (assumed to be the average response time in steady state at the end of 22 minutes) we see a very small deviation between the two, in most cases less than 1% and in one case just 3.4%.
If we did not specify a minimum duration of 5 minutes after steady state, and just waited for the first instant where the 99% confidence interval size was within 15% of the estimated average response time, we observed that 'convergence' happened in a matter of a few seconds for three of the applications and within 1 to 2 minutes for two others, as shown in Table 3. As seen from Table 3 the deviation from the true mean can go up to 20%, which may be acceptable only during the initial stages of load testing.
Table 2: Application of Algorithm 1 to Lab Applications (Min Duration=5 min)
Application Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size: Percent of Estimated Mean
Average Response Time at Convergence
Average Response Time at End of Max Duration
Percent Deviation in Avg Response Time
DellDVD 3 min 5 min 8.1% 23.86 ms 23.79 ms 0.3%
JPetStore 2 min 5 min 1.9% 33.80 ms 34.07 ms 0.8%
RUBiS 2 min 5 min 5.5% 16.75 ms 16.20 ms 3.4%
eQuiz 2 min 5 min 2.9% 62.48 ms 63.02 ms 0.9%
NxGT 2 min 5 min 0.9% 31.59 ms 31.52 ms 0.2%
Table 3: Application of Algorithm 1 to Lab Applications (Min Duration=0 min)
Application Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size: Percent of Estimated Mean
Average Response Time at Convergence
Average Response Time at End of Max Duration
Percent Deviation in Avg Response Time
DellDVD 3 min 52 sec 14.9% 24.54 ms 23.79 ms 3.1%
JPetStore 2 min 3 sec 14.6% 33.67 ms 34.07 ms 1.1%
RUBiS 2 min 14 sec 14.6% 18.49 ms 16.20 ms 14.1%
eQuiz 2 min 2 sec 11.6% 76.19 ms 63.02 ms 20.9%
NxGT 2 min 2 sec 5.4% 31.40 ms 31.52 ms 0.3%
63
4.3 Validating Algorithm 1 against Real World Applications
The following three real world IT applications were chosen for validation of Algorithm 1:
i. MORT: A mortgage and loan application implemented using web services and a web portal. 26 pages of MORT were load tested with an open source tool, for a total of 20 minutes with 80 concurrent users. MORT has a mix of pages some of which complete in a few milliseconds and some which take up to 30 seconds.
ii. VMS: A vendor management system that deals with invoice and purchase order processing. 11 pages were load tested using a commercial tool, for a total duration of 20 minutes, with 25 concurrent users and 5 second think times.
iii. HelpDesk: A service manager application for the help desk management lifecycle. 31 pages were load tested with an open source tool, for a total of 15 minutes with 150 concurrent users, and think times between 0 to 15 seconds.
We see in Table 4 that for all three real world applications Algorithm 1 converged to the average response time quite fast with less than 5% deviation from the true mean. (In fact for VMS and HelpDesk if we remove the requirement of 5 minutes steady state duration the convergence occurs in 1.5 minutes with less than 6% deviation.)
Table 4: Application of Algorithm 1 to Real World Apps (Min Duration=5 min)
Application Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size: Percent of Estimated Mean
Average Response Time at Convergence (ms)
Average Response Time at End of Max Duration (ms)
Percent Deviation in Average Response Time
MORT 2 min 5 min 11.1% 908.64 ms 867.28 ms 4.8%
VMS 2 min 11.6min 14.9% 579.69 ms 579.36 ms 0.1%
HelpDesk 3 min 5 min 5.1% 121.36 ms 125.26 ms 3.2%
4.4 Distribution of Average Response Time We were curious to see if the distribution of average response time converged to a normal distribution. We used batches of samples to compute average response times in the logs provided and then took their cumulative distribution function (CDF) [WALP2002], and compared with that for the Normal distribution with the same overall mean and standard deviation as the response time log. We can see from Figure 2 that the distribution was indeed close to the Normal distribution for MORT and HelpDesk. In the case of VMS the error was a bit more since there were fewer samples in the log file.
Figure 2: Distribution of Average Response Time
5. Test Duration for Page Level Response Time Convergence
Section 4 showed how Algorithm 1 works towards convergence of overall average response time, across all pages of an application. We are now interested in knowing what happens if we want individual page level response times to converge. Note that we have fewer samples per page compared to total number of
Pr
[response tim
e <
r]
r milliseconds r milliseconds
Pr
[response tim
e <
r]
64
samples. The result of applying Algorithm 1 to the 7 pages of DellDVD is shown in Table 5. We see that Algorithm 1 correctly predicts convergence of the test and the deviation is within 5% from the true mean per page. We found the same pattern for the other four lab applications.
Table 5: Algorithm 1 Applied to Pages of DellDVD
Page Number of DellDVD
Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size: Percent of Estimated Mean
Average Response Time at Convergence
Average Response
Time at End of Max
Duration
Percent Deviation in Average Response Time
Page 1 2 min 11.2 min 14.9% 4.98 ms 4.88 ms 2.1%
Page 2 2 min 5.0 min 11.8% 13.94 ms 13.48 ms 3.4%
Page 3 2 min 10.5 min 15.0% 4.47 ms 4.70 ms 4.9%
Page 4 3 min 5.0 min 3.7% 49.99 ms 49.73 ms 0.5%
Page 5 3 min 5.0 min 12.9% 12.28 ms 11.74 ms 4.6%
Page 6 3 min 5.0 min 11.6% 12.48 ms 11.95 ms 4.4%
Page 7 3 min 5.0 min 3.9% 71.32 ms 72.24 ms 1.3%
In the case of the real world application MORT there were 26 pages in all, but the frequency of page access was too small in 21 of the pages and there were not enough samples for confidence intervals to converge. For 5 of the pages that had enough samples, we present the results of Algorithm 1 in Table 6. Likewise for HelpDesk there were 10 pages with enough samples and all converged between 6 to 9 minutes of total run time with errors less than 5% of the true mean, as shown in Table 7.
Table 6: Algorithm 1 Applied to Pages of Real World Application MORT
Page Number of MORT
Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size: Percent of Estimated Mean
Average Response Time at Convergence
Average Response Time at End of Max Duration
Percent Deviation in Average Response Time
Page 1 2 min 8.9 min 14.6% 32.74 sec 32.67 sec 0.2%
Page 2 2 min 5.0 min 7.3% 47.51 ms 45.51 ms 4.4%
Page 3 3 min 8.9 min 14.9% 33.40 sec 33.46 sec 0.2%
Page 4 3 min 10.1 min 14.4% 34.44 sec 34.31 sec 0.1%
Page 6 2 min 9.8 min 13.7% 35.59 sec 35.65 sec 0.1%
Table 7: Algorithm 1 Applied to Pages of Real World Application Helpdesk
Page Number of Helpdesk
Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size: Percent of Estimated Mean
Average Response Time at Convergence
Average Response Time at End of Max Duration
Percent Deviation in Average Response Time
Page 14 2 min 6.3 min 4.1% 24.75 ms 24.29 ms 1.9%
Page 15 2 min 6.3 min 6.7% 14.28 ms 14.08 ms 1.4%
Page 16 2 min 7.4 min 13.4% 11.24 ms 11.02 ms 2.0%
Page 17 2 min 5.2 min 14.4% 363.24 ms 366.92 ms 1.0%
Page 22 2 min 5.9 min 2.6% 38.89 ms 38.87 ms 0.1%
Both in the case of MORT and in the case of HelpDesk there were 21 pages that did not converge for lack of samples and if we had to wait for all pages to converge then we would have reached the max duration without convergence. This calls for a modification to our Algorithm. We should allow pages to be tagged and check for convergence of only tagged pages. We assume that the load test team would have knowledge of the application workload and criticality to decide which pages need to be tagged for accurate estimation of performance metrics.
65
When we applied Algorithm 1 to per page of VMS, the number of samples was too few since our batch size was 100. (In fact we had just 5 batches per page.) So we reduced the batch size to 10, for the purpose of analysis. (This is not recommended in general, but our purpose is to draw the attention to handling outliers through this example.) 8 of the pages converged with a deviation of less than 8% from the true mean, but 3 pages did not converge at all, even though there were enough samples. For these three pages Page 0, Page 2, and Page 10, Table 8 shows the 99% confidence interval size at the end of the run. Confidence intervals did not converge in these three pages due to the presence of outliers since outliers can drastically increase variance.
Table 8: Algorithm 1 Applied to Pages of VMS for batch size=10
Page Number of VMS
Time to Steady State
Time to Converge after Steady State
99% Confidence Interval size: Percent of Estimated Mean
Average Response Time at Convergence
Average Response Time at End of Max Duration
Percent Deviation in Average Response Time
Page 0 2 min NA 29.2%
Page 1 2 min 5.9 min 15.0% 2465.31ms 2603.48ms 5.3%
Page 2 2 min NA 18.6%
Page 3 2 min 5.0 min 14.4% 402.31ms 418.48ms 3.9%
Page 4 2 min 5.3 min 14.9% 369.14ms 391.47ms 5.7%
Page 5 2 min 5.2 min 11.7% 364.29ms 383.78ms 5.1%
Page 6 2 min 5.6 min 13.9% 379.98ms 381.86ms 0.4%
Page 7 2 min 5.0 min 9.2% 736.03ms 793.71ms 7.2%
Page 8 2 min 5.0 min 10.8% 372.81ms 346.28ms 7.7%
Page 9 2 min 12.6 min 15.0% 456.22ms 438.85ms 3.9%
Page 10 4 min NA 101.9%
6. Handling of Outliers in Real World Applications
A closer look at the scatter plot of response times for three 'non convergent' pages of VMS revealed the presence of outliers, as shown in Figure 4.
So our next question was how to remove outliers. The easiest way is to maintain a running histogram of response time samples. But if our methodology is to be incorporated in to any load test tool then it has to be very efficient. Therefore we adopted the heuristic that if any response time sample is more than 2 times the current average response time it goes in to an outlier bucket (assuming at least 10 samples before this rule can kick in). We do not discard it because if the number of such samples increases drastically they need to be reclassified as ‘inliers’. Note that while the figure shows actual response time samples, our algorithm applies to samples of batch means, which is why a factor of 2 is appropriate.
66
Figure 3: Outliers in Pages 0, 10 and 2 respectively, in VMS Response Times
We adapted Algorithm 1 to compute the running sum of response times and squared response times for both regular samples and outlier samples. If the number of outliers exceeds 10% of the total samples we included them back in to the regular samples by simply adding to the sums of response times and squared response times and adding the sample counts, at the time of determining convergence. This is very efficient with O(1) complexity. The only challenge is that if the outliers happen to occur very early in the run time after steady state they are likely to be included and never discarded. For now we have not improvised upon this algorithm but we plan to do so in the near future.
When we applied this modified algorithm to the two VMS pages with outliers, average response time for Page 0 converged within 14.9 minutes after steady state with just 1.4% deviation from the final result, and that for Page 2 converged within 17.1 minutes after steady state with 0.2% error. But Page 10 did not converge despite implementing the algorithm for outliers. If we manually remove the outliers shown in Figure 3 and re-plot the data, we get the revised scatter plot in Figure 4.
We now see a new set of outliers. But there are so many of them that we can no longer call them outliers and the algorithm rightly classified them as inliers. Because of the high variance in the response times, the confidence intervals for this page did not converge. After removing outliers for this page the 99% confidence interval at the end of the run had spread of 24% around the mean. Had this been a tagged page it would have required much longer test duration for convergence, as opposed to the 20 minutes used by the load testing team.
Test Duration in seconds Test Duration in seconds
Response tim
e (
ms)
67
Figure 4 : Scatter Plot for 'Outlier' Page in VMS
7. Summary of Algorithm, Applicability, and Future Work
We have presented an algorithm in this paper to automatically determine test duration during load testing. The algorithm has two parts. First, it checks if steady is reached in the k
th minute of test execution by determining if
throughput at the kth minute is within 10% of the throughput at the (k-1)
st minute, for k > 1. Second, it checks if
99% confidence interval of average response time of batch means is within 15% of the estimated average (once runtime exceeds a specified minimum duration after steady state) or the maximum duration is reached.
We have shown that this algorithm works accurately with total average response times having less than 5% error from the true mean, for five lab applications and three real world applications. In the case of page level response times, we have proposed an enhancement to take care of outliers. Note that in the case of overall average response times (across all pages) we do not recommend outlier removal. This is because there may be infrequent pages that have response times much higher than other frequently accessed pages and these readings should not be misconstrued as outliers. We have also shown the need to tag pages when applying this algorithm at the page level so that we check for convergence only for pages that matter.
To speed up load tests, we can get rid of the minimum duration condition for the first few rounds of load tests where we need quick results and where higher percentage of error is tolerable. As we have seen from all applications tested the convergence after steady state is often a matter of seconds with errors less than 20% in all applications tested. For the first rounds of load tests we also need not worry about page level convergence and we can plan on just overall convergence.
While the algorithm has been presented around average response times, the question arises whether it can be applied for percentiles of response time, which are commonly reported in load tests. Note that we used average response times because of the applicability of the central limit theorem. We cannot do the same with percentiles of response times. In general, we should use the proposed algorithm for determining when to stop a test, and during the run time maintain statistics for all performance metrics of interest. Whenever the test stops we can output the estimates of the performance metrics of interest. Note that outliers should not be removed when computing percentiles.
One of the items for future work is the fine tuning of outlier handling when outliers occur at the start of steady state. We also need to assess the applicability of this algorithm when there is variable number of users in load tests.
Acknowledgements
We would like to thank the anonymous referees whose suggestions have drastically improved the quality of this paper. We would like to thank Rajendra Pandya and Yogesh Athavale for providing performance test logs of VMS and Helpdesk applications, respectively. We would also like to thank Rupinder Virk for running performance tests of the lab applications.
Test duration is seconds
Response tim
e (
ms)
68
References
[AUTOC] http://easycalculation.com/statistics/autocorrelation.php
[DELLDVD] Dell DVD Store http://linux.dell.com/dvdstore/
[EICK2007] M. Eickhoff, D. McNickle, K. Pawlikowski, ”Detecting the duration of initial transient in steady state simulation of arbitrary performance measures”, ValueTools(2007).
[FISH2001] G. Fishman, Discrete Event Simulation: Modelling, Programming, and Analysis, Springer (2001).
[JPET] iBatis jPetStore http://sourceforge.net/projects/ibatisjpetstore/
[PAWL1990] K. Pawlikowski,”Steady-state simulation of queuing processes: a survey of problems and solutions.”, ACM Computing Surveys, 22:123–170(1990).
[ROBI2007] S. Robinson,”A statistical process control approach to selecting a warm-up period for a discrete-event simulation.”, European Journal of Operational Research, 176(1):332–346(2007).
[RUBIS] Rice University Bidding System http://rubis.ow2.org/
[MANS2010] R. Mansharamani, A. Khanapurkar, B. Mathew, R. Subramanyan, "Performance Testing:
Far from Steady State", IEEE COMPSAC, 341-346(2010).
[TDIST] http://easycalculation.com/statistics/t-distribution-critical-value-table.php
[WALP2002] R. Walpole. Probability & Statistics for Engineers & Scientists. 7th Edition, Pearson (2002).
[WARMUP] http://rwwescott.wordpress.com/2014/07/29/when-does-the-warmup-end/
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
69
Measuring Wait and Service Times in Java using Bytecode Instrumentation
Amol Khanapurkar, Chetan Phalak Tata Consultancy Services, Mumbai.
{amol.khanapurkar, chetan1.phalak}@tcs.com
Performance measurement is key to many performance engineering activities. Today's programs are invariably concurrent programs that try to optimize usage of resources such as multi-core and power. Concurrent programs are typically implemented using some sort of Queuing mechanism. Two key metrics in queuing architecture are Wait Time and Service Time. Preciseness of these two metrics determines how accurately and reliably the IT systems can be modeled. Queues are amply studied and rich literature is available, however there is paucity of tools available that provide a breakup of Wait Time and Service Time components of Response Time. In this paper, we demonstrate a technique that can be used for measuring the actual time spent in servicing the Synchronized block as well as time spent in waiting to enter the Synchronized block. A critical-section is implemented in Java using Synchronized blocks.
1. INTRODUCTION
Java programming language is one of the most widely adopted programming languages in the world today. It is present in all kinds of applications viz. Large Enterprises, Small and Medium businesses as well as in Mobile apps. One of the features that has made Java so popular is the built-in support for multi-threading to write concurrent and parallel programs. Vast majority of today’s enterprise applications written in Java are concurrent programs. By concurrent programs, we mean programs that exchange information through primitives provided by the native programming language. In Java, such a primitive is provided by the keyword ‘synchronized’. The Java infrastructure for providing concurrency support revolves around this keyword.
Concurrent programs in java are written using JDK API that support multi-threading. When two or more threads in Java try to enter a critical-section, Java enforces queuing so that only one thread can get a lock on critical-section. Upon completing its work the thread relinquishes the lock and leaves the critical-section. Remaining threads competing for the lock wait to acquire the lock. The Java runtime through its synchronization primitives manages the assignment of lock to the next eligible thread. Java offers support for queuing policies to be fair or random. Fair assignment is performance intensive and is rarely used in real life applications. Random assignment of the lock has no performance overheads and is hence preferred in most applications. Queuing is a vastly studied topic. Queuing theory [ QUAN 1984 ] provides the base for analytical modeling. Hence it is highly desirable to be able to apply queuing theory fundamentals on the actual code. Amongst other things, Queuing theory requires Service Time and Arrival rate as input parameters to be able to predict Wait Times and Response Times for jobs to complete. In practice though, it is easy to measure response time, but exact service time and hence wait times remain a little elusive to be measured. There simply aren’t enough tools available which provide queue depth or breakup of response time into service and wait time components.
70
In this paper, we try to address that void. We present techniques that improve measurements and can form inputs for performance modeling. The problem statement that we address in this paper is to provide a solution to find breakup of response time into service and wait time components without the support for it in the Java API itself. More specifically, we provide technique to capture service and wait times for concurrent threads that access a shared resource. We express the problem statement in form of code. Consider the following code.
Fig. 1. Sample Concurrent Java Program
public class Test { static Object _lockA1; static int sharedVal, NUMTHREADS = 10, SLEEPDURATION = 500; WorkerThread wt[] = new WorkerThread[NUMTHREADS]; int max = SLEEPDURATION + SLEEPDURATION/2, min = SLEEPDURATION / 2; public Test(){ _lockA1 = new Object(); } public void doTest() throws InterruptedException { for(int i = 0; i < NUMTHREADS; i++){ wt[i] = new WorkerThread(); wt[i].start(); } for(int j = 0; j < NUMTHREADS; j++){ try { wt[j].join(); } catch (InterruptedException e) { e.printStackTrace(); } } } class WorkerThread extends Thread { public void run(){ for(int iter = 0; iter < 1; iter++){ inc(); } } public void inc(){ synchronized(_lockA1){ try{ sharedVal++; sleep(new java.util.Random().nextInt(max - min + 1) + min); }catch(InterruptedException e){ e.printStackTrace(); } } } } }
71
The function inc() is a critical-section and controls access to variable sharedVal. Different threads access the function concurrently and try to modify the value of sharedVal. Since sharedVal is incremented in the critical-section, access to the critical-section is made serial. While one thread is executing the critical-section, other threads have to wait for their turn to enter the critical-section. This kind of code is present in millions of lines of code that is written in Java, across industry verticals. Few examples where we find such code is Updating account balance, booking a ticket etc. The longer the thread has to wait on the critical-section, the longer its response time will be. The lower bound is established by Service Time. The rest of the paper focuses on how to find the breakup of response time into Service and Wait times. 2. JAVA CONCURRENCY INFRASTRUCTURE Java provides the following infrastructure for writing multi-threaded, concurrent programs.
1) Synchronized Block 2) Synchronized Objects 3) Synchronized Methods
Java Synchronized Blocks are the most prevalent and preferred form of implementing concurrency control. Since the critical-section is localized, it becomes easy to debug multi-threaded programs using a Synchronized Block. Method 2) implements concurrency control by making the Object thread-safe. So if, two threads try to access the same object simultaneously, one thread will get access to the object while the other one will block. Method, 3) generates an implicit monitor. How a synchronized method is treated is mostly compiler dependent. It uses a Java constant called ACC_Synchronized. In this paper, we are going to focus only on Synchronized Blocks. Obtaining wait and service times for Synchronized Objects / Methods require a different state machine than the one required by Synchronized Block. Hence obtaining wait and service times using method 2) and 3) are out-of-scope. Before we get into specifics of the state machine and Java Bytecode, we present alternate method of obtaining the same information. This alternate method is via logging all accesses before the entry, during and after the exit from the critical-section. The information will need to of the form <tid, time, locationID> where
tid - Thread Identifier
time - Timestamp
locationID - a combination of class, method and exact location (say line number or variable on which synchronization happened).
Fig. 2. Alternate Method: Logging
public void inc(){ long t1 = System.currentTimeMillis(); synchronized(_lockA1){
long t2 = System.currentTimeMillis(); try{ sharedVal++; }catch(InterruptedException e){ e.printStackTrace(); } long t3 = System.currentTimeMillis(); } long t4 = System.currentTimeMillis(); }
72
For each such access, there will need to be 4 tuples that need to get captured as shown below T1 – Time thread arrived at synchronized block. T2 – Time thread entered into synchronized block. T3 – Time, thread is about to exit synchronized block. T4 - Time thread exited synchronized block. In this case,
(T4 - T1) is the Response Time,
(T3 - T2) is the Service Time and
(T2 - T1) is the Wait Time. This method has the following disadvantages
1) Logging has its own overheads 2) After the logs have been written to, these logs will need to be crunched programmatically to get the
desired information 3) Even for simple programs, the crunching program can get complex because it has to tag the appropriate
timestamps to appropriate threads 4) For complex programs involving nested synchronized blocks (e.g. code that implements 2-phase
commits), the crunching program can quickly become more complex and mat require significant development and testing time
5) This method will fail in cases where source code is not available To overcome, these disadvantages we chose to implement bytecode instrumentation to capture the information we require. 3. JAVA BYTECODE Wikipedia [ BYTECODE ] defines Java bytecode as a list of the instructions that make up the Java bytecode, an abstract machine language that is ultimately executed by the Java virtual machine. The Java bytecode is generated by language compilers targeting the Java Platform, most notably the Java programming language. Synchronized Blocks are supported in the language using the bytecode instructions monitorenter and monitorexit. Monitorenter grabs the lock on the synchronized() section and monitorexit releases the same. 4. CENTRAL IDEA AND IMPLEMENTATION OF STATE MACHINE 4.1 Central Idea Our objective in building a state machine is to get the following details about a Synchronized Block.
Location of the block i.e. which class and which method is synchronized()
Variable name on which this block is synchronized()
Breakup of synchronized() block response time into wait and service time components Ingredients for implementing a critical-section using synchronized blocks are:
1) The synchronized() construct and 2) The variable on which synchronization is happening, either static or non-static
monitorenter and monitorexit opcodes provide events related to entering and exiting the critical-section. To get access to the variable name we need to track the opcodes getstatic and getfield for tracking static and non-static variables, respectively. Ideally tracking these 4 opcodes should suffice. This is the central idea behind the state machine. However we decided to add tracking of another opcode viz. astore to make the state machine more robust. We took this path for two reasons viz.
1) Typically, the javac compiler generates a bunch of opcodes between the get* and monitorenter opcode.
73
So the exact sequence is not known. 2) However, based on our empirical study we found that the instruction astore always precedes the
monitorenter opcode. Consider the output of javap [ JAVAP ] utility for inc() function to get a better understanding of the reasons stated above.
Fig. 3. Javap Output
public void inc();
Code: 0: getstatic #4 // Field Test._lockA1:Ljava/lang/Object; 3: dup 4: astore_1 5: monitorenter 6: getstatic #5 // Field Test.sharedVal:I 9: iconst_1 10: iadd 11: putstatic #5 // Field Test.sharedVal:I 14: getstatic #6 // Field Test.SLEEPDURATION:I 17: getstatic #6 // Field Test.SLEEPDURATION:I 20: iconst_2 21: idiv 22: iadd 23: istore_2 24: getstatic #6 // Field Test.SLEEPDURATION:I 27: iconst_2 28: idiv 29: istore_3 30: new #7 // class java/util/Random 33: dup 34: invokespecial #8 // Method java/util/Random."<init>":()V 37: iload_2 38: iload_3 39: isub 40: iconst_1 41: iadd 42: invokevirtual #9 // Method java/util/Random.nextInt:(I)I 45: iload_3 46: iadd 47: i2l 48: invokestatic #10 // Method sleep:(J)V 51: goto 59 54: astore_2 55: aload_2 56: invokevirtual #12 // Method java/lang/InterruptedException.printStackTrace:()V 59: aload_1 60: monitorexit 61: goto 71 64: astore 4 66: aload_1 67: monitorexit 68: aload 4
70: athrow 71: return Exception table:
from to target type
6 51 54 Class java/lang/InterruptedException 6 61 64 any 64 68 64 any
74
Notice the presence of dup and astore instructions (ignore everything starting from '_' character) between the getstatic and monitorenter opcodes. For various test programs written in various different ways and compiled with and without -O options, we found that the set of instructions were not the same. Had these instructions been the same always, we would be in a position to provide guarantee regarding the sequence of events leading to entering the critical-section. However, we found that astore opcode always precedes the monitorenter event. Hence, we made it a part of our pipeline of instructions to keep track of, to detect the presence of thread which is about to enter a critical-section. The monitorexit opcode is fairly straight forward. Upon encountering monitorexit, we just have to flush our data structures that keep track of the pipeline. In our study of Java literature, we haven't come across literature that gives strong guarantees regarding sequence of bytecode generation. Hence our implementation is empirical, based on our understanding of how the Java concurrency implementation infrastructure works. Since the implementation is based on empirical data, we carried out exhaustive testing which we will describe later in the paper. We didn't find any test case for which our state machine breaks. 4.2 Implementation We used the ASM [ ASM ] Bytecode Manipulation library to do the instrumentation. ASM is based on Visitor pattern. The ASM library generates events which are captured and processed by our own Java code. For ASM library to generate events we needed to register hooks for events of interest that were useful for us in implementing the state machine. The registering of hooks can be statically done at compile time or done at runtime i.e. at class load time using the Instrumentation API available since JDK 1.5. Since we anticipated this utility to be small (< 5K LOC), we preferred the static approach in which instrumentation is done manually. Conversion to runtime instrumentation is trivial and is just a matter of using the right APIs provided by Java. The ASM API provides the following hooks for events (of interest to us) to be generated
visitInsn() :- For monitorenter and monitorexit
visitVarInsn() :- For astore
visitFieldInsn() :- For getstatic and getfield
visitMethodInsn() :- For class and method names
During instrumentation the ASM implementation parses Java bytecode of classes. After parsing those classes, it generates events for which there is a hook registered. Once an event is generated, it is the responsibility of the calling code to consume the event. For performance reasons we use the streaming API of ASM. With streaming API an event is lost if it is not consumed. When an event is generated, the control returns to our calling code which needs to consume the event. The calling code takes appropriate actions based on the type {visitInsn, visitVarInsn, visitFieldInsn and visitMethodInsn} of the event generated. These events are encapsulated in two Visitors that need to work in lock-steps to be able to distinguish between a thread has arrive and when the thread has entered the monitor. These Visitors are named as
ResponseTimeMethodVisitor and
ServiceTimeMethodVisitor The algorithm that the ResponseTimeMethodVisitor implements is as follows
a) Maintain a list of opcodes in the order in which they are called. b) Obviously, we expect the getfield / getstatic as the first element in our list of opcodes. Maintain that at the
head of our list. Keep track of the latest getfield / getstatic opcodes, overwriting previous occurrences, if any.
c) Ignore all other events, if any are registered e.g. dup, until an astore is received. Add astore as second element of our list.
d) Ignore all other events, if any are registered until an astore or getfield / getstatic is received. e) If astore is received, overwrite it at second position in the list.
75
f) If getfield / getstatic is received, empty the list and add getfield / getstatic to the head of the list. g) Continue the same until first two elements are getfield / getstatic and astore respectively and the third
element is monitorenter. h) Once the list comprises of
1. getfield / getstatic 2. astore 3. monitorenter in that order, it sets the flag for that thread to true.
i) Once it sets the flag to true, it does update the book-keeping data structures. In one of the data structures, it sets the arrival time (T1) of current thread against the synchronized block implemented on variable pointed to by getfield / getstatic
j) Upon encountering a monitorexit event, it updates the book-keeping data structures by entering the time (T4) at which thread exited the synchronized block.
The ServiceTimeMethodVisitor simply piggybacks on the work that ResponseTimeMethodVisitor. It only does things mentioned below
a) Looks for the status of the flag that ResponseTimeMethodVisitor has set to true for current thread. Once it finds the flag to be set, it updates the book-keeping data structure by setting the syncblock enter time (T2) against the current thread. Updating the data-structure for a thread whose arrival time is not set before by ResponseTimeMethodVisitor is an illegal state.
b) After updating the data structure, it again resets the flag to false, so that nested synchronized blocks can be processed.
c) Since book-keeping data structures are common to both Visitors, it simply ignores the monitorexit event and treats time set by ResponseTimeMethodVisitor as the timestamp (T3) at which servicing is completed.
Thus in our state machine implementation
T3 and T4 are same
(T3 – T2) gives us the Service Time
(T4 – T1) gives us the Response Time The above description is for the simplest case. For nested synchronized blocks, the state machine gets a little more complex. Technically, however the same algorithm is followed since the methods in our Java code which maintains the state machine are reentrant. Only, the book-keeping and hence the results printing that get a little trickier to handle. 4.2 Output of State-machine For code snippet depicted in Fig. 1, assume appropriate main() is called. Our state machine then outputs the result in the following format
ThreadName ArrivalTime EnterTime ExitTime serviceTime waitTime LockName LockLocation
Thread-0 1408309576179 1408309576179 1408309576196 17 0 Test._lockA1 Test$WorkerThread.inc
Thread-1 1408309576205 1408309576205 1408309576221 16 0 Test._lockA1 Test$WorkerThread.inc
Thread-2 1408309576221 1408309576221 1408309576238 17 0 Test._lockA1 Test$WorkerThread.inc
Thread-3 1408309576238 1408309576239 1408309576264 25 1 Test._lockA1 Test$WorkerThread.inc
Thread-6 1408309576243 1408309576334 1408309576352 18 91 Test._lockA1 Test$WorkerThread.inc
Thread-4 1408309576243 1408309576352 1408309576369 17 109 Test._lockA1 Test$WorkerThread.inc
Thread-8 1408309576244 1408309576319 1408309576334 15 75 Test._lockA1 Test$WorkerThread.inc
Thread-5 1408309576245 1408309576302 1408309576319 17 57 Test._lockA1 Test$WorkerThread.inc
Thread-9 1408309576245 1408309576264 1408309576281 17 19 Test._lockA1 Test$WorkerThread.inc
Thread-7 1408309576245 1408309576281 1408309576301 20 36 Test._lockA1 Test$WorkerThread.inc
Fig. 4 Output from State-machine
76
5. TESTING STATE MACHINE IMPLEMENTATION The testing was divided into two types, viz.
1) Theory-based programs which demonstrate computer science principles or concepts
Producer-Consumer problem [ PRODCONS ]
Dining Philosopher problem [ DINIPHIL ]
Cigarette Smokers problem [ CIGARETTE ]
M/M/1 Queues [ MM1 ] In computing, the producer–consumer problem, dining philosopher problem and cigarette smokers problem are classic example of a multi- process synchronization problem. We implement these problems using JAVA. Each of these problem’s solutions, we design by creating common, fixed-size buffer used as a queue, which was shared among all participants. By writing synchronized block, we gave access of shared queue to every participant. We test proposed state-machine implementation on these problems. We perceived 100% accurate results with respect to the service time, lock time for every thread running. 2) Custom programs which are comparable to code that gets written in IT industry today
Database connection pooling code
Update account balance for money transfer transaction (2-phase commit) These programs check the qualitative and quantitative correctness of the code that passes through the state machine. Other than M/M/1 Queues all others verify the functional correctness of the state machine, while M/M/1 (which is actually a set of programs) validates the quantitative correctness. 6. APPLICATION OF STATE MACHINE TECHNIQUE Once response time is accurately broken down into wait and service time components, performance modeling becomes accurate. Other researchers have built utilities on top of our API to predict performance in the presence of software and hardware resource bottlenecks [ SUBH 2014 ].
1.
This technique once tool-ified can help in detecting performance bottlenecks. During development phases of Software Development Life Cycle (SDLC) this tool can provide immense value in troubleshooting performance issues. We haven’t quantified the performance overhead of our technique yet, but we believe it to be very low. Our basis for this assumption is our own past work in developing a java profiler, named Jensor [ AMOL 2011 ] using bytecode instrumentation techniques. 7. CONCLUSION It is possible to derive Wait and Service Time components of Response Times of concurrent programs written in Java using bytecode instrumentation techniques. Our state-machine based approach is capable of capturing these metrics which can be used for other performance engineering activities like performance modeling, capacity planning and during performance testing.
77
REFERENCES
[ AMOL 2011 ] Amol Khanapurkar, Suresh Malan “Performance Engineering of a Java Profiler”, NCISE, Feb – 2011.[ ASM ] http://asm.ow2.org/
[ BYTECODE ]http://en.wikipedia.org/wiki/Java_bytecode
[ CIGARETTE ] http://en.wikipedia.org/wiki/Cigarette_smokers_problem
[ DINIPHIL ] http://en.wikipedia.org/wiki/Dining_philosophers_problem
[ JAVAP ] http://docs.oracle.com/javase/7/docs/technotes/tools/windows/javap.html
[ MM1 ] http://en.wikipedia.org/wiki/M/M/1_queue
[ PRODCONS ] http://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem
[ QUAN 1984 ] Lazowska et al: Quantitative System : computer system analysis using queuing network models, A popular book, 1984
[ SUBH 2014 ] Subhasri Duttagupta, Rupinder Virk and Manoj Nambiar, "Predicting Performance in the Presence of Software and Hardware Resource Bottlenecks", SPECTS, 2014.
78
CLOUD PERFORMANCE TESTING - KEY CONSIDERATIONS (COMPLETE
ANALYSIS USING RETAIL APPLICATION TEST DATA)
Abhijeet Padwal
Performance engineering group
Persistent Systems, Pune
email: [email protected]
Due to its lower cost and greater flexibility, cloud has become the most preferred option for the deployments for
any size of the applications and products in today’s world. Through its Platform as a Service (PaaS) and
Infrastructure as a Service (IaaS) services, cloud has attracted and benefitted the testing services of the
applications especially the load and performance testing. Though cloud provides superior flexibility, scalability at
lower cost over the traditional on-premises deployments, it has got its own limitations and challenges. If those
limitations are not evaluated carefully they can severely impact overall projects and their budgets if not evaluated
carefully. It is recommended to take holistic view while deciding about using cloud for any purpose by taking
detailed look at pros and cons of cloud.
This paper illustrate cloud in brief and a detail case study a load testing of a Retail application in cloud and how
cloud’s pros and cons worked in favor and against during the course of load testing and what actions needed to
be taken to overcome those.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
79
1. Introduction
In recent years there have been revolutionary technology innovations which have changed the world where we
live and the way we interact and do our business. These innovations have resulted in to a technology
transformation which is happening at a rapid speed. Technology transformation is vital and has resulted in to a
better and faster, serving to the business and the end users. One of the most talked about and which has reached
the reality and established a new type of service delivery arena is the Cloud Computing! The services offered by
the cloud are helping business to move in to an arena of reduced cost, highly available, faster, reliable and high
margin services and products and that’s why businesses are aggressively adapting cloud based services.
Increasingly, businesses are moving their traditional on-premises deployments of their applications or products to
the scalable cloud environment which gives an advantage of the low cost, high availability at low maintenance.
Along with the production deployments, cloud has been also benefitted in the testing of the applications especially
for load and performance testing through its Platform as a Service (PaaS) and Infrastructure as a Service (IaaS)
services. Cloud has found to be useful for hosting the load testing environments due to its ability to arrange high
end servers, applications and number of load injectors with a higher flexibility and lower costs. However like any
other service Cloud does have its own limitation and challenges over conventional on-premises deployments. For
example Cloud doesn’t provide accessibility to the low level hardware configuration parameters which are
important during the activities such as tuning. And in this case tuning or optimization activities cannot be
performed effectively on the cloud. Depending on the cases and type of use of cloud services, those limitations
can be categorized. If one want to use cloud for load and performance testing and at its best then he must take a
holistic view by considering the pros and cons of cloud environment and define an effective strategy to use it.
2. Cloud Computing
Gartner definition for the cloud computing-
A style of computing in which scalable and elastic IT-enabled capabilities are delivered as a service using internet
technologies. [Gartner 2014]
This definition itself describes the cloud computing in very simple words. A computing which is,
o Scalable and elastic – One can do dynamic provisioning of resources (on-demand)
o Accessibility over the internet – Accessible to the end users over the internet on wide range of
devices, PC, Laptops, mobile etc.
o Service-Oriented – A service which is a value add to the end user for whom it is a black box.
2.1 Types of Cloud Services
Based on these characteristics cloud services are classified in 3 main categories
Infrastructure as a Service (IaaS)
This is the most basic cloud-service model, where physical or virtual machines – and other resources are
offered by the provider and cloud users install operating-system images and their application software on the
cloud infrastructure.
Platform as a Service (PaaS)
A computing platform, typically including operating system, programming language execution environment,
database, and web server. Application developers and testers can develop, run and test their software
solutions on a cloud platform without the cost and complexity of buying and managing the underlying
hardware and software layers.
Software as a Service (SaaS)
80
In the SaaS model, cloud providers install and operate application software in the cloud and cloud users
access the software from cloud clients. Cloud users do not manage the cloud infrastructure and platform
where the application runs. This eliminates the need to install and run the application on the cloud user's own
computers, which simplifies maintenance and support.
2.2 Cloud Service Providers Amazon, Google, Microsoft Azure, Openstack and many other vendors provide different kind of service offerings
in cloud arena.
2.3 Market Current Status and Outlook
Due to the inherent characteristics of cloud which are beneficial for business and the attractive pricing models
offered by the service providers Cloud based services have enormous demand. A Recent survey by the well-
known agencies shows that demand for cloud based services is getting stronger all the time.
Grtner - Global spending on public cloud services is expected to grow 18.6% in 2012 to $110.3B,
achieving a CAGR of 17.7% from 2011 through 2016. The total market is expected to grow from $76.9B in
2010 to $210B in 2016. The following is an analysis of the public cloud services market size and annual growth
rates. [Cloud Market2013]
Picture 1 – Annual growth for cloud market
3. Case Study
3.1 About customer
Customer is a leading software company delivering Retail Solutions to market leaders across the globe. These
solutions include POS, CRM, SCM and ERP.
3.2 About Application
Application is an enterprise class retail solution to manage the front end and backend operations within a retail
store and controlling the stores from the head office through a single application.
81
Figure1 – Application architecture
App server (AS) is the core application located at Head office and responsible for managing all the stores and
real-time processing and analyzing the data generated by the stores. AS is also responsible for transferring the
software updates to the stores through its ‘Update’ functionality.
Operations is the core application at every store which is responsible for store management and maintaining the
store level master and transactional data and exchanging it between billing counters and AS server. Operations
takes care of the store operations starting from maintaining stock inventory, pricing, promotions, store level
reports, online data transfer to AS server through ‘Replication client’ component and receiving the patches from
EAS server and transferring those to counters.
Billing counter takes care of item information and billing of those. All the billing data generated by the counter is
stored in Store DB which is finally replicated to AS server using the ‘Replication client’ component at Operations.
All the applications were developed in ASP.Net and the database was the SQL server.
3.3 Performance testing requirement
This retail application has been deployed at various customers and working fine. However till recently maximum
number of stores at any of the customers were 200. Recently the customer got a requirement where this retail
solution would be deployed across 3000 stores. The customer had never done deployment at such a high scale
and thus unaware of the whether the application would sustain 3000 stores, if not then what needs to be tune and
what kind of hardware would be required. As a first step the customer decided to put the application under load of
3000 stores for various business workflows and see how it behaves. For load testing activity the customer came
up with the 5 real life business scenarios which have been used more frequently and does the high amount of
transactions.
The customer had identified below 5 scenarios across AS, Operations and Billing counter as below,
Scenario 1 –Replication Replication of billing data from store to AS for 3000 stores.
Scenario 2 – Billing counter Multi user (minimum 25 parallel counters) performing billing transactions which include the Bill, Sales Return, Bill Cancellation, Lost Sales (in order of execution priority) with max line item not above 200 and minimum of 20 line items with Cash and Credit card as payment.
Scenario 3 – AS
82
Access the reports to be checked while data from Store (minimum 20+ stores) is being updated to AS.
Scenario 4 – Operations Access stock management functions with 1000 + line items namely with 5/10 users
Scenario 5 –Updates Download of patch for more than 100 stores simultaneously. Various patch sizes to be tested namely 50MB, 80MB, 100MB
4. Approach
Scenario 1 i.e. Replication was on the high priority as it was most frequent operation between stores and central
server and handles huge amount of data generated by the stores. Here after this white paper would illustrate the
approach taken for load testing this scenario.
4.1 Scenario
Replication of data from Store to Server for 3000 stores. Each store would have 100 billing counters and each
counter generating bill with 200 line items.
4.2 Scenario Architecture
Figure2 – Replication scenario architecture
This replication scenario has 3 sub activities,
1. Collation of billing data from all the counters and generate the xml message files.
2. Transfer the xml message files from store to server (replication client -> replication server).
3. Extract the xml files and store the extracted billing data on the head office database.
It was decided to take pragmatic approach for simulation of the entire scenario. First simulate above mentioned
each step in isolation and then go for the end to end mix execution. First candidate was the transfer of the xml
files from replication client located on 3000 stores to the replication server on the head office. Rational behind
selecting this particular step of the scenario on priority was, step 1 was the ‘within a store’ process which would
83
have max to max 100 counters each store so the max load for this step at any given point would be not more than
100. Step 2 is the event when actual load of 3000 stores would come in to picture so it was decided to start with
that particular step.
4.3 Test Harness setup
To simulate this scenario a test harness was created which had 5 parts,
1. xml messages folders on injector machine
2. Vb based replication client (.exe) on injector machine
3. IIS and sql server based replication server
4. xml message folder on the head-office server and
5. Perfmon setup for monitoring the resource consumption on the AS as well as load injectors.
Folder structure on the store and head office was as below,
Picture2 – Message folder structure on replication client and server
XML messages which have to be transferred are placed in the ‘OutBox’ folder on replication client on store side
and messages which have been received are placed in the ‘Inbox’ folder on replication server at head-office.
Each store has 100 xmls messages of 2 MB size each in the outbox folder with the billing data of the 100 line
items each.
Replication client was a VB based .exe file which was executed through command line\.bat file by passing
arguments as server IP and XML message folder name at client\store end.
Command:
start prjReplicationUpload20092013-1.exe C:\
\ReplicationUpload\ReplicationUpload:10.0.0.35:S000701:100:S000701:20130812-235959(1)
prjReplicationUpload20092013-1.exe: application file name for 1st store
10.0.0.35: server IP
S000701: store folder at server end
20130812-235959(1): XML message folder at client end
It was not feasible to setup and manage 3000 actual store machines to inject the load so it was obvious to
simulate multiple stores from single load injector box. This was achieved by using windows batch utility. Multiple
copies of EXE files were created by different names to represent number of store considered for data replication.
Picture 3 – Multiple copies of replication utility
A batch file was created to execute all exes one after another in a sequence.
84
The next question was how to calculate the time taken for the entire messages file upload operation when
multiple copies of replication clients are fired which are uploading xml messages to replication server
simultaneously. Best way to calculate end to end data transfer time was to start with a first replication exe
triggered to the last xml message file uploaded to the replication server.
5. Test Setup
For server configuration it was decided to go ahead with the same configuration which has been used for the
existing customers and based on the results of these test, perform server sizing and capacity planning activity.
AS Configuration
Operating System Windows Server 2012 DataCenter
Web-Server IIS 8
Number of Cores 4
RAM 28 GB
Network Card Bandwidth (Mbps) 10 Gbps
Table1 – AS Server configuration
Database Server Configuration
Operating System Windows Server 2012 DataCenter
Web-Server IIS 8
Number of Cores 4
RAM 7 GB
Network Card Bandwidth (Mbps) 10 Gbps
Table2 – DB Server configuration
This hardware configuration was not available in house and needed to be either procured or rented out for this
activity. Considering the short span of test execution phase it was decided to rent out this hardware from local
market.
5.1 Load Injectors
Finding out the size and required number of load injectors was tricky. As mentioned above it was not feasible to
setup and manage 3000 actual store machines to inject the load and thus it was necessary to initiate the load of
multiple stores from single load injector box. With this approach it was must to make sure that load injector itself
should not be overloaded and number of injectors should be optimum so that the load injector management
efforts are less and feasible.
To come up with required number of injector, sample tests were conducted by simulation of the multiple copies of
replication client from single injector using the windows batch file. Number of replication client was gradually ramp
up till the point injector CPU reaches to 70%. Single injector with Intel P4 processor with 2 GB RAM supported
100 instances of replication client that means to initiate the load of 3000 stores 30 load injectors are required.
These many machines were not available for load testing in local environment so an option was evaluated to
reducing the number of injectors by increasing the hardware capacity. However this option was not commercially
and logistically viable to arrange those high end machine machines. Considering this it was decided to go ahead
with machine configuration which was used for sample test as it was the normal configuration so the availability
and costing would be affordable.
85
5.2 Rented Vs Cloud base load injectors
Here 2 options were at disposal, either go for renting of the load injectors as well as servers in local market or see
if the test could be performed in virtual cloud environment. Costing was taken for rented option from local market
and for cloud based virtual environment multiple vendors were evaluated such as Amazon cloud and Microsoft
Azure.
Total efforts of 15 days were originally planned for the execution of this particular scenario. For local renting
minimum duration for rent was 1 month with cost of
Client - $50 per month per machine
App Server - $ 150 Per month per server
Database - $50 per day per server
In case of cloud, flexible on-demand costing option was available. For on-demand costing calculation a detailed
usage pattern was defined for the load injectors and server for those 15 days.
Machine Number of Instances
Number of days required
Usage Activity
Setup machines 2 15 12hrs per day Environment setup and sample runs
Load Injectors 30 5 12hrs per day Execution of 3000 stores
Application Server 1 15 12 hrs per day Sample and actual runs
Database Server 1 15 12 hrs per day Sample and actual runs
Table3 – Usage pattern for machines during design and execution of scenario1
Based on the above usage pattern cost of Amazon and Microsoft Azure setup were calculated and further
compared with local renting option as below,
Virtual Machines /
Instance Microsoft Azure ($) Amazon ($)
Local Renting ($)
Load injectors 30 648 858 1500
Setup machines 2 86.4 547 100
AS Server 1 183.6 270 150
DB Server 1 442.8 98 50
Total 1360 2055 1800*
Table4 – Cost comparison between Azure, Amazon and local renting
*Cost includes only hardware. OS on client and servers and SQL server licenses are separately charged.
In clouds Microsoft Azure was a cheaper than Amazon which also had added benefit of 5 GB of free data
upload and download from cloud which was just 1 GB in case of Amazon.
Microsoft Azure also stood as a winner in cost comparison with local renting option. Apart from hardware
cost local renting had another added cost of licenses of OS and SQL server.
86
5.3 Microsoft Azure Load Test Environment
Figure3 – Load test setup at Azure and local environment
An isolated environment was setup in Azure cloud having replication server on AS, database server, 30 load
injectors and 2 setup machines. Considering the high volume of transaction traffic 10GB LAN was setup for the
load testing environment in Azure. This environment was accessed through the controlling client’s setup in local
environment over the RDP connection.
To control and manage the 30 load injectors in Azure environment, 6 controlling local clients needed to be setup
in local environment. From each controlling client 5 load injectors were accessed to setup and execute the test
and capturing the result data.
6. Test Execution and Results Analysis 6.1 Initial Test Results
After setting up the test environment, test execution was started with less number of stores. Based on the results
of each test run, number of stores load was gradually ramped up. First test was conducted for the 100 stores
which was successful. Then number of stores gradually increased during each test from 100, 200, 500 and 700,
800. Till 700 stores, all xmls files from stores were getting transferred to replication server however during the 800
stores test, number of stores started getting failed. Few more tests with 1000 and 1600 stores were also
conducted for the analysis for the failures. Please refer below table for the summary of the results.
Stores # Successful
Stores # Failed
Stores # Start Time (HH:MM)
End Time (HH:MM)
Total Time (mm:ss)
Status
100 100 0 6:41:00 6:43:00 0:02:00 Pass
200 200 0 11:27:00 11:30:00 0:03:00 Pass
500 500 0 13:28:00 13:35:00 0:07:00 Pass
700 : Round 1
700 0 8:37:00 8:46:00 0:09:00 Pass
700 : Round 2
700 0 14:03:00 14:15:00 0:12:00 Pass
800 : Round 1
700 100 6:38:00 6:48:00 0:10:00 Fail
800 : Round 2
702 98 9:46:28 10:02:00 0:15:32 Faill
87
1000 : Round 1
954 46 12:57:00 13:12:00 0:15:00 Fail
1000 : Round2
906 94 12:28:00 12:45:00 0:17:00 Fail
1600 1300 300 7:49:00 8:05:00 0:16:00 Fail
Table5 – Test results summary of scenario1 on Azure
It was observed that after 700 stores replication scenario behaviour was inconsistent. To ascertain the reason for
this failure, resource consumption data on the replication server was further analysed. For this detailed analysis
parameters for each hardware resource were identified, %CPU utilization, Available Memory, % Disc queue
length, % processor queue length and network bandwidth.
Table 6 –Resource Utilization Analysis
This analysis highlighted that when all the stores start replication activity, server disk becomes saturated and thus
the processor and queue length for these resources builds up beyond the threshold values which results in to
inconsistent behaviour and failures.
Based on this analysis it was decided to upgrade both of the hardware resources if possible or atleast the disk
speed which was the main culprit for the failures. Current configuration of these 2 resources was, number of cores
– 4 and disk speed – 10k RPM. To do a stepwise scaling of these resources it was decided to upgrade both of
these resources to CPU – 6 cores and disk speed – 15k RPM.
These new hardware requirements were checked with Microsoft Azure if more numbers of cores and higher
speed disk could be made available. It was found that number of cores could be upgraded to 8 however not the
higher speed disk. Reason behind that was all the instances in the disk array were having the same speed and it
was impossible for the Microsoft to arrange higher speed disk for our testing. This was the show stopper for
further testing and an important revelation of the limitation of the cloud environment. Performance tuning
activity, which requires lot of configuration changes at the underlying hardware layer, cannot not be
performed efficiently in the cloud environment where the resources are being shared and cannot be
changed.
6.2 Moving to Physical Server in local environment
After this revelation of limitations at Microsoft Azure, it was decided to evaluate other options and even though
those are costly compared to Azure. Those were Amazon cloud and rented physical servers in local environment.
Amazon cloud had the option of higher disks as well as more number of CPU’s. However based on the
88
experience of Microsoft Azure it was decided to rule out cloud option as even if Amazon is providing higher speed
disk, they might have further limitations on other resources and their tuning.
Considering the entire situation and limitations of cloud environment it was decided to rent out the server from the
local market with below configuration.
SERVER
Operating System Windows Server 2012 DataCenter
Operating System Type 64-bit
Processor Intel®Xeon®CPU E5-2630 [email protected]
Web-Server IIS 8
Number of Cores 6
RAM 28GB
Table7 – AS server configuration for local environment
Fortunately this hardware configuration was available with the local vendor however the next challenge was to
arrange 30 load injectors which were not available in the load test environment and could not be made available
by the vendor in the short time. Looking at the limited time in hand it was decided to use machines from other
teams during the out of office hours to carry out the further tests.
By overcoming all the challenges tests were carried out in the local environment and as anticipated 3000 stores
replication worked without any hitches!
Store # Successful
Stores # Failed Stores #
Total Time (in minutes)
Status
100 100 0 1 Pass
200 200 0 2 Pass
500 500 0 3 Pass
700 700 0 4 Pass
1000 1000 0 8 Pass
1500 1500 0 15 Pass
2000 2000 0 16 Pass
2500 2500 0 21 Pass
3000 3000 0 29 Pass
Table8 – Test results summary for scenario1 on local environment
7. Challenges/ issues faced in Cloud during execution
Apart from the limitation of the configuration changes in the cloud environment, there were few other challenges
were faced during the course of the execution. Most of those were due to the fact that there were large number of
load injectors to be managed and the mode of accessibility i.e. RDP was slow over the internet. Few of those are
mentioned below,
7.1 Switching between injectors to initiate the test
Due to the nature and design of the replication client, there was no central utility or application available to initiate
the load from all 30 injectors automatically the way most load testing tools does. Test initiation had to be done
manually by login in to individual boxes. To accurately simulate the real-time behaviour of replication scenario it
was required to keep the high concurrency during the execution and to achieve this load should have been
initiated from all the 30 injectors at the same time or atleast with a very short delay. To facilitate this, switching
between injectors through controlling machine had to be very fast. This switching between load injectors over the
89
RDP connectivity was tedious, it would have been very easy if there would have been 30 controller machines
each managing single injector but it was not the case in this particular scenario.
7.2 Test data setup
Test setup included various task and one of the difficult one was creating folder setup for the 100 stores per
injector. During each execution cycle it was required to use unique bill number for the billing data so the message
folders were required to be updated before each test cycle with unique bill number in each store folder. Setting up
folder structures on 30 client machines to simulate 3000 stores was tedious and time consuming task and
performing it over RDP had added more complexity and time to it.
7.3 Monitoring
It was also necessary to monitor the health of each load injector during the execution to make sure that those are
not overloaded. Keeping an eye on resource consumption of 5 load injectors from a single controlling machine
was challenging.
7,4 Data transfer
These tests generated huge amount of result data including the resource utilization data. In absence of the
applications such as Microsoft excel and slow speed of the RDP connectivity it was difficult to perform the
analysis of this data on cloud machines itself so that data had to be downloaded for every test run. Downloading
of large amount of data over the internet connection was a time consuming process and there were cost involved
after the stipulated limit set by Microsoft for data transfer.
8. Conclusion and outlook
For load and performance testing, cloud provides an edge over the conventional on-premises test setups. One
can take advantage of the cloud to build the test environments within a very short period of time with a cost and
logistical flexibility however with all these advantages there are few disadvantages or challenges with the cloud
environment which could severely hamper the purpose. A limitation such as no access to the underlying hardware
configuration parameters doesn’t suit for the activities such as bottleneck identification, tuning and optimization.
Managing cloud setup remotely and data transfer over internet adds few more complexities and delays in overall
schedule.
One can consider cloud for load and performance testing however it is recommended that all these pros and cons
should be studied in details and that too in the context of the load testing requirement, see what could be possible
and what cannot be performed on cloud and base on that define the strategy to perform load testing on cloud.
References
[Gartner 2014]– http://www.gartner.com/it-glossary/cloud-computing/
[Cloud market 2013 ]– Gartner survey
http://www.forbes.com/sites/louiscolumbus/2013/02/19/gartner-predicts-infrastructure-services-will-
accelerate-cloud-computing-growth/
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
90
Building Reliability in to IT Systems
Kutumba Velivela / Ramapantula Uday Shankar Tata Consultancy Services
[email protected]/ [email protected]
Information Technology (IT) System reliability is in critical focus for government and business because of huge cost and reputation impacts. A well designed application will not only be failure-free but also will allow predicting failures so that preventive maintenance can take place. It will also have adequate resilience, capacity, security and data integrity. IT reliability includes all parts of the system, including hardware, software, interfaces, support setup, operations and procedures. Due to the complexity in each of these areas, organisations are giving priority to developing end-to-end reliability-specific capabilities. These capabilities can delivered under the headings of: assessment, engineering, design, modelling, assurance and monitoring. In this paper, we propose formal methods for developing reliability centre of excellence, with a customised maturity model, that will guarantee 5-9s availability to critical business functions. Positive effects of this approach, other than giving peace of mind to senior managers, include reduction in frequent re-design of applications, positive culture change within the organisation and increase in market share.
Keywords: IT Availability Management, Reliability, Centre of Excellence, Assessment, Engineering, Design, Modelling, Assurance, Monitoring, Metrics, Error Prevention, Fault Detection, Fault Removal, Service Level Agreement (SLA), Maintainability.
1. Introduction
“Availability Management is responsible for optimising and monitoring IT services so that they function reliably and without interruption, so as to comply with the SLAs, and all at a reasonable cost.”[ITIL OSIATIS] Technology services failure has been making news headlines for last few years for causing extreme impacts in well established businesses and government departments. Payment / ATM failures, travel disruptions, medical operations cancellation, huge trading losses, reduced defence security, smart mobile blackout and unpaid wages headed top technology disasters in the last few years. Affected organisations include US Government, NHS, Walmart, Bank of England, M&S, Natwest, LBG, Stock exchanges, Airlines, Utilities and car manufacturers [Colin 2013] [Phil 2011] [Phil 2012]. A research summary on reasons for IT systems unavailability is included in Appendix A.
Figure 1 – Costs and other impacts of service disruptions
Technology Service Disruptions are costly…
£109,116
£5,721
£143,759
£457,500
All respondents Small Companies MediumCompanies
Large Companies
LARGE OR SMALL - DOWNTIME HURTS( COST PER HOUR)
Source: Aberdeen Group, May 2013
• Recover from downtime 1.13hours to 27 hours• Maximum Tolerant downtime 52.63minutes• Average number major disaster events per year (not
including medium and minor): 3.5
1 in 4 small companies close down due to major IT systems failure. 70% of small firms go out of business in a year after a major data loss
Source: HP and SCORE Report
91
Even companies that had no major failures are hit by ever increasing hardware/software maintenance costs and delayed software deliveries. Operations are not able to cope when there is unexpected increases in faults. Essential services are being shut down with no prior notice due to communication failures and/or process failures. Backups and switching to redundant systems often does not work when needed. Technology and IT systems reliability used to be the primary concern of installation designers and maintenance teams for more than 50 years. But, due to millions of dissatisfied customers, loss of data, fraud write-offs, regulatory fines and criminal/civil penalties, technology reliability has become a major concern for Business/IT account managers, Business analysts, IT Strategists / architects, Designers and Testers. This is more the case with safety critical, 24x7 web sites, systems software, embedded systems and other “high-availability must” applications. This paper presents reliability-specific offerings organisations can adapt for preventing errors, detecting faults and removing them, maximising reliability and reducing the drastic impacts of failures. By using this paper as a roadmap, businesses can build IT Reliability skills which provide additional peace of mind to senior management.
2. Background
Reliability is an important, but hard to achieve, attribute of IT systems quality. These attributes are normally covered under non–functional requirements in the early stages of projects. Reliability analysis methods help identify critical components and to quantify their impact on the overall system reliability. Employing this sort of analysis early in the lifecycle saves large percentage of budget on maintenance and production support. Hardware-specific reliability and related methods originated in Aerospace industry nearly 50 years ago and subsequently became ‘must-use’ in automotive, oil & gas and various other manufacturing industries. Arising from this appreciation of the importance of reliability and maintainability, a series of US defence standards (MIL-STDs) were introduced and implemented around 1960s. Subsequently the UK Ministry of Defence also introduced similar standards. Reliability methods have successfully allowed hardware products to be built to satisfy high reliability requirements and the final product reliability to be evaluated with acceptable accuracy. In the recent years, many of these products have come to depend on software for
their correct functioning, so the reliability of combined hardware + software components has become critically important. Even pure IT applications are dependent on hosting data centre, servers and other components to be reliable. Hence, software reliability has become important area of study for software engineers. Even though still maturing, reliability methods have been adapted either as standard or as a best practice by a few large organizations. It is regulatory in some of these for IT systems to be certified to meet prior specified availability and reliability requirements. Many other organisations are yet to lap up this standard and reap the rich rewards through focusing on the availability criteria that all critical IT development processes must comply with.
3. Reliability Engineering = 5-9s or 99.999% Availability
In order to meet raising customer expectation for having quality software running 24X7, often defined as 5-9s in requirements, there is a need for a fundamental shift in the way IT applications are developed and maintained. Detailed hardware and software reliability requirements required to be documented and special focus is to be given to meet these from design to implementation stages. Reliability skills proposed in this paper will offer a comprehensive approach for addressing all IT reliability-related issues including capacity, redundancy, data integrity, security and maintainability. Critical applications developed without a proper reliability approach would lead to frequent partial re-designs or full re-development because they become cumbersome to maintain. A well designed application will either be failure-free or will allow predicting failures so that preventive maintenance can take place. If a failure has safety or environmental impact, it must be preventively maintainable, preferably before it starts disrupting production. New reliability-specific capabilities will help businesses substantially shift from reacting to failures when they happen to pro-actively manage them through approaches like Reliability Centred Analysis and Design (RCDA), covered in more detail in next section. Setting up a separate reliability Centre-of-Excellence (COE) will not only help directly enhance business image and customer satisfaction, but also indirectly contribute to increase in market share and cost-savings. Developers who have applied these methods have described them as “unique, powerful, thorough, methodical, and focused.” The skills developed
92
highly correlated with attaining best-in-class levels 4 and 5 of Capability Maturity Model. Based on multiple projects experience, when done properly, Software Reliability Engineering only adds approximate maximum of 2-3% to project cost.
- 3.1 Reliability-Specific Capabilities
Reliability COE focusses on related business issues and help them efficiently meet their expectations. A combination of offerings can be provided under major headings of:
- Reliability Engineering, - Reliability Assessment, - Reliability Modelling, - Reliability Centred Design Analysis (RCDA), - Software Reliability Acceptance Testing, - Reliability Analysis and Monitoring using
specialist tools. The methods used under the above heading are fundamentally similar but reliability offerings often have to be customised depending on different stages of development. For example, a reliability assessment offering will apply mainly to existing applications and will need some modelling, use of tools and some testing. Similarly, a reliability engineering offering applies to new or being redesigned applications and will need some assessment, modelling, RCDA, testing and tools use. 3.1.1 Reliability Engineering Reliability Engineering involves defining reliability objectives and adapting required fault prevention, fault removal and failure forecasting modelling techniques to meet the defined objectives all through the development lifecycle. The emphasis is on quantifying availability by planning and guiding software development, test and build processes to meet the target service levels. A collaborative culture change is needed in solution architecture, application development, service delivery, operational and maintenance teams to implement this approach. Fault prevention during build requires better development and test methods that will reduce error occurrences. Smart error handling and debugging techniques are to be adapted during design and test reviews so that faults are removed at the earliest possible time. By modelling occurrences of failures and using statistical methods to predict and estimate reliability of IT systems, more focus can be given to high risk components and Single Points of Failure (SPOFs). Refer to Figure 2 for a representation of engineering components.
Reliability engineering is a continuous process as the analysis may have to be repeated as more IT system releases are delivered. On-going improvements in fault tolerant and defensive programming techniques will be required to meet business expected targets for reliability.
Figure 2 – Reliability Engineering Components 3.1.1.1 Reliability Engineering Techniques Popular Hardware techniques include redundancy, load-sharing, synchronisation, mirroring and reconciliation at different architecture tiers. Some of the software techniques include Modularity for Fault Containment, Programming for Failures, Defensive Programming, N-Version Programming, Auditors, and Transactions to clean up state after failure. 3.1.2 Reliability Assessment Reliability Assessment can be conducted on multi-location systems, single data centres, services, servers and/or component levels. Diagram below shows three popular assessment methods and how they can be implemented together in a continuous improvement scenario. Each of the approaches can be implemented on their own as one-off exercises depending on the life cycle stage the IT system is in.
Fault Prevention
during build
Fault Removal through
Inspection and Testing
Failure Forecasting &
Modelling
Fault Tolerance & Defensive Programming techniques
FeedbackLoops
93
Figure 3 – Reliability Assessment Methods Architecture-based reliability analysis focuses on understanding relationships among system components and their influence on system reliability. This is based on the process of identifying critical components/interfaces and concentrating more on the potential problem areas and SPOFs. It assumes reliability and availability of IT systems is proportionate to corresponding measurements of its reusable hardware/software components. For example,
Figure 4 – Measuring reliability by components Metric based Reliability analysis is based on the static analysis of the hardware/software complexity and maturity of the design and development process and conditions. This approach is particularly useful when there is no failure data is available, for example, when the new IT system is still in design stages. IEEE had developed a standard IEEE Std. 982.2 (1988) and a few other product metrics are available to support reliability assessors in achieving optimum reliability levels in software products. Similar vendor supplied reliability data available for hardware components and third-party components used. The black box approach ignores information about the internal structure of the application and relationships among system components. It is based on collecting failure data during testing and/or operation and using such data to predict/estimate when the next failure occurs. Black-box reliability analysis evaluates how reliability improves during testing and varies after delivery. As pointed out in Appendix A, not adapting best practices in long-term monitoring of
relevant components is one of the major reasons for IT unavailability. A combination of these methods will be required for IT systems that require high levels of reliability. 3.1.3 Reliability Modelling Over 200 models have been developed to help IT Project Managers to deliver reliable software on-time and with-in budget. A good practical modelling exercise can be used to initiate enhancements that improve reliability from early development phase. Based on predictive analytics concepts, different models are used depending on the type of analysis needed:
- Predict reliability at some future time based on past historical data even during design stages,
- Estimate reliability at some present or future time based on data collected from current tests,
- Estimate the number of errors remaining in a partially tested software and guide the test manager as to when to stop testing.
Like performance models, no single reliability model can be used in every situation because they are based on a number of assumptions, parameters, mathematical calculation and probabilities. The modelling field is fast maturing and carefully chosen models can be applied in practical situations and give meaningful results. 3.1.4 Reliability Centred Design and Analysis (RCDA) Reliability should be designed-in at the IT strategy level and a formalized RCDA methodology is needed to reduce the probability and consequence of failure. Various statistics have been published that prove large % of failures can be prevented by making needed changes at design stage. Successfully implemented RCDA can result in an improved productivity and reduced maintenance costs. The focus of RCDA all through the life cycle is to ensure services are available whenever business users need it. For that to happen IT capacity has to be aligned to business needs, sufficient redundancy is built-in such that critical services still run during significant failures and data integrity/confidentiality is maintained at all times. Below is a high level flow diagram that shows a sequence of basic steps to be followed as part of RCDA:
Black box Reliability Analysis
Metric-based
Reliability Analysis
Architecture -based
Reliability Analysis
Estimation of the reliability based on failure observations from testing or operation
Evaluation based on function points, complexity, development process and testing methods
Evaluation of the IT component reliabilities and system architecture
Feedback Loop
Feedback Loop
Web Server
App Server 1
DB Server
97.99% 97.7% 98.99% 99.9% 98.99% End-to-end 97.7%
Intranet
App Server 2
98%
94
Figure 5 – Basic Steps in RCDA for an IT system 3.1.4.1 Load-balancing and Failover Reliable IT systems should be housed in a highly secure and resilient data centres and the solutions should be built around a redundant architecture able to ensure hardware, network, databases, and power availability as needed. Latest active/active failover, recovery and continuity mechanisms to be considered to help meet the high business availability requirements. However, IT architects need to be careful while employing complex redundant solutions as they can often be the sources for major failures. Some of the latest major business IT failures are due to incorrect setup or inadequate testing of complex redundancy and backup solutions. 3.1.4.2 Other Design Factors Business will not accept IT systems just because they are available 24x7. Reliable IT systems must meet various business specified requirements including performance, capacity to match business growth, security, data integrity and on-going maintainability. [Evan 2003] identified Top-20 Key High Availability Design Principles that range from removing Single Point of Failures to keeping things simple. This kind of analysis will guide reliability designers and architects in developing customised best practices. 3.1.5 Reliability Acceptance Testing Like all other non-functional requirements, reliability and availability for IT systems need good
validation and verification phases. However, traditional software development and testing often focus on the success scenarios whereas reliability-specific testing focuses on things that can go wrong. New testing methods focus on failure modes related to timing, sequence, faulty data, memory management, algorithms, I/O, DB issues, schedule, execution and tools.
Figure 6 – Example Assurance Team Structure Some of the methods that guide these tests are Reliability Block Diagrams (RBDs), Failure Mode Effect Analysis (FMEA), Fault Tree Analysis, Defect Classification, Operational Profiles and error handling/reporting functions. These methods help testers develop reliability-specific test cases during integration, user acceptance, non-functional, regression and deployment test phases. Some sectors need their IT systems to be certified along with hardware components and they need reliability based acceptance criteria to be defined and met before releasing any changes in to production. Given a component of IT system advertised as having a failure rate, Assurance team can analyse if it meets that failure rate to a specific level of confidence.
Reliability Requirements and Specification
Key Component analysis using RBDs
Conduct Failure Mode Effect Analysis for Key Components
Update Security and Data Integrity Plan
Update Resiliency /Failsafe/Backup Plans
Review Error Handling/Reporting/ Diagnostic Techniques
Review Operational Profiles/ Fault Tree/ Event Tree Diagrams
Update Production Support/ Maintenance Plans
Review Validation/ Verification Reports
Acceptance Criteria Met? CertifiedRe-design
Update Capacity and Performance Plan
Assurance Facilitator
Reliability Engineer
Service Delivery
Production Support
Solution Architect
Technology Suppliers
95
Figure 7 – Assurance Criteria Example 3.1.6 Reliability Monitoring and Analysis using Specialist Tools Reliability is measured by counting the number of operational failures and their effect on IT systems at the time of failure and afterwards. A long-term measurement program is required to assess the reliability of critical systems. Some of the well-known software reliability metrics that can used include Probability of Failure on Demand (POFOD), Rate of Fault Occurrence (ROCOF), Mean Time to Failure (MTTF), Mean Time Between Failure(MTTR), and Mean Time to Repair(MTTR). Most of the analysis mentioned above can be performed by using office tools by an experienced analyst. However, a few specialized tools and workbenches available that will help in completing different types of analysis including reliability modelling and estimation/prediction. Partial list of these tools is available in references [Kishor 2013, Goel 1985]. Prediction/estimation using these tools need good understanding of analytics methods and basic probability theory. Reliability specialist team has to master the tool related skills before recommending any of them to the customer area. Often tools related skills result in continuous source of budgets / revenue for the CoE for prolonged periods of time.
4. How to Setup Reliability CoE?
There is no one fixed method for setting up Reliability CoE and, whichever way, it is not going to be simple journey. Constructing any niche team requires commitment, hard work and support from all stakeholders. Below sample model shows, some of the factors that will bring maturity to the CoE organisation.
Figure 8 – Example CoE Maturity Model The maturity model similar to the above can be used as a basis for a ‘CoE development plan of action’ and as a means of tracking progress against targets. The model above shows sample 9 central and 4 interface headings. In a real model, these heading are to be chosen in consultation with senior management and other stakeholders. 4.1 Strategy Most organisations prefer to start with small steps when it comes to new CoEs and customise the approach as the concept catches on with more partners and customers. Here are a list of generic steps that can be followed:
- Consult with industry sponsors and outside partners,
- Appoint a talent leadership with high-level of business knowledge,
- Establish vision for the reliability practices, - Identify software reliability champions
internally and customer areas, - Define organisation structure and secure
funding, - Start building a knowledge repository and
sharing mechanisms, - Develop action plan for each of the areas
mentioned in the maturity model, - Develop strict metrics for each area
mentioned in maturity model, - Evaluate, select, and mandate vendor
products and standards, - Collaborate with other IT consultancy areas
to create reusable assets, - Setup review and approval mechanism for
deliverables, - Seek feedback and use it for continuous
improvement, - Encourage innovation and allow challenging
status-quo,
Accept
Continue
Reject
Normalized Failure Time
Failu
re N
um
be
r People Quality Process
Tools Thought Leadership
Governance
Efficiency Innovation Collaboration
Customers
Vendors
Partners
Regu
lation
96
- Customise to fit different customer cultures.
4.2 Processes IT processes often constrained by resources, backlog of projects, governance processes and controls, and lack of focus on security and maintainability, fail to deliver any of the set objectives. Other than some of the generic processes like project management, software engineering, and marketing, Reliability CoE need the following for quick delivery of set availability objectives:
- an agile assessing, modelling, testing and measurement process for reliability,
- techniques that focus on error prevention, fault detection and removal,
- process to adapt for real time, online/web, batch applications,
- an early defect/ SPOFs detection framework supported by comprehensive error handling process,
- knowledge repository and reliability governance program,
- adaption programs to find better ways of working with partners, vendors and governmental departments,
- processes to identify areas which will need less effort but likely to have bigger outcome,
- review process with the aim of continuous improvement.
4.3 Technologies Reliability technologies are fast evolving but currently there are no uniformly recognised and matured ones. Most companies have their own selection of products and methods that fall in their own comfort zone. That means, a thorough assessment with customer engagement and a Proof-of-Concept is needed before adapting these technologies in customer areas. Below diagram shows where customer engagement and POC fits in an IT technology lifecycle.
Figure 9 – Reliability Technology Selection Process
- 4.3.1 Tools A few suites of tools/workbenches available that will support reliability analyst in documenting Reliability Block Diagrams(RBDs), Fault Tree Analysis, Markov Modelling, Failure Mode and Effect Analysis (FMEA), Root cause Analysis, Weibull Analysis, Availability Simulation, Reliability Centred Maintenance and Life-cycle Cost Analysis. The focus of these tools is mostly hardware reliability but recently they have been adapted for IT infrastructure, software and process components. A few software specific tools available that help in Software Reliability Modeling, Statistical Modeling and Estimation, Software Reliability Prediction Tools[Kishor 2013], [Allen 1999]. 4.4 People Supply of people with proven and practical reliability analysis experience is very limited. Because of this companies need to find people with partially available skills and have to train them in the rest of the areas. Below shows a good proportion of skills needed in a reliability CoE.
Figure 10 – Proportion of skills in Reliability CoE Other than the generic roles like project manager, business analyst, architect, operational analyst, a
Assess Engage POC Governance Architect Build
Assess Stakeholder
Requirements
Engage Business & IT in tech.
selection
Build POC with* Vendor Support
Confirm with Business Case and Standards
Architect Solution
Build solution**
* Choose technologies adaptable to customer scenarios. ** Build solutions that scale for growth.
97
few companies recruit specialist Reliability Managers and Reliability Analysts. Sample position descriptions for these roles is provided in Appendix B. In general, staff with 6-10 years of experience in 3-4 areas of the below list could be trained into the specialised reliability roles.
- Capacity Management, - Service Level Management, - Configuration Management, - Change Management, - Test/Release Management, - Incident Management, - Production Support and Operations, - Maintenance Management, - Product Life Cycle Management, - Vendor Management, - Resilience and Disaster Recovery, - Supply chain Management, - Asset Management.
When the focus is on a particular IT application, participation from SMEs in the areas of business functions, hardware, network, process, security, software, tools, data, operations, and maintenance would be needed.
5. Conclusion
IT organisations must focus on what is going on in business areas and customise to help them efficiently meet their requirements for systems availability and reliability. Good set of reliability practices can halve the re-active fixes needed for the IT systems. The earlier they are adapted in the lifecycle the better the savings for businesses. Based on experience, up to 30% productivity gains and roughly same percentage in reduction in maintenance costs is predicted to be achievable through these practices. Reliability is one of the characteristic of IT systems and, with systematic approach, it is possible to meet business requirements with smaller cost and minimum disruption. Implementation of any chosen reliability methods will succeed with seamless integration with current SDLC, Agile and Transformation methodologies. Marketed properly, Reliability Capabilities has good potential for generating regular income and on-going project work for commercial organisations. Setting up separate reliability excellence team in specialist IT departments would require broader effort and participation from the strategy, architecture, assurance, tools and industry vertical solution teams. Developing a system for proper data capture, its interpretation and taking action to reflect in terms of KPIs like reliability and
availability, identifying critical failure areas is the key. Setting up the reliability CoE will not only help in giving reliability the priority it needs but also enhance organisation image and improve customer satisfaction, greatly reducing the risk of angry customers. In the long term, best reliability practices will result in positive culture change within the team as well as increased market share.
6. References
[ITIL OSIATIS] http://itil.osiatis.es/ITIL_course /it_service_management/availability_management/overview_availability_management/overview_availability_management.php [HP 2007] Impact on U.S. Small Business of Natural & Man-Made Disasters, HP and SCORE report 2007. [Colin 2013] http://www.telegraph.co.uk/ technology/ news/ 10520015/ The-top-ten-technology -disasters-of-2013.html - By Colin Armitage, chief executive, Original Software [Phil 2011] http://www. Business computing world .co.uk/ top-10-software-failures-of-2011 - By Phil Codd, Managing Director, SQS. [Phil 2012] http://www.business computing world.co.uk/ top-10-software -failures- of-2012 - By Phil Codd, Managing Director, SQS. [Quoram 2013] Quorum Disaster Recovery Report, http://www.quorum.net/ 2013 QuorumLabs, Inc. [JBS 2013] http://www. jbs. cam. ac.uk/ media/2013/research-by- cambridge- mbas-for-tech-firm- undo-finds- software- bugs- cost-the-industry- 316-billion-a-year/ [NIST 2002] The Economic Impacts of Inadequate Infrastructure for Software Testing, June 2002, NIST Planning Report 02-3 [Ponemon 2013] 2013 Study on Data Centre Outages, Ponemon Institute LLC, September 2013. [Aberdeen 2013] Downtime and data loss – How much can you afford? Analyst Insight, Aberdeen Group, August 2013 [Kishor 2013] Software Reliability and Availability – TCS Ahmadabad – January 2013 – Kishor Trivedi, Dept. of Electrical & Computer Engineering, Duke University, Durham, NC 27708 [Musa 1987] Software reliability: measurement, prediction, and application. Musa, J. D., Iannino, A., & Okumoto, K. (1987). New York: McGraw–Hill Publication. [Bonthu 2012] A Survey on Software Reliability Assessment by Using Different Machine Learning Techniques, Bonthu Kotaiah , Dr. R.A. Khan, International Journal of Scientific & Engineering Research, Volume 3, Issue 6, June-2012 1 ISSN 2229-5518 [Pandey 2013] Early software reliability prediction – A fuzzy logic approach, Pandey A.K, Goyal N.K, Springer, 2013
98
[Pham 2006] System software reliability, reliability engineering series. Pham, H. (2006). London: Springer. [Lyu 1996] Handbook of software reliability engineering. Lyu, M. R. (1996). NY: McGraw–Hill/IEE Computer Society Press. [Goel 1985] Software reliability models: assumptions, limitations, and applicability. Goel, A. L. (1985). IEEE Transaction on Software Engineering, SE–11(12), 1411–1423. [Allen 1999] Software Reliability and Risk Management: Techniques and Tools. Allen Nikora and Michael Lyu, tutorial presented at the 1999 International Symposium on Software Reliability Engineering.
[Ulrik 2010] Availability of enterprise IT systems –
an expert-based Bayesian model, Ulrik Franke,
Pontus Johnson, Johan König, Liv Marcks von Würtemberg: Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid [Evan 2003] Blueprint for High Availability – Evan Marcus/Hal Stern – 2003 – Wiley Publications
7. Acknowledgement
Authors are grateful to Girish Chaudhari, Peter Andrew, Carl Borthwick and Jonathan Wright who reviewed the material when it was first prepared for an internal team discussion. We would also like to thank Prajakta Vijay Bhatt and the anonymous CMG referees for their comments which have helped make this paper better.
99
Appendix A: Surveyed Reasons for Unavailability
A survey among a few academic availability experts in 2010 ranked reasons for unavailability of enterprise IT
systems [Ulrik 2010]. They identified the lack of best practices in the following areas are the causes:
• Monitoring of the relevant components • Requirements and procurement • Operations • Avoidance of network failures, internal application failures, and external services that fail • Network redundancy • Technical solution of backup, and process solution of backup • Physical location, • Infrastructure redundancy • Storage architecture redundancy • Change Control
[Evan 2003] identified investment in the following areas will help improve the availability of IT systems.
• Good systems and admin procedures • Reliable backups • Disk and volume management • Networking • Local Environment • Client Management • Services and application • Failovers • Replication
Even though these studies does not apply in all cases, they provide useful guidelines for architects and designers of IT systems. This paper proposes more structured approach for availability management that applies to most business organisations.
100
Appendix B – Sample Reliability Engineer Position Descriptions
Senior Reliability Engineer – Technical IT Infrastructure
This position is located within xxx Team in Reliability, Maintainability and Testability Support
Discipline,
xxx team has a role of increasing the availability and reducing the through life cost of ownership of IT
systems for customers.
Main responsibilities:
Own end-to-end availability and performance of customer critical services from infrastructure point of
view,
Ensure five 9s reliable experience for IT systems users located in UK and abroad,
Liaison with customer teams and other partners to obtain Reliability data,
Analyse, Model and interpret arising data to forecast the reliability of customer IT systems,
Utilisation of reliability data to produce analysis and system performance reports for customers,
Capable of technical deep-dives into code, networking, operating systems and storage problem areas,
Respond to and resolve emergent service problems to prevent problem recurrence,
Liaising with Design, Support, Maintenance, Procurement and Commercial functions to identify
suitable recommendations for improvements,
Understanding and interpreting IT maintenance and support information to identify root causes of IT
failure,
Attendance at customer high-level service reviews and support root cause analysis,
Detailed IT systems analysis to support releases in different production environments,
Representing xxx team in internal and external customer meetings,
Participate in service capacity planning, demand forecasting, software performance analysis and
system tuning activities
Minimum qualifications
BS degree in Computer Science or related field or equivalent practical experience,
Proven experience in similar role in a commercial organisation, using formal reliability tools and
procedures,
Good understanding of reliability, maintainability and testability practices
Preferred qualifications
MS degree in Computer Science or related field,
Experience with different M/F, servers, desktop systems administration and logistics,
Expertise in data structures, algorithms and basic statistical probability theory,
Expertise in analysing and troubleshooting large-scale distributed systems,
Knowledge of network analysis, performance and application issues using standard tools: BMC
Patrol, Teamquest or similar,
Experience in a high-volume or critical production service environment,
Sound understanding of understanding of IT life-cycle management and maturity gates,
Strong leadership, communication, report writing, and presentation skills.
101
Senior Reliability Engineer – Software Engineering
This position is located within the xxx Team in Reliability, Maintainability and Testability Support
Discipline,
xxx team has a role of increasing the availability and reducing the through life cost of ownership of IT
systems for customers.
Main responsibilities:
Own end-to-end availability and performance of customer critical services from software design point
of view,
Manage availability, latency, scalability and efficiency of customer services by engineering reliability
into software and systems,
Review and influence ongoing design, architecture, standards and methods for operating services and
systems,
Work in conjunction with software engineers, systems administrators, network engineers and
hardware teams to derive detailed reliability requirements,
Identify metrics and drive initiatives to improve the quality of design processes,
Understanding of fault prevention, fault removal, fault tolerance & defensive programming design
techniques,
Liaison with customer teams and other partners to build five 9s reliability into software delivery
procedures,
Capable of technical deep-dives into code, networking, operating systems and storage design
problem areas,
Attendance at customer high-level IT design reviews,
Representing xxx team in internal and external customer meetings,
Participate in capacity planning, demand forecasting, software performance analysis and system
tuning activities.
Minimum qualifications
BS degree in Computer Science or related field or equivalent practical experience,
Proven experience in similar role in a commercial organisation, using formal reliability tools and
procedures,
Good understanding of reliability, maintainability and testability practices.
Preferred qualifications
MS degree in Computer Science or related field,
Expertise in complexity analysis and basic statistical probability theory,
Expertise in designing end-to-end large-scale distributed systems with full resilience,
Experience in end-to-end infrastructure, data, applications, security and service design,
Experience in a high-volume or critical production service environment,
Sound understanding of understanding of IT life-cycle management and maturity gates,
Strong leadership, communication, report writing, and presentation skills