performance analysis of leading application lifecycle management systems for large customer data...

1

Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments

Paul Nelson

Director, Enterprise Systems Management, AppliedTrust, Inc. [email protected]

Dr. Evi Nemeth

Associate Professor Attendant Rank Emeritus, University of Colorado at Boulder Distinguished Engineer, AppliedTrust, Inc.

[email protected]

Tyler Bell Engineer, AppliedTrust, Inc.

[email protected]

AppliedTrust, Inc. 1033 Walnut St, Boulder, CO 80302

(303) 245-4545

Abstract The performance of three leading application lifecycle management (ALM) systems (Rally by Rally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) was assessed to draw comparative performance observations when customer data exceeds a 500,000-artifact threshold. The focus of this performance testing was how each system handles a simulated “large” customer (i.e., a customer with half a million artifacts). A near-identical representative data set of 512,000 objects was constructed and populated in each system in order to simulate identical use cases as closely as possible. Timed browser testing was performed to gauge the performance of common usage scenarios, and comparisons were then made. Nine tests were performed based on measurable, single-operation events. Rally emerged as the strongest performer based on the test results, leading outright in six of the nine that were compared. In one of these six tests, Rally tied with VersionOne from a scoring perspective in terms of relative performance (using the scoring system developed for comparisons), though it led from a raw measured-speed perspective. In one test not included in the six, Rally tied with JIRA+GreenHopper from a numeric perspective and within the bounds of the scoring model that was established. VersionOne was the strongest performer in two of the nine tests, and exhibited very similar performance characteristics (generally within a 1 – 12 second margin) in many of the tests that Rally led. JIRA+GreenHopper did not lead any tests, but as noted, tied with Rally for one. JIRA+GreenHopper was almost an order of magnitude slower than peers when performing any test that involved its agile software development plug-in. All

2

applications were able to complete the tests being performed (i.e., no tests failed outright). Based on the results, Rally and VersionOne, but not JIRA+GreenHopper, appear to be viable solutions for clients with a large number of artifacts. 1. Introduction As the adoption of agile project management has accelerated over the last decade, so too has the use of tools supporting this methodology. This growth has resulted in the accumulation of artifacts (user stories, defects, tasks, and test cases) by customers in their ALM system of choice. The trend is for data stored in these systems to be retained indefinitely, as there is no compelling reason to remove it, and often, product generations are developed and improved over significant periods of time. In other cases, the size of specific customers and ongoing projects may result in very rapid accumulation of artifacts in relatively short periods of time. Anecdotal reports suggest that an artifact threshold exists around the 500,000 artifact point, and this paper seeks to test that observation.

This artifact scaling presents a challenge for ALM solution providers, as customers expect performance consistency in their ALM platform regardless of the volume of the underlying data. While it is certainly possible to architect ALM systems to address such challenges, there are anecdotal reports that some major platforms do not currently handle large projects in a sufficient manner from a performance perspective.

This paper presents the results of testing performed in August and September 2012, recording the performance of Rally Software, VersionOne, and JIRA+GreenHopper, and then drawing comparative conclusions between the three products. Atlassian’s ALM offering utilizes its JIRA product and extends it to support agile project management using the GreenHopper functionality extension (referred to in this paper as JIRA+GreenHopper). Rally Build 7396, VersionOne 12.2.2.3601, and

JIRA+GreenHopper JIRA 5.1with GreenHopper 6 were the versions that were tested.

The tests measure the performance of single-user, single-operation events when an underlying customer data set made up of 500,000 objects is present. These tests are not intended to be used to draw conclusions regarding other possible scenarios of interest, such as load, concurrent users, or other tests not explicitly described.

The fundamental objective of the testing is to provide some level of quantitative comparison for user-based interaction with the three products, as opposed to system- or service-based interaction.

2. Data Set Construction The use of ALM software and the variety of artifacts, custom fields, etc., will vary significantly between customers. As a result, there is not necessarily a “right way” to structure data for test purposes. More important is that fields contain content that is similarly structured to real data (e.g., text in freeform text fields, dates in date fields), and that each platform is populated with the same data. In some cases, product variations prevented this. Rally, for example, does not use the concept of an epic, but rather a hierarchical, user story relationship, whereas VersionOne supports epics.

Actually creating data with unique content for all artifacts would be infeasible for testing purposes. To model real data, a structure was chosen for a customer instance based on 10 unique projects. Within each project, 40 epics or parent user stories were populated, and 80 user stories were created within each of those. Associated with each user story were 16 artifacts: 10 tasks, four defects, and two test

3

cases. In terms of core artifact types, the product of these counts is 16*80*40*10, or 512,000. All platforms suffered from difficulties related to data population. This manifested in a variety of ways, including imports “freezing,” data being truncated, or data being mismapped to incorrect fields. Every effort was made to ensure as much data consistency between data uploads as possible, but there were slight deviations from the expected norm. This was estimated to be no more than 5%, and where there was missing data, supplementary uploads were performed to move the total artifact count closer to the 512,000 target. In addition, tests were only performed on objects that met consistency checks (i.e., the same field data).

These symmetrical project data structures are not likely to be seen in real customer environments. The numbers of parent objects and child objects will also vary considerably. That being said, a standard form is required to allow population in three products and to enable attempts at some level of data consistency. Given that the structure is mirrored as closely as possible across each product, the performance variance should be indicative of observed behaviors in other customer environments regardless of the exact artifact distributions.

Custom fields are offered by all products, and so a number of fields were added and populated to simulate their use. Five custom fields were added to each story, task, defect, and test case; one was Boolean true/false, two were numerical values, and two were short text fields.

The data populated followed the schema specified by each vendor’s documentation. We populated fields for ID, name, description, priority, and estimated cost and time to complete. The data consisted of dates and times, values from fixed lists (e.g., the priority field with each possible value used in turn), references to other objects (parent ID), and text generated by a lorem ipsum generator.

This generator produces text containing real sentence and paragraph structures, but random strings as words. A number of paragraph size and content blocks were created, and their use was repeated in multiple objects. The description field of a story contained one or two paragraphs of this generated text. Tasks, defects, and tests used one or two sentences. If one story got two paragraphs, then the next story would get one paragraph, and so on in rotation. This data model was used for each system.

It is possible that one or more of the products may be able to optimize content retrieval with an effective indexing strategy, but this advantage is implementable in each product. Only JIRA+GreenHopper prompted the user to initiate indexing operations, and based on prompted instruction, indexing was performed after data uploads were complete.

3. Data Population Data was populated primarily by using the CSV import functionality offered by each system. This process varied in the operation sequence and chunking mechanism for uploads, but fundamentally was based on tailoring input files to match the input specifications and uploading a sequence of files. Out of necessity, files were uploaded in various-sized pieces related to input limits for each system. API calls and scripts were used to establish relationships between artifacts when the CSV input method did not support or retain these relationships. We encountered issues with each vendor’s product in importing such a large data set, which suggests that customers considering switching from one product to another should look carefully at the feasibility of loading their existing data. Some of our difficulty in loading data involved the fact that we wanted to measure comparable operations, and the underlying data structures made this sometimes easy, sometimes nearly impossible.

4

4. JIRA+GreenHopper Data Population Issues

We had to create a ‘Test Case’ issue type in the JIRA+GreenHopper product and use what is known in the JIRA+GreenHopper community as a bug to keep track of the parent-child hierarchy of data objects. Once this was done, the data loaded quite smoothly using CSV files and its import facility until we reached the halfway point, when the import process slowed down considerably. Ultimately, the data import took two to three full days to complete.

5. Rally Data Population Issues Rally limits the size of CSV files to 1000 lines and 2.097 MB. It also destroys the UserStory/SubStory hierarchy on import (though presents it on export). These limitations led to a lengthy and tedious data population operation. Tasks could not be imported using the CSV technique. Instead, scripting was used to import tasks via Rally’s REST API interface. The script was made using Pyral, which is a library released by Rally for quick, easy access to its API using the Python scripting language. The total data import process took about a week to complete.

6. VersionOne Data Population Issues VersionOne did not limit the CSV file size, but warned that importing more than 500 objects at a time could cause performance issues. This warning was absolutely true. During import, our VersionOne test system was totally unresponsive to user operations. CSV files of 5000 lines would lock it up for hours, making data population take over a week of 24-hour days.

7. Testing Methodology A single test system was used to collect test data in order to limit bias introduced by different computers and browser instances. The test platform was a Dell Studio XPS 8100 running Microsoft Windows 7 Professional SP1 64-bit, and the browser used to perform testing was Mozilla Firefox v15.0.1. The Firebug add-on running v1.10.3 was used to collect test metrics. Timing data was recorded in a data collection spreadsheet constructed for this project. While results are expected to vary if using other software and version combinations, using a standardized collection model ensured a consistent, unbiased approach to gathering test data for this paper, and will allow legitimate comparisons to be made. It is expected that while the actual timing averages may differ, the comparisons will not.

At the time measurements were being taken, the measurement machine was the only user of our instance of the software products. All tests were performed using the same network and Internet connection, with no software updates or changes between tests. To ensure there were no large disparities between response times, an http-ping utility was used in order to measure roundtrip response times to the service URLs provided by each system. Averaged response times over 10 http-ping samples were all under 350 milliseconds and within 150 milliseconds of each other, suggesting connectivity and response are comparable for all systems. JIRA+GreenHopper had an average response time of 194 milliseconds, Rally 266, and VersionOne 343. All tests were performed during US MDT business hours (8 a.m. – 5:30 p.m.).

It is noted that running tests in a linear manner does introduce the possibility of performance variation due to connectivity performance variations between endpoints, though these variations would be expected under any end-user usage scenario and are

5

difficult, if not impossible, to predict and measure.

Tests and data constructs were implemented in a manner to allow apples-to-apples comparison with as little bias and potential benefit to any product as possible. However, it should be noted that these are three different platforms, each with unique features. In the case where a feature exists on only one or two of the platforms, that element

was not tested. The focus was on the collection of core tests described in the test definition table in the next section.

The time elapsed from the start of the first request until the end of the last request/response was used as the core time metric associated with a requested page load when possible. This data is captured with Firebug, and an example is illustrated below for a VersionOne test.

Example of timing data collection for a VersionOne test.

We encountered challenges timing pages

that perform operations using asynchronous techniques to update or render data. Since we are interested in when the result of operations are visible to the user, timing only the asynchronous call that initiates the request provides little value from a testing perspective. In cases where no single time event could be used, timing was performed manually. This increased the error associated with the measurement, and this error is estimated to be roughly one second or less. In cases where manual measurements were made, it is indicated in the result analysis. A stopwatch with 0.1-second granularity was used for all manually timed tests, as were two people — one running the test with start/stop instruction and the other timing from those verbal cues.

It is acknowledged that regardless of the constraints imposed here to standardize data and tests for comparison purposes, there may be deviations from performance norms due to the use of simulated data, either efficiencies or

inefficiencies. Bias may also be introduced in one or more products based on the testing methodology employed. While every effort was made to make tests fair and representative of legitimate use cases, it is recognized that results might vary if a different data set was used. Further, the testing has no control over localized performance issues affecting the hosted environments from which the services are provided. If testing results in minor variance between products, then arguably some of this variance could be due to factors outside of the actual application.

The enterprise trial versions were used to test each system. We have no data regarding how each service handles trial instances; it is possible that the trial instances differ from paid subscription instances, but based on our review and the trial process, there was no indication the trial version was in any way different. We assume that providers would not intentionally offer a performance-restricted instance for trial customers, given that their

6

end goal would be to convert those trial customers to paying subscribers.

Based on a per-instance calibration routine, the decision was made to repeat each test 10 times per platform. A comparison between a 10-test and 50-test sample was performed for one test case (user story edit) per platform to ensure the standard deviation between respective tests was similar enough to warrant the use of a 10-test sample. In no case was the calibration standard deviation greater than one second. If the performance differences between applications are found to be of a similar order of magnitude (i.e., seconds), then the use of a 10-test sample per application should clearly be questioned. However, if the overriding observation is that each application performs within the same small performance range of the others, the nuances of sample size calculation are rendered insignificant.

A more in-depth sample sizing exercise could also be performed, and could realistically be performed per test. However, it is already recognized that there are numerous factors beyond the control of the tests, to the extent that further increasing sample size would offer little value given the relatively consistent performance observed during calibration.

To help reduce as many bandwidth and geographic distance factors as possible, the client browser cache was not cleared between tests. This also better reflects real user interaction with the systems. A single pretest

run for every test was performed to allow object caching client-side — so in fact, each test was executed 11 times, but only results 2-11 were analyzed. Based on the belief that the total artifact count is the root cause of scalability issues, allowing caching should eliminate some of the variation due to factors that cannot be controlled by the test.

The use of attachments was not tested. This was identified as more of a bandwidth and load test, as opposed to a performance of the system in a scalability scenario.

8. Test Descriptions Tests were constructed based on common uses of ALM systems. Timing data was separated into discrete operations when sequences of events were tested. These timings were compared individually, as opposed to in aggregate, in order to account for interface and workflow differences between products.

There may be tests and scenarios that could be of interest but were not captured, either because they were not reproducible in all products or were not identified as common operations. Also, it would be desirable in future tests to review the performance of logical relationships (complex links between iterations/sprints and other artifacts, for example). The core objective when selecting these tests was to enable comparison for similar operations between systems.

# Test Name Description/Purpose

1 Refresh the backlog for a single project.

The backlog page is important to both developers and managers; it is the heart of the systems. Based on variance in accessing the backlog, the most reliable mechanism to test was identified as a refresh of the backlog page. Views were configured to display 50 entries per page.

2 Switch backlog views between two projects.

A developer working on two or more projects might frequently swap projects. Views were configured to display 50 entries per page.

7

3 Paging through backlog lists.

With our large data sets, navigation of large tables can become a performance issue. Views were configured to display 50 entries per page.

4 Select and view a story from the backlog.

Basic access to a story.

5 Select and view a task. Basic access to a task.

6 Select and view a defect/bug.

Basic access to a defect or bug. (Note: JIRA+GreenHopper uses the term bug, while Rally and VersionOne use defect.)

7 Select and view a test. Basic access to a test case.

8 Create an iteration/sprint.

Common management chore. (Note: This had to be manually timed for JIRA+GreenHopper, as measured time was about 0.3 seconds while elapsed time was 17 seconds.)

9 Move a story to an iteration/sprint.

Common developer or manager chore. (Note: JIRA+GreenHopper and VersionOne use the term sprint, while Rally uses iteration.)

10 Convert a story to a defect/bug.

Common developer chore (Note: This operation is not applicable to Rally because of the inherent hierarchy between a story and its defects).

9. Test Results Each test was performed 1+10 times in sequence for each software system, and the mean and standard deviation were computed. The point estimates were then compared to find the fastest performing application. A +n (seconds) indicator was used to indicate the relative performance lag of the other applications from the fastest performing application for that test.

The test result summary table illustrates the relative performance for each test to allow observable comparisons per product and per test. In order to provide a measurement-based comparison, a scale was created to allow numerical comparison between products. There were no cases where the leader in a test

performed badly (subjectively). As such, the leader in a test is given the “Very Good” rating, which corresponds to five points. The leading time is then used as a base for comparative scoring of competitors for that test, with each test score based on how many multiples it was of the fastest performer. The point legend table is illustrated below.

Time Multiple Points

1.0x ≤ time < 1.5x 5 1.5x ≤ time < 2.5x 4 2.5x ≤ time < 3.5x 3 3.5x ≤ time < 4.5x 2

4.5x ≤ time 1

8

Test Result Summary Table (Relative Performance Analysis)

Legend Very Good: (5) Good: (4) Acceptable: (3) Poor: (2) Very Poor: (1) System and

Test Summary

Overall Rating (Out of

45)

1 Backlog Refresh

2 Switch

Backlog

3 Backlog Paging

4 View Story

5 View Task

6 View

Defect

7 View Test

8 Create Sprint

9 Story

→ Sprint

Rally

VersionOne

JIRA+ GreenHopper

It must be noted that resulting means are

point-estimate averages. For several reasons, we don’t suggest or use confidence intervals or test for significance. Based on the challenges associated with structuring common tests with different interfaces, different data structures, and no guarantee of connection quality, it is extraordinarily difficult to do so. In addition, because each test may have a different weight or relevance to each customer depending on their ALM process, the relevance of a test leader should be weighted according to the preference of the reader. That being said, these tests are intended to reflect the user experience. To address some of the concerns associated with point estimates, analysis of high and low bounds based on one and two standard deviations was performed. If the high bound for the fastest test overlaps with the low bound for either of the slower performing application tests, the significance of the performance gain between those comparisons is questionable. The overlap suggests there will be cases where the slower (overlapping) application may perform faster than the application with the fastest response time.

Statistical theory and the three-sigma rule suggest that when data is normally distributed, roughly 68% of observations should lie within one standard deviation of the mean

(symmetrically distributed), and 95% should lie within two standard deviations. We graphically tested for normality using our calibration data and observed our data to be normally distributed. When there is no overlap between timing at two standard deviations, this implies it will be fairly rare for one of the typically slower performing applications to exceed the performance of the faster application (for that particular test). If there is no overlap at one or two standard deviations between the lower and upper bounds, the result is marked as “Significant.” If there is overlap in one or both cases, that result is flagged as “Insignificant.” Significance is assessed between the fastest performing application for the test and each of the other two applications. Therefore, the significance analysis is only populated for the application with the fastest point estimate. The advantage is classed as insignificant if the closest performing peer implies the result is insignificant. All data values are in seconds.

Results from each test are analyzed separately below. The results of each test are shown both in table form with values and in bar graph form, and are also interpreted in the text below the corresponding table. Note that long bars in the comparison graphs are long response times, and therefore bad.

32

43

18

9

Test 1: Refresh Backlog Page for a Single Project

System Mean Request

Time (seconds)

Standard Deviation (seconds)

Point Estimate

Comparison (seconds)

1 SD Range

(seconds)

1 SD Overlap Analysis

2 SD Range

(seconds)


JIRA+ GreenHopper

15.27 1.38 +12.13 13.89 – 16.64

- 12.52 – 18.02

-

Rally 5.53 0.29 +2.39 5.24 – 5.81

- 4.95 – 6.10

-

VersionOne 3.14 0.25 Fastest 2.88 – 3.39

Significant 2.63 – 3.64

Significant

Interpretation: The data indicates that for this particular task, even when accounting for variance in performance, VersionOne performs fastest. Note that the advantage is relatively small when compared to Rally, though the Rally point estimate does lag by

almost 2.4 seconds. Both VersionOne and Rally perform significantly better than JIRA+GreenHopper when executing this operation. Best Performer: VersionOne

10

Test 2: Switch Backlog Views Between Two Projects

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

13.84 0.83 +11.39 13.01 – 14.66

- 12.19 – 15.49

-

Rally 2.45 0.16 Fastest 2.29 – 2.60


Significant

VersionOne 2.94 0.07 +0.49 2.87 – 3.01

- 2.79 – 3.08

-

*To perform this operation on JIRA+GreenHopper, the user must navigate between two scrumboards and then load the data. Therefore, the timing numbers for JIRA+GreenHopper are the sum of two measurements. This introduces request overhead not present in the other two tests, yet the disparity suggests more than just simple transaction overhead is the cause of the delay. Furthermore, the resulting page was rendered frozen and was not usable for an additional 10 – 15 seconds. Users would probably pool that additional delay before the page could be accessed in their user experience impression, but it was not included here. Interpretation: The data indicates that Rally and VersionOne are significantly faster than JIRA+GreenHopper, even when considering the sum of two operations. Rally is faster than VersionOne, though marginally so. In terms of

user interaction, the experience would be similar for the two products. Best Performer: Rally

11

Test 3: Paging Through Backlog List

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

1.53 0.66 Fastest 0.87 – 2.19

Insignificant 0.21 – 2.85

Insignificant

Rally 1.93 0.11 +0.4 1.81 – 2.04

- 1.70 – 2.15

-

VersionOne 3.45 0.29 +1.92 3.16 – 3.74

- 2.87 – 4.04

-

Interpretation: JIRA+GreenHopper had the fastest point-estimate mean, but the analysis suggests there is minimal (not significant) improvement over Rally, which was the second-fastest. The standard deviations suggest a wider performance variance for JIRA+GreenHopper, and so while the point estimate is better, the overall performance is

likely to be comparable. The data indicates that VersionOne is significantly slower than the other two systems, and for very large data sets like the tests used, this makes scrolling through the data quite tedious. Best Performer: JIRA+GreenHopper and Rally

12

Test 4: Selecting and Viewing a User Story From the Backlog

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

3.49 0.99 +2.95 2.49 – 4.48

- 1.50 – 5.47

-

Rally 0.53 .07 Fastest 0.46 – 0.60


Significant

VersionOne 1.90 0.30 +1.36 1.59 – 2.20

- 1.29 – 2.5 -

Interpretation: The data indicates that Rally is significantly faster than either JIRA+GreenHopper or VersionOne. While the result is significant, the one-second difference between Rally and VersionOne is not likely to have a significant impact on the user

experience. Rally’s performance is also more consistent than the other two products (i.e., it has a much lower response standard deviation). Best Performer: Rally

13

Test 5: Selecting and Viewing a Task

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

1.36 0.17 +0.92 1.20 – 1.53

- 1.03 – 1.69

-

Rally 0.44 0.03 Fastest 0.42 – 0.47


Significant

VersionOne 1.46 0.16 +1.01 1.29 – 1.62

- 1.13 – 1.78

-

Interpretation: The data indicates that Rally is significantly (in the probabilistic sense) faster than either JIRA+GreenHopper or VersionOne by about one second, and also has a more consistent response time (with the lowest standard deviation). JIRA+GreenHopper and

VersionOne showed similar performance. Overall, the result for all applications was qualitatively good. Best Performer: Rally

14

Test 6: Selecting and Viewing a Test Case

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

1.91 0.86 +1.37 1.05 – 2.77

- 0.19 – 3.64

-

Rally 0.54 0.13 Fastest 0.41 – 0.67


Insignificant

VersionOne 1.45 0.18 +0.91 1.27 – 1.62

- 1.09 – 1.80

-

Interpretation: The data indicates that, again, Rally is fastest in this task, though the speed differences are significant at the one standard deviation level where there is no overlap in their respective timing ranges, but not at two standard deviations. Rally performed with the lowest point estimate and the lowest variance,

suggesting a consistently better experience. VersionOne was second in terms of performance, followed by JIRA+GreenHopper. Best Performer: Rally

15

Test 7: Selecting and Viewing a Defect/Bug

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

1.70 0.81 +1.02 0.88 – 2.51

- 0.07 – 3.32

-

Rally 0.68 0.05 Fastest 0.63 – 0.72


Insignificant

VersionOne 1.74 0.17 +1.06 1.56 – 1.91

- 1.39 – 2.08

-

Interpretation: The data indicates that Rally is faster by roughly one second based on the point-estimate mean when compared to the other two products, with the difference being significant at the one standard deviation level but not at two standard deviations. Variance in the results of the other products suggests they will perform similarly to Rally on some occasions, but not all. Rally’s performance was relatively consistent, as indicated by the

very low standard deviation. Though the point estimates are very close, the performance of VersionOne is preferred based on the low standard deviation. That being said, given that the point estimates are all below two seconds, there would be little to no perceptible difference between VersionOne and JIRA+GreenHopper from a user perspective. Best Performer: Rally

16

Test 8: Add an Iteration/Sprint

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

17.76 0.60 +17.72 17.16 – 18.36

- 16.56 – 18.96

-

Rally 0.04 0.00 Fastest 0.04 – 0.05


Significant

VersionOne 1.36 0.10 +1.32 1.25 – 1.46

- 1.15 – 1.57

-

*Due to the disparity between Rally and JIRA+GreenHopper here, the graph appears to show no data for Rally. The graph resolution is simply insufficient to render the data clearly, given the large value generated by JIRA+GreenHopper tests. **The JIRA+GreenHopper data was manually measured due to inconsistencies in timing versus content rendering. Based on requests, it appeared asynchronous page timings were completing when requests were submitted, and the eventual content updates and rendering were disconnected from the original request being tracked. While this increases the measurement error, it certainly would not account for a roughly 17-second disparity. Interpretation: Rally is the fastest performer in this test, with the results being significant at both the one and two standard deviation levels.

JIRA+GreenHopper is many times slower than both Rally and VersionOne. Best Performer: Rally

17

Test 9: Move a Story to an Iteration/Sprint

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

9.80 6.88 +8.42 2.91 – 16.68

- 0.00* – 23.56

-

Rally 3.37 0.22 +1.99 3.15 – 3.59

- 2.94 – 3.80

-

VersionOne 1.38 0.36 Fastest 1.02 – 1.74


Insignificant

*The standard deviation range suggested a negative value, which is, of course, impossible. Therefore, 0.00 is provided.

Interpretation: The data indicates that VersionOne is fastest for this operation. The insignificant two standard deviation overlap

test is a result of the enormous standard deviation of the JIRA+GreenHopper tests. Best Performer: VersionOne

18

Test 10: Convert a Story to a Defect/Bug

System Mean Request

Time (seconds)


Point Estimate


1 SD Range

(seconds)


2 SD Range

(seconds)


JIRA+ GreenHopper

26.56 2.94 +24.87 23.62 – 29.50

- 20.68 – 32.44

-

Rally 1.69 0.25 Fastest 1.44 – 1.94


Significant

VersionOne 6.06 0.28 +4.36 5.77 – 6.34

- 5.49 – 6.62

-

*JIRA+GreenHopper required manual timing. See the interpretation below for explanation.

Interpretation: This operation is an example of one in which the procedure in each system is completely different and perhaps not comparable in any reasonable way. In JIRA+GreenHopper, there are three operations involved (access the story, invoke the editor, and after changing the type of issue, saving the changes and updating the database) and these had to be manually timed. In addition, the JIRA+GreenHopper page froze after the

update for about 10 seconds while it updated the icon to the left of the new defect from a green story icon to a red defect icon. This extra 10 seconds was not included in the timing results, although perhaps it should have been. In Rally, defects are hierarchically below stories as one of a story’s attributes, and so a story cannot be converted to a defect, though defects can be promoted to stories. That is what we measured for Rally’s case. And

19

finally, VersionOne has a menu option to do this task. The results, reported here just for interest and not defensible statistically, indicate that Rally is fastest at this class of operation, followed by VersionOne at plus-four seconds and JIRA+GreenHopper at +24 seconds. Best Performer: N/A – Informational observations only.

10. Conclusions

Our testing was by no means exhaustive, but thorough enough to build a reasonably sized result set to enable comparison between applications. It fundamentally aimed to assess the performance of testable elements that are consistent between applications. We tried to choose simple, small tests that mapped well between the three systems and could be measured programmatically as opposed to manually (and succeeded in most cases, though some manual timing was required).

Rally was the strongest performer based on the test results, leading outright in six of the nine that were compared. In one of these six tests, Rally tied with VersionOne from a scoring perspective in terms of relative performance (using the scoring system developed for comparisons), though it led from a raw measured-speed perspective. In one test not included in the six, Rally tied with JIRA+GreenHopper from a numeric perspective and within the bounds of the scoring model that was established. VersionOne was the strongest performer in two of the nine tests, and exhibited very similar performance characteristics (generally within a 1 – 12 second margin) in many of the tests that Rally led. JIRA+GreenHopper did not lead any tests, but as noted, tied with Rally for one.

With the exception of backlog paging, JIRA+GreenHopper trailed in tests that leveraged agile development tools such as the

scrumboard, which JIRA+GreenHopper implements with the plug-in GreenHopper. The GreenHopper overlay/add-on seemed unable to handle the large data sets effectively. When we tried to include the test of viewing the backlog for all projects, we were able to do so for Rally and VersionOne, but the JIRA+GreenHopper instance queried for over 12 hours without rendering the scrumboard and merged project backlog. Some object view operations resulted in second-best performance for JIRA+GreenHopper, but with the exception of viewing tasks, the variance associated with request was extraordinarily high compared to Rally and VersionOne. The large variance will manifest to users as an inconsistent experience (in terms of response time) when performing the same operation.

Anecdotally, the performance of VersionOne compared to Rally was significantly degraded when import activity was taking place, to the extent that VersionOne becomes effectively unusable during import operations. Further testing could be performed to identify whether this is a CSV-limited import issue or if it extends to programmatic API access, as well. Given how many platforms utilize API access regularly, it would be interesting to explore this result further.

Both Rally and VersionOne appear to provide a reasonable user experience that should satisfy customers in most cases when the applications are utilizing large data sets with over 500,000 artifacts. JIRA+GreenHopper is significantly disadvantaged from a performance perspective, and seems less suitable for customers with large artifact counts or with aggressive growth expectations. Factors such as user concurrency, variations in sprint structure, and numerous others have the potential to skew results in either direction, and it is difficult to predict how specific use cases may affect performance. These tests do, however, provide a reasonable comparative

20

baseline, suggesting Rally has a slight performance advantage in general, followed closely by VersionOne.

References A variety of references were used to help build and execute a performance testing methodology that would allow a reasonable, statistically supported comparison of the performance of the three ALM systems. In addition to documentation available at the websites for each product, the following resources were used:

“Agile software development.” Wikipedia.

Accessed Sept. 28, 2012 from http://en.wikipedia.org/wiki/Agile_software_development.

Beedle, Mike, et al. “Manifesto for Agile

Software Development.” Accessed Sept. 28, 2012 from http://agilemanifesto.org.

Hewitt, Joe, et al. Firebug: Add-ons for

Firefox. Mozilla. Accessed Sept. 28, 2012 from http://addons.mozilla.org/en-us/firefox/addon/firebug.

Honza. “Firebug Net Panel Timings.”

Software is Hard. Accessed Sept. 28, 2012 from http://www.softwareishard.com/blog/firebug/firebug-net-panel-timings.

Peter. “Top Agile and Scrum Tools – Which

One Is Best?” Agile Scout. Accessed Sept. 28, 2012 from http://agilescout.com/best-agile-scrum-tools.

http://en.wikipedia.org/wiki/Agile_software_development

http://en.wikipedia.org/wiki/Agile_software_development

http://agilemanifesto.org/

http://addons.mozilla.org/en-us/firefox/addon/firebug

http://addons.mozilla.org/en-us/firefox/addon/firebug

http://www.softwareishard.com/blog/firebug/firebug-net-panel-timings

http://www.softwareishard.com/blog/firebug/firebug-net-panel-timings

http://agilescout.com/best-agile-scrum-tools

http://agilescout.com/best-agile-scrum-tools

performance analysis of leading application lifecycle management systems for large customer data...

Documents

5x time

low standard deviation

large data sets

similar performance characteristics

standard deviation level

relative performance analysis

switch backlog views

scoring system developed