d2.5 leanbigdata evaluationleanbigdata.eu/wp-content/uploads/sites/3/2015/03/d2.5-leanbigdata... ·...

© LeanBigData Consortium of 89

Project Acronym: LeanBigData Project Title: Ultra-Scalable and Ultra-Efficient Integrated

and Visual Big Data Analytics Project Number: 619606 Instrument: STREP Call Identifier: ICT-2013-11

D2.5 LeanBigData Evaluation

Work Package: WP2 – LeanBigData Integrated Platform Due Date: 31/01/2017 Submission Date: 31/01/2017 Start Date of Project: 01/02/2014 Duration of Project: 36 Months Organisation Responsible for Deliverable: UPM Version: 1.0

Status: Final Author(s): Marta Patiño

Ricardo Jiménez Peris Alexandre Carvalho Tomás Pariente Vrettos Moulos George Margetis

UPM LeanXcale INESC ATOS ICCS FORTH

Reviewer(s): Valerio Vianello UPM Nature: R – Report P – Prototype

D – Demonstrator O - Other Dissemination level: PU - Public

CO - Confidential, only for members of the consortium (including the Commission)

RE - Restricted to a group specified by the consortium (including the Commission Services)

Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)



Revision history Version Date Modified by Comments 0.1 10-10-2016 Marta Patiño Table of contents 0.2 12-10-2016 Ricardo Jiménez

Peris Contents

0.3 12-01-2017 Alexandre Carvalho 0.4 13-01-2017 Vrettos Moulos 0.5 17-01-2017 ATOS 0.6 30-01-2017 Marta Patiño 0.7 10-2-2017 Marta Patiño



Copyright © 2014 LeanBigData Consortium

The LeanBigData Consortium (http://leanbigdata.eu/) grants third parties the right to use and distribute all or parts of this document, provided that the FIRST project and the document are properly referenced.

THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

----------------



Table of Contents 1. Introduction ............................................................................................................ 9

1.1. Relation with other deliverables ...................................................................... 10 2. Scalability of Transaction Management ............................................................. 11

2.1. TPC-C Benchmark .......................................................................................... 11 2.1.1 Experiment setup ..................................................................................... 12 2.1.2 Results ..................................................................................................... 14

2.2. Scalability of the Transaction Manager ........................................................... 16 2.2.1 Experiment setup ..................................................................................... 16 2.2.2 Evaluation ................................................................................................ 16

2.3. Snapshot Server and Commit Sequencer Scalability ...................................... 17 3. Parallel and Distributed SQL Engine Evaluation ............................................... 19

3.1. Large Scale OLAP Evaluation ......................................................................... 19 3.2. Comparison with MapReduce ......................................................................... 22

4. Complex Event Processing Evaluation .............................................................. 25

5. 2D Visualization System usability Evaluation .................................................... 35

5.1. Introduction ..................................................................................................... 35 5.2. Scope .............................................................................................................. 35 5.3. Description ...................................................................................................... 35 5.4. Consent participation ....................................................................................... 36 5.5. User tasks ....................................................................................................... 37 5.6. Metrics ............................................................................................................. 37 5.7. SUS questionnaire .......................................................................................... 38 5.8. Experiment execution ...................................................................................... 39 5.9. User task list .................................................................................................... 40 5.10. Results ......................................................................................................... 41

5.10.1 User population ........................................................................................ 41 5.10.2 Successful task completion ...................................................................... 41

5.11. User suggestions ......................................................................................... 44 6. Non-Functional Evaluation of Sentiment Analysis Tool ................................... 46

6.1. Introduction ..................................................................................................... 46 6.2. Phases ............................................................................................................ 46

6.2.1 1st Phase (End of Year 2) ......................................................................... 46 6.2.2 2nd Phase (Early Year 3) .......................................................................... 48 6.2.3 3rd Phase (End of Year 3) ......................................................................... 49

6.3. Results Visualization ....................................................................................... 50 6.4. Questionnaires ................................................................................................ 53

6.4.1 Software Evaluation Questionnaire .......................................................... 53 6.4.2 Training Evaluation Questionnaire ........................................................... 55

6.5. Overall Results ................................................................................................ 55 6.5.1 Aggregated Comparative Results ............................................................ 62



6.6. Conclusion ...................................................................................................... 63 7. Usability Evaluation of the Data Centre 3D Visualization application ............. 64

7.1. Usability testing through observation ............................................................... 66 1.1.1 Usability evaluation session ..................................................................... 68 1.1.2 Post-test ................................................................................................... 79

1.2 User Experience assessment .......................................................................... 82 1.3 Conclusions ..................................................................................................... 84

8. References ............................................................................................................ 87

9. Appendix – UEQ Questionnaire .......................................................................... 88



Index of Figures Figure 1: LeanBigData components. ............................................................................................. 10 Figure 2 TPC-C schema design ..................................................................................................... 11 Figure 3 Throughput with a single computing node. .................................................................... 13 Figure 4 Average transaction latency with a single computing node. .......................................... 13 Figure 5 Scalability of LeanXcale – Throughput. ......................................................................... 15 Figure 6 Scalability of LeanXcale - Latency of transactions. ....................................................... 15 Figure 7: Scalability of the Transaction Manager (transactions/second) ........................... 17 Figure 8 Snapshot Server Throughput .......................................................................................... 18 Figure 9 Snapshot Server - Latency with increasing number of Transaction Managers .............. 18 Figure 10: Queries processing times ............................................................................................. 20 Figure 11: Queries speedups ......................................................................................................... 21 Figure 12: CPU usage over time, running Q1 with an increasing number of workers per DQE instance .......................................................................................................................................... 21 Figure 13: Disk usage over time, running Q1 with an increasing number of workers per DQE instance .......................................................................................................................................... 22 Figure 14: Network usage over time, running Q1 with an increasing number of workers per DQE instance .......................................................................................................................................... 22 Figure 15: Queries processing time with DQE and MR ............................................................... 23 Figure 16 – Data centre monitoring CEP query ............................................................................ 26 Figure 17 – Sub-queries configuration .......................................................................................... 27 Figure 18 – CEP Throughput evolution (2 nodes). ....................................................................... 28 Figure 19 - Bytes sent by each node in the CEP Cluster (2 nodes). .............................................. 28 Figure 20 - CPU idle in each node in the CEP Cluster (2 nodes). ................................................ 29 Figure 21 – CEP Throughput evolution (4 nodes). ....................................................................... 30 Figure 22 - Bytes sent by each node in the CEP Cluster (4 nodes). .............................................. 30 Figure 23 - CPU idle in each node in the CEP Cluster (4 nodes). ................................................ 31 Figure 24 – Storm Throughput (4 nodes) ...................................................................................... 31 Figure 25 – CEP Throughput evolution (8 nodes). ....................................................................... 32 Figure 26 – Bytes sent by each node in the CEP Cluster (8 nodes). ............................................. 33 Figure 27 – CPU idle in each node in the CEP Cluster (8 nodes). ................................................ 33 Figure 28 – CEP scalability ........................................................................................................... 34 Figure 29 Evolution of the UI of the Data Centre room view across the three versions .............. 65 Figure 30 Evolution of the UI of the selected rack close-up view across the three versions ........ 65 Figure 31 Usability test participants’ gender distribution ............................................................. 67 Figure 32 Usability test participants’ age distribution .................................................................. 67 Figure 33 Usability test participants’ experience with the motion sensor .................................... 67 Figure 34 Task success diagram for the user observation experiment .......................................... 71 Figure 35 Success Rate: total success rate and success rate per gender, sensor experience and age for the user observation experiment .............................................................................................. 71 Figure 36 Mean time-on-task for the user observation experiment. Error bars represent the standard deviation, while red dots represent the median value of time for each task. .................. 74 Figure 37 Fatigue level per participant for the user observation experiment ................................ 77 Figure 38 Numbered targets to be reached consequently by navigating in the virtual environment with gestures during the post-test evaluation ................................................................................ 80 Figure 39 Mean time-on-task for the post-test experiment. Error bars represent the standard deviation, while red dots represent the median value of time for each task. ................................ 81 Figure 40: Means per scale diagram ............................................................................................. 84



Index of Tables Table 1 TPC-C benchmark database size ................................................................................. 14 Table 2: Analytical queries uses for analytical processing evaluation ........................................ 19 Table 3: Hardware resources usage (master node) ................................................................... 24 Table 4: Hardware resources usage (slave nodes) .................................................................... 24 Table 5 – Schema of the events in the Data Centre Scenario ................................................... 25 Table 6 – Alert Conditions .......................................................................................................... 25 Table 7: task completion summary. ............................................................................................ 42 Table 8: number of errors while performing the task (summary). ............................................... 42 Table 9: time to complete the task (summary). .......................................................................... 43 Table 10: SUS questionnaire summary ...................................................................................... 44 Table 11 Evolution of the Data Centre 3D Visualization application in terms of supported functionality and interaction methods. For each version only the changes or newly introduced features are listed ....................................................................................................................... 64 Table 12 Metrics employed per task .......................................................................................... 69 Table 13 Success Rate and Confidence Intervals per task for the user observation experiment ................................................................................................................................................... 72 Table 14 Success Rate and Confidence Intervals per group of tasks for the user observation experiment .................................................................................................................................. 73 Table 15 Time-on-task data for the user observation experiment .............................................. 73 Table 16 Number of tries data for the user observation experiment .......................................... 75 Table 17 Success rate, total time, total number of tries, fatigue level presented per participant, for the user observation experiment ........................................................................................... 77 Table 18 Identified usability problems ........................................................................................ 78 Table 19 Success Rate and Confidence Intervals per task for the post-test experiment ........... 80 Table 20 Time-on-task data for the post-test experiment ........................................................... 81 Table 21 Means per scale .......................................................................................................... 83 Table 22 Scores per item ........................................................................................................... 84


© LeanBigData consortium of 89

1. Introduction LeanBigData is an ultra-scalable and ultra-efficient big data platform integrating in one product the three main big data technologies: a novel transactional NoSQL key-value data store, a distributed complex event processing (CEP) system, and a distributed SQL database. The platform is designed to achieve scalability in a very efficient way avoiding the inefficiencies and delays introduced by current Extract-Transform-Load-based (ETL) approaches. Currently, one of the main issues in data management at enterprises and other organizations is the fact that databases are either operational (OLTP-OnLine Transactional Processing) or analytical (OLAP-OnLine Analytical Processing). This leads to a separation of the management of the operational data performed at operational databases, and the management of analytical queries performed at analytical databases or data warehouses. This separation results in having to copy the data periodically from the operational database into the data warehouse. This copy process is termed Extract-Transform-Load (ETL). ETLs are estimated to consume 75-80% of the budget for business analytics. LeanBigData solves this issue in data management by bringing a database, LeanXcale, with the two capabilities, operational and analytical. Another aspect in which LeanBigData innovates lies in the efficiency of the transactional processing and the storage engine. The transactional processing has been re-architected and re-implemented to be an order of magnitude more efficient than the initial version at the beginning of the project. A new storage engine, KiVi, has been architected and implemented from scratch. It is based on a new data structure to be efficient both for range queries and updates. Another main innovation brought by LeanBigData is in the area of data streaming. Here, the goal has been to produce an efficient scalable distributed complex event processing engine LeanBigData platform is equipped with a visualization subsystem able to report incremental visualization of results of long analytical queries and with an advanced anomaly detection and root cause analysis module. The visualization subsystem also supports efficient manipulations of visualizations and query results through hand gestures. Four use cases have been integrated with the developed infrastructure to demonstrate the value of the LeanBigData platform and validate it. The main components of the LeanBigData platform are shown in Figure 1. This deliverable evaluates the Complex Event Processing component, the parallel SQL engine, the transaction management component (LeanXcale) and the visualization system.



Figure 1: LeanBigData components.

The project is divided into nine work packages. This deliverable belongs to work package 2 and is the result of tasks 2.5. The deliverable reports the results of the empirical evaluation of the integrated platform and the individual subsystems providing a quantification of their scalability and performance under different workloads. This deliverable also reports on the usability evaluation. This deliverable shows the scalability of different components of the LeanBigData. The platform was evaluated in the UPM cluster. The deliverable also reports on the usability evaluation of the visualizations and human-computer interface of the LeanBigData platform with a population of at least 100 people counting with professionals from the different partners.

1.1. Relation with other deliverables

Previous deliverables of work package 2, LeanBigData Integrated Platform, dealt with the architecture and design of the LeanBigData integrated platform.



2. Scalability of Transaction Management In this section we evaluate the scalability of the transaction management component of LeanBigData which is integrated in the LeanXcale database. We study the scalability of the platform in terms of maximum sustainable throughput under a response time constraint. To evaluate the database, we use the industrial benchmark TPC-C™ (TPC-C)1, the reference benchmark for OLTP databases. TPC-C benchmark stresses key hardware components in database systems like I/O, CPU, memory, and for distributed databases, also network. We have implemented the benchmark application according to the TPC-C specification. The implementation is available at GitHub2. We used a shared-nothing cluster with 464 cores. The cluster is composed of two different types of nodes. Type A nodes are equipped with 4-core Intel(R) Xeon(R) CPU X3220 @ 2.40GHz, 8 GB of RAM, 1 Gbit Ethernet and a directly attached 150 GB SSD hard drive. Type B nodes are equipped with 12-core Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz, 128GB of RAM, 1Gbit Ethernet and a directly attached 0.5 TB SSD hard disk. There are in total 52 machines, 20 type A nodes and 32 type B nodes. All the nodes run Ubuntu 12.04 LTS.

2.1. TPC-C Benchmark

The TPC-C database model is defined in the following figure.

Figure 2 TPC-C schema design

The database size and load scale unit is the warehouse. For every warehouse, there are 10 clients accessing concurrently to data in the same warehouse. A one warehouse database in plain CSV format is about 63.15 MB and it spans 599011 rows. Scaling the database in terms of size is linear to the number of warehouses. TPC-C defines five types of transactions:

1 http://www.tpc.org/tpcc/ 2 https://github.com/rmpvilaca/EscadaTPC-C

http://www.tpc.org/tpcc/

https://github.com/rmpvilaca/EscadaTPC-C



• NewOrder: Inserts a new order with a variable number of items. The transaction performs 2 row selections with data retrieval, 1 row selection with data retrieval and update and 2 row insertions. Then, for a variable number of items (in average 10), performs (1 * number of items) row selections with data retrieval, (1 * number of items) row selections with data retrieval and update and (1 * number of items) row insertions.

• Payment: Updates the customer's balance and reflects the payment on the district and warehouse sales statistics. This transaction presents two cases: In the first case, a customer is retrieved based the customer id, then the transactions performs 3 row selections with data retrieval and update; and 1 row insertion. In the second case, the customer is retrieved based on the last name. The transaction performs 2 row selections (on average) with data retrieval, 3 row selections with data retrieval and update; and 1 row insertion.

• OrderStatus: Checks the status of a given order. The transaction first picks a customer and her last order. To do so, the transaction defines two cases. In the first case, the customer is retrieved based on the customer id then, the transaction performs 2 row selections with data retrieval. In the second case, the customer is retrieved based on the last name, and then the transaction performs 4 row selections (on average) with data retrieval. Finally, in both cases, the transaction checks the status (delivery date) of each item on the order (on average there are 10 items per order). This operation performs: (1 * number of items) row selections with data retrieval.

• Delivery: Processes a batch of 10 new (not yet delivered) orders. The transaction performs 1 row selection with data retrieval, (1 + number of items per order) row selections with data retrieval and update, 1 row selections with data update and 1 row deletion.

• StockLevel: Determines the number of recently sold items that have a stock level below a specified threshold. The transaction performs 1 row selection with data retrieval, (20 * number of items per order) row selections with data retrieval and at most (20 * number of items per order) row selections with data retrieval.

TPC-C workload states that for every 100 submitted transactions, 45 are NewOrder, 43 are Payment, 4 are StockLevel, 4 are OrderStatus and 4 are Delivery transactions.

2.1.1 Experiment setup

We have configured the platform as follows: Zookeeper, HBase Master, HDFS Namenode, Snapshot Server and Commit Sequencer Manager run each one on a type A node. This number of machines is fixed for all configurations. Each type B node (i.e., computing node) runs 1 Query Engine instance, 2 Conflict Managers instances, 1 Logger instance, 4 HBase RegionServer instances and 1 HDFS DataNode instance. TPC-C clients run on type A nodes. We measured the capacity of one computing one in terms of database size and number of concurrent clients. We stress the computing node until we reach the maximum throughput, while the benchmark SLAs3 are still met. Then, the number of computing nodes is increased to show the scalability of LeanXcale database. In order to measure the capacity of the computing node we populated the database with 50, 100, 200, 300, 400 and 500 warehouses and injected load from 500 to 5000 clients. Figure 3 shows that a single computing node is able to handle 3000 clients with a database populated with 300 warehouses. For larger databases, the latency grows exponentially because the 3 http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf



system is almost saturated (Figure 4). The CPU becomes a bottleneck, reaching around 80 % of CPU utilization with 400 warehouses.

Figure 3 Throughput with a single computing node.

Figure 4 Average transaction latency with a single computing node.

The next step in the evaluation consists in scaling the number of nodes along with the database size and the number of concurrent clients. We scaled the system to up to 32 computing nodes.



We used up to 16 client nodes to inject the TPC-C load. Each node runs up to 2 TPC-C client instances. The goal of this evaluation is to prove that LeanXcale scales linearly. That is, increasing the number of computing nodes, the database size and the load the system, the system is able to produce n-times the throughput of a single computing node, maintaining the latency of transactions. Table 1 shows the different configurations adopted for the selected database sizes and load.

Table 1 TPC-C benchmark database size

Warehouses Clients DBSize (#rows)

DBSize (GigaBytes)

Number of Computing nodes

300 3000 149803300 19 1 600 6000 299606600 38 2 1200 12000 598913200 76 4 2400 24000 1197726400 152 8 4800 48000 2245649500 304 16 9600 96000 4491299000 608 32

2.1.2 Results

Figure 5 and Figure 6 show the evolution of the throughput and latency when running TPC-C from 3000 up to 96000 clients in settings with 1 to 32 computing nodes. The figures show that LeanXcale is able to scale linearly. Throughput scales linearly and the latency observed by the clients does not increase as more load is injected. In this evaluation, the Snapshot Server and Commit Sequencer nodes (type A) presented less than 10% of CPU of utilization in the largest configuration. That is, these nodes can cope with ten times more load before they become a bottleneck. Running these two services in a more powerful machine will increase even more the load the two services can run.



Figure 5 Scalability of LeanXcale – Throughput.

Figure 6 Scalability of LeanXcale - Latency of transactions.



2.2. Scalability of the Transaction Manager

The main bottleneck of an OLTP database is the transaction processing. This key component was designed in LeanXcale as a set of independent components. The goal of this experiment is to evaluate the scalability of the transaction manager. For that purpose, we used the Yahoo! Cloud Serving Benchmark (YCSB4) to evaluate the scalability of the transaction manager. YCSB is an up-to-date benchmark for cloud data stores. We have used YCSB because the operations are simpler than in TPC-C and there are no waiting times, and no restrictions on the number of clients. YCSB client implements basic row bases operations such as read, insert, update, scan, and read-modify operations. YCSB operations access a single table, usertable. Usertable is populated with synthetic data generated by the benchmark. The benchmark allows to customize the usertable by defining the key and columns length and the number of columns. We used the standard configuration of user table with a key of 100 bytes and 10 columns of 100 bytes each. In order to stress the system the benchmark only executes update transactions. Read-only transactions are much lighter since they do not conflict with concurrent transactions and nothing is logged. For each transaction we invoke the methods to start transaction, check for conflicts and commit in the Transaction Manager API.

2.2.1 Experiment setup

The goal of this experiment is to proof the scalability of the TMs. The Snapshot Server and Commit Sequencer are standalone servers and they are not scale out. T The evaluation is run on a cluster of 12-core Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz,128 GB of RAM, 1Gbit Ethernet and a directly attached 0.5 TB SSD hard disk. There are in total 16 nodes, all run Ubuntu 12.04 LTS and 4-core Intel(R) Xeon(R) CPU X3220 @ 2.40GHz, 8 GB of RAM, 1 Gbit Ethernet and a directly attached 150 GB SSD hard drive. We deploy 1 Transaction Manager instance with 1 co-located Logger instance and up to 12 Conflict Manager Instances per 12 core machine. The Conflict Manager is a single-threaded process so we deploy as many instances as the number of cores. Local Transaction Manager/Logger pair (from now on, LTM) and Conflict Managers are deployed on different machines.

2.2.2 Evaluation

In order to study the scalability of the Local Transaction Manager and Conflict Managers, we provision with a Snapshot Server and Commit Sequencer node. For every two Local Transaction Managers we provision with a Conflict Manager node. Figure 7 presents the throughput for an increasing number of LTM and CMs. A single unit (of two LTMs and 1 CM) is able to handle 46000 update transactions per second approximately. In our experiment, we deployed up to 10 TMs and 5 Conflict Managers. The throughput in the largest configuration is more than 2,300,000 updates transactions per second, which is 5 times the throughput of a single unit. Therefore, the system scales linearly processing. This number of transactions per second satisfies most of current applications.

4 https://github.com/brianfrankcooper/YCSB



Figure 7: Scalability of the Transaction Manager (transactions/second)

2.3. Snapshot Server and Commit Sequencer Scalability

The throughput of this service depends on the number of Transaction Managers being served. This is because the job done is constant per Transaction Manager and does not depend on the actual load of a Transaction Manager. That is, for the Snapshot Server it represents the same amount of work if the Transaction Manager is serving one or one million transactions per second. In order to saturate this service, the TMs needed to generate such load is too high (as we will show later in this section). Therefore, we have conducted a simulation. The simulation consists in deploying this service and deploy as many client threads (Snapshot Server spawns a thread for each Transaction Manager, plus the server thread which is in charge to process/compute the batches). For the Snapshot Server evaluation, we have used a single 4-core Intel(R) Xeon(R) CPU X3220. Each client thread is orchestrated by the benchmark generating the exact amount of work that a single Transaction Manager would generate in the system. We measure the throughput of the system in terms of computed batches per second and the latency of computing the batches. The Snapshot Server produces a new snapshot every 10 milliseconds when the system is under loaded. We consider the Snapshot Server to be saturated when it is not able to produce a new snapshot every 10 milliseconds. Figure 8 shows the throughput of the Snapshot Server with an increasing number of Transaction Managers. The Snapshot Server scales linearly up to 800 Transaction Managers. The Snapshot Server produces 25,000 batches per second with 250 TMs. The number of batches produced is doubled (50,000 batches/s) with 500 TMs. The latency is almost constant (below 10 milliseconds) until 800 clients (Figure 9). If the Snapshot Server can handle up to 800 clients with a latency lower than 10 milliseconds, it means that the Holistic Transaction Manager, when properly scaled, could be able to serve up



to 800 times the throughput of a single Transaction Manager. That is, If a Transaction Manager handles up to 140,000 update transactions per second, we can conclude that LeanXcale Transaction Manager could potentially handle up to 112 million transactions per second.

Figure 8 Snapshot Server Throughput

Figure 9 Snapshot Server - Latency with increasing number of Transaction Managers



3. Parallel and Distributed SQL Engine Evaluation In this section, we validate that the OLAP parallel and distributed SQL engine implementation developed within the project satisfies the success indicators proposed in the DoW regarding analytical processing, that is: (i) it is able to process 1 TB of data in 10 seconds using 15 nodes, and (ii) it uses 5 times less resources than MapReduce (MR).

3.1. Large Scale OLAP Evaluation

For this evaluation, we setup an experiment using a synthetic dataset, generated from real data, containing voice calls and messages logs from a telco company, with 1TB size. The dataset contains multiple details related to voice calls and messages, such as date, duration, geographical information, error information, device (e.g., manufacturer, model, operating system), or user account. Considering this dataset, the analytical queries presented in Table 2 were used to measure LeanXcale Distributed Query Engine (DQE) processing times.

Table 2: Analytical queries uses for analytical processing evaluation

Description SQL Query Q1 Average call

duration SELECT SUM({fn timestampdiff(SQL_TSI_SECOND, CONNECT_TIME, DISCONNECT_TIME)})/COUNT(*) AS AVG_TIME FROM LOGS3G WHERE CONNECT_TIME IS NOT NULL AND DISCONNECT_TIME IS NOT NULL;

Q2 Total call duration for Q4 of 2010

SELECT SUM({fn timestampdiff(SQL_TSI_SECOND, CONNECT_TIME, DISCONNECT_TIME)}) AS TOTAL_TIME_2010Q4 FROM LOGS3G WHERE CONNECT_TIME IS NOT NULL AND DISCONNECT_TIME IS NOT NULL AND DATE_END >= timestamp('2010-10-01 00:00:00') AND DATE_END < timestamp('2011-01-01 00:00:00')

Q3 Call duration and call count, per year and month

SELECT "YEAR", "MONTH", SUM({fn timestampdiff(SQL_TSI_SECOND, CONNECT_TIME, DISCONNECT_TIME)}) AS TOTAL_TIME, COUNT(*) AS TOTAL FROM LOGS3G WHERE CONNECT_TIME IS NOT NULL AND DISCONNECT_TIME IS NOT NULL GROUP BY "YEAR", "MONTH"

Q4 Number of failed calls and messages events per terminal manufacturer

SELECT TRM_BRAND, COUNT(*) AS FAILED FROM LOGS3G WHERE INSUCCESS = 1 GROUP BY TRM_BRAND;

Both queries Q1 and Q2 perform a scalar aggregation, with Q1 performing a full table scan, whereas Q2 performs an indexed scan over a quarter of a year range. Queries Q3 and Q4 perform grouped aggregations, with Q4 performing a count based on a boolean attribute value. This dataset was loaded on the DQE, deployed over 15 machines/nodes at UPM cluster. Each node has 2 Intel Xeon [email protected] CPUs (6 cores, 12 threads), 128GB of RAM, a direct attached Intel SSD DCS3500 (460GB) hard drive, and GigaBit Ethernet network. The deployment was organized as follows:



• 1 node running the snapshot server, the commit sequencer, the configuration manager, the zookeeper, the HBase master, the HDFS name node, and the monitor server;

• 14 nodes each one running 1 DQE instance, 1 HDFS data node, 4 HBase region servers, 1 conflict manager, and 1 monitor agent.

An extra machine was used to run the client application submitting the queries. We run the experiment with different levels of parallelism, starting with 1 worker thread per DQE instance (see deliverable D5.3 for additional details about parallelism on DQE), and increasing the number of workers up to 24 (total of 336 workers), with increments of 4. Figure 10 shows the queries processing times for the different levels of parallelism. For each configuration, the query was run 5 times, and the average of the last 4 executions was reported (to account for cache warm ups). Considering the proposed KPI of 10s processing time, we can observe that all queries were able to meet the indicator with no more than 12 workers per DQE instance. Moreover, all queries were able to improve performance beyond 12 workers per DQE instance. Taking the best average processing time for each query, we can verify that all queries executed in less than 8.4s (i.e., 16% less than the promised KPI). The fastest query executed in less than 5.1s (i.e., 49% less than the promised KPI), and the fastest full scan queries (Q3 and Q4) executed in 8.1s (i.e., 19% less than the promised KPI).

Figure 10: Queries processing times

Figure 11 shows the speedups obtained with the aforementioned queries. As expected, considering that each of the nodes used has 12 physical cores, the queries speedup increases steadily up to 12 workers per DQE instance, where it reaches a value between 7x and 8x (depending on the query). Small improvements can be obtained adding more workers, with query Q4 reaching a maximum speedup of 8.3x with 20 workers per DQE instance.

63,6

19,8

11,38,6 8,8 8,4 8,5

38,6

11,77,4 5,4 5,6 6,7 5,1

63,8

20,2

12,18,5 8,1 8,4 8,6

67,8

20,7

12,38,7 8,2 8,1 8,6

0,0

10,0

20,0

30,0

40,0

50,0

60,0

70,0

80,0

1 4 8 12 16 20 24

Proc

essin

g tim

e (s

)

Workers per DQE instance

Q1 Q2 Q3 Q4



Figure 11: Queries speedups

Analysing hardware resources usage, we can observe the CPU is the component limiting further performance improvements. Figure 12, Figure 13 and Figure 14 show the evolution on hardware resources (CPU, network and disk I/O) usage for one of the nodes used to deploy a DQE instance, running Q1 with an increasing number of workers per DQE instance (up to 24). With 24 workers per DQE instance (i.e., 1 worker per CPU thread of the machine) the CPU usage reaches 100%, whereas network and disk I/O usage remains far from its limits (both components can achieve throughputs above 100MB/s). Note that although the CPUs support 24 threads of execution, they only have twelve physical cores (per node), which explains why the performance improvements after 12 workers per DQE instance are reduced, even though the CPU only reaches full usage with 24 workers per DQE instances.

Figure 12: CPU usage over time, running Q1 with an increasing number of workers per DQE

instance

0

1

2

3

4

5

6

7

8

9

0 4 8 12 16 20 24

Spee

dup

Workers per DQE instance

Q1 Q2 Q3 Q4

0

10

20

30

40

50

60

70

80

90

100

0 60 120

180

240

300

360

420

480

540

600

660

720

780

840

900

960

1020

1080

1140

1200

1260

1320

1380

1440

1500

1560

1620

1680

1740

CPU

usa

ge (%

)

Time (s)



Figure 13: Disk usage over time, running Q1 with an increasing number of workers per DQE

instance

Figure 14: Network usage over time, running Q1 with an increasing number of workers per DQE

instance

3.2. Comparison with MapReduce

The queries from Table 2 were also used to compare the hardware resources usage of the DQE and MR (Hadoop MapReduce v2.7.2 was the implementation used in this comparison). For MR, handmade implementations of the queries were produced, which read the data from CSV files stored on HDFS. For this experiment, we used 5 machines/nodes. Each node has an Intel Core [email protected] CPU (2 cores, 4 threads), 8GB of RAM, a direct attached Seagate HDD ST500DM002 (500GB) hard drive, and GigaBit Ethernet network. The DQE deployment was organized as follow:

0K

200K

400K

600K

800K

1000K

1200K

1400K

0 60 120

180

240

300

360

420

480

540

600

660

720

780

840

900

960

1020

1080

1140

1200

1260

1320

1380

1440

1500

1560

1620

1680

1740

Disk

I/O

(B/s

)

write read

0K

200K

400K

600K

800K

1000K

1200K

0 60 120

180

240

300

360

420

480

540

600

660

720

780

840

900

960

1020

1080

1140

1200

1260

1320

1380

1440

1500

1560

1620

1680

1740

Net

wor

k us

age

(B/s

)

recv send



• 1 node running the snapshot server, the commit sequencer, the configuration manager, the zookeeper, the HBase master, the HDFS name node, 1 conflict manager, and the monitor server;

• 4 nodes each one running 1 DQE instance, 1 HDFS data node, 1 HBase region servers, and 1 monitor agent.

The MR deployment was organized as follow:

• 1 node running the zookeeper, the HDFS name node, and the resource manager;

• 4 nodes each one running 1 HDFS data node, and 1 node manager. In both cases, an extra machine was used to run the client application, which submits the queries/jobs. The dataset size was adjusted to the size of the cluster. That is, as we reduce the total number of cores available by a factor of 20, the dataset used was also reduce by a factor of 20, i.e., we used a dataset of 50GB. Figure 15 reports the processing times for each query, using the DQE (with 8 workers per instance) and MR. Tables Table 3 and Table 4 summarize the results of this experiment regarding hardware resources usage, for the master node and the slave nodes (aggregated), respectively. In both cases, we present average CPU usage, total CPU time, total disk I/O and total network usage. These results show that DQE reduces query processing times by more than an order of magnitude (between 12.4x and 13.5x), which is directly reflected on CPU time. The reduction on network usage is also of about one order of magnitude. However, the reduction on disk I/O is of about three orders of magnitude. This is impressive reduction is achieved by keeping most of the data cached in memory instead of reading it from disk. These results show that the reduction on resources usage enabled by the DQE OLAP SQL engine is above the 5x value proposed in the DoW.

Figure 15: Queries processing time with DQE and MR

26,4 25,1 25,7 25,8

342 339 338320

0

50

100

150

200

250

300

350

400

Q1 Q2 Q3 Q4

Proc

essin

g tim

e (s

)

DQE MR



Table 3: Hardware resources usage (master node)

Q1 Q2 Q3 Q4 DQE MR DQE MR DQE MR DQE MR

Avg. CPU (%) 1.29 0.40 1.33 0.41 1.33 0.37 1.25 0.39 CPU (s) 1.36 5.44 1.34 5.63 1.37 5.05 1.29 4.94

Disk (MB) 0.75 4.18 0.73 4.07 0.61 3.80 0.76 4.07 Network (MB) 3.58 11.40 3.70 11.25 3.46 11.31 3.84 11.15

Table 4: Hardware resources usage (slave nodes)

Q1 Q2 Q3 Q4 DQE MR DQE MR DQE MR DQE MR

Avg. CPU (%) 90.78 96.83 96.21 94.26 94.77 97.56 94.59 95.91 CPU (s) 95.87 1324.62 96.60 1278.19 97.42 1319.07 97.62 1227.67

Disk (MB) 4.39 58221.66 3.76 44170.55 2.97 58933.55 2.66 58880.67 Network (MB) 12.76 185.35 12.58 76.04 12.52 92.72 12.76 85.49



4. Complex Event Processing Evaluation In this section we demonstrate that LeanBigData complex event processing (CEP) is a scalable system able to satisfy the success indicator stated in part B of the DoW, that is: be able to process 1 million events per second. To this end we use a continuous CEP query in the context of the Data Centre Monitoring use case (a sub-set of this query was used for the preliminary evaluation of the CEP presented in deliverable “D4.6 CEP Engine Distributed”. Each machine installed in the data centre reports events with the schema described in Table 5.

Table 5 – Schema of the events in the Data Centre Scenario

Field Name Field Description

UnixTime Timestamp when this metric was read

MachineName Identifier of the machine. Composed as: {roomID}_{rackID}_{serverID}

CPULoad Percentage of CPU used at a certain time

NetworkByteIn Bytes of information sent through the network at a given timestamp

NetworkByteOut Networks sent through the network at a given timestamp

DiskIO Disk usage

CPUTemp Temperature of the server

Power Consumption value

For this evaluation we used synthetic data simulating events from a cluster made by 100 rooms each one with 1.000 rack of 50 blades each. LeanBigData CEP is used to process the metrics reported by the data centre sensors and to raise alerts when some specific conditions are matched. The alert conditions are described in Table 6. Maximum and average values are evaluated over data received in intervals of 10 minutes.

Table 6 – Alert Conditions

Alert Condition

Server Alerts

Maximum CPU load > 95% Maximum CPU temperature > 85

Maximum Power consumption > 350 Average CPU load > 80%

Average CPU temperature > 70 Average Power consumption > 300

Rack Alerts Average CPU load of all server in the rack > 80% Average CPU temp of all server in the rack > 70

Average Power consumption of all server in the rack > 300

Room Alerts Average CPU load of all server in the room> 60% Average CPU temp of all server in the room > 60

Average Power consumption of all server in the rack > 200



Figure 16 – Data centre monitoring CEP query

The CEP query depicted in Figure 16 is used to process the data and raise the alerts. The query starts with the MAP operator that transforms the field MachineName into three new fields containing respectively the room name, the rack number and the server id. The de-multiplexer DMUX operator is used to replicate the input stream into three different output streams that will be used to compute the statistics per server, per rack and per room. Finally three blocks, each one composed by an aggregate and filter operators are used to compute the statistics per sever/rack/room and to check the alert conditions. The evaluation is being made at UPM cluster. We used a set of blades each one equipped with AMD Opteron [email protected] GHz, 128GB of RAM, 1Gbit Ethernet and a directly attached 460GB SDD disk. In detail the setup is composed by:

• 3 nodes for load generators.

• 1 node to run LeanBigData CEP Orchestrator and the resource monitor server.

• 2 to 8 nodes to run LeanBigData Instance Managers. In the experiments we used up to 75 clients distributed on 3 blades. After a warm-up stage, each client sends events with a constant rate of 20.000 events per second. The query was divided into three sub-queries by the JCEPC driver auto query factory. Figure 17 shows how the query is divided in sub-queries. The sub-query SQ1 contains the MAP and DMUX operators because the heading stateless operators are all inserted in the same sub-query. Then, there are three more sub-queries SQ2, SQ3 and SQ4 with the pair AGGREGATE and FILTER operators. These sub-queries were created because the AGGREGATE is a stateful operator and the JCEPC driver allocates each stateful operator together with the following stateless operator into a new sub-query.



Figure 17 – Sub-queries configuration

To demonstrate the scalability of the engine, we run the query in three different deployments with 2, 4 and 8 nodes for the CEP. First we run the query in 2 nodes with the following deployment:

• Sub-query SQ1 with 45 instances.



• Sub-query SQ4 with 12 instances. SQ1 and SQ2 were deployed with a larger number of instances because they are heavier than SQ3 and SQ4. In fact, SQ3 and SQ4 compute statistics respectively per rack and room which is much less information than the total number of blade statistics computed by SQ2. Figure 18 shows the evolution of the CEP throughput during an experiment that last for about 15 minutes. The maximum throughput reached at the end of the experiment was around 400.000 tuples per second. In this experiment we gradually started 20 clients each with a constant target throughput of 20.000 tuples per second.



Figure 18 – CEP Throughput evolution (2 nodes).

Figure 19 and Figure 20 show the evolution of the network (bytes sent per second) and CPU idle (that shows the percentage of available CPU) of the blades used in this experiments, that is blade105 for the clients and blade102 and blade103 for the Instance Managers. Looking at Figure 20 we see that the CPU on the CEP nodes were almost full utilized (Cpu Idle going to 0).

Figure 19 - Bytes sent by each node in the CEP Cluster (2 nodes).



Figure 20 - CPU idle in each node in the CEP Cluster (2 nodes).

In the 4 nodes deployment we used the following distribution for the query:





Figure 21 shows the evolution of the CEP throughput during an experiment that last for about 35 minutes. The maximum throughput reached at the end of the experiment was around 750.000 tuples per second. Figure 22 and Figure 23 show the evolution of the network and CPU idle of the blades used in this experiments, that is blade105 and blade109 for the clients and blade102, blade103, blade104 and blade106 for the CEP Instance Managers. Again the CPU is almost saturated on the CEP nodes demonstrating that we reach the maximum throughput also in this configuration. In this experiment we gradually started 40 clients (distributed on two nodes) each with a constant target throughput of 20.000 tuples per second.




Figure 22 - Bytes sent by each node in the CEP Cluster (4 nodes).



Figure 23 - CPU idle in each node in the CEP Cluster (4 nodes).

Figure 24 shows the throughput of the same query deployed in Apache Storm in a 4 workers deployment. LeanBigData CEP is able to process much more tuples per second (750, 000) than Apache Storm (550, 000) using the same amount of resources.

Figure 24 – Storm Throughput (4 nodes)



Finally in the experiments with 8 nodes the CEP Cluster was composed by 448 Instance Managers. The query was deployed in the cluster with the following deployment:





Figure 24 shows the evolution of the CEP throughput during an experiment that last for 1 hour. The maximum throughput reached at the end of the experiment was around 1.400.000 tuples per second that is greater than the KPI for the CEP promised at the beginning of the project (1 million of events per second). We first gradually started 25 clients to send a load of 500.000 tuples per second (at 15:10) then, we gradually started 25 clients more to reach the KPI of 1.000.000 tuples per second (at 15.30). Finally, we started more clients to show that the CEP was able to go above the promised KPI.


The limit in this experiment was given by the network bandwidth. Figure 25 and Figure 26 show, respectively, the bytes sent per second and CPU idle in each of the nodes used for running the CEP and the clients (blade101, blade105 and blade109). In Figure 25 we can observe that after starting the third block of clients (around 15:30 on blade101) the amount of data sent by the other nodes of the cluster saturate the network (around 90 MegaBytes per second at 15.50). At the same time in Figure 26 we observe that the CPU idle never goes below the 40% on all



nodes. Initially, the percentage of idle CPU is higher because the load (events received) is much lower than the one at the end of the experiment.

Figure 26 – Bytes sent by each node in the CEP Cluster (8 nodes).

Figure 27 – CPU idle in each node in the CEP Cluster (8 nodes).



Figure 27 shows the scalability of LeanBigData CEP. The blue line (Linear) depicts the ideal linear scalability multiplying by 2 and 4 the throughput obtained with 2 CEP nodes. The red line reports the maximum throughput reached by LeanBigData CEP in the three deployments. The red line is very close to the blue one demonstrating the scalability feature of the CEP. The difference in the point with 8 nodes is around 200.000 tuples/second out of a total throughput of around 1.5 million tuples/second. This difference is justified by the network saturation showed in Figure 25 and explained beforehand.

Figure 28 – CEP scalability



5. 2D Visualization System usability Evaluation

5.1. Introduction

This document is the first and last deliverable regarding the usability evaluation of the 2D visualization system. It addresses the WP2 Task 2.5 – LeanBigData Evaluation, more specifically the sentence “This task will also perform a usability evaluation of the visualizations and human-computer interface of the LeanBigData platform with a population of at least 100 people counting with professionals from the different partners.” First, this document presents the goals and defined method as well as the important details concerning the performed evaluation tests. Next, this document presents how the tests were conducted and also the achieved results. Finally, conclusions are drawn from the results.

5.2. Scope

The system evaluation was performed through an experiment where users performed certain tasks in the 2D visualization System. It is important to state that users participating in the experiment did not participate in the specification and in development of the system and that the usability experiment was first time they became in contact with the system. The 106 users participating in the 2D visualization System were from three institutions: 6 users from INESCTEC (Portugal), 24 users from ATOS (Spain) and 76 users from NTUA (Greece). The goal of the experiment was to evaluate how users were able to perform the following specific tasks that characterize the most important user operations.

1. To create a query through the set of visual operator. 2. To inspect and run a query 3. To create a chart based on an existing workflow 4. To edit an existing chart and change visual properties 5. To add a chart into an existing dashboard 6. To add a chart into newly created dashboard

The experiment aimed to reveal if users:

a) can easily navigate between workflows, charts and dashboards b) can easily access a specific workflow and specific chart c) can efficiently use the visual operators for workflow creation d) can efficiently configure a chart e) can efficiently configure a dashboard

5.3. Description

For testing purpose the 2D Visualization System was available to users through the Google Chrome web browser at http:// rocket.inescporto.pt. Despite the testing application contained several other datasets, the SYNCLAB dataset was used for testing purposes. The test were



performed in laptops with the following general characteristics: Full HD (1920x1080) screen resolution; Core i7 processor or equivalent; At least 4Gb RAM (recommended 8Gb RAM); Three button mouse input. Google Chrome up to date.

The tests were scheduled for December 2016 at INESCTEC and ATOS Facilities and for January 2017 at NTUA facilities. The selected user audience profile tried to be similar to the user profile that is expected to use the 2D visualization system: users should have at least minimal knowledge on representation of information through SQL database, the purpose of visual representation thought chart and minimal information regarding the purpose of each chart type. Finally, users should also have minimal knowledge on the purpose of dashboards. Generally speaking, users should have proficiency in information technology.

5.4. Consent participation

Each user participating in the experiment were asked to carefully read the consent form, pose any question they liked and if they agreed, sign the consent form.

The consent form provided the following information:

1. The purpose and objectives of the experiment is to evaluate the usability of specific features of the developed LBD visualization system in what concerns: To create a query through the set of visual operator; To inspect and run a query; To create a chart based on an existing workflow; To edit an existing chart and change visual properties; To add a chart into an existing dashboard; To add a chart into newly created dashboard.

2. The experiment process is the following: • The evaluator shortly describes the system • The participant is asked to perform a group of tasks. For each task:

o The evaluator asks the participant to perform one task. The task is written in a sheet of paper and the participant can read it at any time (for details, for instance).

o The participant performs the task. o In the end a System Usability Description (SUS) questionnaire is completed by

the participant. 3. This experiment poses no risks for the participant. 4. During the performance of each task, the evaluator collects the number of errors performed

by the user. At the end of the task the evaluator collects the total amount of time it took the participant to perform the task. Furthermore, the answers from a SUS questionnaire are also retrieved. Brief general user remarks are also collected.

5. The methods of collecting the data are reduced to quantitative data retrieved by the evaluator: gender and age of the participant, time to perform, task completion, task difficulty and qualitative data: participant remarks.

6. The results are anonymized, in order to protect the privacy and identity of the user. Any acquired personal information (name and contact info) will not be shared with anyone outside the team. The anonymized results are statistically processed and the output of this process can be shared with third persons outside the scope of this project.



7. The following contact should be used in case the participant has more questions about the experiment or wants to withdraw from the experiment and any data acquired at a later time: Marta Patiño (LeanBigData Project Coordinator) or Ricardo Jimenez-Peris (LeanBigData Technical Coordinator) at http://leanbigdata.eu.

8. The consent form should be read by each participant before the session and signed by both the participant and the evaluator. The participant should sign two copies of the consent form and keep one for himself/herself.

By signing this document I hereby consent to voluntarily participate in this experiment which can be stopped by myself at any moment for any reason with no questions asked. I also declare to be aware that there is no monetary compensation for conducting the experiments.

5.5. User tasks

The following tasks were performed for each experiment session:

• UT1: to create a query through the set of visual operator • UT2: to inspect and run a query • UT3: to create a chart based on an existing workflow • UT4: to edit an existing chart and change visual properties • UT5: to add a chart into an existing dashboard • UT6: to add a chart into newly created dashboard

5.6. Metrics

For each of these tasks metrics were collected, mostly quantitative data: time to perform task, numerical classification of task difficulty. The following methods were used to retrieve metric data:

• Quantitative data: an excel worksheet5 was provided the LBD evaluator conducting the

experiment and for each experiment session the following data was recorded:

o Number of errors per task: an erroneous interaction for each task ahead of time and use that definition for the calculation of number of errors for the entire evaluation.

o Time on Task: the amount of time it takes for each user to complete a task. o Successful task completion:

Success: user completed the task on his/her own without any help from the evaluator.

Partially success: user completed the task, but only after some help from the evaluator.

Failure: User took more than average time to complete task and only after substantial help from evaluator, or user gave up on completing task, or user expressed great frustration.

o Qualitative data: the evaluator recorded the user’s general comments,

suggestions and recommendations. Example: “The information I was looking for

5 Excel document named LBD FP7 2D Visualization experiment data.xlsx, attached.



was hidden. I would have liked it better if there was a link from the main page that would take me directly there.”

5.7. SUS questionnaire

By the end of each experiment, the following SUS questionnaire was presented and answered by each user.

Place a cross over the number that best suits your answer

1=Strongly disagree, 5=Strongly Agree

Q1 I would like to use this system frequently (if I had the need)

1 2 3 4 5

Q2 I thought the system was easy to use 1 2 3 4 5

Q3 I think that I would need the support of a technical person to be able to use this system

1 2 3 4 5

Q4 I found the various functions in this system were well integrated

1 2 3 4 5

Q5 I thought there was too much inconsistency in this system 1 2 3 4 5

Q6 I would imagine that most people would learn to use this system very quickly

1 2 3 4 5

Q7 I felt very confident using the system 1 2 3 4 5

Q8 I needed to learn a lot of things before I could get going with this system

1 2 3 4 5



5.8. Experiment execution

For each experiment session the following steps were performed by the evaluator: 1. Start of the session (before the participant enters):

a. Have the printed sheet with the tasks available. b. Have two copies of the consent form6 available. c. Have a copy of the SUS questionnaire7 available. d. Ensure the system is available for tests and without any data from previous tests.

2. Welcome the user and introduce the evaluator or facilitator. 3. Explain the purpose of the evaluation, and a description of the process that is going to be

followed. Inform the user that it is the system that is going to be evaluated and not the user performance and that if they wish to stop the evaluation for any reason and at any time, to just let the facilitator know about it.

4. Hand out the two copies of the consent form to sign. Allow the participant to read the form. Answer all the questions the participant may pose. The participant must sign the two copies.

• The evaluator should assign a new ID (number: 1, 2, ... etc.) to the pair of copies. This ID is used in other documents and in the excel resume sheet to anonymously identify the participant. One of the copies should be returned to the participant.

5. Give a brief overview of the system that is going to be evaluated. Explain the system is divided into 3 parts: workflows, charts and dashboards and the purpose of each. Explain that the structure of the workflow panel is divided into: connection, operators, workflow edit window, options for selected operator, SQL generated and results.

6. Begin the evaluation 7. For each task: A) give task to participant to read and allow them time to complete it. B)

record: number of errors, time on task (seconds), user success rate. C) take notes on user comments, questions, and observations about the interaction that is taken place

8. After each task, ask the user if they have anything to note about what they had to do or how they went about completing the task.

9. Repeat same process for the rest of the tasks. 10. Give the SUS questionnaire to the participant to fill-out, at the end of the session. The

evaluator should mark the questionnaire with the ID from step 4. 11. Thank the participant for their time and effort. 12. Closing the session (after the participant leaves)

a. The evaluator should transcribe the SUS questionnaire answers, along with other recorded data from this session to the excel resume sheet.

b. The evaluator should clean the workflows, charts and dashboards created by the participant in the test session. Evaluator should bear in mind that a single dashboard needs to exist to perform task.



5.9. User task list

On the user side, an experiment consisted on the following set of user tasks:

UT0: Login into the system (evaluator provides credentials).

UT1: create new workflow. (you are going to create a query that selects name and belief from a table interest where the belief number is higher than 30. The results will be sorted by that value)

1. In the Workflows tab click the button New 2. Use the SYNCLAB connection 3. Add a datasource. Select the schema SYNCLAB_SDDSENTINEL and table interest 4. Add Filter select column belief to be greater than 30 5. Add Sort by column belief, ascending 6. Add Select columns and ensure that only columns belief and name are selected 7. Run the workflow 8. Save the workflow

UT2: rename the worflow 1. Open the previously created workflow. 2. Change the name to test 3. Save the workflow

UT3: create a new chart (line chart, using the previously created workflow) 1. In the Chats tab click the button New 2. Select the previously created workflow 3. Drag column name to the dimension 4. Drag column belief to measures 5. Drag Line chart to the chart preview 6. Save the chart

UT4: edit the existing chart and change visual properties 1. Edit the previously created chart 2. In the chart options:

i. Change the X axis format to value (representing the dimension name) ii. Change the color value to red iii. Change the update interval to 5000 milliseconds.

3. Rename the chart to Test chart 4. Save the chart.

UT5: add the chart into a newlly created dashboard 1. In the Workflows tab click the button + (a new dashboard, named Dashboard appears) 2. In the select chart, select the Test chart (chart is added to dashboard)

UT6: edit the dashboard 1. In the new dashboard, change the size of the chart and save the options of the

dashboard



5.10. Results

5.10.1 User population

As previously said the number of user experimenting the 2D Visualization system reached 106, 76 men and 30 women.

The age of participants is distributed according to the following picture where the Y axis shows the amount of users while X axis shows the age bins. No participant was under 18 years old.

5.10.2 Successful task completion

The following table summarizes the task completion success. There were no fails towards task completion, however UT1 shows a considerable percentage of partial success, as compared to the partial success of other UTs. One possible explanation is the fact that it was the first time users actually used the system and the query that was requested to create has some complexity.

Participants by gender

Men Women

0

5

10

15

20

25

30

35

40

10-20 20-30 30-40 40-50 50-60 60-70 70-80

Ages of participants



Table 7: task completion summary.

UT1 UT2 UT3 UT4 UT5 UT6

To create a query through the set

of visual operator.

To inspect and run a query

To create a chart

based on an

existing workflow

To edit an existing

chart and change visual

properties

To add a chart into

an existing

dashboard

To add a chart into

newly created

dashboard

Success 43% 78% 86% 88% 89% 83% Partial Success 57% 22% 14% 12% 11% 17%

Fail 0% 0% 0% 0% 0% 0% The next table shows the number of users grouped by the number of errors they performed while completing a particular user task. The number was recorded by the evaluator every time he needed to aid, correct or answer a question towards UT completion. The numbers are consistent to what has been presented in the completion table. UT1 shows a reasonable high number of users having 1 and 2 errors while performing the task. Other UTs present a very low number of errors which is also consistent with the degree of success. Table 8: number of errors while performing the task (summary).

Nº of errors UT1 UT2 UT3 UT4 UT5 UT6 0 0 5 5 5 4 2 1 18 82 101 101 100 95 2 32 18 0 0 1 9 3 24 1 0 0 1 0 4 16 0 0 0 0 0 5 9 0 0 0 0 0 6 4 0 0 0 0 0 7 2 0 0 0 0 0 8 1 0 0 0 0 0 9 0 0 0 0 0 0

10 0 0 0 0 0 0 11 0 0 0 0 0 0 12 0 0 0 0 0 0

Total 106 101 101 101 102 104 The following table addresses the time it took to complete the UTs. In fact most users completed tasks Ut2-UT6 in a fair amount of time given no aid was given and that it was the first time they got in contact with the system. The behavior regarding UT1 is different but consistent with the previous results. Regarding UT1 most of the users took several minutes to create the visual query. This may have to do with the fact that the visual query creation encompassed the time the user required to understand the interface (operators, operator linkage, and parameter



entry). We believe the learning process is very short and further visual query creations would result in much lower times. Table 9: time to complete the task (summary).

Less than (seconds) UT1 UT2 UT3 UT4 UT5 UT6 0 0 0 0 0 0 0

30 0 9 1 0 13 18 60 0 8 13 10 14 23 90 0 20 28 45 45 36

120 0 40 55 45 34 29 150 1 20 7 6 0 0 280 20 9 2 0 0 0 210 22 0 0 0 0 0 240 10 0 0 0 0 0 270 0 0 0 0 0 0 300 2 0 0 0 0 0 330 7 0 0 0 0 0 360 8 0 0 0 0 0 390 9 0 0 0 0 0 420 4 0 0 0 0 0 450 7 0 0 0 0 0 480 1 0 0 0 0 0 510 3 0 0 0 0 0 540 3 0 0 0 0 0 570 1 0 0 0 0 0 600 2 0 0 0 0 0

more 6 0 0 0 0 0 Regarding the SUS questionnaire, the following table shows the result summary. The top of the table contains the questions and the first column shows the possible answers (users answers with an integer from 1 to 5 denoting the corresponding semantics of totally disagrees to totally agrees).



Table 10: SUS questionnaire summary

From the SUS questionnaire summary we can emphasize that users mostly accept the system if they needed; users consider the system easy to use however, they differ their perspective regarding the need for technical support. Most of the users consider there is a good integration between the various functions the system contains (visual query, chart building, dashboard support). However there are some inconsistencies in the SUS questionnaire that should be addressed in future work: while users consider that they do not need to learn much before they start using the system and that other people could use the system very quickly, the majority finds the system difficult to use.

5.11. User suggestions

User suggestions were:

• In the login page the login action could be performed with keyboard and not just with the mouse.

• In the visual query window: o Visibility problems in the visual query window: currently user has to scroll down to

see the selected visual operator properties and this may not be entirely evident. On the other hand, when the workflow becomes too big it is difficult to see the entire workflow in the screen, therefore some automatic resizable of the workflow size could be useful.

o Save function: the name box and the save button should be visually emphasized. When pushing the "save" button, there should be the possibility to change the default name. Place the name close to the save button.

o The name of the columns in "Select columns" operator is too long. Maybe having just the local name, and not include the connection and schema, could be easier.

o The connection mechanism between two operators is not clear. “She tried to connect them as if they were puzzle pieces.” “When she was trying to link the "DataSource" operator with "Filter" operator, she was looking at the "Operator" menu and experimenting with "Join Operator" operator. “After some instructions from the evaluator, she realized is just by throwing a line between them.”. “The first experimentation with operators and their connection is difficult, but once you know the modus operandi it is quite easy.”

I would like to use this

system frequently

(if I had the need)

I thought the system was easy

to use

I think that I would

need the support of a technical person to be able to

use this system

I found the various

functions in this system were well integrated

I thought there was too much

inconsistency in this system

I would imagine

that most people would

learn to use this system

very quickly

I found the system

very difficult to

use

I needed to learn a lot of things before I

could get going with this system

totally disagrees 0 3 5 0 10 0 3 34mostly disagrees 6 2 14 8 14 10 9 39neither agrees nor disag 35 31 32 35 30 25 28 28mostly agrees 32 37 28 40 22 43 42 3totally agrees 32 33 27 23 30 28 24 2



o Exclamation point does not offer any clue about what options are missing. o Change Run SQL to Run Workflow o Having two "Run" buttons is misleading

• In the dashboard tab: o When selecting the dashboard tab, a dashboard by default is shown, therefore

people start to work on it instead of clicking “+” to open a new one. Using the same button “+ new” from other tabs, could be more intuitive.

o The "Chart selection" dropdown in the dashboard tab needs to be visually emphasized. Almost all user complained.

o The user expects find a "save" button as in previous tabs. She could not identify the diskette icon. Sometimes it is difficult to fix a specific size desired by the user, when the user drop the border, the application re-adjusts automatically a size.

o Colours might be a dropdown. o Highlight the size icon (bottom right) when the user positioning in the chart. o The “Play” button is not clear. Maybe some tooltips will be welcome. Sometimes

it is difficult to fix a specific size desired by the user, when the user drop the border, the application re-adjusts automatically a size.

In general, users focused on improvements towards: a) visual query: component layout, save query usability, operator linkage usability and b) creation of new dashboards, chart selection, dashboard options save.



6. Non-Functional Evaluation of Sentiment Analysis Tool

6.1. Introduction

This section describes the evaluation process regarding the sentiment analysis tool, which took place in three phases and its goal was to collect feedback from the users, in order to improve the tool in terms of accuracy, usability, comprehensiveness of API etc. The document is structured as follows:

• section 6.2 describes the three phases of the evaluation process

• section 6.4 presents the questionnaires that was distributed to the participants

• section 6.5 presents the results of the evaluation

• section 6.6 concludes the document

6.2. Phases

6.2.1 1st Phase (End of Year 2)

The first phase of the evaluation process was carried out in the context of the “Web Programming” under-graduate course of the school of Electrical and Computer Engineering (ECE) of the National and Technical University of Athens (NTUA), where 86 students participated in total.



The flow of the process was the following:

• Build the sentiment analysis Eclipse Java project (Installation)

• Run the project in order to load the positive graph, the negative graph and the classifier (Initialization)

• Run the sentiment analysis tool (Execution)

Input The users were able to insert a string through the keyboard Output The output of the tool was positive or negative



6.2.1.1 User Feedback

Some of the common comments that the participants made are the following:

• It is rather hard to build the sentiment analysis project as it contains lots of dependencies which we had to add them manually

• The initialization phase should be more configurable in terms of path pointing to the necessary files and other parameters (e.g. all the paths are hard-coded)

• The accuracy of the sentiment analysis algorithm is modest

• The output should also be neutral

6.2.2 2nd Phase (Early Year 3)

The second phase of the evaluation process was carried out in the context of the Hackathon that took place on 23 May 2016 at NTUA premises (http://iot-cosmos.eu/node/1998), where 80 students and 26 individuals participated in total.


• Build the sentiment analysis Eclipse Java Maven project (Installation)

• Familiarize with the process of training; produce the positive graph, the negative graph and the classifier based on a small subset of the training datasets (Training)

• Run the project to load the positive graph, the negative graph and the classifier (Initialization)

• Run the sentiment analysis tool (Execution)

http://iot-cosmos.eu/node/1998



Input The users were able to insert either a string (through the keyboard) or historical tweets, related to USA presidential elections 2016, that are stored in our MongoDB. Output The output of the tool was positive, neutral, or negative



• The training phase should be more configurable; provide a more convenient way to include new training datasets

• It would be useful to be able to provide as input real-time tweets

• The accuracy of the sentiment analysis algorithm is good

6.2.3 3rd Phase (End of Year 3)

Finally, the sentiment analysis tool was evaluated by 133 researches that participate in EU projects, during various events which took place in December 2016 and January 2017, such as workshop at the Center for Security Studies of Hellenic Republic etc.


• Build the sentiment analysis Eclipse Java Maven project (Installation)



• Run the project to load the positive graph, the negative graph and the classifier (Initialization)

• Run the sentiment analysis tool (Execution) Input The users were able to insert either a string (through the keyboard) or historical tweets, related to USA presidential elections 2016, that are stored in our MongoDB. Output The output of the tool was positive, neutral, or negative



• It would be useful to be able to provide as input real-time tweets

• The accuracy of the sentiment analysis algorithm is very good

6.3. Results Visualization

As mentioned above, the users that participated in the second and third evaluation process could provide as input to the tool, either a single string or a set of stored tweets. In the second case the tool provides a table characterizing each tweet as positive, neutral or negative. Since they had to deal with millions of tweets, this way of presenting the results would be rather hard for the users to understand it. For this reason, we created a GUI running in our server, where the participants were able to watch aggregated results when running the sentiment analysis against historical tweets regarding USA presidential election 2016. Therefore, during the second and the third evaluation phase, the evaluation of the GUI was part of the whole process.



6.4. Questionnaires

After testing the tool, the following questionnaires were answered by each participant:

6.4.1 Software Evaluation Questionnaire



6.4.2 Training Evaluation Questionnaire

6.5. Overall Results

The results of the evaluation process are depicted in the following graphs:



6.5.1 Aggregated Comparative Results

The following two figures present the aggregated comparative results regarding questions #10 and #12, which are considered as the most important:



6.6. Conclusion

The figures in section 6.5 (and especially the two graphs in subsection 6.5.1) show that the sentiment analysis tool was significantly improved throughout the project lifecycle. Based on the feedback provided by the evaluators (subsections 6.2.1.1, 6.2.2.1, 6.2.3.1), we made the following modifications:

• Implemented the component as a Java Maven project so that the installation process becomes easier and faster

• Connected the tool with the MongoDB where we store the tweets, in order to enable the users to test the algorithm in a very large scale (millions of tweets instead of a single string)

• Improved the training and initialization process by introducing a “config.properties” file through which all the parameters can be easily configured.

• Added the neutral class

• Keep improving the accuracy of the algorithm



7. Usability Evaluation of the Data Centre 3D Visualization application

The development of the Data Centre 3D Visualization application has followed an iterative approach, continuously evolving the supported functionality, the User Interface and the interaction methods. Three major application versions can be identified during the evolution of the application, which has been documented in deliverables D6.6 and D6.7, and is in summary described in Table 11 and illustrated in Figure 28 and Figure 29. More specifically, the first version of the application aimed at providing a 3D visualization of a data centre room, while the second version incorporated gesture-based interaction and an updated user interface which better exploited the screen real estate and provided a more contemporary look-and-feel. The second version of the application was assessed through heuristic evaluation (Nielsen and Molich, 1990) by three User Experience (UX) experts. Based on the results of the evaluation, the third version of the application was implemented, featuring UI improvements and new functionality to support the user’s spatial orientation in the visualised room.

Table 11 Evolution of the Data Centre 3D Visualization application in terms of supported functionality and interaction methods. For each version only the changes or newly introduced

features are listed

Functionality Version 1 Version 2 Version 3 • 3D Visualization of the racks,

following the metaphor of a room • Orbit camera • Filter displayed results by server

attribute, criticality level, timestamp

• Change view by sparseness • Navigation controls for zoom

in/out and move forward/backward/right/left

• Anomalies detection and notification

• Close-up view of a selected rack • Detailed information through

charts for a specific unit of a rack

• Navigation controls for translation and rotation of the camera

• Play/pause of live data retrieval

• Lock/unlock interaction with gestures

• Data centre room selection

• Mini map • Reset room view • Units indexing in

selected rack close-up view

• Markers, value labels, and value for the currently visualized time on the chart in the full screen mode

•

Interaction Version 1 Version 2 Version 3 • Mouse movement: Raised pointer

finger of one hand • Mouse click : Raised pointer finger

of the one hand, closed fist which opens for the other hand

• Mouse click: Raised pointer finger of the one hand, open fist which closes for the other hand (click in the air)

• Gestures to move and move and rotate the camera forward/backward/left/right/up/ down

-



Figure 29 Evolution of the UI of the Data Centre room view across the three versions

Figure 30 Evolution of the UI of the selected rack close-up view across the three versions

Although the system has already been evaluated by experts and updated accordingly, an additional evaluation by users was required, since even though heuristic evaluation finds many usability problems that are not found by usability testing, it is also possible to miss some problems which can be discovered only through usability testing (Nielsen, 1994). Heuristic evaluation and user testing should better be employed in an iterative approach, carrying out the heuristic evaluation first and trying to eliminate as many problems as possible before involving users. The main reason for this is that a system is better to be evaluated by users, when all known errors have been corrected, given that user-based testing requires many more resources (in terms of time, effort, and money) than heuristic evaluation. To this end, a usability evaluation with end-users has been performed in the improved final version of the system (version 3), in two stages: • A typical observation experiment has been carried out, involving 20 users, aiming to assess

the usability of the system and the gestures’ learnability (Section 7.1) • A UX assessment questionnaire, has been delivered to 80 users, with the aim to evaluate

their overall impression and subjective view regarding their experience from the system (Section 1.2)

The two evaluation approaches are complementary and each is aimed at assessing different attributes of the system. More specifically, the observation of users while using the system aimed at revealing potential usability problems, evaluating the effectiveness and efficiency of the system, assessing the gestures vocabulary and its learnability, as well as users’ fatigue, and retrieving explicit user feedback about likes, dislikes and potential suggestions through a semi-structured interview. However, another important parameter that should be studied for the system is that of the overall UX and users’ feelings from their interaction with the system. These approaches, although complementary, were carried out separately, involving different users, since on the one hand a user-based experiment with 100 users would require many more resources, while these additional resources would be spent without any actual benefit regarding the goals of the evaluation. Literature has indicated that involving 20 users in a usability testing evaluation will find 95% of the problems of the system (Faulkner, 2003).



7.1. Usability testing through observation

Usability testing involved twenty participants who used the system in a laboratory setup, resembling however the envisioned context of use. In more details, the system was deployed in a 60’’ TV screen, placed in a large room. Users were asked to sit on a chair, which had been placed at a convenient position, so that the motion sensor would effectively detect the user’s hand gestures. Minor amendments were made to the placement of the chair, if needed, to better serve each individual user, in case the system was not responsive enough (e.g., shorter users and users with smaller palms were required to sit a little closer to the sensor than taller users or users with larger palms). Three main hypothesis were formed and tested during the usability testing: H1. The system can be used effectively in the envisioned context of use to serve the needs of

monitoring a large data centre room H2. The system does not impose physical strain over users H3. Interaction with gestures is easy to learn Taking into account that the system addresses everyday users, and therefore users will be quite familiar with its features and functionality, the experiment comprised three phases, accomplished in three consecutive days for each user: A. Acquaintance with the system. Each participant was individually introduced to the system,

explained all the functionality and interaction methods and was allowed to explore it through free interaction. Each acquaintance session lasted for about twenty minutes.

B. Usability evaluation. Each user was asked to interact with the system following specific tasks, which were handed one-by-one. Additional details about the tasks, the process that was followed during the test and the results obtained are provided in section 1.1.1 below. Each usability evaluation session lasted approximately thirty minutes, with minor differences between users.

C. Post-test. Users were asked to navigate using gestures only, to specific targets marked on their screen, in order to assess the learnability of the gestures in terms of vocabulary and interaction efficiency. Additional details regarding the process and the results of the post-test session are provided in section 1.1.2 below. Each post-test session lasted approximately ten minutes, with minor divergences among users.

Participants were recruited with the aim to involve an as equal as possible number of users with small, medium, and high expertise in interacting with software systems through a motion sensor. Given that the 3D Data Centre Visualisation system does not address first-time users only, but it is intended to be used on a regular basis by its target audience, it was important to involve experienced users as well. Users’ demographic information with reference to their gender, age, and expertise in using motion sensors is described in Figure 30, Figure 31, and Figure 32.



Figure 31 Usability test participants’ gender distribution

Figure 32 Usability test participants’ age distribution

Figure 33 Usability test participants’ experience with the motion sensor



1.1.1 Usability evaluation session During the usability evaluation session, the participants were asked to execute specific tasks while they were observed and notes were kept regarding specific problems, comments, the time on task, and successful execution of the task. Once they completed the test scenario, they were asked about their impressions on the system following a semi-structured interview approach (Fylan, 2005). Finally, they were requested to rate the physical strain as they perceived it on a scale from 1 to 10. This section describes in more detail the tasks that were assigned to the users, the metrics that were recorded, the main contents of the semi-structured interview, the usability evaluation test process as it was followed, and the results and findings obtained. Tasks The usability evaluation session involved the execution of a series of tasks, in the order given by the evaluator. The tasks revolved around a specific user scenario, according to which the user is a system administrator working for a multinational corporation that houses a large data centre with more than 5,000 servers. Part of the user’s daily work routine is to assess the servers’ state and locate any functional problems, a task which is accomplished through the 3D Data Centre Visualisation application. The tasks that constituted the entire evaluation scenario were the following:

• Task 1: Looking at the overview of Data Centre Room 1, you notice that a specific rack, namely N37, seems to have an indication of an anomaly. Select rack N37 in order to find out more details about the problem that occurred.

• Task 2a (In N37 close-up view): Locate the server for which the anomaly alert was issued.

• Task 2b: What is the exact time that the problem was initiated? • Task 2c: Seeing that the problem has already been resolved, you completed your

inspection of the rack. Return to the room view, with all the racks of Data Centre Room 1.

• Task 3a: You would like to see the status of the servers regarding their temperature. Change the view of the displayed racks, so as to view racks by temperature.

• Task 3b: Filter the displayed racks, so as to view racks with servers in critical state. • Task 4a: In a faraway location of the room (right back area) you notice that there are

several racks with servers in critical state. Use gestures to navigate to the specific rack that will be indicated to you by the evaluator.

• Task 4b: Perform a 180O rotation and return back to the place you started from, until you can see no further racks.

• Task 4c: Reset the view of the room. • Task 5a: In another faraway corner of the room (left back corner) you notice that there

are several racks with anomaly indication. Use the navigation controls to navigate to the rack that will be indicated to you by the evaluator.

• Task 5b: Using again the navigation controls, perform a 180O rotation and return back to the place you started from, until you can see no further racks

It should be noted that all tasks were designed so as to have a clear ending condition (e.g., find a specific information, bring the room to a specific state, or perform a specific interaction) and that all participants were asked to navigate to the same racks for tasks 4a and 5a. Metrics The following metrics were recorded during the usability evaluation session, using appropriate recording sheets:

- Success in accomplishing the task, marked as (S) for success, (PS) for partial success, and (F) for failure.

- Time on task (in seconds) for all the tasks besides 2a and 2b where this recording would not be meaningful. In more detail, in task 2a users were asked to locate the specific unit



for which an alert was issued. Since the unit was not visible at once and users were required to scroll, while each user might employ a different strategy for locating the server (some users first moved to the bottom and then to the top, while other users followed the inverse route) time would not be an illustrative performance metric. Likewise, the information requested for task 2b could be retrieved either by hovering over the alert or clicking on it, therefore the time would depend on the interaction strategy employed by each user.

- Number of tries, for all the tasks that involved executing the mouse click gesture, i.e., keep their pointer finger of the one hand raised, and have their closed fist of the other hand opened.

Table 12 indicates which metrics of the aforementioned were employed for each task. It should be noted that the dropdown controls require a composite interaction as follows (i) click on the dropdown menu and (ii) select an option from the displayed menu. Therefore success and number of tries were recorded as composite metrics for tasks involving handling a dropdown menu, namely tasks 3a and 3b.

Table 12 Metrics employed per task

Task Time Success Number of tries Task 1 Task 2a Task 2b Task 2c Task 3a Task 3b Task 4a Task 4b Task 4c Task 5a Task 5b

Additionally, handwritten notes were kept for users’ comments and any suggestions which were provided by the users during their interaction with the system, as well as regarding the interaction flow, system errors, or other remarks that might lead to useful results. Interview Following the scenario-based execution of tasks, users were interviewed to elicit their comments regarding the system, following a semi-structured interview approach. Semi-structured interviews are conversations in which the topics of the discussion are known before hand – therefore a preliminary set of questions can be prepared and used to guide the interview – but the conversation is free, and therefore it can be vary between participants (Fylan, 2005). The questions they were asked were:

1. What is your general impression about the system that you tested? 2. What did you like the most about the system? 3. What did you like the least about the system? 4. On a scale from 1 (least strenuous) to 10 (most strenuous), please rate the physical

strain that you think was imposed to you by using the system, where 1 is the minimum and 10 is the maximum strain.

All the provided answers were recorded by the evaluator, through handwritten notes. Process As soon as the user arrived at the usability evaluation room, they were welcomed and introduced to the evaluator and facilitator. The evaluator was responsible for steering the



session, handing out the tasks one by one to the user, providing all the explanation (if and when appropriate), keeping notes, and asking questions to the user. The facilitator was a technical person, responsible for setting up the system and assisting in adjusting the user’s distance from the sensor if needed, as well as for recording the time required to complete a task, where appropriate. After welcoming the user and since the user had already been introduced to the system during the acquaintance session that had been held the previous day, the evaluator proceeded with explaining the purpose of the evaluation and the process that would be followed, as well as the recordings (e.g., time in task, number of tries) that would be made during each task. It was clearly explained to users that it was the system that was being tested and not the users themselves or their performance, and that they could stop the evaluation for any reason and at any time they wished during the experiment. Then, participants were handed out two copies of the consent form to sign. Both copies were also signed by the evaluator, and one of them was returned to the participant. In summary, the informed consent form described the system, the purpose of the evaluation and the evaluation process. It also informed participants that their participation was voluntary and can be withdrawn anytime they wish, that they would not receive any reimbursement and that they would not have a direct benefit from the evaluation other than assisting in improving the system. In addition, it described how the participants’ personal data are protected and provided contact details of the evaluator and the scientific advisor of the evaluation. Once the consent form had been signed by the participant, tasks were handed one by one. The evaluator avoided answering questions and prompting the user as to how to proceed with carrying out the task. The only clarifications that were provided, if needed by any of the participants, were related to the description of the tasks themselves, in case they were not clear to the user. As the user proceeded with the execution of the tasks, the evaluator took notes on user comments, questions, and observations about the interaction that was taking place. Furthermore, the evaluator and the facilitator proceeded to the necessary metrics’ recordings. The interview and debriefing session followed the execution of tasks, allowing users to describe their overall experience with the system. Finally, participants were thanked for their time and effort and reminded of their next day post-test appointment. Findings and results Users turned out to be very successful at accomplishing the tasks they were assigned with, based on the user success rate metric. Success rate is defined as “the percentage of tasks that users complete correctly” (Nielsen, 2001). For each task, the user’s success is marked as (S) for success, (PS) for partial success, and (F) for failure. The total success rate is calculated by the formula:

Success Rate = (TS + (PS*0,5))/Number of attempts TS is the total number of successful attempts, and PS is the total number of partial successful attempts. More specifically, for the evaluation of the 3D Data Centre Visualization application, the formula is produced as follows:

Success Rate = (247 + (5*0,5))/259 = 0,96 It should be noted that one participant was not given the last task to carry out, since the evaluator decided to stop the session, because too much effort was required from this user to use the system and the user had expressed that her shoulder was in pain before starting the evaluation. Therefore, the total number of attempts was 259 and not 260 as expected, given that 13 subtasks in total were expected to be carried out by 20 users. The detailed analysis of users’ success per task is presented in Figure 33, displaying user success per subtask. Successful task accomplishment is marked in green, partial successful accomplishment in orange and fail in red. It can be easily deduced that failures have been noticed for two types of tasks: tasks requiring selecting an option from a drop-down menu (tasks 3aii and 3bii) and tasks requiring navigation in the virtual environment through gestures (tasks 4a and 4b).



Figure 34 Task success diagram for the user observation experiment

Task success was also examined as a factor that might be dependent on the user’s gender, expertise with the motion sensor, or age. Initially, the success rate has been calculated for every value of the independent variables of the experiment and is illustrated in Figure 34. In more detail, Figure 34 illustrates the total success rate, as well as success rate per user gender (male vs. female users), experience in using the motion sensor (high, medium, small), and age (20-30, 30-40, 40-50, 50-60).

Figure 35 Success Rate: total success rate and success rate per gender, sensor experience

and age for the user observation experiment



In order to proceed with statistical analysis and comparisons, success was also studied as a binary factor. To this end, all partial successfully attempts and failures have been accounted as failures. Table 13 illustrates for each task the actual success rate (%), as well as the Confidence Interval Lower Limit (CI [LL]), and Confidence Interval Upper Limit (CI [UL]), as these have been calculated with the adjusted Wald method, which is preferable for the given user sample (Sauro & Lewis, 2005). The Confidence Interval helps to better interpret our results in order to account for the fact that we have tested our system with a small user sample. Therefore for example, for task 1, given that 20 out of 20 users successfully completed the task, we can be 95% confident that the success rate for the general population is between 91% and 97%. Table 13 Success Rate and Confidence Intervals per task for the user observation experiment

Tasks 1 2a 2b 2c 3ai 3aii 3bi 3bii 4a 4b 4c 5a 5b Success Rate (%)

100 95 100 100 100 95 100 90 90 80 100 100 85

CI [LL] 0.91 0.74 0.91 0.91 0.91 0.74 0.91 0.68 0.68 0.57 0.91 0.91 0.63 CI [UL] 0.9

7 0.9

9 0.9

7 0.9

7 0.9

7 0.9

9 0.9

7 0.9

8 0.9

8 0.9

2 0.9

7 0.9

7 0.9

5 Furthermore, task success as a binary variable was analysed regarding its independence, using chi-square tests. More specifically, it was studied:

a. Whether task success and gender (male, female) are independent of one another. The relation between these variables was not significant (χ2 (1) = 0.071, p>.05)

b. Whether task success and user experience in using a motion sensor (high, medium, small) are independent of one another. Analysis of the results suggested that there is a statistically significant difference between task success and the user’s experience with the motion sensor (χ2 (2) = 0.002, p<.05). An additional chi-square test including the results of users with high and medium experience with the sensor indicated that there is not a statistically significant difference for task success between these two user categories (χ2 (1) = 0.024, p>.05). Henceforth, it can be concluded that users with none or low experience in the employed interaction technology may achieve lower task completion rates.

c. Whether task success and user age group (20-30, 30-40, 40-50, 50-60) are independent of one another. Analysis of the success rates indicated that there is a statistically significant difference between the different user groups (χ2 (3) = 0.000015, p<.05). Two additional chi-square tests were carried out, one examining the success rates of users with age until 50 years old and one studying the success rates of users with age until 40 years old. The results indicated that a statistically significant difference in success rates is noticed when users of the three age groups are studied (20-30, 30-40, 40-50) (χ2 (2) = 0.0725, p<.05), while no statistically significant difference is observed when studying success rates of users belonging in the 20s-30s and 30s-40s age groups (χ2 (1) = 0.0227, p>.05). A possible cause for finding statistically significant differences when users older than 40 years old participate in the examined user sample, may be the fact that - according to the demographic information acquired for the participants - these users had none or low previous experience in interacting with systems employing a motion sensor.

Finally, to test whether there is a specific category of tasks for which users mostly fail, an analysis adjusted Wald method to determine confidence intervals was carried out, having the tasks grouped into the following clusters: (i) tasks requiring simple selections (ii) tasks requiring handling drop-down menus (iii) navigating in the virtual environment through gestures (iv)



navigating in the virtual environment through the navigation buttons. Table 14 displays the success rate, lower limit and upper limit of confidence interval for the aforementioned groups of tasks. It can be seen that tasks requiring a selection of a UI element (through the mouse-click gesture) have an extremely high success rate 99.28%, while based on the statistical analysis we can be 95% confident that the success rate for the general population is between 95% and 99%. Selecting an option from a drop-down menu and navigating in the virtual world through the navigation buttons exhibited a success rate of 92.5%. Furthermore for such tasks we can be 95% confident that the success rate for the general population will be between 79% and 98%. Navigation through gestures scored lower, with a success rate of 85% and CI [0.7, 0.93]. Therefore, even for navigating with gestures, which is the task group with the worst success rates, we can be 95% sure that at least 70% of users will be successful.

Table 14 Success Rate and Confidence Intervals per group of tasks for the user observation experiment

• Tasks Simple selection Drop-down menu

handling Navigating with

gestures Navigating through the

navigation buttons Success Rate (%) 99.28 92.5 85 92.5 CI [LL] 0.95 0.79 0.70 0.79 CI [UL] 0.99 0.98 0.93 0.98 In summary, studying success rates has led to the following conclusions: Conclusion 1: High success rates (at least 70%) can be achieved even for the most difficult tasks for any user population. Conclusion 2: Having prior experience with the motion sensor affects user performance. Taking into account that the actual system users will use the system regularly and will therefore soon become experienced in the interaction technology, it can be stated that H1 was successfully verified, i.e., the system can be used effectively in the envisioned context of use. Users’ efficiency in using the evaluated system was measured through time-on-task and number of tries required in order to complete the task. Table 15 includes data regarding time-on-task for these tasks that time was meaningful to measure, as explained earlier. A graphical representation of users’ time-on-task is provided in Figure 35, where bars represent the average time on task, error bars the standard deviation and red dots the median value of time required for each task.

Table 15 Time-on-task data for the user observation experiment Tasks 1 2c 3a 3b 4a 4b 4c 5a 5b Mean (sec) 12.95 8.35 25.52 17.83 38.22 51.81 6.92 38.9 49.88 Std. Error 1.46 1.46 2.77 2.68 4.80 5.63 1.18 4.15 4.73 Median (sec) 12 7 22 17 35 45.5 5 32.5 45 Std. Deviation (sec) 6.54 6.53 12.10 11.37 20.36 22.53 5.31 18.57 19.50 Minimum (sec) 5 2 10 6 17 23 2.5 20 30 Maximum (sec) 30 28 55 60 100 105 23 90 110 Confidence Level (95,0%) 3.063 3.056 5.833 5.657 10.127 12.005 2.488 8.693 10.027



Figure 36 Mean time-on-task for the user observation experiment. Error bars represent the standard deviation, while red dots represent the median value of time for each task.

It is evident that the tasks which are the most time-demanding are these which involve navigation in the virtual environment, either through gestures (tasks 4a, 4b) or through the navigation buttons (tasks 5a, 5b). The most efficient interactions appear to be the ones requiring a simple selection of a UI element (tasks 1, 2c, 4c), which – taking into account the data reported in Table 15 – we can be 95% confident that for the general population will not require more than 16 seconds. Selecting an option from a drop-down menu (tasks 3a, 3b) was a more composite task, requiring users to first select the drop-down menu and then choose one of the displayed options. With reference to the Confidence Interval presented in Table 15, it is 95% certain that any user will manage to accomplish such a task in 31.5 seconds at most. With a focus in tasks that require navigation in the virtual world, it can be concluded that, we are 95% confident that any user in the general population will require at most:

- 48.5 seconds to traverse the entire room (tasks 4a, 5a), which is the time to navigate in the room using gestures (the corresponding time using the navigation buttons is 47.6 seconds)

- 63.8 seconds to perform a manoeuver as the 360o turn required in tasks 4b and 5b and traverse the entire room, which is the time for performing the task through gestures, while the corresponding task carried out with the navigation buttons will require at most 60 seconds

Although the goal of the evaluation was by no means to compare navigation through gestures with navigation through the corresponding UI buttons, the data acquired suggested that they were almost equivalent in terms of users’ efficiency. To further explore this indication, a two-sample t-test was conducted to compare the time-on-task

a. for traversing the entire room using gestures and using the UI buttons, which showed no significant difference for navigation through gestures (M=38.22, SD=20.36) and through UI buttons (M=38.9, SD=18.57); t(36)=-0.1, p=0.91

b. for making a 360o turn and also traversing the entire room using gestures and using the UI buttons, which also showed no significant difference for navigation through gestures (M=51.81, SD=22.53) and through UI buttons (M=49.88, SD=19.50); t(31)=0.26, p=0.79



With the aim to explore the impact that user gender, expertise in using the motion sensor and user age may have on time-on-task, additional tests were carried out. More specifically, a paired two-sample t-test was conducted to compare the time-on-task between male and female users. No significant difference was found between male (M=26.4, SD=15.59) and female users (M=29.63, SD=20.69); t(8)=-1.1, p=0.3. In order to compare the effect of level of experience with the motion sensor on the time-on-task, a one-way ANOVA was conducted. An analysis of variance showed that the effect of prior experience with the motion sensor on time-on-task was not statistically significant, F(2,24)= 0.16, p=.851. In addition, one-way ANOVA was conducted to explore the effect of age on the time-on-task, indicating that user age has not a statistically significant impact on time-on-task, F(3,32)= 0.27, p=.845. Therefore, in summary, user gender, expertise in using the motion sensor and age were not found to have a statistically significant impact on task-on-time. This may seem contradictory with previous results indicating that user age and prior usage experience with the motion sensor have a statistically significant impact on task completion rate. Nevertheless, it actually does not contradict previous findings since time-on-task was measured only for participants who successfully completed a task. Another metric that has been recorded is the number of tries required to carry out a selection in tasks involving the mouse click action. Table 16 presents the data acquired for these tasks. It is observed that tasks can be achieved with a small number of tries, while we can be 95% confident that the general population would require at most three tries to successfully make a selection.

Table 16 Number of tries data for the user observation experiment

Tasks 1 2b 2c 3ai 3aii 3bi 3bii 4c Mean 2 1.47 1.95 1.63 1.93 1.29 1.70 1.57 Std. Error 0.24 0.28 0.38 0.21 0.17 0.16 0.18 0.23 Median 2 1 1 1 2 1 2 1 Std. Deviation 1.05 1.26 1.70 0.95 0.68 0.68 0.77 1.01 Minimum 1 1 1 1 1 1 1 1 Maximum 4 5 8 4 3 3 3 4 Confidence Level (95,0%) 0.508 0.608 0.795 0.460 0.362 0.352 0.396 0.490 In summary, studying time-on-task and the number of tries required to achieve specific tasks, the following inferences can be derived: Conclusion 3: Tasks requiring a selection of UI element through the mouse click gesture can be carried out efficiently, as we are 95% confident that it will take users of the general population at most 16 seconds and 3 tries to achieve a selection Conclusion 4: Tasks requiring selecting an option from a drop-down menu can be less efficiently carried out, as we are 95% confident that users of the general population will require at most 31.5 seconds to accomplish Conclusion 5: Navigating from the one end of the visualised room to the other with or without making any manoeuvres is the most time consuming task Specific observations and usability problems, which related with the above conclusions, are reported below. During the interview, users were first asked to explain what they liked most as well as what they liked least about the system. The reported dislikes usually stemmed either from problems that they faced during their interaction with the system, issues that did not confine them from achieving the task but imposed some difficulty nonetheless, or simple aesthetic preferences. Users’ reported likes and dislikes, were initially filtered to create one unified set of likes and dislikes, to eliminate data duplication. Then, they were combined with the evaluators’ observations and were organized in a table making cross-references to the users involved, in order to locate any outliers. Finally, reported dislikes were filtered as to their impact, examining: a. the number of users who reported the issue



b. the type of the issue (e.g. personal aesthetics, functionality problem, usability problem, etc.) c. frequency with which the issue occurs d. the impact of the issue if it occurs, assessing if it will it be easy or difficult for the users to

overcome e. the persistence of the problem, assessing whether users can overcome the problem once

they know about it or will they repeatedly be bothered by it

In summary, the more appreciated features of the system were reported to be: 1. The 3D representation, which was engaging and offers benefits which cannot be provided

by 2D visualizations 2. Navigation in the virtual world with gestures, which were mentioned as a like by 45% or the

participants. 3. The overall user interface, which was nice and intuitive 4. The mini map, which facilitated the user’s orientation in the virtual world 5. The criticality filters, allowing users to easily locate units in critical situations

Dislikes were reported in a more sporadic manner, usually having each user report as dislikes the features that made it difficult for them to complete the tasks. These dislikes have been combined with the evaluator’s observations and are reported as problems in Table 18 below. Other dislikes that have been reported: 1. Using the system might require time which will be limited in critical situations. Although this

was mentioned as a dislike by only one user, more users provided suggestions along this direction, i.e., as to how to improve locating servers in critical state and interacting with them, and are reported next.

2. The gesture for making a turn was rather slow. 3. A lag which was noticed in gestures imposed difficulties in interaction and was reported by a

few users as a system attribute that they disliked. 4. The surrounding space is vast; instead the boundaries of the virtual world should be

confined. This dislike was also reported by one single user. Although users weren’t allowed to move outside the visualised area, this user found the fact that there were no walls frightening. Adding walls to the visualised data centre room was also provided as a suggestion by one more user, although not reported as a dislike. Future designs should explore whether adding walls would improve the overall user experience.

In addition, during the interviews users were asked to rate the perceived fatigue on a scale from 1 to 10, where 1 stands for no fatigue at all and 10 represents a major physical strain and user fatigue. Figure 36 illustrates the fatigue level per participant. The mean fatigue level for all users was 3.31 (SD=1.6, 95% CI[2.54, 4.08]). In addition, fatigue level was examined in comparison with the other effectiveness and efficiency metrics recorded for each participant as shown in Table 17. Furthermore, we analysed whether there is a relationship between the fatigue level and the users’ success rate, between the fatigue level and the total time required for each user to complete the task scenario, as well as between the fatigue level and total number of tries for selecting UI elements via the mouse click gesture. Based on the results of this study, the fatigue level reported by the participants is related to the total time required to complete the task scenario r=.444, p < .05.



Figure 37 Fatigue level per participant for the user observation experiment

Table 17 Success rate, total time, total number of tries, fatigue level presented per participant,

for the user observation experiment Success Rate (%) Total time (sec) Total number of tries Fatigue level User 1 100 261 22 5 User 2 100 249 7 2 User 3 100 286 8 3 User 4 100 253 9 2 User 5 76.92 234 5 7 User 6 100 279 15 3 User 7 100 232 19 2 User 8 100 260 13 4 User 9 92.30 198 12 3 User 10 61.53 99 9 1 User 11 100 161 11 2 User 12 100 315 8 6 User 13 100 273 13 2 User 14 100 350 11 3 User 15 100 207 10 3 User 16 92.3 248 13 5 User 17 76.92 165 28 3 User 18 100 188.5 9 3 User 19 100 198 14 3 User 20 100 305 12 6 Users mostly reported that any fatigue they felt was mostly due to the mouse click gesture, which required them to keep their fingers open apart. Based on the analysis of the data acquired regarding the fatigue level, the following can be inferred, which confirm our hypothesis (H2) that the system does not impose physical strain on users: Conclusion 6: The usage of the system does not impose major physical strain on the users. Analysis of the data acquired indicates that we can be 95% confident that the general population will report a maximum of 4.08 fatigue level on a scale from 1 to 10.



Conclusion 7: A positive relation was found between the amount of time a user uses the system with the fatigue level they report, especially when users are required to perform often the mouse click gesture. Based on the evaluator’s observations as well as on the issues that were reported as dislikes by the users themselves, a number of UX problems were identified and are listed in Table 18 below, organized in categories and rated according to their severity. Severity rates are as follows:

• Minor usability problem, which can be easily overcome: [1] • Major usability problem which can be overcome once users know about it: [2] • Critical usability problem, a showstopper, or a problem which persistently causes major

dissatisfaction to users: [3]

Table 18 Identified usability problems

ID Problem description Category Severity 1. Jittering of the visualised mouse pointer Hand tracking 1 2. The system might occasionally stop recognizing user gestures,

due to users’ hand posture. In more detail, it was observed that users’ anatomic details would result some times in erroneous gesture recognition. Two specific problems were observed: a. When trying to carry out the mouse click gesture,

some users closed their fist, but not tightly enough resulting in a protruded finger. As a result, the system recognized this hand posture as index raised, which corresponds to the gesture for moving the mouse cursor.

b. When carrying out gestures to navigate in the virtual world, some times, certain users did not keep their fingers apart, resulting in false recognition. In certain cases that problem had to do with the users’ hand anatomy or with each person’s ability to stretch the fingers to the optimal distance. In addition, some users would automatically and unconsciously relax their fingers after prolonged use of the system.

In both cases, once users were informed as to how to correct the false recognition, they tried to adapt their gestures.

Hand tracking 2

3. Selecting an option from a drop-down menu was difficult, since the user had to keep their pointer finger quite steady and perform the mouse select gesture without losing the pointed target, therefore it required coordination between the user’s hands and eyes. Moving the pointer finger would result in some cases in moving the focus out of the drop-down menu and having to carry out the task all over again. This could be resolved by enlarging the clickable surrounding area of the drop-down menu.

UI – drop-down menus

2

4. Controlling a vertical slider also imposed difficulty to users, since they had to hold the pointer finger steady for a prolonged time. As a solution to this problem sliders can be handled either via drag and drop, or by directly clicking in the point where the user wishes to move the slider to.

UI – sliders 1

5. The criticality slider confused some users since it allows users to filter out some options and view racks with units having a criticality level which falls in a specific range, of which only the lower limit is defined by the users. As a result, it was mentioned that it would not be possible to view only servers in good state.

UI – sliders 1

6. The button to activate / deactivate gesture recognition resembled more as a button mentioning the current state.

UI 1



Other observations related to how users interacted with the system were the following: - In certain tasks, one hand would cross over the other, when for instance a right-handed

user had to select something on the left side of the screen. This would require the user to point first the item, extending their right hand to the left, while moving the left hand under the right one in order to perform the open palm select gesture in front of the motions sensor camera. Given that the system does not require a specific hand to be used as a pointer, therefore the left hand could be used for pointing and the right hand for selecting whenever needed, during the training session users were reminded that they could interchange their hands accordingly and were explained to preferably use their left hand as pointer for items located in the left half of the screen. The majority of users did not exhibit any difficulties during the test, however it was noticed that five users did not remember to make this hand pointer change by themselves, facing thus difficulty in choosing with accuracy targets placed in the left part of the screen. Once users were reminded of this feature they found it more comfortable to carry out the requested task. It is expected that long-term users of the system will be more familiar with such features and exploit them to increase their efficiency.

- Height of user affected how the sensor was recognizing their gestures. Taller users had to increase distance of the chair from the sensor, while shorter users had to decrease the distance. This is not expected to affect user interaction in the envisioned context of use, as users can easily find the appropriate distance without any impact in their performance when using the system.

- All the users remembered the gestures that they had to carry out in order to navigate in the virtual world.

Finally, during the interview users provided several suggestions as to how to improve specific features of the system or regarding features that were missing and would be desirable in such a system. Regarding the 3D representation of the virtual world some users noted that the representation of isles of racks is not realistic at all, since usually they are grouped and there is no space between the racks. Furthermore, additional room attributes could also be represented in the virtual world, such as the door of the room, or walls, to assist orientation. One more suggestion to assist users’ orientation in the virtual world was to include a grid on the floor with letters and numbers, while on hovering over a rack, a highlight on the grid row and column would help users to map the rack letter and number to the corresponding row and column. Finally, features that the users would like to be included in such a system were: • direct access to a list with all the problems, so that the user can first focus on them and

resolve them • a functionality for directly contacting other people to assign them problems to look into • a “teleportation” functionality for directly navigating to a specific area (e.g., an area with

racks including many units in critical state).

1.1.2 Post-test The goal of the post-test session was to assess navigation in the virtual world through gestures, in order to assess the learnability of the gestures in terms of vocabulary and interaction efficiency. To this end, five targets were marked in the virtual world with numbers from 1 to 5 (see Figure 37) and users were asked to navigate from the one target to the other, flying through the number itself. The task was considered successful as soon as a user passed through the number. In addition to the task success, the required time was recorded. Once the users completed the task of navigating in the virtual world, they were asked if they had any additional comments or suggestions to provide and they were thanked for their participation. All the comment and suggestions retrieved from the post-test session have been included in the results reported in section 1.1.1.



Figure 38 Numbered targets to be reached consequently by navigating in the virtual

environment with gestures during the post-test evaluation

It should be noted that all the users remembered without any assistance or requests for help the gestures that they had to employ in order to navigate in the virtual world, both during the observation session and the post-test sessions. Thus, it can be concluded that the employed gestures are easy to learn. Conclusion 8: The gestures’ vocabulary is easy to learn. During the test, one of the users although navigating correctly with the appropriate gestures failed to understand how to “pass through” the marks. Although she went by the marks, she did that at a lower height, requiring therefore less time than other users to move from one target to the other. Also, one user had tendonitis and the evaluator stopped her in order not to aggravate the injury. Therefore, the results from these two users have been removed as invalid from the result set which is analysed in this section. The metrics that have been recorded for the tasks of the post-test experiment include whether the user was successful, the time required to complete the task, as well as the fatigue level as reported by the users themselves. Table 19 presents for each task the actual success rate (%), as well as the Confidence Interval Lower Limit (CI [LL]), and Confidence Interval Upper Limit (CI [UL]), as these have been calculated with the adjusted Wald method.

Table 19 Success Rate and Confidence Intervals per task for the post-test experiment

1 2 3 4 5 Success Rate (%) 100 100 94.44 94.44 94.44 CI [LL] 0.845 0.845 0.723 0.723 0.723 CI [UL] 1 1 0.999 0.999 0.999 Furthermore, task success was analysed regarding its independence, using chi-square tests. More specifically, it was studied:

a. Whether task success and gender (male, female) are independent of one another. The relation between these variables was statistically significant (χ2 (1) = 0.027, p<.05)

b. Whether task success and user experience in using a motion sensor (high, medium, small) are independent of one another. Analysis of the results suggested that there is a statistically significant difference between task success and the user’s experience with the motion sensor (χ2 (2) = 0.04, p<.05). Studying the actual results indicated that



failures were located only for users belonging in the group with small experience in the interaction method.

c. Whether task success and user age group (20-30, 30-40, 40-50, 50-60) are independent of one another. Analysis of the success rates indicated that there is a statistically significant difference between the different user groups (χ2 (3) = 0.000, p<.05). Examination of the data revealed that failures were marked only for users belonging in the group with older users (50-60) who also had no previous experience with the motion sensor.

Although the statistical analysis suggested that task success can be affected by the user’s gender, experience in the motion sensor and age, these results cannot be adopted without being further examined, since having excluded two users from our sample, failures were recorded only for one out of the remaining eighteen participants. This participant was a female user, with low experience in the Kinect sensor, in the age group of 50-60. As a result, based on the aforementioned data and statistical analysis, the following conclusion can be derived. Conclusion 9: High success rates (at least 72%) can be achieved for navigating in the virtual world through gestures for any user population, with a 95% confidence. Data regarding time-on-task are presented in Table 20, while Figure 38 illustrates the average time-on-task for each one of the tasks, the median value of time for each task, as well as standard error bars depicting the standard deviation of the mean.

Table 20 Time-on-task data for the post-test experiment 1 2 3 4 5 Mean (sec) 42.25 27.36 31 31.64 24.97 Std. Error 9.08 4.46 4.82 4.36 4.58 Median (sec) 26.5 23.25 22 26.5 18 Std. Deviation (sec) 38.53 18.95 19.89 17.99 18.89 Minimum (sec) 16 7 9 8 6 Maximum (sec) 183 75 79 70 70 Confidence Level (95,0%) 19.163 9.423 10.226 9.250 9.714

Figure 39 Mean time-on-task for the post-test experiment. Error bars represent the standard

deviation, while red dots represent the median value of time for each task.

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5

Task

tim

e (s

ec)

Task Completion Times



Although the markers were placed at equal distances, and reaching targets required making manoeuvres for all the targets except the first one, it is noticed that reaching the first target required in average more time than the other ones and exhibits the largest standard deviation. In fact the statistical analysis of data suggests that we can be 95% confident that:

- Reaching the first target will require at most 61.41 seconds for the general population - Moving from the first target to the second will require at most 36.78 seconds for the

general population - Moving from the second target to the third will require at most 41.226 seconds for the

general population - Moving from the third target to the fourth will require at most 40.89 seconds for the

general population - Moving from the fourth target to the fifth will require at most 34.68 seconds for the

general population

Therefore it is evident that users’ efficiency improves over time. Conclusion 10: Users’ efficiency in using the gestures to navigate in the virtual world improves over time. It should be noted that a comparison of users’ performance in the post-test tasks with the tasks 4a and 4b of the observation experiment, which required gesture-based navigation in the virtual world, would not be meaningful as in the case of the observation experiment users were not asked to “fly” at a specific height in the world, therefore the conditions of each experiment were different and cannot be compared. 1.2 User Experience assessment The User Experience Questionnaire (UEQ) (Laugwitz, Held, & Schrepp, 2008), also available in Appendix, was used to capture user impressions. The system was installed and extensively demonstrated to the participants of a full-day workshop which was organized by the Project partners. This questionnaire allows a quick assessment of the user experience of interactive systems as it supports users to immediately express feelings, impressions, and attitudes that arise after their interaction with a system. It contains the following 6 scales with 26 items:

• Attractiveness: Overall impression of the system. Do users like or dislike it? • Perspicuity: Is it easy to get familiar with the product? Is it easy to learn how to use the

product? • Efficiency: Can users solve their tasks without unnecessary effort? • Dependability: Does the user feel in control of the interaction? • Stimulation: Is it exciting and motivating to use the product? • Novelty: Is the product innovative and creative? Does it catch the interest of users?

Each item in the questionnaire has the form of a semantic differential, i.e., each item is represented by two terms with opposite meanings, for example: Attractive – Unattractive. The order of the terms is randomized per item, i.e., half of the items of a scale start with the positive term and the other half of the items start with the negative term and a seven-stage scale is used to reduce the well-known central tendency bias for such types of items. So, the items are scaled for -3 to +3, with -3 representing the most negative answer, 0 representing a neutral position, and +3 representing the most positive answer. Example:

attractive unattractive

This response would mean that you rate the application as more attractive than unattractive.



Of the 6 scales, Attractiveness is a pure valence dimension, while Perspicuity, Efficiency, and Dependability are pragmatic quality aspects (task and goal-oriented), and Stimulation and Novelty are hedonic quality aspects (not task or goal-oriented). Process During the workshop, the Data Centre 3D Visualization application was demonstrated and presented in details to the participants. Then, participants were allowed to have a short interaction with it and were handed the UEQ questionnaire to fill-in. Results A total of 80 people answered the questionnaire. However, during the analysis, questionnaires whose responses showed inconsistency in more than 2 scales were factored out. This was done by checking how much the best and worst evaluation of an item in a scale differ. If there was a big difference (>3) this is seen as an indicator for a problematic data pattern. This can be the result of random response errors or a misunderstanding of an item. Regardless of the reason, these questionnaires were left out from the analysis. So the end number of questionnaires that were used for the analysis was 60. The results from the valid questionnaires are depicted in Table 21, Table 22 and Figure 39 below. Table 21 and Figure 39 respectively show the collective mean averages per scale. Values between -0.8 and 0.8 represent a neutral evaluation of the corresponding scale, values >0.8 represent a positive evaluation and values <0.8 represent a negative evaluation. As it is shown, the system scored higher in the attractiveness and hedonic qualities scales than the pragmatic ones. More specifically, the Attractiveness, Stimulation, and Novelty scales all received positive scores from 1.8 to 1.98. These scores is an indication that the participants found the showcased system to be innovative, engaging, and pleasant. On the other hand, the pragmatic scales, Perspicuity, Efficiency, and Dependability received lower scores than the hedonic ones. Efficiency and Dependability received a 1.3 and 1.2 score respectively, which are positive values and greater than the 0.8 neutral point line, while Perspicuity (learnability) received a 0.75 score which is the lowest score of all the scales and just a little lower than the 0.8 neutral point line. Even though it is impossible to know exactly why we observed such a difference between the scores of the Hedonic and the Pragmatic scales, it can be partially justified by the fact that the Data Centre 3D Visualization system is using innovative technologies (3D modeling) and user interaction modes (hand gestures) that are foreign to the average user in the context of interacting with a regular software application. And as a result, it is understandable to receive mixed feelings and impressions from the users. Uncertainty and insecurity in respect to how easily such a system can be learned and used in a dependable way is a natural phenomenon when dealing with new technological paradigms. However, from the user-based evaluations during which the participants had much longer exposure to the system and were given some training prior to using it, it was concluded that the users improved rather quickly in using the hand gestures to navigate through the 3D user interface of the system and in interacting with the menus.

Table 21 Means per scale

UEQ Scales Attractiveness 1.808 Perspicuity 0.767 Efficiency 1.321 Dependability 1.217 Stimulation 1.988 Novelty 2.013



-3

-2

-1

0

1

2

3

Figure 40: Means per scale diagram

Table 22 Scores per item

Item Mean Variance Std. Dev. No. Left Right Scale 1 1.1 3.5 1.9 60 annoying enjoyable Attractiveness 2 1.8 1.6 1.3 60 not understandable understandable Perspicuity 3 1.5 1.6 1.3 60 creative dull Novelty 4 -1.0 2.6 1.6 60 easy to learn difficult to learn Perspicuity 5 2.1 1.0 1.0 60 valuable inferior Stimulation 6 1.8 1.0 1.0 60 boring exciting Stimulation 7 1.9 1.5 1.2 60 not interesting interesting Stimulation 8 0.6 1.3 1.1 60 unpredictable predictable Dependability 9 1.3 0.9 1.0 60 fast slow Efficiency

10 2.1 1.1 1.0 60 inventive conventional Novelty 11 1.6 1.2 1.1 60 obstructive supportive Dependability 12 2.0 1.1 1.1 60 good bad Attractiveness 13 1.1 2.3 1.5 60 complicated easy Perspicuity 14 1.9 1.3 1.1 60 unlikable pleasing Attractiveness 15 2.2 0.9 0.9 60 usual leading edge Novelty 16 1.9 0.9 0.9 60 unpleasant pleasant Attractiveness 17 1.1 1.2 1.1 60 secure not secure Dependability 18 2.2 1.0 1.0 60 motivating demotivating Stimulation 19 1.7 1.3 1.1 60 meets expectations does not meet expectations Dependability 20 1.8 0.8 0.9 60 inefficient efficient Efficiency 21 1.2 2.5 1.6 60 clear confusing Perspicuity 22 0.9 2.6 1.6 60 impractical practical Efficiency 23 1.4 1.7 1.3 60 organized cluttered Efficiency 24 2.1 0.8 0.9 59 attractive unattractive Attractiveness 25 1.9 1.0 1.0 60 friendly unfriendly Attractiveness 26 2.3 0.9 0.9 60 conservative innovative Novelty

1.3 Conclusions The usability evaluation of the Data Centre 3D Visualization application followed a two-step approach, with the aim to assess different attributes of the system:

1. A usability evaluation experiment, involving 20 users, in order to evaluate the effectiveness and efficiency of the users while using the system, as well as the learnability of the gestures’ vocabulary and interaction, and find out whether the system imposes any physical strain on the user.



2. A User Experience evaluation by demonstrating the system in a workshop and handling a questionnaire to the workshop participants in order to retrieve their feedback classified in five main categories, namely attractiveness, perspicuity, efficiency, dependability, stimulation, and novelty. Questionnaires were retrieved by 80 workshop participants, and after filtering out potentially biased answers, 60 valid questionnaires participated in the final UX analysis.

Analysis of the results of the usability evaluation experiment indicated the following conclusions: 1. High success rates (at least 70%) can be achieved even for the most difficult tasks for

any user population. 2. Having prior experience with the motion sensor affects user performance. 3. Tasks requiring a selection of UI element through the mouse click gesture can be carried

out efficiently, as we are 95% confident that it will take users of the general population at most 16 seconds and 3 tries to achieve a selection.

4. Tasks requiring selecting an option from a drop-down menu can be less efficiently carried out, as we are 95% confident that users of the general population will require at most 31.5 seconds to accomplish.

5. Navigating from the one end of the visualised room to the other with or without making any manoeuvres is the most time consuming task.

6. The usage of the system does not impose major physical strain on the users. Analysis of the data acquired indicates that we can be 95% confident that the general population will report a maximum of 4.08 fatigue level on a scale from 1 to 10.

7. A positive relation was found between the amount of time a user uses the system with the fatigue level they report, especially when users are required to perform often the mouse click gesture.

8. The gestures’ vocabulary is easy to learn. 9. High success rates (at least 72%) can be achieved for navigating in the virtual world

through gestures for any user population. 10. Users’ efficiency in using the gestures to navigate in the virtual world improves over

time.

Therefore, it can be concluded that the following hypothesis are confirmed: H1. The system can be used effectively in the envisioned context of use to serve the needs of monitoring a large data centre room H3. Interaction with gestures is easy to learn Regarding the second hypothesis, H2, that the system does not impose physical strain over users, it is 95% certain that users would rate their fatigue level as an average fatigue at most. Furthermore, the usability evaluation revealed six usability problems, four of which were classified as minor and two as major usability problems. The two major problems that were identified were that the system might occasionally stop recognizing user gestures, due to users’ hand posture and hand anatomic details, as well as that handling drop-down menus could improve in terms of efficiency. In addition, the most appreciated features of the system were identified including the 3D representation, navigation in the virtual world with gestures, the overall user interface, the mini map and the criticality filters. Furthermore, the users’ major dislikes were reported which were mostly originating from the problems they faced during their interaction with the system. Finally, a number of suggestions regarding features that users would like to be added into the system were acquired, namely direct access to a list with all the problems, a functionality for directly contacting other people to assign them problems to look into, as well as a “teleportation” functionality for directly navigating to a specific area. Finally, the UX assessment indicated that participants found the showcased system to be innovative, engaging, and pleasant. Efficiency and dependability also scored a positive result, indicating that participants felt in general that they could solve their tasks with the product



without unnecessary effort, as well as that they would feel in control of the interaction. The lowest score was received for learnability, indicating that participants would have concerns as to how easy it will be for a user to get familiar with the product. However, during the usability evaluation experiment the system turned out to score high in learnability. This difference in results is well justified, since the workshop participants had no previous experience in the system, while the system itself addresses regular users, a condition which was anticipated and controlled in the usability evaluation experiment.



8. References Nielsen, Jakob, and Rolf Molich. "Heuristic evaluation of user interfaces." In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 249-256. ACM, 1990 Nielsen, Jakob. “Heuristic Evaluation” in Usability inspection methods. Elsevier, 1994. 25-62. Faulkner, L. (2003). Beyond the five-user assumption: Benefits of increased sample sizes in usability testing. Behavior Research Methods, Instruments, & Computers, 35(3), 379-383. Fylan, F. (2005). Semi-structured interviewing. A handbook of research methods for clinical and health psychology, 65-78. Nielsen, Jakob. “Success rate: The simplest usability metric” in Jakob Nielsen’s Alertbox, 18, 2001. Retrieved on 11 January 2017 from: https://www.nngroup.com/articles/success-rate-the-simplest-usability-metric/ Sauro, J., & Lewis, J. R. (2005, September). Estimating completion rates from small samples using binomial confidence intervals: comparisons and recommendations. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 49, No. 24, pp. 2100-2103). SAGE Publications. Laugwitz, B., Held, T., & Schrepp, M. (2008, November). Construction and evaluation of a user experience questionnaire. In Symposium of the Austrian HCI and Usability Engineering Group (pp. 63-76). Springer Berlin Heidelberg.

https://www.nngroup.com/articles/success-rate-the-simplest-usability-metric/

https://www.nngroup.com/articles/success-rate-the-simplest-usability-metric/



9. Appendix – UEQ Questionnaire Please make your evaluation now. For the assessment of the product, please fill out the following questionnaire. The questionnaire consists of pairs of contrasting attributes that may apply to the product. The circles between the attributes represent gradations between the opposites. You can express your agreement with the attributes by ticking the circle that most closely reflects your impression. Example:

attractive unattractive This response would mean that you rate the application as more attractive than unattractive. Please decide spontaneously. Don’t think too long about your decision to make sure that you convey your original impression. Sometimes you may not be completely sure about your agreement with a particular attribute or you may find that the attribute does not apply completely to the particular product. Nevertheless, please tick a circle in every line. It is your personal opinion that counts. Please remember: there is no wrong or right answer!



Please assess the product now by ticking one circle per line. 1 2 3 4 5 6 7

annoying enjoyable 1

not understandable understandable 2

creative dull 3

easy to learn difficult to learn 4

valuable inferior 5

boring exciting 6

not interesting interesting 7

unpredictable predictable 8

fast slow 9

inventive conventional 10

obstructive supportive 11

good bad 12

complicated easy 13

unlikable pleasing 14

usual leading edge 15

unpleasant pleasant 16

secure not secure 17

motivating demotivating 18

meets expectations does not meet expectations 19

inefficient efficient 20

clear confusing 21

impractical practical 22

organized cluttered 23

attractive unattractive 24

friendly unfriendly 25

conservative innovative 26

d2.5 leanbigdata evaluationleanbigdata.eu/wp-content/uploads/sites/3/2015/03/d2.5-leanbigdata... ·...

Documents