toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient...

35
Toward automatic comparison of visualization techniques: application to graph visualization L. Giovannangeli, R. Bourqui, R. Giot and D. Auber {loann.giovannangeli, romain.bourqui, romain.giot, david.auber}@labri.fr March, 2020 Abstract Many end-user evaluations of data visualization techniques have been run during the last decades. Their results are cornerstones to build effi- cient visualization systems. However, designing such an evaluation is al- ways complex and time-consuming and may end in a lack of statistical ev- idence and reproducibility. We believe that modern and efficient computer vision techniques, such as deep convolutional neural networks (CNNs), may help visualization researchers to build and/or adjust their evaluation hypoth- esis. The basis of our idea is to train machine learning models on several visualization techniques to solve a specific task. Our assumption is that it is possible to compare the efficiency of visualization techniques based on the performance of their corresponding model. As current machine learn- ing models are not able to strictly reflect human capabilities, including their imperfections, such results should be interpreted with caution. However, we think that using machine learning-based pre-evaluation, as a pre-process of standard user evaluations, should help researchers to perform a more ex- haustive study of their design space. Thus, it should improve their final user evaluation by providing it better test cases. In this paper, we present the results of two experiments we have conducted to assess how correlated the performance of users and computer vision techniques can be. That study compares two mainstream graph visualization techniques: node-link (NL) and adjacency-matrix (AM) diagrams. Using two well-known deep con- volutional neural networks, we partially reproduced user evaluations from Ghoniem et al. and from Okoe et al.. These experiments showed that some user evaluation results can be reproduced automatically. 1 arXiv:1910.09477v2 [cs.HC] 17 Apr 2020

Upload: others

Post on 06-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Toward automatic comparison of visualizationtechniques: application to graph visualization

L. Giovannangeli, R. Bourqui, R. Giot and D. Auber{loann.giovannangeli, romain.bourqui, romain.giot, david.auber}@labri.fr

March, 2020

Abstract

Many end-user evaluations of data visualization techniques have beenrun during the last decades. Their results are cornerstones to build effi-cient visualization systems. However, designing such an evaluation is al-ways complex and time-consuming and may end in a lack of statistical ev-idence and reproducibility. We believe that modern and efficient computervision techniques, such as deep convolutional neural networks (CNNs), mayhelp visualization researchers to build and/or adjust their evaluation hypoth-esis. The basis of our idea is to train machine learning models on severalvisualization techniques to solve a specific task. Our assumption is that itis possible to compare the efficiency of visualization techniques based onthe performance of their corresponding model. As current machine learn-ing models are not able to strictly reflect human capabilities, including theirimperfections, such results should be interpreted with caution. However,we think that using machine learning-based pre-evaluation, as a pre-processof standard user evaluations, should help researchers to perform a more ex-haustive study of their design space. Thus, it should improve their final userevaluation by providing it better test cases. In this paper, we present theresults of two experiments we have conducted to assess how correlated theperformance of users and computer vision techniques can be. That studycompares two mainstream graph visualization techniques: node-link (NL)and adjacency-matrix (AM) diagrams. Using two well-known deep con-volutional neural networks, we partially reproduced user evaluations fromGhoniem et al. and from Okoe et al.. These experiments showed that someuser evaluation results can be reproduced automatically.

1

arX

iv:1

910.

0947

7v2

[cs

.HC

] 1

7 A

pr 2

020

Page 2: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

1 IntroductionInformation visualization has now been established as a fruitful strategy in variousapplication domains for exploration of large and/or complex data. When design-ing a new visualization technique, it is necessary to assess its efficiency comparedto existing ones. While computational performances can be easily evaluated, it ismore complicated to assess user performances. This is usually achieved by carry-ing out a user evaluation that compares several techniques on one or several tasks(e.g. [12, 13, 26, 27]). During such a study, a population of users is requestedto solve several times the same task(s) using various input data presented withdifferent visualization techniques or different variations of a technique. Experi-mental results (e.g., accuracy or completion time) are then statistically validatedto confirm or refute some preliminary hypothesis. Completing a user evaluationcan become time-consuming as the number of techniques, the considered dataset,and the number of tasks increase. As that may bias the experimental results dueto loss of attention or tiredness, it is necessary to maintain the completion time ofthe entire evaluation reasonable. To that end, one usually tries to keep the num-ber of experimental trials reasonable [28] by reducing the number of evaluatedtechniques, the number of tasks and/or the dataset. This is one of the strongestbottlenecks of user evaluations as visualization designers have to reduce the ex-plored design space to a few visualization techniques without any strong evidence.

We believe that using automatic comparison of visualization techniques, as apre-evaluation step, should help to address this issue. As no user is involved insuch an evaluation, visualization designers could assess the relative efficiency ofseveral visualization techniques, or variations of one. A formal user evaluationwould then be carried out based on the results of that pre-evaluation.

Deep neural networks systems [20] are used in more and more scenarios(e.g, natural language processing, instance segmentation, image captioning) andoutperform well-established state-of-the-art methods since the publication of theAlexNet in 2012 [19], and its great performance in the ImageNet challenge [31].[4] mention, in their survey on quality metrics in visualization, that using super-vised machine learning approaches, such as deep learning, is a promising directionfor evaluating the quality of a visualization. [15] made use of such systems in aninformation visualization context and showed that these techniques can be widelybeneficial to our community.

This paper presents the results of a study conducted to assess how correlatedthe performance of users and computer vision techniques are. The basis of thatstudy consists in: first, generate a set of still images for each compared visualiza-

2

Page 3: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

tion technique from a common dataset, then, train models for each images-set andtask. Consider two visualization techniques A and B, a model architecture M andthe bests-trained models MA and MB respectively on A and B for a given task. Thestudy aims at answering the following question: if MA performs more accuratelythan MB, can we observe the same performance difference with a population ofusers ? Computer vision algorithms and users performances have already beencompared [35, 10]; sometimes, machine learning algorithms provide better re-sults, especially when the model architecture focuses on a specific task [16] whileusers outperform when bias (e.g. distortion, noise) is added to the images [9].However, none of these studies compared machine learning models with users ina context of information visualization. Our study reproduces four tasks from twoprevious user evaluations [12, 13, 26, 27]. It aims at comparing node-link diagram(NL) and adjacency matrix diagrams (AM) for three low-level tasks (counting) andone task of higher level (minimum distance between two nodes). That evaluationinvolves about 230000 experimental objects.

The results that come through this study are drawing good indicators to thisapproach; however they must be considered within the scope of their limitationswhich are raised all along the experimental process.

The remainder of this paper is organized as follows. Section 2 presents re-lated works with a focus on machine learning for counting, machine learning forinformation visualization and user evaluation of graph visualization techniques.Section 3 describes the general approach proposed to automatically compare vi-sualizations and its different steps. Section 4 describes the experimental evalua-tions made to assess the pertinence of the method and their results while section 5discusses them. Finally, Section 6 draws conclusions and presents directions forfuture works.

2 Related worksThe proposed study mainly relates to the use of machine learning for informationvisualization. As mentioned in section 1, we identified four tasks of counting andconnectivity. While machine learning has already been used for counting, thereis, to the best of our knowledge, no prior work related to the shortest path betweentwo nodes. We also present the results of user evaluations of graph visualizations,emphasizing some inconsistencies, to justify the choice of tasks we used.

3

Page 4: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

2.1 Machine learning for information visualizationGraphical Perception [14] have reproduced [8] settings with four different neuralnetworks. Their goal was to show how these networks compare to human. Theyshow that CNNs perform as well as or better than humans for almost every task,although CNNs cannot estimate edge length ratios properly. In that work, Multi-Layer Perceptron (MLP), LeNet [21], VGG [34] and Xception [7] networks havebeen tested. In each case, results were better if the networks were completely re-trained specifically for a specific task. We use a similar set up for our experimentby completely retraining our CNNs.

Readability of Force Directed Graph Layouts [15] work is the most similarin spirit to ours. They proposed to use handcrafted CNN (inspired by VGG) inorder to find properties of force directed layouts in node link diagrams. Whilethe focus of that research is to design an optimized CNN architecture for thatparticular task, our study relies on standard model architectures. We assume thesemodels, designed by the machine learning community, are efficient [14], and weuse them to compare various visualization techniques.

2.2 Machine learning for counting objectsIn this paper, we aim to compare several counting tasks and a connectivity oneon two graph visualization techniques (see Section 4) using machine learning.Recent advances in that domain have proven that such tasks may be automaticallydone but also provide some cues to set up our experimentation.

Complete image versus patches [22] propose a semi-supervised learningframework for visual object counting tasks. They recover the density functionof objects in an image thanks to a linear transformation from the feature spacethat represents each pixel. A tailored loss function helps to optimize the weightsof this linear transformation while the integration of the density function pro-vides the number of objects. The method has been validated on cell counting onfluorescent microscopy images and people counting in surveillance video. [37]improve the former work by 20% with an approach based on CNNs in a boost-ing framework (CNNi+1 learns to estimate the residual error of CNNi) and imagepatches. Of course, this method is not exact, and the quality of the results mayvary according to the input images. Measuring these differences is the basis ofour comparison approach. While that method improves the performances, we usefull image search method in our experiment since we do not make any assumptionon the size of the objects to be counted.

4

Page 5: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Objects identification Recent research also proved that deep networks maylearn automatically the kinds of graphical objects they needed to solve a task. [33]estimate the count (up to 5) of even digits in an image. They show that the networkcan extract features that encode the objects to count whereas they have never beenexplicitly provided to the network. Another result is the one of [29]; they used adeep learning framework to count fruits in images. The originality of this workresides in the use of artificial images during the training while still achieving morethan 90% of prediction on 100 real images of tomato plants. Therefore, for graphcounting tasks, it may be possible to use the same learning technique on differentvisualizations without specific changes according to the way entities and relationsare visually encoded. This will be the basis of the approach used in our study: toapply the same learning technique (model architecture, metrics, etc.) on differentvisualizations so that the visualization itself is the only parameter which shouldimpact the performances of the training and trained models.

2.3 User evaluation of graph visualizationIn this paper, we focus on the evaluation of two of the most widely acceptedgraph visualization techniques: nodelink (NL) and adjacency matrix (AM) dia-grams. They are state-of-the-art graph visualization techniques many researcherstry to improve, for example by optimizing an aesthetic criterion or an objectivefunction, or to identify some criteria of the visualization that can facilitate task’ssolving (e.g. [24, 18]).

Concurrently, user evaluations have also been performed to compare the ef-ficiency of these two techniques [12, 13, 17, 1, 2, 25, 26, 27]. On one hand,[12, 13] evaluated these techniques on seven counting and connectivity tasks andconcluded that NL performs better than AM for connectivity tasks when graphsare small and sparse while AM outperforms NL in all the other conditions. Theseresults were later reproduced by the user evaluation of [17] for similar tasks evenwith small graphs. In their evaluation, [2] compared two variations of NL andAM for dynamic weighted graphs and concluded that AM performs better than NLfor weight variations and connectivity tasks. On the other hand, [1] did not findany difference in term of user performance when considering higher level tasksof the software engineering domain. [32] compared four visual edge encodingsin AM among which one corresponds to a NL where nodes have been duplicatedfor the path finding task. This evaluation showed very few significative differ-ences between AM and their variation of NL. Such contradictory results were aswell obtained by other researches [25, 26, 27]. [25] partially reproduced two

5

Page 6: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

connectivity tasks out the seven tasks of [12, 13] with some modifications of theexperimental setup. These modifications consisted in using a realworld dataset,ordering the matrix to emphasize the cluster structures and allowing node dragin the NL. Another important difference is the size of the involved population asrunning their experiment with an online system allowed them to recruit a largernumber of participants. This evaluation’s results showed that NL outperforms AMon each tested task. A larger scale study, presented by [26, 27], explored a largerrange of tasks as it included fourteen tasks classified into topology/connectivity,group and memorability tasks. That study showed that NL outperforms AM fortopology/connectivity and memorability tasks while the contrary can be observedfor group tasks. Finally, a recent study by [30] on 600 subjects showed that NLperforms better than AM for adjacency, accessibility and connectivity tasks. Thedownside of this study is that it has been performed with only 2 graphs of 20and 50 vertices, and relatively low densities (≤ 0.1); therefore its results must becompared carefully to previous studies.

Figure 1: Illustration of the experimental protocol proposed in this work. Firststeps consist of generating data and their representations with the visualizationtechniques to compare. Next steps consist of training models, selecting the bestones and evaluating them on unseen data. Finally, the efficiencies of the visual-ization techniques are compared using usual statistical methods as for any quanti-tative user evaluation. Blue icons indicate where visualization experts are needed.

We believe that the divergence in these results are mainly due to three aspects.First of all, even though the evaluation protocols are similar, their slight differ-ences (available interaction tools, wording of the questions) can drastically induce

6

Page 7: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

variations in the results. The second aspects relates to the populations involved inthese studies. As their size (e.g., from few dozens [12, 13] to few hundred [26, 27])and demographics varied (e.g., post-graduate students and researchers in computerscience [12, 13] and Amazon Mechanical Turk users [26, 27]), the results can varyas well. The last aspect relates to the evaluation datasets as these evaluations areperformed in distinct contexts and therefore use different datasets having varioustopological properties. Using an automated evaluation as a pre-process of the userevaluation may help to solve such reproducibility issues as far as the datasets,the model architectures and the visualization protocols are provided. Consideringthese previous studies, we decided to run our experiment on four tasks having nocontradictory result. In order to be able to compare our results to those of existinguser evaluations, we partially reproduced those of [12, 13] and [27]. Three of ourtasks are low-level ones and relate to counting, while the last task, of higher level,consists in determining the length of the shortest between two highlighted nodes.

3 Automatic evaluation with machine learning: gen-eral approach

The key idea of the proposed method is to use machine learning models to solvea specific task for each visualization technique to be compared. Then, the ef-ficiency of these techniques can be evaluated by comparing the performance oftheir related models. While bio-inspired models [36] aim to mimic human behav-ior, many modern machine learning models do not perform like humans but rathertry to perform better than them [16]. If such a model cannot solve a task on avisualization, it could mean either that: (i) the model is too simple or not designedfor the task; (ii) the task needs a high level of abstraction (e.g. cultural knowledge)to be solved; or (iii) the visualization does not contain (enough) elements for themodel to solve the task.

Several important criteria for Human Computer Interface field may be esti-mated using such machine learning models.

Learning convergence: by comparing how fast, in term of number of epochs,a visualization can be understood by the model, one could estimate the learningcurve to solve a task on this visualization.

Complexity of the task: models have various structure complexity (e.g. num-ber of layers, type of operations in the layers, etc.). By checking the performanceof several well-known models, going from less to more complex, one could au-

7

Page 8: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

tomatically evaluate the difficulty to solve a task on a visualization. Simple tasksare expected to work well with simple models while more complex tasks shouldwork with more complex models.

Accuracy of a visualization: by measuring the number of correct results for atask, one can compare a visualization to another. Errors are usually present whenthere is some ambiguity in the visualization due to, for instance, cluttering issues.Modern machine learning-based models perform quite well here, especially oncounting tasks (e.g. [37, 29]). Thus, one can assume that if the model cannotsolve the ambiguity, it will also be hard or at least time consuming to be solved ina standard evaluation.

These criteria rely on the main assumption our study aims at validating: acorrelation can be made between humans and machines performances. We areaware this assumption is not self-evident and may be dependent on the task type,its dataset, the model’s architecture and other parameters. If this assumption isto be refuted, this approach can enable us to explore (dis)similarities betweenhumans and machines performances.

Figure 1 summarizes our experimental protocol while the remaining of thesection presents each block of our evaluation process: the data generation, theimages, tasks and ground truths, the model selection and the model evaluation andPerformance comparison. It also presents the role that remains to visualizationexperts.

3.1 Data generationOur evaluation protocol should be applied in a context for which we know whichtasks have to be handled. Consequently, we should have a set of raw input data (tobe visualized with the different techniques) associated with the expected answerfor each of these tasks. Such dataset can either be generated automatically using adata generation algorithm or collected from existing data source. Our experimenthas shown that the data generation process requires significant attention. Indeed,as we detail in the next sections, data distribution is of utmost importance: themore uniform the distribution is, the easier the setup of evaluation and training willbe. It is not always easy or feasible to generate datasets respecting this constraint.One of our main will for this approach being the reproducibility of the evaluation,generated data can be stored and reused for future experimentation. It enablesthe community to create test beds to compare new visualization techniques withprevious ones.

8

Page 9: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

3.2 Images, tasks and ground truthsThe second step consists of generating visual representations for each input datawith each visualization technique. They will be used during the experiment. Inour current approach, we only deal with static visualizations and do not supportinteractive ones. Thus, that step consists of generating static labeled images onwhich a model or a user should answer a question (i.e., solve a task). We notevi = (imgi,grdi) a data sample, where imgi is the image (corresponding to a vi-sualization technique) and grdi is the ground truth to be predicted among the se-quence of all possible answers to a specific task, noted G. As previously stated,generating a set V = {v1, . . . ,vn} of samples for a specific task is challenging asit has to follow a valid distribution. For instance, if the same value has to bepredicted in 80% of the cases and a model has an accuracy of 80% for that task,it could mean that the model succeeded in detecting the bias in our experiment.Ideally, the distribution should verify the following property:

∀a,b ∈ G, ∑(img,grd)∈V

1{a = grd}= ∑(img,grd)∈V

1{b = grd} (1)

where 1{a}=

{1 if a is true0 otherwise

.

In other words, each ground truth of G should appear the same number of timesin V so its distribution is uniform. Such experiment is usually not feasible withusers since the number of experimental objects is too large making the completiontime unrealistic.

The first step of a formal user evaluation is to let the user learn the visualizationand the tasks with a training set. Our approach works similarly: V is split intothree sets called train (to learn the model), validation (to select the best model)and test (to fairly evaluate the model) datasets; they are respectively noted Vtrain,Vvalidation and Vtest . Each of them should verify the property (1) and thus initial setV must be large enough.

3.3 Model selectionThe model selection step consists of choosing the model that performs the beston Vvalidation on a given metric. It means that it has learned well with Vtrain and isable to generalize with the dataset Vvalidation (i.e., it does not suffer of overfitting).We apply standard machine learning methodology for that purpose. With deep

9

Page 10: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

neural networks-based models, we train models during a fixed number of epochs(100 by default) and save their state as many times. Then we use the set Vvalidationto compute the set of predictions for each epoch and save these predictions. Foreach epoch e and each sample vi ∈Vvalidation, a prediction is the answer providedby the model and is noted pe(vi) ∈ G. We can then evaluate for each epoch itsperformance on a given metric regarding its predictions and the ground truths. Inthis experiment, we used four metrics:

• MSE (mean squared error): mse(y, y) = ∑Card(y)i

|yi−yi|2Card(y)

That is a common regression metric in the machine learning community thatconsiders the labels proximity (lower is better).

• Accuracy: accuracy(y, y) = ∑Card(y)i=1

1{yi=yi}Card(y)

• percent10: percent10(y, y) = ∑Card(y)i

1{.9yi<=yi<=1.1yi}Card(y)

That metric is a variation of the accuracy that allows +/-10% of errors(higher is better).

• delta10: delta10(y, y) = ∑Card(y)i

1{yi−10<=yi<=yi+10}Card(y)

That metric is a variation of the accuracy that allows +/-10 units of errors(higher is better).

where

y = [grd | (img,grd) ∈Vvalidation], list of ground truths;

y = [pe(v) | v ∈Vvalidation], list of associated predictions.

While MSE allows measuring the distance of the predictions to their associatedground truths, percent10 and delta10 consider predictions that are close to theirground truths as correct. Although percent10 and delta10 seem similar, they haveopposite limitations. While percent10 is more permissive when the predictionsare associated with high-value ground truths than low-value ones, delta10 has theopposite behavior. Thus, both are sensitive to the ground truths average value anddistribution.

3.4 Model evaluation and Performance comparisonVtest serves as input data to evaluate the selected model performances but also toensure that it generalizes well. As the elements of Vtest have not been previously

10

Page 11: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

processed during the learning and model selection stages, we can measure thecapacity of the model to solve a task on an unseen image.

For each v ∈ Vtest , we compute pe(v) using the best epoch e found during themodel selection stage. We note pe(Vtest) the sequence of all predictions for theelements of Vtest . When comparing k visualization techniques (or variations), thek corresponding models provide k sequences of predictions to be compared. It isnoteworthy to mention that the best epochs may be different for each combinationof model’s architecture, visualization technique, task and metric.

Then, we make sure the selected model has effectively learned how to solve theconsidered task. This is usually achieved by comparing the predictions of a modelto random choices or computing evaluation metrics. In this work, we propose touse the coefficient of determination (R2) designed for regression problems.

R2(y, y) = 1−∑∑Card(y)i=1 (yi− yi)

2/∑Card(y)i=1 (yi− y)2

where y = ∑Card(y)i=1 yi/Card (y) 1 means that y explains perfectly y (very good

system), while 0 means the opposite. For our experiments, we’ll consider that R2

scores above 0.3 mean that the model learned to solve a task.In case of a classification problem, we recommend the use of the Matthews

correlation coefficient defined as:

MCC(y, y) =∑i jk CiiC jk−Ci jCki√

∑i(∑ j Ci j)(∑i′, j′|i′6=i Ci′ j′)√

∑i(∑ j Ci j)(∑i′, j′|i′6=i C j′i′)

considering the confusion matrix C (built from y and y) of a K classification prob-lem. This metric’s advantage is to be insensitive to unbalanced datasets. A valueof 1 represents a perfect prediction, 0 a random one and −1 a total disagreementbetween the prediction its ground truth.

Finally, performance comparison consists of comparing the sequences of pre-dictions of each model (i.e. each visualization technique) with the ground truths.For that purpose, we use the previously presented metrics (see Section 3.3) to eval-uate each prediction on Vtest and we use standard statistical tests (e.g. Wilcoxonsigned rank, Kruskal Wallis) to determine if the predictions of a model are signif-icantly better than others.

3.5 Role of visualization expertsThe very design of our evaluation protocol permits to run experiments withoutinvolving any human. However visualization experts must remain in the process

11

Page 12: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

even in its early stages and play the same role as in formal user evaluations. Inparticular, the expert must set all the experimental parameters such as the visual-ization techniques to be compared, the considered tasks and hypothesis, the datageneration model (or the collection of data) as well as the evaluation metrics (seeFigure 1).

It is noteworthy to remind that our approach does not intend to replace humansevaluations. It rather provides a tool to help experts (i) to choose the visualizationsand their parameters to use during a user evaluation, and (ii) and to justify thesechoices with reproducible computed metrics.

4 Experimental comparison of NL and AMThis section presents two evaluations carried out to validate the proposed approachand its premise. We have chosen to reproduce user evaluation studies that serveas references to draw a first conclusion about its effectiveness. Thus, we repro-duced 3 (low-level) tasks out of the 7 tasks of [12, 13] and 1 (higher-level) taskof [27]. These two experiments aim at comparing NL and AM representations. Itis important to remind that, as in any other similar study of the literature, theseexperiments compare opinionated implementations of NL and AM views. For thesake of reading simplification, “NL ” and “AM ” use systematically refer to ourimplementation.

The following sections describe these two evaluations including their experi-mental protocol, results and a discussion about their main limitations.

4.1 Counting tasks of Ghoniem et al.4.1.1 Evaluation protocol

Task, visualization techniques and hypothesis In this evaluation, we focus onthe first three tasks of [12, 13] that mainly relate to counting; (i) Task 0: exactcount of the nodes number; (ii) Task 1: exact count of the edges number; (iii)Task 2: maximum degree of the graph. Task 0 and Task 1 are slight variationsof Ghoniem et al. first two tasks which consisted of providing estimates of thenumber of nodes and edges. Giving an exact count being a harder problem thangiving an approximation, we know our results will differ from the original study,in that the accuracies will most likely be lower. However, during both validation

12

Page 13: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

and evaluation, estimates of the ground-truths can be considered (see Section 3.3).Task 2 is a variation of Ghoniem et al. third task as they considered identificationof the node of maximum degree instead of the maximum degree. We decidedto implement this task variation as we wanted to focus on counting tasks (seeSection 2.3).

Several studies [12, 13, 26, 27, 30] have provided contradictory results aboutthe efficiency of NL and AM on several tasks. However, to the best of our knowl-edge, none of these results are related to the 3 ones we focus on. This was, to acertain extent, mandatory in this study as we expected our results to corroboratesome commonly admitted ones to estimate our approach feasibility. Therefore,our hypothesis is:

H1: AM should outperform NL for each of the three tasks.

Graph dataset We generated our dataset with the same random graph model asin the original study but we could not use the same implementation of that modelas their generator is no more available. Moreover, the number of graphs consid-ered in our experiment is much larger. On one hand, supervised learning methodsrequire a large training dataset, and on the other hand, we have made the hypoth-esis that using many graphs would help to train models able to better generalize.When generating the graphs of this study, we considered two parameters:

• the number of nodes which varies from 20 to 100 (we refer to this parameteras the graph size).

• the edge density as defined in [12, 13] which was set to 0.2, 0.4 and 0.6 andcorresponds to

√2|E||V |2 .

There are 243 combinations of parameters (3 edge densities ∗ 81 graph size).100 instances are generated for each combination, which led to 24300 graphs fortraining and validation. 2700 additional graphs are generated for test while tryingto follow the same distribution as for the training and validation datasets. Thisleads to 27000 graphs.

Images datasets To feed the models, we computed a 256×256 grayscale imagefor each combination of graph, task and visualization technique (NL or AM).

In NL diagram, nodes were represented by a circular shape, and each graphhas been laid out with the GEM force-directed algorithm [11]. To make easierthe identification of nodes and edges, node size has been set to half the minimum

13

Page 14: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

distance between any two nodes in the resulting drawing. Concerning AM diagram,we computed the ordering of nodes by using the Louvain clustering algorithmof [5]. This emphasizes the community structures of the graph even if assessingsuch a topological information is not necessary (given our tasks and the randomdata generation model). In this representation, nodes were represented by squares.For readability purposes, we drew the grid of the matrix with lines of width 1pixel. Finally, row and column size (resp. height and width) has been set as largeas possible to maximize space usage.

Nodes appear larger in NL than in AM (see Figure 2) as we maximized thespace usage in the computed images and due to the different properties of the twovisualization techniques.

Deep convolutional network architectures For this experiment, we have se-lected two common convolutional network architectures. On one hand, we se-lected a simple yet efficient architecture for solving simple problems. That ar-chitecture, called LeNet [21], contains 8 layers and has been initially designedto recognize hand-written digits from small 28×28 images. On the other hand, weselected a much more complex one, calledVGG16 [34], that has been shown to perform well with complex image classi-fication problems. [14] mentioned that VGG architecture provides the best resultsout of their four tested architectures. It seems therefore to be a good candidateto run comparison between visualization techniques. That architecture contains21 layers and has been initially designed for large-scale image classification prob-lems.

We have chosen a classification approach over a regression one (i.e., modelspredict a class instead of a value) to generalize them for other tasks in the future.For the same reason, the output layer has also been oversized (i.e. some classesdo not appear in the data) compared to the number of classes to predict to be moredataset independent.

We trained a model for each combination of architecture-task-visualizationtechnique. Each network has been completely trained (as opposed to pre-trainednetworks) and no weight was shared between the networks of a same architecturefor the different tasks (as it could be done by [15]). Each model has been trainedduring 100 epochs using negative log likelihood loss and the Adadelta optimizerfrom [38] that has the advantage of requiring no manual tuning of the learningrate.

14

Page 15: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Table 1: R2 scores of the best models on accuracy metric (see Section 3.4) foreach combination of task, network architecture, and visualization technique (1.0values are due to rounding).

Task Network R2 score per vis. techniqueArchitecture AM NL

Task 0LeNet 1.0 0.73VGG16 1.0 0.97

Task 1LeNet 1.0 0.99VGG16 1.0 1.0

Task 2LeNet 0.98 0.97VGG16 0.98 0.97

Task 3LeNet −1.48 0.46VGG16 −1.18×10−6 −1.18×10−6

4.1.2 Results

We are interested in verifying if the models could learn, how they perform and howthey compare between the two graphs representations to validate our hypothesis.Another interesting point is the learning convergence speed. This section presentsthese different aspects.

Learning feasibility As mentioned in Section 3, it is mandatory to determinewhether the models could learn how to solve the tasks. Even if we use classifi-cation models in this experiment, the models solve a regression problem as thegoal is to predict a value (e.g. the number of nodes). We therefore used R2 met-ric to assess how well the models did learn. Table 1 shows the R2 scores of thebest models for each combination of the considered parameters in our experiment.These R2 scores range from 0.73 (for the Task 0 on NL with LeNet architecture)to 1.0. It shows that both LeNet and VGG16 have learned how to solve the threetasks with each of the visualization techniques.

15

Page 16: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Models performances and comparison Table 2 summarizes the classificationresults using different metrics computed on Vtest by the models performing thebest on Vvalidation.

Our hypothesis was that automatic evaluation should reproduce the resultsof [12, 13], i.e. AM should outperform NL for each of the three tasks. To vali-date our hypothesis on each task, we selected for each metric (MSE, accuracy,percent10, delta10) the best models (one for each architecture) for each visual-ization technique. For instance, Figure 3 shows the confusion matrix of the bestmodels selected using MSE for each visualization technique. Then we evaluatedhow well the best models approximate the ground-truths by computing for eachgraph the absolute difference between the ground-truth and their predictions (seeTable 3). For each selection criterion (i.e. metric), we compared these absolutedifferences for AM and NL. We used the Wilcoxon test that verifies if these twosets of data share the same distribution. If the p-value is lower than 0.05, we canconsider that they are different in favor of the one having the smallest mean ab-solute difference, otherwise no statement can be made. This process is repeatedfour times by using the best models obtained with the evaluation metrics (MSE,accuracy, percent10, delta10). Table 3 shows that the p-value is systematically<< 0.05 (so the distributions are different) for the AM representation; KruskalWallis test provides the same results. These results validate the hypothesis H1:AM outperforms NL for the three considered tasks.

Even if statistical tests validated our hypothesis, we have further analyzed theresults to identify if one of the dataset generation parameters has an influence onthe performances of the models. Figure 4 groups the recognition performance perrange of nodes and per density to locally identify the difficult cases on models thatmaximize the accuracy metric. AM accuracy does not seem affected by either thenodes number or the edge density of the graph for Task 0 and Task 1. NL accuracyabruptly decreases when graph size exceeds 30 nodes and then remain stable upto 100 nodes. Moreover, NL accuracy also seems to increase with increasing edgedensity which is surprising. However, NL accuracies are so low that it may notbe relevant to compare them. Both NL and AM accuracies decrease when graphsize or edge density increases for Task 2. Such an impact on the accuracy wasexpected and confirmed the results of [12, 13].

Models convergence Another interesting aspect of such study is to determinewhich visualization technique is prone to fast convergence of the model. We be-lieve it is correlated to the difficulty of the problem: a fastest convergence means

16

Page 17: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

an easier problem and a slowest convergence means a more difficult problem. Ta-ble 2 shows that models trained on the AM representation usually train faster (24epochs to obtain the best model in average) than models trained with the NL rep-resentation (39 epochs in average). The difference is statistically significant asa Wilcoxon signed rank test computed on these 18 pairs obtained a p-value of0.016 << 0.05. Therefore, models learn faster how to solve the tasks with AMrepresentations than with NL ones. Interpreting this learning speed is thorny as itcould be only related to the time a network needs to learn the shape of the elementsit is requested to count. This would not directly reflect the task difficulty itself.But even so, we believe that this kind of relation is induced by the visualizationtechnique and can be interpreted. Thus, it could be easier for users to learn howto solve the tasks on AM representation than on NL as models trained with AM doconverge faster. It would be interesting to investigate whether this assumption isrelevant or not in a further study.

4.1.3 Limitations

As for any experimental evaluation, the results of our use case can only be inter-preted within the limits set by its evaluation protocol (datasets, algorithms used togenerate layouts, resolution and visual choices to draw images, etc).

Our database has been tailored for Task 0; Figure 5 presents the ground-truthdistribution for the 3 tasks on the test database. As expected, the distributionof the number of nodes is nearly uniform (the variation is due to a sampling ofthe entire dataset). However, the distributions of the number of edges and ofthe maximum degree clearly do not follow a uniform distribution (but rather apower-law distribution) which was also expected. Indeed, small graphs with highedge densities may have the same number of edges than larger ones with lowedge densities. Even if it may have been better to generate a custom dataset foreach task, the purpose of our study was to test whether the results of [12, 13]could be reproduced using computer vision techniques or not. As such unbalanceddistributions are due to the dataset generation parameters from the original study,we could not avoid that bias.

To draw a graph with the NL representation, it is first necessary to lay it outwith a specific algorithm and its specific configuration parameters (Section 4.1.1).The models predictions, and the drawn conclusions, could be different dependingon whether it is a non-force-directed algorithm or a force-directed algorithm. Wecan assume CNN models may be more efficient if the graphs were drawn in anorthogonal manner (e.g. [6]). Similar concerns arise in the AM setting where it is

17

Page 18: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

known that several encodings can be used to draw the edges [32] and several nodeorderings could be used to ease reading [3]. However such limitations are inherentto any evaluation (automatic or human-based) and are not specifically related toour experiment.

The last main limitation we have identified lies in the image resolution andvisual choices for its rendering. Concerning the AM, our rendering engine ren-ders grid-lines with a constant thickness (1 pixel), increasing the image resolutionwould increase rows height and columns width, it could also ease the completionof the considered counting tasks. One can have similar concerns about the ren-dering of NL for which nodes size and edge thickness had to be arbitrarily set.Moreover, to reduce the computational time and memory consumption during themodel training, we used grayscale images as mentioned in Section 4.1.1. Thisdrastically reduces the memory usage and the computational cost as the input im-age is coded with one channel instead of three in RGB. While color is usuallyconsidered as an important visual channel, we consider it should not penalize themodels because the number of grayscale colors used is quite small (3) and are notambiguous.

More generally, our NL and AM representations (i.e. layout algorithm, imagedrawing, absence of interactions) were not the exact same as those from [12, 13]as their study do not provide enough details to reproduce entirely theirs.

4.2 Shortest path length task of Okoe et al.4.2.1 Evaluation protocol

Tasks, visualization techniques and hypothesis To study how the models per-form on a task of higher level, we reproduced the Task 10 from the evaluation of[27]. That task consists in determining the length of the shortest path between twohighlighted nodes. The evaluation showed that NL performs better than AM whichwas not contradictory with other studies evaluating that specific task (in particularwith the results of [12, 13]).

In this paper, we refer to this task as Task 3. As for the first experiment, weexpected our result to confirm the result of [27]:

H2: NL should outperform AM for this task.

Graphs dataset Unlike our first evaluation, the graphs we use for this experi-ment are those of the original study [27]; they correspond to two real-world net-

18

Page 19: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

works shared in a GitHub repository1, along with their study setup and results.In the original experiment setup, the answers to the Task 3 followed two rules:(i) there is always a path between the two highlighted nodes and (ii) the shortestpath between these nodes is no more than 4. To generate as many task instancesas possible (as it is required by deep learning training processes) while ensuringa uniform distribution of the ground-truths, we computed the number of shortestpaths of length 2 to 4 for each graph. Considering the two graphs of the originalsetup, we could extract 5593 pairs of nodes for each graph and shortest path length(2, 3 and 4). It led to 33558 tasks instances (5593 pairs of nodes ∗ 3 path lengths∗ 2 graphs). This dataset is then randomly split into 3 parts: 80% for training set,10% for validation set and 10% for test set.

Images datasets In this experiment, we did not need to compute the graph lay-outs for the NL representation and the node orderings for the AM one, as [27]provide them in their GitHub repository.

With the AM representation used in the first experiment (see Section 4.1.1), thegenerated images size would be about 2 times the number of nodes in a graph,as every node shall be represented by at least 1 pixel and a 1 pixel grid line. Forthis experiment, such a representation would lead us to images of size 665×665pixels which, coupled with the architecture of our models, is too large to feedthem on our runtime infrastructure (see limitations in Section 4.2.3). To remedythis, we removed grid lines from the AM representation and set the image size forboth NL and AM representations to 400× 400 pixels.We generated RGB imageswhere nodes were drawn in red (255,0,0) while nodes of interest were highlightedin green (0,255,0) (see Figure 6). These colors have been chosen so that (i) itis easy for a user to find the highlighted nodes, and (ii) the difference betweenthese two colors is still significant once converted to grayscale as expected bythe models. The second point is induced by our conversion formula which is:L = 0.299R+ 0.587G+ 0.114B where red and green are the two most importantchannels.

Deep convolutional networks As the approach we propose is based on the useof generic model architectures, the models we use, the number of epochs and theoptimizer in this experiment are the same as in the first one (see Section 4.1.1).

1github.com/mershack/Nodelink-vs-Adjacency-Matrices-Study, lastconsulted on February 2020

19

Page 20: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Supposedly, the only changes to the architecture we would need to apply arein the input layer, to adapt the model input to our image size; and the outputlayer, to fit the number of classes the models need to predict in the current task.Therefore, the input of the models has been changed to a tensor of 400× 400.As the maximum shortest path lengths (i.e. diameters) in these graphs are 6 and7, the number of classes to predict has been set to 8. However, even with thesemodifications, the models did not fit in the memory of our runtime infrastructure.Thus, we reduced the size of the fully connected layers of the two models hiddenlayers from 4096 and 4096 in the first experiment to 512 and 256. Finally, weneeded to add to VGG16 both a Convolution and a MaxPooling layer betweenthe last convolution and the first fully connected layer in order to reduce memoryusage when running our experiment.

4.2.2 Results

Learning feasibility Table 1 shows the R2 scores of the best models for eachnetwork architecture and visualization. One can see that except LeNet with NL,none of the models were able to learn how to solve the task. Indeed, VGG16with both NL and AM and LeNet with AM all have R2 scores lower than zero. Byinspecting the predictions of these models, we noticed that at each epoch thesemodels were only able to predict one value disregarding the input images. In fact,unable to learn how to solve the task, the model learned the distribution of thedataset it has been trained on and does always predict the predominant value inthe ground-truths so that it minimizes its error. On the contrary, LeNet with NLhas a R2 score of 0.46 which indicates that the model was able to learn how tosolve Task 3.

Models performances and comparison One can see in Table 2 that we onlyconsidered accuracy and MSE metrics for Task 3. Indeed, percent10 and delta10would not have much meaning with the current models since they only predict8 values between 0 and 7. According to accuracy and MSE scores, LeNet ob-tained better results with NL than with AM. On the contrary, VGG16 best modelsoffer the same performance whatever the chosen metric and visualization. In fact,as these models always predict the same value and as the ground-truths only con-tains 3 values evenly distributed, their accuracies are about 33%. We followedthe same procedure as in the first experiment to assess whether a model outper-forms another or not. First, we selected the best model for each metric, then wecomputed the mean absolute difference between the ground-truths and the models

20

Page 21: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

predictions and finally, we compared these absolute differences using a statistical(Wilcoxon) test. Table 3 shows that the mean absolute difference between AM andNL is significant with LeNet architecture as the p-value is << 0.05. It showsthat, considering these two model architectures, Task 3 is solved more accuratelywith NL representations than with AM representation.

Models convergence With LeNet architecture, the fact that the model trainedon AM representation has not been able to learn in this experiment has major im-pacts on our results interpretation. We may consider that because the model didnot learn, it is impossible to compare its convergence speeds to others. Neverthe-less, if the convergence speed were to be analyzed, we consider that if a modelwas not able to learn how to solve the task, it may be due to the bound on numberof epochs. For instance, LeNet on AM might have been able to learn to solvethe task with a hundred more training epochs. In that case, LeNet with NL out-performs LeNet with AM as its best results are reached at epoch 76. Though,we rather believe that if a model cannot learn a task on a representation, it eithermeans that the representation prevents to solve the task, or that the representationis too complex to solve such a task.

As VGG16 models were unable to learn with both AM and NL, there is noconvergence to take into account. We will discuss some possible reasons of thatfailures in Section 5.

We can conclude that our hypothesis H2 has been verified as there is a sig-nificant performance difference in favor of NL representation between selectedmodels (see Table 3). Yet, this result is to be considered within the scope of thatexperiment.

4.2.3 Limitations

The first limitation concerns memory issues faced by our runtime infrastructurewhen feeding large images to the networks. We had two runtimes environmentsat our disposal to train them: (i) a GPU computer with a NVIDIA Titan X (Pas-cal) 12GB card, and (ii) a CPU cluster of 50 nodes (i.e. computers) with 64GBof RAM each among which up to 45GB could be reserved for training execution.We achieved to train LeNet on the 12GB GPU computer with some modifica-tions (see Section 4.2.1), but this memory was not enough to run VGG16. Forthis last, we had to run on the CPU cluster, which took about 6 days to train over100 epochs. We needed to scale down our architectures to run this experiment, al-though it goes against our will to make use of generic implementations of models

21

Page 22: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

architectures. We expect such drawback will be present in any other study treatinghigh-level tasks or large images.

Like in the first experiment, the remaining limitations are related to the graphdrawings. First, we still ignored any interaction; this is even more limiting as thegraphs are few hundreds nodes big and we could not scale up the image size as itwould have been necessary to keep the nodes drawings wide enough. Thus, nodesin our representations are significantly small (1 pixel per node in AM images) andinteractions such as zoom-in or selection, like in the original study, bring a lot toa visualization technique in these cases. Nevertheless, we want to compare thevisualizations techniques themselves, not the interactive tools they are broughtwith.

Finally, our way of highlighting a node (see Section 4.2.1) differs from theoriginal study. We defined a fair way to do it so that it is possible for a human and alearning model to identify them. The main issue is that since nodes are representedwith only 1 pixel in the AM representation, it could be tough for a network tounderstand which nodes are highlighted (mainly because of MaxPooling layers).This difficulty comes from our infrastructure’s technical limitations. But even so,such a limitation is at first induced by the AM representation itself, which willalways occupy a quadratic space on the nodes number.

5 DiscussionWhile it seems that users and performance of computer vision techniques canbe correlated, our approach also suffers from several limitations. This sectiondiscusses the main identified limitations of our approach.

The user is usually asked to solve a given task in an interactive environmentin most of the formal user evaluations. For example, in the studies of [12, 13],the user can zoom in/out, pan or even highlight the neighborhood of a given nodeto ease task completion. On one hand, providing interaction tools may make theevaluation more realistic as no visualization software would provide only still im-ages. On the other hand, it may restrict the scope of the evaluation as it would notonly compare several visualization techniques but rather combinations of visual-izations and their associated interaction tools. This is particularly true when com-paring multiform visualization techniques where the provided interaction toolshave to be adapted to each technique.

Although our experiments covered different task types and difficulties, somefurther work shall be made to strengthen this study. Indeed, these experiments

22

Page 23: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

are not sufficient to conclude that users and performance of computer vision tech-niques can be generally correlated. Comparing more diverse tasks, on more vi-sualization techniques would be a first step to generalize our study and explorefurther how correlated the results of users and models can be.

Another drawback of our approach is related to its computational cost. Per-forming the proposed evaluation was more difficult than expected. VGG16 andeven LeNet can be heavy to run on some infrastructure when the input shapebecomes too large. For the first experiment, and with the distributed infrastructurewe described in Section 4.2.3, we could train all the models with a batch size of400 images in one week. For VGG16, train a model (100 epochs) for one task andone visualization took one day, against 4 hours for LeNet. For the second ex-periment we had to downsize our models and batch sizes during training so theycould fit in memory. With the previously described settings, each VGG16 tookabout 6 days to train over 100 epochs for each visualization technique, against 9hours for LeNet. Such a computation time is a limitation of our approach. Weassume that progress in machine learning dedicated hardware will soon enable torun such experiments in less than a day. Nevertheless, today’s infrastructures aresufficient to deepen this exploration.

The second experiment raised another concern about the way we designed ourstudy. The way we normalize the models training parameters causes problemsduring the training of some of them. It is expected that some models will be betterdesign for some tasks than others. But as previously stated, we want our differentmodels to be trained in a way that the visualization technique (i.e. the features theyare provided) are the only difference that can impact their performances. However,the second experiment showed that it might be more complicated than expectedas the different model’s architectures we use and their inherent complexity makeit difficult to set a generalist environment that makes them all learn properly. Inthat experiment, we noticed that the accuracies of the VGG16 models across theepochs oscillated. Such a behavior could be due to a too high learning rate thatmakes the gradient descent steps too large to optimize the cost function; or a lackof normalization. Other explanations could be hypothetized, for instance, VGG16might have been unable to learn due to a lack of training samples, a too shorttraining period, or even a too large batch size.

Finally, it is noteworthy to mention that the model which best learned to solveTask 3 converged in about 80 epochs and 100 epochs were not enough with others,whereas the mean of convergence speed for the 3 counting tasks is 31 epochs.Thus, it could be interesting to study whether this result means that there is acorrelation between the task’s difficulty and the model convergence speed or if

23

Page 24: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

they are only due to some other parameter.

6 ConclusionThis paper has presented some experiments that explore correlations between hu-mans and computer vision techniques based on deep learning; to evaluate graphvisualization techniques. We trained models to solve tasks on several representa-tions (AM and NL) and compared their performances to two state-of-the-art userevaluations. Our results tend to show that humans and computer vision techniquescould be correlated on the considered tasks types, however additional work isneeded to verify its generalization.

A first improvement to deepen the experiment we ran is to enlarge its scopewith more tasks of various types and use other models architectures to verify ourresults robustness. A second lead of improvement is to compare a representationwith itself while only changing some of its parameters, to make sure that ourmodels are learning to interpret the visualization. For example, one could comparetwo identical AM representations which use a state of the art and a random orderingalgorithm and verify that our approach gives an expected result [24]: randomordering performs worse.

The purpose of this study was to test the feasibility of our approach. As itsresults draw positive feedback, we hope it could serve as a first step for furtherstudies around this thematic.

A direction for future research is to include interactive visualization in theevaluation process. Reinforcement learning [23] seems to be a good option aswith such technique an agent could interact with the visualization to learn how tosolve a task. A second direction could explore further the (dis)similarities betweenhumans and machines computations. One could study the correlation between ma-chine performance metrics (e.g. convergence speed) and humans performances,or how classes of problems affect humans and machines performances and if suchchanges are correlated.

Finally, if future studies were to verify our assumption, we hope that this newapproach could lead toward a new methodology that uses computer vision tech-niques as a pre-process of user evaluations. We believe that such a methodologycould bring a lot to the visualization community by providing it a tool to make ex-periments reproducible, create test beds and improve actual end-user evaluation’sprotocol definition.

24

Page 25: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

References[1] Ala Abuthawabeh, Fabian Beck, Dirk Zeckzer, and Stephan Diehl. Finding

structures in multi-type code couplings with node-link and matrix visual-izations. In Software Visualization (VISSOFT), 2013 First IEEE WorkingConference on, pages 1–10, 2013.

[2] Basak Alper, Benjamin Bach, Nathalie Henry Riche, Tobias Isenberg, andJean-Daniel Fekete. Weighted graph comparison techniques for brain con-nectivity analysis. In Proceedings of the SIGCHI conference on human fac-tors in computing systems, pages 483–492, 2013.

[3] Michael Behrisch, Benjamin Bach, Nathalie Henry Riche, Tobias Schreck,and Jean-Daniel Fekete. Matrix reordering methods for table and networkvisualization. Computer Graphics Forum, 35(3):693–716, 2016.

[4] Michael Behrisch, Michael Blumenschein, Nam Wook Kim, Lin Shao, Men-natallah El-Assady, Johannes Fuchs, Daniel Seebacher, Alexandra Diehl,Ulrik Brandes, Hanspeter Pfister, et al. Quality metrics for information vi-sualization. Computer Graphics Forum, 37(3):625–662, 2018.

[5] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and EtienneLefebvre. Fast unfolding of communities in large networks. Journal ofstatistical mechanics: theory and experiment, 2008(10):P10008, 2008.

[6] Markus Chimani. Bend-minimal orthogonal drawing of non-planar graphs,2004.

[7] Francois Chollet. Xception: Deep learning with depthwise separable con-volutions. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 1251–1258, 2017.

[8] William S. Cleveland and Robert McGill. Graphical perception: Theory,experimentation, and application to the development of graphical methods.Journal of the American Statistical Association, 79(387):531–554, 1984.

[9] Samuel Dodge and Lina Karam. A study and comparison of human and deeplearning recognition performance under visual distortions. In 26th Interna-tional Conference on Computer Communications and Networks (ICCCN),United States, 9 2017.

25

Page 26: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

[10] Samuel Dodge and Lina Karam. Can the early human visual system com-pete with deep neural networks? In Proceedings - 2017 IEEE InternationalConference on Computer Vision Workshops, ICCVW 2017, volume 2018-January, pages 2798–2804. Institute of Electrical and Electronics EngineersInc., 1 2018.

[11] Arne Frick, Andreas Ludwig, and Heiko Mehldau. A fast adaptive layout al-gorithm for undirected graphs. In International Symposium on Graph Draw-ing, 1994.

[12] Mohammad Ghoniem, J-D Fekete, and Philippe Castagliola. A comparisonof the readability of graphs using node-link and matrix-based representa-tions. In IEEE Symposium on Information Visualization (INFOVIS), pages17–24, 2004.

[13] Mohammad Ghoniem, Jean-Daniel Fekete, and Philippe Castagliola. Onthe readability of graphs using node-link and matrix-based representations:a controlled experiment and statistical analysis. Information Visualization,4(2):114–135, 2005.

[14] Daniel Haehn, James Tompkin, and Hanspeter Pfister. Evaluating graphicalperception with CNNs. IEEE Transactions on Visualization and ComputerGraphics, 2018.

[15] Hammad Haleem, Yong Wang, Abishek Puri, Sahil Wadhwa, and HuaminQu. Evaluating the readability of force directed graph layouts: A deep learn-ing approach. arXiv preprint arXiv:1808.00703, 2018.

[16] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification. In 2015 IEEE Interna-tional Conference on Computer Vision (ICCV), pages 1026–1034, Dec 2015.

[17] Rene Keller, Claudia M Eckert, and P John Clarkson. Matrices or node-linkdiagrams: which visual representation is better for visualising connectivitymodels? Information Visualization, 5(1):62–76, 2006.

[18] Stephen G Kobourov, Sergey Pupyrev, and Bahador Saket. Are crossingsimportant for drawing large graphs? In International Symposium on GraphDrawing, pages 234–245, 2014.

26

Page 27: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas-sification with deep convolutional neural networks. In Advances in neuralinformation processing systems, pages 1097–1105, 2012.

[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,521(7553):436, 2015.

[21] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE,86(11):2278–2324, 1998.

[22] Victor Lempitsky and Andrew Zisserman. Learning to count objects in im-ages. In Advances in neural information processing systems, pages 1324–1332, 2010.

[23] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deepreinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[24] Christopher Mueller, Benjamin Martin, and Andrew Lumsdaine. A compar-ison of vertex ordering algorithms for large graph visualization. In Visual-ization, 2007. APVIS’07. 2007 6th International Asia-Pacific Symposium on,pages 141–148, 2007.

[25] Mershack Okoe and Radu Jianu. Ecological validity in quantitative userstudies–a case study in graph evaluation. In IEEE VIS, 2015.

[26] Mershack Okoe, Radu Jianu, and Stephen Kobourov. Revisited networkrepresentations. In International Symposium on Graph Drawing (GD), 2017.

[27] Mershack Okoe, Radu Jianu, and Stephen G Kobourov. Node-link or adja-cency matrices: Old question, new insights. IEEE Transactions on Visual-ization and Computer Graphics, 2018.

[28] Helen C Purchase. Experimental human-computer interaction: a practicalguide with visual examples. 2012.

[29] Maryam Rahnemoonfar and Clay Sheppard. Deep count: fruit countingbased on deep simulated learning. Sensors, 17(4):905, 2017.

27

Page 28: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

[30] Donghao Ren, Laura R. Marusich, John ODonovan, Jonathan Z. Bakdash,James A. Schaffer, Daniel N. Cassenti, Sue E. Kase, Heather E. Roy, Wan-yi (Sabrina) Lin, Tobias Hllerer, and et al. Understanding node-link andmatrix visualizations of networks: A large-scale online experiment. NetworkScience, 7(2):242264, 2019.

[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-stein, et al. Imagenet large scale visual recognition challenge. InternationalJournal of Computer Vision, 115(3):211–252, 2015.

[32] J. Sansen, R. Bourqui, B. Pinaud, and H. Purchase. Edge visual encodingsin matrix-based diagrams. In 2015 19th International Conference on Infor-mation Visualisation, pages 62–67, July 2015.

[33] Santi Seguı, Oriol Pujol, and Jordi Vitria. Learning to count with deep objectfeatures. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops, pages 90–96, 2015.

[34] Karen Simonyan and Andrew Zisserman. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[35] Sebastian Stabinger, Antonio Rodrıguez-Sanchez, and Justus Piater. 25Years of CNNs: Can we compare to human abstraction capabilities? InAlessandro E.P. Villa, Paolo Masulli, and Antonio Javier Pons Rivero, ed-itors, Artificial Neural Networks and Machine Learning – ICANN 2016,pages 380–387, Cham, 2016. Springer International Publishing.

[36] Andy Thomas and Christian Kaltschmidt. Bio-inspired Neural Networks,pages 151–172. Cham, 2014.

[37] Elad Walach and Lior Wolf. Learning to count with cnn boosting. In Euro-pean Conference on Computer Vision, pages 660–676, 2016.

[38] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXivpreprint arXiv:1212.5701, 2012.

28

Page 29: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

(a) V = 20, E =8, d = 0.2

(b) V = 20, E =32, d = 0.4

(c) V = 20, E =72, d = 0.6

(d) V = 50, E =50, d = 0.2

(e) V = 50, E =200, d = 0.4

(f) V = 50, E =450, d = 0.6

(g) V = 100, E =200, d = 0.2

(h) V = 100, E =800, d = 0.4

(i) V = 100, E =1800, d = 0.6

(j) V = 20, E =8, d = 0.2

(k) V = 20, E =32, d = 0.4

(l) V = 20, E =72, d = 0.6

(m) V = 50, E =50, d = 0.2

(n) V = 50, E =200, d = 0.4

(o) V = 50, E =450, d = 0.6

(p) V = 100, E =200, d = 0.2

(q) V = 100, E =800, d = 0.4

(r) V = 100, E =1800, d = 0.6

Figure 2: Examples of uncorrelated graph images for NL and AM representations.V is the number of nodes, E is the number of edges and d is the density. NLlayout algorithm is GEM and AM ordering is computed using Louvain clusteringalgorithm.

29

Page 30: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Table 2: Summary of the results obtained for each combination of tasks, networkand evaluation metric. For each triplet, the best epoch is obtained from the valida-tion database while the performance is obtained from the test database. The bestconfiguration over the rows is presented in bold.

Performance Best epochRepresentation AM NL AM NL

Task Evaluation Network

Task 0

accuracyLeNet 1.0 0.1 11 33VGG16 1.0 0.15 18 25

delta10LeNet 1.0 0.86 5 47VGG16 1.0 0.97 13 24

MSELeNet 0.0 143.6 11 47VGG16 0.0 19.23 18 25

percent10LeNet 1.0 0.71 5 47VGG16 1.0 0.82 14 24

Task 1

accuracyLeNet 1.0 0.14 17 50VGG16 0.95 0.19 69 49

delta10LeNet 1.0 0.4 17 76VGG16 0.96 0.51 86 49

MSELeNet 0.04 1327.56 17 38VGG16 174.04 606.15 59 55

percent10LeNet 1.0 0.61 17 78VGG16 0.95 0.74 86 49

Task 2

accuracyLeNet 0.29 0.25 16 15VGG16 0.28 0.26 19 36

delta10LeNet 1.0 1.0 7 5VGG16 1.0 1.0 19 6

MSELeNet 2.64 4.92 10 64VGG16 2.64 3.37 19 31

percent10LeNet 0.66 0.56 10 24VGG16 0.64 0.63 19 31

Task 3 accuracyLeNet 0.34 0.79 1 76VGG16 0.32 0.32 1 3

MSELeNet 1.68 0.27 1 84VGG16 0.68 0.68 1 3

30

Page 31: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Table 3: Statistical validation of the hypotheses. For each task × representation,the mean of the absolute difference between the ground-truth and the prediction isshown for AM and NL over all the possibilities graph× representation; the p-valueof the Wilcoxon test is also provided (0 values are due to rounding issues).

Mean of abs. diff. p-valueTask Model se-

lection cri-terion

AM NL

Task 0

MSE 0.00 4.64 0.00accuracy 0.00 4.86 0.00percent10 0.05 4.66 0.00delta10 0.05 4.66 0.00

Task 1

MSE 0.98 20.15 0.00accuracy 1.00 20.38 0.00percent10 1.01 19.90 0.00delta10 1.01 20.15 0.00

Task 2

MSE 1.16 1.46 3.03×10−49

accuracy 1.17 1.53 9.85×10−60

percent10 1.16 1.44 8.76×10−39

delta10 1.24 1.66 8.99×10−78

Task 3MSE 0.84 0.46 6.45×10−239

accuracy 0.84 0.47 9.07×10−203

31

Page 32: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

Task 0 Task 1 Task 2 Task 3

LeNet

AM

20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98

Prediction

2024283236404448525660646872768084889296

100

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

8 15 24 35 48 62 79 98 118

141

165

192

233

278

332

392

450

512

584

662

737

832

1012

1210

1425

1658

Prediction

815243548627998

118141165192233278332392450512584662737832

1012121014251658

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

Prediction

2468

10121416182022242628303234363840424446485052

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4Prediction

23

4G

roun

dtru

th

0.0

0.2

0.4

0.6

0.8

1.0

NL

20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98

Prediction

2024283236404448525660646872768084889296

100

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

8 15 24 35 48 62 79 98 118

141

165

192

233

278

332

392

450

512

584

662

737

832

1012

1210

1425

1658

Prediction

8162740567495

118144172208259317380450524605691784959

118014251693

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

Prediction

25

811

1417

2023

2629

3235

3841

4447

50G

roun

dtru

th

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4Prediction

23

4G

roun

dtru

th

0.0

0.2

0.4

0.6

0.8

1.0

VGG16

AM

20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98

Prediction

2024283236404448525660646872768084889296

100

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

8 15 24 35 48 62 79 98 118

141

165

192

233

278

332

392

450

512

584

662

737

832

1012

1210

1425

1658

Prediction

8162740567495

118144172208259317380450524605691784959

118014251693

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

Prediction

25

811

1417

2023

2629

3235

3841

4447

50G

roun

dtru

th

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4Prediction

23

4G

roun

dtru

th

0.0

0.2

0.4

0.6

0.8

1.0

NL

20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98

Prediction

2024283236404448525660646872768084889296

100

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

8 15 24 35 48 62 79 98 118

141

165

192

233

278

332

392

450

512

584

662

737

832

1012

1210

1425

1658

Prediction

8162740567495

118144172208259317380450524605691784959

118014251693

Gro

undt

ruth

0.0

0.2

0.4

0.6

0.8

1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

Prediction

25

811

1417

2023

2629

3235

3841

4447

50G

roun

dtru

th

0.0

0.2

0.4

0.6

0.8

1.0

2 3 4Prediction

23

4G

roun

dtru

th

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3: Confusion matrices with the best model on MSE metric for each taskand representation, computed on the test database. LeNet and VGG16 modelshave been trained to solve the three considered graph visualization tasks on bothAM and NL diagrams.. In a perfect performing system, the diagonal is dark bluewhile the other cells are white. VGG16 seems to perform better than LeNet onNL representation while the opposite can be observed for AM representation.

32

Page 33: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

0.2 0.4 0.60.0

0.2

0.4

0.6

0.8

1.0

accu

racy

(a) Task 0.

20 30 40 50 60 70 80 90 1000.0

0.2

0.4

0.6

0.8

1.0

accu

racy

(b) Task 0.

0.2 0.4 0.60.0

0.2

0.4

0.6

0.8

1.0

accu

racy

(c) Task 1.

20 30 40 50 60 70 80 90 1000.0

0.2

0.4

0.6

0.8

1.0

accu

racy

(d) Task 1.

0.2 0.4 0.60.0

0.2

0.4

0.6

0.8

1.0

accu

racy

(e) Task 2.

20 30 40 50 60 70 80 90 1000.0

0.2

0.4

0.6

0.8

1.0

accu

racy

Technique,Model(AM, lenet)(AM, vgg16)(NL, lenet)(NL, vgg16)

(f) Task 2.

Figure 4: Accuracies on the test dataset for all the couples of network and rep-resentation grouped by range of edge densities (left column) and nodes (rightcolumn). The selected models maximize accuracy on the validation dataset.

33

Page 34: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

20 30 40 50 60 70 80 90 100GroundTruth

0

25

50

75

100

125

150

175

Num

ber

of g

raph

s

(a) Task 0.

0 250 500 750 1000 1250 1500 1750GroundTruth

0

100

200

300

400

500

600

700

800

Num

ber

of g

raph

s

(b) Task 1.

0 10 20 30 40 50GroundTruth

0

100

200

300

400

500

Num

ber

of g

raph

s

(c) Task 2.

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00GroundTruth

0

200

400

600

800

1000

Num

ber

of g

raph

s

(d) Task 3.

Figure 5: ground-truths distribution for the 4 tasks in the test database.

34

Page 35: Toward automatic comparison of visualization techniques: … · 2020. 4. 20. · cient visualization systems. However, designing such an evaluation is al- ... in their survey on quality

(a) Receipe (NL) (b) Receipe (AM)

(c) UsAir (NL) (d) UsAir (AM)

Figure 6: Our AM and NL representations of the graphs of [27]. Example of imagesof Receipe and UsAIr with NL (a and b) and MD (b and d) where two nodes havebeen highlighted.

35