interactive visualization of large similarity graphs and ... · support large graphs,...

4
Interactive Visualization of Large Similarity Graphs and Entity Resolution Clusters Demonstration Paper M. Ali Rostami, Alieh Saeedi, Eric Peukert, and Erhard Rahm Database group and ScaDS Dresden/Leipzig, University of Leipzig {rostami,saeedi,peukert,rahm}@informatik.uni-leipzig.de ABSTRACT Entity Resolution (ER) identifies semantically equivalent enti- ties, e.g. describing the same product or customer. It is a crucial and challenging step when integrating heterogeneous (big) data sources. ER approaches typically compute a similarity graph where vertices represent entities and edges (links) connect sim- ilar entities. Different clustering algorithms can be applied on such similarity graphs to finally determine groups of matching entities. In this demonstration paper, we introduce a new inter- active tool to visualize and thus help to analyze large similarity graphs and large sets of ER clusters. Users can intuitively investi- gate the link and cluster structure to identify potential problems such as overly large clusters, cluster overlaps or singletons that might indicate the need for repair activities on the ER result. To support large graphs, computation-intensive tasks like layout- ing and sampling are executed on the server side as parallel or serial processes. The demo walks through different matching and clustering tasks and allows users to interactively explore the results. 1 INTRODUCTION In the era of big data, one of the challenging data integration tasks is to identify semantically equivalent entities (e.g., describ- ing the same product or customer) in large and heterogeneous data sources. This task is referred to as Entity Resolution (ER) or record linkage [6]. ER typically computes a similarity graph where vertices represent entities and edges (links) connect similar entities with a pair-wise similarity above a predefined threshold. Matching entities can directly be derived from such a similar- ity graph or from groups of matching entities determined with subsequent clustering algorithms [9]. Our recently introduced framework FAMER (Framework for Multi-source Entity Resolu- tion) supports a parallel computation of such similarity graphs and ER clusters for multiple (2) data sources [12]. During the continuous development of FAMER, it is difficult to investigate the correctness and efficiency of the certain algo- rithms and to understand the problems. Such an investigation may lead to introduce better match and cluster algorithms or even an extra postprocessing step of repair (for more details see [13]). However, to identify (or debug) such issues with lim- ited effort and time we see the need for a comprehensive and powerful approach to visually analyze similarity graphs and ER clusterings. Unfortunately, general purpose graph visualization tools like Gephi 1 or Graphviz 2 have only limited capabilities to 1 https://gephi.org 2 http://www.graphviz.org © 2018 Copyright held by the owner/author(s). Published in Proceedings of the 21st International Conference on Extending Database Technology (EDBT), March 26-29, 2018, ISBN 978-3-89318-078-3 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. Figure 1: The software architecture analyze ER clusterings and also have problems to visualize large (big data) similarity graphs with vertex and edge properties. In this demo paper we introduce SIMG-VIZ, a new visual- ization system for entity resolution and clustering that allows us to investigate different match and clustering techniques for multi-source entity resolution. SIMG-VIZ offers the following key features: SIMG-VIZ allows a user to analyze precomputed similarity graphs and clusterings from existing ER tools and also supports executing and analyzing ER match tasks directly with FAMER. Different graph and ER cluster visualization techniques and layouts can be applied to choose the best visualiza- tions. To increase performance, some layouts can be precom- puted on the server with either parallel or serial computa- tion. This provides a significant optimization potential in particular for force-directed layouts [3]. To support visualization of large graphs, preprocessing techniques such as sampling (also executed in parallel on the server) can be selected to obtain a fast overview of large similarity graphs and their clustering results. Clusters and their overlaps as well as edges annotated with their type and similarity are visualized by using a simple but useful cake-like visual metaphor. Users can interact with clusters and select individual clusters for investigation. 2 SIMG-VIZ OVERVIEW The SIMG-VIZ system consists of three modules: (1) the FAMER server, (2) a visualization server in JAVA and (3) a web-based UI-client written in JavaScript (see Fig. 1). The FAMER server is used to link several sources and executes defined matching tasks. However, SIMG-VIZ also allows a user to load similarity graphs and clustering results that were com- puted by other frameworks and tools. The visualization server offers several preprocessing (e.g. for sampling) and layouting algorithms. The preprocessing algorithms are implemented in distributed fashion based on Gradoop and Apache Flink whereas Demonstration Series ISSN: 2367-2005 690 10.5441/002/edbt.2018.86

Upload: others

Post on 06-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interactive Visualization of Large Similarity Graphs and ... · support large graphs, computation-intensive tasks like layout-ing and sampling are executed on the server side as parallel

Interactive Visualization of Large Similarity Graphs and EntityResolution Clusters

Demonstration Paper

M. Ali Rostami, Alieh Saeedi, Eric Peukert, and Erhard RahmDatabase group and ScaDS Dresden/Leipzig, University of Leipzig

{rostami,saeedi,peukert,rahm}@informatik.uni-leipzig.de

ABSTRACTEntity Resolution (ER) identifies semantically equivalent enti-ties, e.g. describing the same product or customer. It is a crucialand challenging step when integrating heterogeneous (big) datasources. ER approaches typically compute a similarity graphwhere vertices represent entities and edges (links) connect sim-ilar entities. Different clustering algorithms can be applied onsuch similarity graphs to finally determine groups of matchingentities. In this demonstration paper, we introduce a new inter-active tool to visualize and thus help to analyze large similaritygraphs and large sets of ER clusters. Users can intuitively investi-gate the link and cluster structure to identify potential problemssuch as overly large clusters, cluster overlaps or singletons thatmight indicate the need for repair activities on the ER result. Tosupport large graphs, computation-intensive tasks like layout-ing and sampling are executed on the server side as parallel orserial processes. The demo walks through different matchingand clustering tasks and allows users to interactively explore theresults.

1 INTRODUCTIONIn the era of big data, one of the challenging data integrationtasks is to identify semantically equivalent entities (e.g., describ-ing the same product or customer) in large and heterogeneousdata sources. This task is referred to as Entity Resolution (ER)or record linkage [6]. ER typically computes a similarity graphwhere vertices represent entities and edges (links) connect similarentities with a pair-wise similarity above a predefined threshold.Matching entities can directly be derived from such a similar-ity graph or from groups of matching entities determined withsubsequent clustering algorithms [9]. Our recently introducedframework FAMER (Framework for Multi-source Entity Resolu-tion) supports a parallel computation of such similarity graphsand ER clusters for multiple (≥ 2) data sources [12].

During the continuous development of FAMER, it is difficultto investigate the correctness and efficiency of the certain algo-rithms and to understand the problems. Such an investigationmay lead to introduce better match and cluster algorithms oreven an extra postprocessing step of repair (for more detailssee [13]). However, to identify (or debug) such issues with lim-ited effort and time we see the need for a comprehensive andpowerful approach to visually analyze similarity graphs and ERclusterings. Unfortunately, general purpose graph visualizationtools like Gephi1 or Graphviz2 have only limited capabilities to1https://gephi.org2http://www.graphviz.org

© 2018 Copyright held by the owner/author(s). Published in Proceedings of the 21stInternational Conference on Extending Database Technology (EDBT), March 26-29,2018, ISBN 978-3-89318-078-3 on OpenProceedings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0.

Figure 1: The software architecture

analyze ER clusterings and also have problems to visualize large(big data) similarity graphs with vertex and edge properties.

In this demo paper we introduce SIMG-VIZ, a new visual-ization system for entity resolution and clustering that allowsus to investigate different match and clustering techniques formulti-source entity resolution. SIMG-VIZ offers the followingkey features:

• SIMG-VIZ allows a user to analyze precomputed similaritygraphs and clusterings from existing ER tools and alsosupports executing and analyzing ER match tasks directlywith FAMER.

• Different graph and ER cluster visualization techniquesand layouts can be applied to choose the best visualiza-tions.

• To increase performance, some layouts can be precom-puted on the server with either parallel or serial computa-tion. This provides a significant optimization potential inparticular for force-directed layouts [3].

• To support visualization of large graphs, preprocessingtechniques such as sampling (also executed in parallel onthe server) can be selected to obtain a fast overview oflarge similarity graphs and their clustering results.

• Clusters and their overlaps as well as edges annotatedwith their type and similarity are visualized by using asimple but useful cake-like visual metaphor. Users caninteract with clusters and select individual clusters forinvestigation.

2 SIMG-VIZ OVERVIEWThe SIMG-VIZ system consists of three modules: (1) the FAMERserver, (2) a visualization server in JAVA and (3) a web-basedUI-client written in JavaScript (see Fig. 1).

The FAMER server is used to link several sources and executesdefined matching tasks. However, SIMG-VIZ also allows a userto load similarity graphs and clustering results that were com-puted by other frameworks and tools. The visualization serveroffers several preprocessing (e.g. for sampling) and layoutingalgorithms. The preprocessing algorithms are implemented indistributed fashion based on Gradoop and Apache Flink whereas

Demonstration

Series ISSN: 2367-2005 690 10.5441/002/edbt.2018.86

Page 2: Interactive Visualization of Large Similarity Graphs and ... · support large graphs, computation-intensive tasks like layout-ing and sampling are executed on the server side as parallel

graph layoutings are currently only implemented as non-distributedalgorithms. The web-based client, provides an interactive visual-ization of similarity graphs and ER-clusterings. Node and edgeproperties can be investigated and server-side components likematching, clustering, preprocessing and layouting can be trig-gered. Through the client a user selects options and triggersserver based REST-interfaces. The visualization server and theFAMER server both respond with JSON results. In the followingparagraphs each of the components is described in detail.

2.1 FAMER ToolFAMER offers parallel entity resolution for multiple data sources.It supports different match and clustering schemes that are ex-ecuted on top of a distributed data processing engine (ApacheFlink [4]) and the graph analytics framework Gradoop [10]. Itsexecution has two main phases: (1) computation of a similar-ity graph based on pairwise matching and (2) clustering. Thefirst phase consists of several steps, namely blocking, pairwisecomparisons, and match classification. Blocking reduces the num-ber of necessary comparisons which otherwise would require tocompare each entity of a data source to all entities of any othersource. Different blocking techniques can be applied in FAMERsuch as Standard Blocking (SB), Sorted Neighborhood as well assingle- and multi-pass blocking. In the matching step all entitiesof the computed blocks are compared by computing and combin-ing similarities between their attributes, e.g. based on similaritymeasures such as Edit-Distance or Jaro Winkler. For matching,match rules can specify the required minimal similarity for theconsidered attribute-comparisons. The output of this step is theset of matching entity pairs (links) together with a combinedsimilarity value per link. In the following match classificationstep entity pairs are classified as match or no-match based onthe computed similarity values. This output is stored as a simi-larity graph where entities are represented as vertices and matchlinks as edges. The clustering phase of FAMER aims at groupingtogether all matching vertices of the similarity graph based onthe link structure and similarity values. Clustering algorithmstypically try to maximize the similarity between entities withina cluster while the similarity between entities of different clus-ters should be minimized. FAMER supports several clusteringalgorithms for ER such as Connected Components, CorrelationClustering (CCPivot) [2, 5], Center [9], Merge Center [1], andStar [1].

With SIMG-VIZ, all components of FAMER can be configuredand parameterized which allows users to run and compare dif-ferent ER match tasks.

2.2 Client (Web-based HTML/JS Frontend)Fig. 2 shows an overview of the web-based client of SIMG-VIZ. Atthe top part of the client, several options can be selected. The usercan choose between different clustering and match configura-tions which are stored after executing FAMER. Before visualizingthe similarity graph some available preprocessing algorithmssuch as sampling can be applied. Moreover, the user can selectlayouting options, i.e. which layout to use and where the compu-tation should take place. In particular for large graphs, computingthe layout on the server gives significant improvements. Finally,SIMG-VIZ offers a list of actions (see Table 1) that support differ-ent drawing tasks of the graph and some statistics computationscan be triggered. On the right side of the client a number ofparameters for visualization can be set. To improve interactivity,

styling tasks are performed on the client, for example changingvertex or edge sizes.

Table 1: Actions in SIMG-VIZ

Action Description

Draw graph (Cytoscape) Draws a sim-graph in Cytoscape3.Draw graph (WebGL) Draws a sim-graph in WebGL based on Vi-

vaGraph.Compute only Executes preprocessing and sampling with-

out visualization.Compute labels/keys Computes all labels and property-keys of

the vertices and edges for filtering in theleft part of the UI.

Save as image Exports an image of the drawn graph.Remove selected node Removes a selected node.Degree Distribution Computes the degree distribution of the

graph.Graph Statistics Computes additional basic statistics of a

graph.

2.3 Visualization ServerThe visualization server offers preprocessing and layouting ser-vices. The preprocessing algorithms are implemented in a dis-tributed fashion based on Gradoop and Apache Flink. Table 2 listscurrently implemented preprocessing components. All layouting

Table 2: Preprocessing algorithms in SIMG-VIZ

Preprocessing Description

Graph sampling Computes a statistical sampling of agraph. Currently SIMG-VIZ implementsvertex, edge and page rank sampling.

Graph summary Computes a graph summary by groupingdense subgraphs of a graph to generate acompact overview of a large graph [11].In SIMG-VIZ the Flink implementation isused.

Cluster neighbor filtering In particular for ER-Clustering a filteringto neighbors of clusters is needed. Theuser specifies a cluster ID (at the rightpart of UI) and that cluster together withits neighbor clusters is visualized.

Cluster sizes filtering Only the clusters with specific sizes arevisualized.

Cluster Aggregation It visualizes a graph in which the clustervertices are grouped together.

algorithms are available both for the client and the server as non-distributed algorithms. We observed that executing layoutingalgorithms that need iterations like the force directed layout [8]should not be executed on the client within a browser. Executingit on a server brings significant run-time improvements, evenwithout distributing the computation to multiple nodes. After alayout computation finishes on the server, the positions of nodesare send to the client together with the graph. As future workwe plan to implement parallel versions of layouting to be run ontop of Flink or Gradoop.

691

Page 3: Interactive Visualization of Large Similarity Graphs and ... · support large graphs, computation-intensive tasks like layout-ing and sampling are executed on the server side as parallel

Figure 2: An overview of SIMG-VIZ

2.4 Cluster VisualizationIn this section, we describe some specific features of SIMG-VIZwhich are designed particularly for the visualization of ER clus-ters.

These features are explained along a real world ER-example ofintegrating four duplicate-free data sources namely Freebase (Fb),New York Times (nyt), DBpedia (db), and Geonames (geo). Weinitially compute a similarity graph and apply different clusteringtechniques with FAMER. A result cluster includes only verticesfrom these four sources which are most probably the same realworld entities. An edge connects two vertices which have a highvalue of computed similarity measure. In Fig. 3 we initially visu-alize the complete clustering result. Clusters are given differentcolors to indicates cluster membership. Users can identify clus-ters that may warrant a closer inspection, e.g., clusters with morethan 4 vertices or singleton clusters. A user can zoom in and in-spect the properties of each vertex. Since generic graph layoutingalgorithms often have problems in visualizing large similaritygraphs (e.g., problem of edge cluttering) we applied a compoundlayout for cluster visualization. Such compound layouts of graphslike CoSE-Bilkent [7] visualize vertices in a cluster (referred toas compound in the paper) close to each other while the wholegraph is visualized by using a modified force-directed algorithm.Fig. 4(middle) shows a visualization of such a compound layoutwhich we also compute on the server side. Vertices of a clustershare the same color and are closely grouped together to form acompound.

To get a cleaner picture, a user can interactively select a specificcluster or enter a cluster ID to only visualize a specific clusterfor closer inspection. Often we also need to visualize a clustertogether with its neighboring clusters (see Fig. 4 (top)). The vertexlabels here refer to the corresponding data source for that entity.The scenario illustrates a problem case since there are more thanfour cluster members with some data sources having two entitiesin the same cluster which should not be possible for duplicate-free sources. There is also a singleton cluster that might have

Figure 3: A visualization of all clusters.

to be merged with another cluster. Based on such observationswe are now able to re-assess the used cluster algorithms andinvestigate new approaches for cluster repair.

We also provide support for visualizing clustering result ofspecific clustering algorithms like Star[9]. The Star clusteringcomputes cluster representatives and all neighbors of those repre-sentatives are assigned to the corresponding cluster. In SIMG-VIZthose cluster representatives are highlighted with a black outline(see Figure 4). It happens that a vertex is a neighbor of severalcluster representatives so that such vertex will belong to multipleclusters. These multi-assignments are represented as pie-chartson nodes. Each piece of a pie chart which has a specific color spec-ifies a cluster assignment. For example, Fig. 4 (Middle) contains a

692

Page 4: Interactive Visualization of Large Similarity Graphs and ... · support large graphs, computation-intensive tasks like layout-ing and sampling are executed on the server side as parallel

Figure 4: visualizations of (top) clusters with different col-ors, (Middle) vertices can belong to more than one cluster- drawn in compound layout, (Bottom) different styles foredges indicating how strong connections between entitesare.

pie chart with three pieces which means that node (entitiy) is as-signed to three clusters. Obviously some cluster post-processingis needed to select the best cluster for each entity that have beenassigned to several clusters.

SIMG-VIZ provides special visualization support for evalu-ating clusters when the perfect cluster result is available forcomparison. As shown in Fig. 4 (Bottom) there are different edgecolors: green for edges within correct clusters and red for (wrong)edges between such clusters. Hence, clusters containing red edgesshould be investigated more closely. Finally we implemented amap-visualization of geo-referenced data so that entities can beplotted onto a map. With the help of this map-visualization weare able to identify false matches within a given data set.

2.5 Web-based Visualization LibrariesDrawing large graphs within a browser is problematic. We inves-tigated several different Javascript-based visualization librariesand observed that there are significant performance differences.Three groups of libraries can be found: (1) SVG-based librariescompute SVG-nodes and tags. They are often feature rich but donot scale well due to many generated SVG-Elements. (2) The sec-ond group relies on HTML-Canvas. These libraries are typically

faster but interactivity is harder to realize. Still they mostly donot scale well for larger graphs. (3) WebGL-based libraries offerthe best scalability but are still not as feature rich as existinglibraries in the other two categories. We finally decided to usethe Canvas-based Cytoscape4 library for small graphs up to 2000vertices since it gives more flexibility regarding the style of edgesand vertices. For example, the feature of drawing a pie charton vertices is already available in Cytoscape. For large graphs,we use VivaGraph5, which is a WebGL-based library with lesssupport for vertex and edge attributes, styling and coloring. How-ever for larger graphs the user won’t be able to see those detailsanyway.

3 DEMONSTRATIONIn the demonstration, we walk through a complete workflow ofentity resolution using matching and clustering for small andlarge ER match problems. A user could select select data sourcesand the corresponding properties, similarity measures and matchclassifiers that are used by FAMER. The resulting clusters fromFAMER are loaded into the visualization server. We then allowa user to compare different preprocessing algorithms as well asdifferent layoutings. We consider small data sources as well aslarge ones.

ACKNOWLEDGEMENTThis work was partly funded by the German Federal Ministry ofEducation and Research within the projects Competence Centerfor Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig(BMBF 01IS14014B) and BIGGR (BMBF 01IS16030B).

REFERENCES[1] J.A. Aslam, E. Pelekhov, and D. Rus. 2006. The Star Clustering Algorithm for

Information Organization. Springer.[2] Ni. Bansal, A. Blum, and S. Chawla. 2004. Correlation Clustering. Machine

Learning 56, 1 (2004), 89–113.[3] G.D. Battista, P. Eades, R. Tamassia, and I.G. Tollis. 1998. Graph Drawing:

Algorithms for the Visualization of Graphs. Prentice Hall.[4] P. Carbone, . Katsifodimos, S. Ewen, V. Markl, S.Haridi, and Ko. Tzoumas. 2015.

Apache Flink: Stream and Batch Processing in a Single Engine. IEEE DataEngineering Bulletin Issues 38, 4 (2015).

[5] F. Chierichetti, N. Dalvi, and R. Kumar. 2014. Correlation Clustering in MapRe-duce. In Proc. 20th ACM SIGKDD. 641–650.

[6] P. Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage,Entity Resolution, and Duplicate Detection. Springer.

[7] U. Dogrusoz, E. Giral, A. Cetintas, A. Civril, and E. Demir. 2009. A layoutalgorithm for undirected compound graphs. Information Sciences 179, 7 (2009),980–994.

[8] Thomas M. J. Fruchterman and Edward M. Reingold. 1991. Graph Drawing byForce-directed Placement. Softw. Pract. Exper. 21, 11 (Nov. 1991), 1129–1164.https://doi.org/10.1002/spe.4380211102

[9] O. Hassanzadeh, F. Chiang, R.J. Miller, and H.C. Lee. 2009. Framework forEvaluating Clustering Algorithms in Duplicate Detection. PVLDB 2, 1 (2009),1282–1293.

[10] M. Junghanns, A. Petermann, N. Teichmann, K. Gómez, and E. Rahm. 2016.Analyzing Extended Property Graphs with Apache Flink. In Proc. SIGMODWorkshop on Network Data Analytics.

[11] M. Junghanns, A. Petermann, N. Teichmann, and E. Rahm. 2017. TheBig Picture: Understanding large-scale graphs using Graph Grouping withGRADOOP. In Proc. BTW.

[12] A. Saeedi, E. Peukert, and E. Rahm. 2017. Comparative Evaluation of DistributedClustering Schemes for Multi-source Entity Resolution. Springer LNCS, 278–293.

[13] A. Saeedi, E. Peukert, and E. Rahm. 2018. Using Link Features for Entity Clus-tering in Knowledge Graphs. In Extended Semantic Web Conference. submittedfor publication.

4http://js.cytoscape.org5https://github.com/anvaka/VivaGraphJS

693