visualising complex linked data · 2015-11-01 · visualising complex linked data ... focus on...
Post on 24-Jun-2020
10 Views
Preview:
TRANSCRIPT
AUSTRALIA NATIONAL UNIVERSITY
Canberra, ACT
Visualising complex linked data
A thesis submitted in partial fulfilment
of the requirement for the course of
Computing Project (COMP 8715)
by
Quanwei Han
under the guidance of
Assoc. Prof. Peter Christen and Mr. Jeffrey Fisher
Table of contents
Abstract .......................................................................................................................... 1
1. Introduction ............................................................................................................. 2
1.1 Glossary ........................................................................................................ 4
2. Data collection ........................................................................................................ 5
2.1 Background................................................................................................... 5
2.2 Data description ............................................................................................ 6
3. Data visualisation .................................................................................................. 10
3.1 Single life segment visualisation ................................................................ 11
3.2 Multiple life segment visualisation ............................................................ 14
3.3 Data inconsistencies visualisation .............................................................. 15
3.4 Supplementary functions ............................................................................ 18
3.5 Flexibility and reusability issues ................................................................ 19
4. Evaluation ............................................................................................................. 23
5. Conclusion ............................................................................................................ 26
Acknowledgement ....................................................................................................... 27
References .................................................................................................................... 28
1
Abstract
Data linkage and data visualisation are both widely used in data analysis and
presentation. However, very little effort has been devoted to visualising complex
linked data. This thesis takes a data set of linked demographic data from the Isle of
Skye in Scotland and describes a visualisation technique for communicating linked
data to users in a way that shows the characteristics of linked data and potential wrong
or inconsistent links together in one figure. The technique is implemented as a
Python-based program, which is flexible and configurable and thus can be easily
applied on other demographic datasets. The resulting prototype has been evaluated by
data linkage researchers and was considered an effective tool to illustrate linked
demographic data.
2
1. Introduction
As a mature technique, data linkage has been increasingly relied on by researchers,
government agencies and businesses to integrate and analyse their data. It helps
increase the integration and quality of available data, and allows data mining to be
applied on multiple databases (Christen & Goiser, 2007). Data visualisation is also
widely used by many industries for communicating information to users efficiently
and intuitively (Ward, Grinstein, & Keim, 2010), as well as finding hidden patterns
and trends (Keim, 2002). However, little research has been done to apply visualisation
on complex linked databases. Visualisation of linked data is important because it
illustrates the resulting data in a straightforward way. In addition, it accelerates and
facilitates the process of identifying potentially wrong or inconsistent links (Appendix
A).
This project aims to develop innovative network-based techniques for visualising
complex linked data. The data used for visualisation is a demographic dataset which
links historical birth, death, marriage and census records from the Isle of Skye in
Scotland. These visualisation techniques are implemented as a Python prototype
program that generates different visualisations of the Isle of Skye historical data. This
program is flexible and configurable so that it can be applied on other linked
demographic datasets (See Appendix A for details).
During the last couple of decades, many visualisation techniques have been applied to
reflect the quality of data linkage strategies and they played an important role in
evaluating data linkage techniques (Abecasis & Cookson, 2000; Wigginton &
Abecasis, 2005; Christen & Goiser, 2007). Whereas almost all of these visualisations
focus on visualising the quality measures used for data linkage, like accuracy and
precision, hardly any of them try to visualise the linked data themselves. In other
words, these visualisation techniques can be used to assess data linkage results but
they do not work for identifying potentially wrong links. This project researches novel
visualisation methods for linked data, which will enable researchers to identify
3
potentially wrong links, and look into them by checking relevant characteristics of the
linked data.
The main challenge in this project is how to visualise the characteristics of linked data
and data inconsistencies together in one figure. They need to be shown concurrently
because they are complementary to each other when identifying potentially wrong
links. Potentially wrong links must be validated by checking the characteristics of
relevant linked records, and the volume of linked data is often large so that it is
impossible to traverse them without hints about potentially wrong links. These two
kinds of elements should be clearly distinguishable so that researchers can easily
focus on one of them from different perspectives. In this project, a solution has been
found for this problem, which is discussed in Section 3.3.
Many research works have been conducted to visualise demographic data (Wang,
Ibarra, Adnan, Longley, & Maciejewski, 2014) and the pickings have been
considerably rich (Andreev, 2000). One possible solution for this project is making
use of these techniques and incorporating the visualisation of data inconsistencies into
them. The problem here is that such demographic visualisations are designed only to
reflect the characteristics of demographic data. They tend to encode all the necessary
and useful information so only little space is left for extra information. In this project,
two visualisations for demographic data are developed, and they are succinct and
compatible with the visualisation of data inconsistencies.
The major contribution of this project is a technique for visualising complex linked
data. We demonstrate this technique using a demographic linked dataset as an
example, and evaluate how effective and intuitive it is. Though it is not a standardised
visualisation technique which works with all kinds of datasets, it is still of referential
value for future projects. Another contribution is that it provides a set of concise and
clear visualisations of demographic data which are implemented as a Python-based
program. This program is flexible and configurable such that it can easily been
applied on other similar datasets.
4
1.1 Glossary
This thesis involves many domains. To avoid any ambiguity, we provide a glossary
here. Terms are emphasised in bold. These terms are to be used throughout this thesis.
Data linkage : a technique used to create links between records which represent the
same entity. If two records which represent two entities are linked together, that is a
wrong link. The information in two wrongly linked are usually conflicts with each
other, that is a data inconsistency, which can be used to detect potentially wrong
links.
Data visualisation: a technique used to communicate information in a graphical
format. The term visualisation can also refer to a model which visualise data in a
specific way. In this project, data are presented in the format of a figure.
Life segment: An important part of this work is the visualisation of individuals‘ life
histories. However, an individual‘s life history may be divided into multiple segments
due to missed links or the individual‘s absence from the observed area. Thus, strictly,
it is the life segments that be visualised. For the purpose of comprehensibility, when
describing life segment visualisation, we still refer to life segment as individual. And
we use Lidlife_segment_id to represent the life segment whose id is life_segment_id,
e.g. Lid1559 represents the life segment whose id is 1559.
Event: an event happened in an individual‘s life. Each event corresponds to a civil
registration record or a census record. It is represented by an event object in the data
interface, and represented by a circle or wedge (when multiple events share a circle)
in the visualisation. These circles are concatenated by a lifeline in chronological
order.
Related person: a person who has a relationship to the individual we are focusing on.
It is represented by a related person object in the data interface, and represented by a
square in the visualisation.
5
2. Data collection
2.1 Background
The data used in this program come from a project conducted by Alice Reid, Ros
Davies and Eilidh Garrett, which linked 19th-century census data and civil registration
data from the Isle of Skye in Scotland to form a dynamic model showing how the
family structures changed and how the population migrated (Reid, Davies, & Garrett,
2002). Before the data linkage process, the census records and civil registration records
are discrete data points. They are only useful for gathering population statistics. In
contrast, the data linkage process concatenates these data points into data lines and
reveals the connections such as causation between data points. Moreover, these data
lines together with relationships between individuals, which are embedded in the raw
data, form a data network where more hidden patterns and trends can be tapped.
When applying data linkage on demographic data, population mobility is an important
factor which may affect the quality of the data linkage result. If a location has a
transient population, it is unlikely that the data linkage result will be fruitful. Only
part of these transient individuals‘ demographic data will be recorded in this place,
which will result in many incomplete data lines in the data linkage result. Thus, an
island like the Isle of Skye, which has natural boundary to impede migration, makes a
logical choice for a small linkage project (Reid, Davies, & Garrett, 2002).
Alice Reid, Ros Davies and Eilidh Garrett (2002) chose a ‗sets of related individuals‘
approach to perform data linkage. With this approach, the census records and civil
registration records belonging to one individual are linked under the context of family.
Specifically, it identifies as many families from different datasets as possible and then
matches the remaining individuals. According to the authors‘ analysis, this approach
leads to more robust results than linkage at a purely individual level because the chance
of two families sharing the same name and structure is much less than that of two
individuals, especially when the name pool is quite small.
6
2.2 Data description
The linked data from the work described in Section 2.1 (Reid, Davies, & Garrett, 2002)
is the dataset used in this project. It includes nine tables: longitudinal, birth, marriage,
death, and five census tables, each corresponding to one of the censuses in 1861, 1871,
1881, 1891 and 1901. Among them, only the longitudinal table was created during the
data linkage process. It acts as a ―hub‖ which links all the other tables together. Each
record in it represents a life segment that consists of a set of identifiers corresponding
to census and vital event records.
Here are some important characteristics of the longitudinal table:
Identifiers for all birth, marriage, death and census records are included in the
table.
All finished links are reflected in the table.
Each birth ID only appears once as an individual can only born once.
Each death ID only appears once as an individual can only die once.
Each census ID for each year only appears once as an individual can only be
recorded in each census once.
Each marriage ID must appear twice, once for the bride and once for the groom,
but each individual can marry multiple times. In this dataset, the maximum
number of marriages for one individual is four.
Figure 2.1 shows all columns in the longitudinal table together with comments and
statistics about the columns, and Figure 2.2 gives a sample set of data from the
longitudinal table.
7
Column name Comment Number of
records
with value
Number of
records with
value missing
Number of
unique values
lifesegmentID Unique ID for life segment 54,537 0 54,537
sex Sex of the life segment 54,261 276 2
BirthID ID from birth table, refers to the individual‘s own birth record 17,614 36,923 17,614
sibsetID Indicates the sibset group in which individual is a sibling 16,713 37,824 4,326
parentmarriageID ID from marriage table, refers to the individual‘s parents‘ marriage record 10,571 43,966 2,134
DeathID ID from death table, refers to the individual‘s own death record 12,285 42,252 12,285
marriageID1 ID from marriage table, refers to the individual‘s first marriage record 5,237 49,300 2,666
marriageID2 ID from marriage table, refers to the individual‘s second marriage record 96 54,441 94
marriageID3 ID from marriage table, refers to the individual‘s third marriage record 2 54,535 2
marriageID4 ID from marriage table, refers to the individual‘s fourth marriage record 1 54,536 1
61pid Person ID and Scheme (household) ID from 1861 census table, together
refer to the individual‘s person record in 1861 census
19,604 34,933 19,604
61sch 19,604 34,933 4,078
71pid Person ID and Scheme (household) ID from 1871 census table, together
refer to the individual‘s person record in 1871 census
18,101 36,436 18,101
71sch 18,101 36,436 3,773
81pid Person ID and Scheme (household) ID from 1881 census table, together
refer to the individual‘s person record in 1881 census
17,684 36,853 4,305
81sch 17,684 36,853 3,796
91pid Person ID and Scheme (household) ID from 1891 census table, together
refer to the individual‘s person record in 1891 census
16,476 38,061 3,933
91sch 16476 38,061 3,664
01pid Person ID and Scheme (household) ID from 1901 census table, together
refer to the individual‘s person record in 1901 census
14,609 39,928 14,609
01sch 14,609 39,928 3,385
Figure 2.1: Columns in the longitudinal table
8
life
segment
ID
sex Birth
ID
sibset
ID
parent
marriage
ID
Death
ID
Marriage
ID1
61
pid
61
sch
71
pid
71
sch
81
pid
81
sch
91
pid
91
sch
01
pid
01
sch
88 m 140 1474 23 86 74 7101015 250 8103005 98 9102007
617 m 673 1473 86 100 9102007 222 102029
641 f 697 1473 86 101 9102007 223 102029
667 m 723 1473 86 648 102 9102007
700 f 756 1473 86 224 102029
741 m 798 1473 86 225 102029
1559 f 1673 3251 16 684 86 3330 7207021 154 8102012 99 9102007
17802 f 16 189 6102011 3328 7207021 152 8102012 103 9102007 221 102029
51617 m 1390 16
Figure 2.2: Sample rows from the longitudinal table. Note: blank cells represent null values. The columns marriageID2, marriageID3 and
marriageID4 are hidden because all their values for the given rows are null.
Birth records Death records Marriage records
ID Year
140 1864
673 1888
697 1889
723 1891
756 1893
798 1895
1673 1867
ID Year
648 1894
684 1896
1390 1869
ID Year
16 1862
23 1863
86 1887
Figure 2.3: Sample partial records of vital events. Note: all identifiers here come from Figure 2.2
9
All other tables are from raw data, which means that they comprise the original
columns as well as those added during the data linkage process. Due to confidentiality
requirements, these tables are inaccessible for this project, except identifiers and
occurrence years for vital event records (census years are already reflected in the
longitudinal table). However, this information is enough to support this project.
Figure 2.4 contains statistics about vital event records, and Figure 2.3 on the previous
page gives some sample records of vital events. Note that all identifiers appearing in
Figure 2.3 are for life segments shown in Figure 2.2. The records in Figures 2.2 and
2.3 will be used to assist in the demonstration of data visualisation.
Event
type
Number of
records
Number of
records with
year missing
Number of
unique years
Minimum
year
Maximum
year
birth 17,614 0 41 1861 1901
marriage 2,668 0 41 1861 1901
death 12,279 0 41 1861 1901
Figure 2.4: Statistics about vital event records
10
3. Data visualisation
As mentioned in the introduction, the aim of this project is to develop a novel approach
to visualise the characteristics of linked demographic data as well as data
inconsistencies to help identify the potentially wrong links generated during the data
linkage process (Appendix A). Because the latter feature requires some
domain-specific knowledge and is more subject to adjustment, we decided to design it
first and then incorporate the former feature into it.
When showing the characteristics of demographic data, two aspects are important to
reflect one individual: the individual‘s life experience and family relationships.
According to common practice, life experiences are usually drawn in a lifeline which
concatenates meaningful events that happened in an individual‘s life, and
relationships are also often displayed as a line between two objects. In this project, we
do not intend to break either of these conventions, so different orientations are used to
distinguish them: lifelines are drawn horizontally, and relationship lines are placed
vertically. Thus, a conceptual graph is formed which is shown in Figure 3.1.
Figure 3.1: Conceptual graph which illustrates demographic data
Based on the conceptual graph, two models are created to visualising linked
demographic data: single life segment visualisation and multiple life segment
visualisation. They are described at length respectively in the following two sections.
Though the feature of showing the characteristics of linked data and the feature of
highlighting data inconsistencies are complementary to each other for the purpose of
11
identifying and correcting such data inconsistencies. When they are shown in a figure
together, users should not be confused. Thus, both features should be clearly
distinguishable so that users can easily focus on one of them from different perspectives.
The detailed designs are described in Section 3.3.
Some additional functions were added to equip the program with greater interactivity,
in order to facilitate ease of use as well as provide more information. These functions
are presented in Section 3.4.
The visualisation techniques used in this project are developed based on the linked data
sets from the Isle of Skye in Scotland, as described in Section 2.1. However, this
program should be flexible and configurable so that it can be reusable for other data sets.
Data interfaces and configuration files were defined to satisfy this demand, and Section
3.5 describes them in detail.
3.1 Single life segment visualisation
As the name suggests, the single life segment model is used to reflect the life and family
relationships of an individual. The first conceptual graph in Figure 3.1 at the beginning
of this chapter shows that an individual is represented by a horizontal lifeline. Related
persons are connected to this lifeline by vertical lines. Both events and related persons
are represented by nodes. For the purpose of differentiation, we use circles to represent
events and squares for related persons.
In order to provide extra information in an intuitive way, year of event is set as the
x-axis and year of birth of a person is set as the y-axis. Thus, when a user checks an
individual‘s lifeline from left to right, the sequence of events and time intervals
between them are shown visually. At the same time, distance on the y-axis between
different individuals describes the age distribution of a family, e.g. age gaps between
different generations.
Based on the demographic data, important events include birth, marriage, census and
death. Some other events can also be derived, for example, a birth event of a baby also
12
means a ‗birth of child‘ event to the parents. In the visualisation, different colours are
used to indicate different types of events (green: birth, gray: census, blue: marriage,
yellow: birth of child, red: death).
Besides the function of describing an individual‘s lifeline, these events are important to
link people with their related persons. A wedding (technically, a marriage registration)
creates a spousal relationship between the bride and the groom, and the birth of a baby
builds a parent-child relationship between the parents and the baby. In this project,
squares share the same colours with correlated circles. Different types of circles and
squares are illustrated in Figure 3.2.
Figure 3.2: Different types of circles and squares in this project
One problem here is that there might be more than one event in the same year, for
instance, a newly married couple welcomed their first child in the same year they
married. Under these circumstances, the circles that represent these events share the
same x-coordinate, in other words, these circles will cover each other.
This problem can be solved by stacking circles vertically, rather than strictly strung on
the lifeline (shown on the left of Figure 3.3). However, in practice, this method can
cause another issue: the stacked circles usually overlap the squares which represent
spouses (generally the age gap between bride and groom is not too large and the
y-coordinate of a lifeline or a square is the birth year of the corresponding person). The
final solution is to let multiple events share a circle if they happen in the same year as
shown on the right of Figure 3.3. In practice, this approach performs well.
Figure 3.3: Different approaches to show three events (marriage, birth of child, and
census) which happened in the same year
13
In certain cases, squares can overlap each other, e.g. two squares represent parents of
the same age (shown on the left of Figure3.4). In such a situation, the squares will be
moved upwards or downwards slightly to make them distinguishable. Specifically,
denote by the original y-coordinate of the squares and denote by
the new
y-coordinates after the movement, then and
. Some
asterisks are also added to inform the users that these squares have been moved
marginally, as shown on the right of Figure3.4.
Figure 3.4: A comparison of the effects before and after the movement
Figure 3.5 illustrates the effect of the visualisation of an example individual. In the
graph, It can be observed that Lid1559, who was born in 1867 to her parents,
experienced three censuses during her life, married Lid88 when she was 20, where her
husband was three years older than her. In the next 8 years, she gave birth to five
children: Lid617, Lid641, Lid667, Lid700 and Lid741. Unfortunately, she died very
young, at the age of 29. Also, it can be noticed that the age gap between her parents is
much larger than that between her and her husband.
Figure 3.5: Visualisation of an example individual‘s life and her related persons. Note:
the records used in this figure are presented in Figure 2.2 and Figure 2.3
14
3.2 Multiple life segment visualisation
In most cases of the single life segment visualisation, there are large blank areas at the
lower left and upper right of the figures. Because the event birth, which creates the
relationship between an individual and his/her parents is always the starting point
(leftmost) of that individual‘s lifeline, and parents must be older than children, so the
squares representing parents typically cluster in the upper left corner. Similarly, the
squares representing children tend to cluster in the lower right corner. This
characteristic can be observed in Figure 3.5.
These empty areas can be used to provide more information in one figure. A multiple
life segment visualisation can be got if expanding all squares denoting related persons
into lifelines. In this model, the given individual and related persons are all represented
by horizontal lifelines so that multiple individuals‘ life experiences can be viewed
simultaneously. In order to highlight the individual currently being focused on, the
circles‘ size is decreased and a thinner lifeline is used for all related persons.
Figure 3.6: Multiple life segment visualisation of an example individual. Note: the
records used in this figure are presented in Figure 2.2 and Figure 2.3
Figure 3.6 shows an example of a multiple life segment visualisation, the data used here
are the same as that of Figure 3.5. In this figure, the given individual and her related
persons‘ lifelines are displayed together. It can be observed that her father (Lid51617)
15
died young and her mother (Lid17802) had a relatively long life, and one of her
children (Lid667) died in infancy.
In some cases, two individuals are of the same age. The solution in this model is: if one
of them is the individual highlighted, smaller circles that denote events of related
persons are displayed on top of larger circles (representing the central individual) so
that they will not be hidden, as shown on the top of Figure 3.7. If both of them are
related persons, their lifelines will be moved slightly in the vertical direction to make
sure they are distinguishable. Shifted lifelines are marked by an asterisk to indicate the
displacement, as shown on the bottom of Figure 3.7.
Figure 3.7: The solutions to the problems of two individuals sharing the same age
3.3 Data inconsistencies visualisation
The second part of this project was the visualisation of potential inconsistencies.
However, this depends on some information which is not directly included in the
provided linked datasets. Before they can be visualised, functionality to detect those
data inconsistencies is need.
In this project, data inconsistency descriptions are expressed in the form of rules. Here
are some examples:
a) An individual cannot have an event prior to their birth;
b) An individual cannot give birth to his/her first child before 8;
c) There should not be more than 3 marriages in one individual‘s life.
It is impossible to enumerate all data inconsistencies here, because for projects with
different research directions, demographic datasets (output by different systems) or
backgrounds, the definition of data inconsistencies can vary. Also, as a data linkage
and analysis project proceeds, more data inconsistencies might be found and need to
be added into the set of current rules. Based on these demands, the function should be
16
easily maintainable and extensible.
Here, the concept of constraints in the database field can be used for reference. The
domain-specific rules are reduced to a few inbuilt domain- independent rules. For
instance, all examples mentioned above can be supported by the following two
domain- independent rules:
a) For one individual, the time lag between two types of events cannot be larger or
smaller than a given threshold;
b) For one individual, the frequency of a particular type of event cannot be higher or
lower than a given threshold.
This approach greatly increases the flexibility and reusability of the program. If a
domain-specific rule is supported by current domain- independent rules, it can be
added in a configuration file rather than having to modify the source code. A more
detailed description of domain- independent rules will be presented in Section 3.5.
Note that not all data inconsistencies happen inside one lifeline (intra- inconsistencies),
a case in point expressed in rule form is that the age gap between bride and groom is
larger than 20 years. If a data inconsistency happens between two individuals, it is
referred as an inter- inconsistency. The detection of an intra- inconsistency indicates
that there may exist some errors when linking the individual‘s relevant records in
different datasets, and an inter-inconsistency suggests that the errors can be related to
either or both involved individuals. The visualisation of data inconsistencies should
clearly reflect their differences.
Another issue about data inconsistencies is that, just like exceptions in software, data
inconsistencies also can be divided into multiple levels. In this project, two levels are
defined: error and warning. Errors mean there must be some errors in the linked data,
either introduced during the data linkage process or in the raw data, while warnings
indicate that there is something unusual but are not physically impossible, but not sure.
For example, if the three rules mentioned at the beginning of this section are applied,
data inconsistencies detected by a) and b) are errors, because these situations should
17
never happen, and those detected by c) are warnings as they are possible, although
very rare. Thus, these levels should be included when defining rules and when
visualising data inconsistencies.
When visualising the characteristics of demographic data, circles/squares are used to
represent events/related persons and use different colours to indicate different events
or relationship types. These circles and squares are linked by horizontal lifelines and
vertical relationship lines. These lines can be used to reflect the data inconsistencies:
lifelines for intra- inconsistencies and relationship lines for inter- inconsistencies.
Normally, lines are black, however, if there exists one or more data inconsistencies,
the corresponding line will be coloured differently: red for an error and orange for a
warning. If both types of data inconsistencies are reflected on one line, the line will be
coloured red.
Figure 3.8 is an example of data inconsistency visualisation. In the figure, the lifeline
of Lid38052 is coloured orange, indicating one or more warnings exist. Right clicking
on the coloured line will pop up a window which shows the detailed information
about the corresponding data inconsistencies, i.e. the rule which is violated. In this
case, the warning is: there are three marriages in Lid38052‘s life. The right-click
function will be described in Section 3.4.
Figure 3.8: An example of data inconsistencies visualisation
18
3.4 Supplementary functions
Some additional functions were added to make the figures more interactive and thus
facilitate users‘ research. They aim to provide extra information (directly or indirectly)
when a certain element has been chosen in the current figure. As Figure 3.8 shows,
there are dozens of elements in a figure and it is infeasible to choose one of them just
based on keyboard inputs, so these additional functions are mainly aimed to respond to
mouse actions, such as buttons pressed.
Figure 3.9: The window showed when a circle is right clicked. Note: the records used in
this figure are presented in Figure 2.2 and Figure 2.3
Right clicking a circle (wedge if multi-events share a circle) or a coloured line will
pop-up a window which shows some additional information about the circle (wedge) or
line. For a circle (wedge), the information is a detailed description about the
corresponding event. While for a line it is a list of data inconsistencies which explains
the colour of the line. Figure 3.9 illustrates the effect of right clicking a circle.
Left clicking a circle in the lifeline which corresponds to a related person will jump to
another figure which focuses on the related person‘s lifeline and shows all his/her
related persons. This enables users to look into an individual‘s related persons and
conduct continuous research. In addition, users‘ browsing histories are recorded so that
users can use the arrow buttons at the top right corner to go back to previous views or
19
forward to following views. If left clicking the circle (which represents the 1871‘s
census in Lid17802‘s life) in Figure 3.9 instead of right clicking, it will jump to another
figure which selects Lid17802 as the first person, as shown in Figure 3.10.
Also, note that in the top right corner that the ―go back‖ button is not grayed out any
more, which means that the button can be clicked to go back to previous view which is
shown in Figure 3.9.
Figure 3.10: Resulting figure when left clicking Lid17802‘s lifeline in Figure 3.9
3.5 Flexibility and reusability issues
Efforts have been made to make sure the program is configurable and can be applied on
other demographic datasets. As Section 3.3 discussed, specific data inconsistencies are
reduced to domain- independent data inconsistencies so they are configurable. The
visualisation function and data inconsistency detection function both share the same
data interface, so once demographic data are loaded into this structure, they can be
visualised without changing the source code.
If a domain-specific rule is supported by current domain- independent rules, it can be
added into the program by configuring a profile without modifying the source code.
20
Below is the structure for each line in the configuration file:
in-built_rule: formula ―shown_information‖ inconsistency_level
Here, in-built_rule is the abbreviated name of an in-built domain- independent rule.
Formula defines how to detect a certain type of data inconsistency with the chosen
domain- independent rule. It is similar to an arithmetic expression, and its grammar
varies according to the chosen domain- independent rule. Shown_information is the
information which appears in the popup window when right clicking the corresponding
coloured line in a visualised figure, and inconsistency_level is an integer which
represents the level of the detected data inconsistencies, 1 stands for error and 2 strands
for warning.
Currently, there are three in-built domain- independent rules for detecting data
inconsistencies. Below are their definitions, formula grammars and some examples:
a) Intra time lag inconsistency: for one individual, the largest (smallest) time lag
between two types of events is larger (smaller) than a given threshold. Here is the
grammar for this rule:
event_type1 – event_type2 comparison_operator threshold
Event_type can be birth, marriage, census, birthofchild, or death. – stands for
minus. Comparison_operator can be >, >=, < or <=. Threshold is the given
threshold. Here is an example which adds four lines in the profile to detect if there
is any event of an individual that happens before his/her birth:
timelag(intra): birth - marriage > 0 "birth year after marriage year" 1
timelag(intra): birth - census > 0 "birth year after census year" 1
timelag(intra): birth - birthofchild > 0 "birth year after his/her 1st child's birth year" 1
timelag(intra): birth - death > 0 "birth year after death year" 1
b) Inter time lag inconsistency: similar to intra time lag inconsistency, however, it is
for two individuals on each side of a particular type of relationship (parent-child or
marriage). Here is the grammar for this rule:
role1.event_type1 – role2.event_type2 comparison_operator threshold
One of role1 and role2 must be main, which represents the individual currently
being focused on, and the other can be a type of related person: parent, spouse or
21
child. The domains of other elements are the same as that of intra time lag
inconsistency. Here is an example which adds two lines in the profile to detect if
the age gap between an individual and his/her spouse(s) is larger than 20 years:
timelag(inter): main.birth - spouse.birth >= 20 "the age gap between bride and
groom is larger than 20" 2
timelag(inter): spouse.birth - main.birth >= 20 "the age gap between bride and
groom is larger than 20" 2
c) Intra frequency inconsistency: for one individual, the highest (lowest) frequency
of a certain type of event is higher (lower) than a given threshold. Here is the
grammar for this rule:
event_type comparison_operator number in interval year(s)
The domains of event_type and comparison_operator are equal to that of intra
time lag inconsistency. However, the (frequency) threshold is defined by
number/interval. Besides, the ―in interval year(s)‖ at the end of the grammar can be
omitted if the interval is the individual‘s life. Here are two examples which
respectively detect if an individual has more than 3 children born in one year and if
he/she married more than 3 times:
frequency(intra): birthofchild >= 3 in 1 year "give birth to more than 3 children in
1 year" 2
frequency(intra): marriage >= 3 "more than 3 marriages in his/her life" 2
Note that in the second example (line) the ―in interval years‖ has been omitted.
Besides the configurability of data inconsistency detection rules, a data interface is
also be defined to make sure the data inconsistency detection function and the
visualisation function can be reused. The general idea is that an individual can be
depicted by a list of events which can be used to construct his/her lifeline and a list of
related persons that describe relationships. Each event and related person has many
attributes.
Based on the analysis above, the data interface for an individual (denoted by A)
includes a list of event objects sorted chronologically by and a list of related person
objects. Here, list refers to a data structure. Every event object or related person object
are represented by an associative array, where each key-value pairs describes the
22
name and value of an attribute. Figure 3.11 and Figure 3.12 shows the attributes
together with some comments about an event object and a related person object
respectively.
Attribute name Comment
id the identity of the event, used to distinguish one event from
another
type the type of the event, i.e. birth, marriage, census, birthofchild
or death
event_year the occurrence year of the event
birth_year the birth year of A
description a detailed description of the event which to be shown when
the corresponding circle (wedge) is right clicked
Figure 3.11: Attributes of an event object
Attribute name Comment
id the identity of the related person, used to distinguish one
related person from another
type the type of the event, i.e. parent, spouse or child
event_year the occurrence year of the event which creates the
relationship between the related person and A
birth_year the birth year of the related person
description a detailed description of the related person which to be shown
when the square representing the related person is right
clicked
Figure 3.12: Attributes of a related person object
When another linked demographic dataset is given, if all its individuals‘ data are
provided conforming to the data interface, then the information contained will allow
data inconsistency detection and visualisation, and the dataset can be visualised in the
program without changes to the source code.
23
4. Evaluation
In general, the proportion of data visualisation papers having evaluation is much
lower than that of papers from other domains. Most papers published at four main
visualisation conferences (EuroVis, InfoVis, IVS and VAST) do not have any
evaluation (Elmqvist & Yi, 2015). In addition, different from other domains, the
evaluation techniques for data visualisation tend to be qualitative rather than
quantitative (Redpath & Srinivasan, 2003), which means that evaluation techniques
that work for other domains are not applicable to this project.
In this project, qualitative evaluation is applied. When choosing evaluation methods,
Komlodi, Sears and Stanziola‘s classification of information visualisation evaluation
practices (Plaisant, 2004) is used for reference, and four main patterns for evaluating
data visualisation techniques are defined:
a) Controlled experiments comparing design elements: compare the effect of certain
elements in the same visualisation technique.
b) Usability evaluation of a technique: get feedback from users.
c) Controlled experiments comparing two or more techniques: compare two or more
visualisation techniques in the same scenario.
d) Case studies of techniques in realistic settings: evaluate a technique in a natural
environment doing real tasks.
Among them, c) is impractical in this project as there is hardly any other research about
visualising linked data, and d) is too time consuming to be adopted. As a result, both a)
and b) are applied as each of them has its own advantage.
During the development of visualisations, the spiral model (Boehm, 1988) was used. At
the end of each cycle, the models were assessed by the project supervisors. The criteria
include the accuracy and integrity of the reflected information, as well as the
accessibility and unambiguity of figures. If a certain design element showed poor
performance, an alternative element was designed and compared with the original one,
and the latter was substituted if the former leads to a better quality figure. One case in
24
point is that multiple events sharing a circle rather than stacking circles vertically,
which has been illustrated in Figure 3.3. Some other improvements were made after
self-evaluations are listed below:
Y axis was inversed to make the figures more intuitive.
Always to show an individual's information on the left of the lifeline.
Using a dashed vertical line to make the lifelines and relationship lines more
distinguishable.
Pop-up a window containing detailed information when right clicking rather than
hovering over a circle or a coloured line.
In multiple life segment visualisation, small circles (representing related persons'
events) will always be in front of large circles (representing the events of the
individual in the spotlight) to make sure they will not be hidden, as described in
Section 3.2.
If two related persons share the same age, their squares or life lines will be moved
upwards or downwards slightly to make sure they are distinguishable, as described
in Sections 3.1 and 3.2.
After the visualisations were completed, two videos demonstrating the models were
made by a supervisor and sent to data linkage researchers in Cambridge and Scotland
for evaluation. Feedback on the videos (See Appendix B, the feedback is anonymous
for privacy issues) shows that most of the reviewers consider these visualisations as a
―clean and tidy‖ way to illustrate ―messy‖ linked records, and that showing data
inconsistencies via colouring lines is novel and useful. Some of them mentioned that
the usefulness of the models cannot be confirmed without practical application. In
addition, some researchers suggest it would be useful to visualise the data at a higher
level and check more individuals simultaneously. This is the goal of a future project.
Other positive opinions in the feedback include:
It can be applied to other domains.
The right click function makes it possible to track a wrong link back to its origin.
It can be ―a good base‖ for further extensions.
25
Some critical opinions in the feedback are listed below:
In multiple left segments visualisation, it is not clear if a vertical relationship line
just passes by a lifeline or connects with the lifeline.
It is not clear if having small circles always in front of large circles will help make
figures more intuitive.
In addition, many other suggestions are provided in the feedback. These suggestions
were all recorded and could be used as references for further works:
Indicate gender with colour coding.
Evaluate the visualisations by comparing them with other diagram styles.
Visualise a family together with the causes of all deaths in this family.
Enable users to select and organise required data in a more arbitrary way, for
example, have a query system.
Enable users to choose how many generations to be shown simultaneously in one
figure.
26
5. Conclusion
This thesis has described a novel approach for the visualisation of complex linked
databases, showing the characteristics of linked data as well as data inconsistencies in a
concise and clear way. Though this technique is not universal enough to be applied on
all kinds of linked datasets, it still has referential value for other projects. In addition,
this method also provided original and standardised models for the visualisation of
demographic data, so other similar datasets can be easily visualised using these models
if they are loaded into the pre-defined data interface.
In order to keep visualisations concise and clear as well as make sure all important
information is included, a multi- layer display is used. The structure and major attributes
of the given demographic data are shown in a graphical interface, and detailed
information is provided via supplementary functions when users interact with the
visualisations.
Current results have shown that the characteristics of data and data inconsistencies can
be visualised in one figure together to help identify potentially wrong links, without
confusing a user. Feedback from the data linkage researchers in Cambridge and
Scotland also indicates they consider these visualisations are ―clean and tidy‖ when
illustrating intricate linked records, and the visualisation of data inconsistencies is an
effective way to highlight areas that researchers should pay more attention to.
However, with only low-level visualisations, this approach is not ideal for identifying
potentially wrong or inconsistent links, as it does not support roll-up and drill-down
operations which are common in visual data analysis. In the future, works will
concentrate on visualising the given data at higher levels, such as family tree
visualisation and communal visualisation, as well as connecting the visualisation at
different levels with each other.
27
Acknowledgement
This project was conducted under the guidance of Assoc. Prof. Peter Christen and Mr.
Jeffrey Fisher. I would like to express my sincere gratitude to them for their
continuous support and infinite patience. They spent a lot of time helping me and
provided plenty of suggestions covering from the design of the visualisations and the
implementation of the Python program, to the structure and contents of this thesis as
well as the preparation of the presentations.
Thanks also to Assoc. Prof. Weifa Liang for his help while I was choosing the project
and writing this thesis.
28
References
1. Abecasis, G. R., & Cookson, W. (2000). GOLD—graphical overview of linkage
disequilibrium. Bioinformatics , pp. 182-183.
2. Andreev, K. (2000). Sex differentials in survival in the Canadian population,
1921--1997: a descriptive analysis with focus on age-specific structure.
Demographic Res , 3:article 12.
3. Boehm, B. (1988). A spiral model of software development and enhancement.
Computer , pp. 61-72.
4. Christen, P., & Goiser, K. (2007). Quality and complexity measures for data
linkage and deduplication. In Quality Measures in Data Mining (pp. 127-151).
Springer.
5. Elmqvist, N., & Yi, J. S. (2015, 7). Patterns for visualization evaluation.
Information Visualization , pp. 250-269.
6. Keim, D. (2002). Information visualization and visual data mining. Visualization
and Computer Graphics, IEEE Transactions on , pp. 1-8.
7. Plaisant, C. (2004). The Challenge of Information Visualization Evaluation. In
Proceedings of the working conference on Advanced visual interfaces (pp.
109-116). ACM.
8. Redpath, R., & Srinivasan, B. (2003). Criteria for a Comparative Study of
Visualization. In Intelligent Systems Design and Applications (pp. 609-620).
Springer Berlin Heidelberg.
9. Reid, A., Davies, R., & Garrett, E. (2002). Nineteenth-Century Scottish
Demography From Linked Censuses and Civil Registers: A'Sets of Related
Individuals' Approach. History and Computing , pp. 61-86.
10. Wang, F., Ibarra, J., Adnan, M., Longley, P., & Maciejewski, R. (2014). What‘s in a
Name? Data Linkage, Demography and Visual Analytics. In EUROGRAPHICS
29
(pp. 7-11).
11. Ward, M. O., Grinstein, G., & Keim, D. (2010). Interactive data visualization:
foundations, techniques, and applications. CRC Press.
12. Wigginton, J. E., & Abecasis, G. R. (2005). PEDSTATS: descriptive statistics,
graphics and quality assessment for gene mapping data. Bioinformatics , pp.
3445-3447.
Appendix A
INDEPENDENT STUDY CONTRACT Note: Enrolment is subject to approval by the projects co-ordinator
SECTION A (Students and Supervisors)
UniID: u5455264
SURNAME: Han FIRST NAMES: Quanwei
PROJECT SUPERVISOR (may be external): Assoc Prof Peter Christen, Mr Jeffrey Fisher
COURSE SUPERVISOR (a RSCS academic): ________________________________________________
COURSE CODE, TITLE AND UNIT: COMP8715 Computing Project 12units
SEMESTER S1 S2 YEAR: 2015
PROJECT TITLE:
Visualising complex linked data
LEARNING OBJECTIVES:
Become familiar with complex linked data sets; learn about graph visualisation techniques; and become
proficient in the Python programming language to develop prototypes for processing and analysing large data
sets.
PROJECT DESCRIPTION:
Social science researchers, government agencies and businesses increasingly rely on the linking of larger and
complex databases. Such linking allows enrichment of data, helps improve data quality, and enables data
mining not feasible on a single database. One crucial aspect so far neglected by many data linkage projects is
the visualisation of complex linked databases. Visualisation is important when identifying potentially wrong or
inconsistent links, and helps users better understand the characteristics of linked data.
The first outcome of this project is a set of Python prototype programs that generate different visualisations of
the Isle of Skye historical data, as well as other data sets in a similar format. The programs need to be flexible
and configurable in order to work with other data sets. The second outcome is a report describing the
visualisation techniques implemented in the Python prototype programs, how to use these programs, and how
they have been tested and evaluated.
The aims of this project are to research and develop novel network- and graph-based visualisation techniques
for complex linked data. Specifically, using a data set of linked historical birth, death, marriage and census data
from the Isle of Skye in Scotland, the objective is to develop and evaluate a set of visualisation techniques
implemented as Python-based prototypes.
ASSESSMENT (as per course’s project rules web page, with the differences noted below):
Assessed project components: % of mark Due date Evaluated by:
Report: name style: _____________________________
(e.g. research report, software description...)
60%
Artefact: name kind: ____________________________
(e.g. software, user interface, robot...)
30%
Peter Christen
and Jeff Fisher
Presentation:
10%
MEETING DATES (IF KNOWN): Weekly time and day to be arranged.
STUDENT DECLARATION: I agree to fulfil the above defined contract:
______________________________ ______________________
Signature Date
SECTION B (Supervisor):
I am willing to supervise and support this project. I have checked the student's academic record and
believe this student can complete the project.
______________________________ ______________________
Signature Date
REQUIRED DEPARTMENT RESOURCES:
SECTION C (Course coordinator approval)
______________________________ ______________________
Signature Date
SECTION D (Projects coordinator approval)
______________________________ ______________________
Signature Date
Appendix B
Feedback on videos demonstrating the visualisations of life segments in this project.
It's hard to be confident about usefulness without having actually used it for a specific
purpose - but I would be optimistic - it looks very nice.
The only specific thing that occurs to me is to wonder whether gender could also be
indicated with colour coding?
Is it significant that the dotted vertical lines on the right side of the screen in video 1
don't connect with the male?
I assume it was a deliberate design decision to limit the quantity of data being
displayed to 3 generations. For evaluation, it might be interesting to contrast this with
other common genealogy diagram styles e.g. tree, fan etc.
It would be interesting, if perhaps less useful, to be able to zoom out to see a large
number of people simultaneously, perhaps with the ability to highlight sets of people
e.g. ancestors of a given person.
Related to that, the orange coding for errors looks very useful when inspecting
particular parts of the graph. It might also be useful to have a way of indicating errors
in a zoomed-out view, so that you could find them without having to browse around.
In the future obviously it would be worth considering whether this sort of tool could
be extended to indicate linkage, uncertainties etc. As you may remember we have an
HCI group here who might well be interested in helping with the visualization side of
things.
I have managed to download videolan and watch the two videos, many thanks. It
certainly got me wanting to be able to click on the circles to see what happened.
Well done to the student for managing to make what can be quite 'messy' data look so
'clean and tidy'. I thought the way that it linked marriage partners, their parents and
their children together was excellent.
The colour coded dots on the time line were easy to follow, and I thought that the
colour coded 'warning' lines were a very good idea, and would highlight for
researchers where extra 'digging' might be needed.
I am off to find the gentleman with 19 children in the Skye records....
One thing I wondered for an 'extra' would be the ability to show a family with the
causes of all deaths displayed shown at once - might give some interesting research
leads.
I've taken a look at the videos and its looking nice and clean as a visualisation tool.
One thing I wasn't able to work out from the videos was in relation to the colouring of
lines to visualise warnings and errors. Are these only shown when the person in
question is 'in focus'? If not I feel these things may be easy to miss if viewing the
population at a larger scale (i.e. more zoomed out) where there maybe many more
people on a single screen.
I'm not sure if having peoples time lines laid atop of one another as in video 2 makes
thing easier to intuitively understand - in the complex example that is given it doesn't
seem possible to work out whose death event is whose without clicking into the
individual (I guess to whom the death belongs could be inferred by genders of married
individuals - but this may be an assumption that isn't possible to make in more
modern populations).
I can see it being useful across a number of domains. From a generating synthetic
populations viewpoint it would be very useful to be able to graphically review
subsections of a population to check it for sanity (here having higher zoom levels
would be useful too).
From a linkage viewpoint again it could be useful to be able to identify places where
automated linkages have gone awry - here it being possible to display, on the right
click, the underlying provenance of linkage and the values associated with the linkage
information may be useful.
It's looking like a nice tool, that potentially could be a good base for further modules
to be added onto.
I'd be interested to see how useful the tool would be for reviewing the synthetic
populations I'm currently generating, if you'd be able to send over a copy of the code
it would be appreciated.
So I think the visualisation looks good - ie clean and clear, so the question is therefore
how it can help with various tasks/ uses
As an exploratory tool to look across records it is clearly very helpful ie rather than
looking across multiple tables etc. and then perhaps running queries to calculate birth
intervals etc.
useful extension to this use would be:
- ways of organising, grouping and searching the life segments ie 'show all segments
for a part of Skye', 'show a particular pedigree starting at person X', 'show all families
with >9 births', show all segments with an average birth interval <x
- How many records could be shown on average on a page? Could there be a
change in form with progression 'up' to more records ie - could there be a visual
summary (typology) at 'higher levels'. This would then allow a user to drill down into
a particular set of records that might be of interest to them (eg short lived).
This would then help a user find the segments they are interested in
top related