pathline: a tool for comparative functional genomics. miriah meyer

10
Eurographics/ IEEE-VGTC Symposium on Visualization 2010 G. Melançon, T. Munzner, and D. Weiskopf (Guest Editors) Volume 29 (2010), Number 3 Pathline: A Tool For Comparative Functional Genomics M. Meyer 1,2 , B. Wong 2 , M. Styczynski 3 , T. Munzner 4 , and H. Pfister 1 1 Harvard University, USA 2 Broad Institute, USA 3 Georgia Institute of Technology, USA 4 University of British Columbia, Canada Abstract Biologists pioneering the new field of comparative functional genomics attempt to infer the mechanisms of gene regulation by looking for similarities and differences of gene activity over time across multiple species. They use three kinds of data: functional data such as gene activity measurements, pathway data that represent a series of reactions within a cellular process, and phylogenetic relationship data that describe the relatedness of species. No existing visualization tool can visually encode the biologically interesting relationships between multiple path- ways, multiple genes, and multiple species. We tackle the challenge of visualizing all aspects of this comparative functional genomics dataset with a new interactive tool called Pathline. In addition to the overall characterization of the problem and design of Pathline, our contributions include two new visual encoding techniques. One is a new method for linearizing metabolic pathways that provides appropriate topological information and supports the comparison of quantitative data along the pathway. The second is the curvemap view, a depiction of time series data for comparison of gene activity and metabolite levels across multiple species. Pathline was developed in close collaboration with a team of genomic scientists. We validate our approach with case studies of the biologists’ use of Pathline and report on how they use the tool to confirm existing findings and to discover new scientific insights. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Picture/Image Generation—Line and curve generation 1. Introduction Biologists conduct comparative functional genomics stud- ies in order to infer the evolution of gene regulation, and to understand how this evolution is linked to changes in gene activity. These studies focus on the functional outputs of genomes, looking at similarities and differences in gene activity over time and across species. Subsets of the genes work together in pathways, which represent specific chain reactions that occur in the cell. For example, many function- ally related genes are involved in a cell signaling pathway that carries stimulus from the outside of the cell to the in- side. Researchers use these pathways to focus the scope of their scientific inquiry and to gain a greater understanding of biology. We work with a team of researchers who study a collec- tion of pathways that make up metabolism in yeast. These pathways consist of specialized gene products that process chemical compounds called metabolites. The researchers need to analyze the levels of gene activity and metabo- lites belonging to multiple pathways over time and across multiple species. Their visualization needs were not met by currently available tools. Many tools focus on show- ing the topological structure of pathways, and can show only one experimental data point for each gene and metabo- lite [SMO * 03, AHP * 05, LYKB08]. Recently developed tools can show an entire set of values, for example a full time series, for each gene and metabolite [BMGK08, KPR02, JKS06, MSH * 05, BW09]. Our collaborators, however, need to look at multiple sets of values simultaneously — across time, species, genes and metabolites — in order to compare trends between species. In contrast, many visualization tools allow the compari- son of multiple sets of functional experimental data points but do not attempt to show pathways. The heatmap vi- sual encoding is popular with biologists, showing a matrix where a single gene activity value is encoded by color in c 2010 The Author(s) Journal compilation c 2010 The Eurographics Association and Blackwell Publishing Ltd. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

Upload: others

Post on 11-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Eurographics/ IEEE-VGTC Symposium on Visualization 2010G. Melançon, T. Munzner, and D. Weiskopf(Guest Editors)

Volume 29 (2010), Number 3

Pathline: A Tool For Comparative Functional Genomics

M. Meyer1,2, B. Wong2, M. Styczynski3, T. Munzner4, and H. Pfister1

1Harvard University, USA2Broad Institute, USA

3Georgia Institute of Technology, USA4University of British Columbia, Canada

AbstractBiologists pioneering the new field of comparative functional genomics attempt to infer the mechanisms of generegulation by looking for similarities and differences of gene activity over time across multiple species. They usethree kinds of data: functional data such as gene activity measurements, pathway data that represent a series ofreactions within a cellular process, and phylogenetic relationship data that describe the relatedness of species.No existing visualization tool can visually encode the biologically interesting relationships between multiple path-ways, multiple genes, and multiple species. We tackle the challenge of visualizing all aspects of this comparativefunctional genomics dataset with a new interactive tool called Pathline. In addition to the overall characterizationof the problem and design of Pathline, our contributions include two new visual encoding techniques. One is anew method for linearizing metabolic pathways that provides appropriate topological information and supportsthe comparison of quantitative data along the pathway. The second is the curvemap view, a depiction of time seriesdata for comparison of gene activity and metabolite levels across multiple species. Pathline was developed in closecollaboration with a team of genomic scientists. We validate our approach with case studies of the biologists’ useof Pathline and report on how they use the tool to confirm existing findings and to discover new scientific insights.

Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Picture/ImageGeneration—Line and curve generation

1. Introduction

Biologists conduct comparative functional genomics stud-ies in order to infer the evolution of gene regulation, andto understand how this evolution is linked to changes ingene activity. These studies focus on the functional outputsof genomes, looking at similarities and differences in geneactivity over time and across species. Subsets of the geneswork together in pathways, which represent specific chainreactions that occur in the cell. For example, many function-ally related genes are involved in a cell signaling pathwaythat carries stimulus from the outside of the cell to the in-side. Researchers use these pathways to focus the scope oftheir scientific inquiry and to gain a greater understanding ofbiology.

We work with a team of researchers who study a collec-tion of pathways that make up metabolism in yeast. Thesepathways consist of specialized gene products that processchemical compounds called metabolites. The researchers

need to analyze the levels of gene activity and metabo-lites belonging to multiple pathways over time and acrossmultiple species. Their visualization needs were not metby currently available tools. Many tools focus on show-ing the topological structure of pathways, and can showonly one experimental data point for each gene and metabo-lite [SMO∗03,AHP∗05,LYKB08]. Recently developed toolscan show an entire set of values, for example a full timeseries, for each gene and metabolite [BMGK08, KPR02,JKS06, MSH∗05, BW09]. Our collaborators, however, needto look at multiple sets of values simultaneously — acrosstime, species, genes and metabolites — in order to comparetrends between species.

In contrast, many visualization tools allow the compari-son of multiple sets of functional experimental data pointsbut do not attempt to show pathways. The heatmap vi-sual encoding is popular with biologists, showing a matrixwhere a single gene activity value is encoded by color in

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and350 Main Street, Malden, MA 02148, USA.

M. Meyer et al. / Pathline

each cell, usually with rows representing different genes andcolumns representing species, time points, or treatment con-ditions [ESBB98, Sal04, SS02]. The ordering of rows andcolumns is often determined by, and shown with, a hierar-chical structure such as a phylogenetic tree or a hierarchicalclustering of the data. These tools, however, show only a sin-gle value for each cell rather than an entire time series. More-over, the lack of a pathway representation limits the abilityof researchers to extract meaningful insight about systemicbiological questions from gene activity data [SND05].

To fill this gap, we present Pathline, an interactive visual-ization tool that shows time series data for both gene activityand metabolite levels over multiple pathways and multiplespecies. We present a characterization of the data our biologycollaborators are studying, as well as a translation of their bi-ology questions into data-centric tasks. In supporting thesetasks, we make two novel visual encoding contributions. Thefirst is a method for linearizing metabolic pathways for a vi-sually concise overview that provides appropriate topolog-ical information and supports the comparison of quantita-tive data along pathways. The second is the curvemap detailview, depicting time series data with small multiples of filledline charts and overlaid curves that support comparison ofgene activity and metabolite levels across multiple species.We validate our approach with case studies from our biol-ogy collaborators who used Pathline to both confirm existingknowledge and discover new scientific insights.

2. Biological Background and DataResearchers at the Broad Institute study 14 species of yeastthat span over 300 million years of evolution. They studygenes involved with metabolism. Metabolism is a complexnetwork of chemical reactions essential in all living organ-isms that allows the organism to grow and reproduce, main-tain cellular structures, and respond to environmental condi-tions. The metabolic network is remarkably similar acrossspecies in terms of the reactions that it comprises. How-ever each organism uses, and controls the use of, the reac-tions slightly differently. For example, the same gene may beturned on earlier in one species, versus later in another. Or,some species may have developed a different control mecha-nism for a particular gene, or have evolved a novel gene alto-gether to accomplish a certain metabolic task. These changesare hallmarks of the evolutionary process.

In metabolism, chemical compounds called metabolitesare catalyzed from one form to another by the actions of en-zymes, which are a type of gene product. Enzymes involvedin metabolism can work in one of three directions: forward,moving a metabolite ahead a step; reverse, moving a metabo-lite back a step; or bidirectional, capable of catalyzing bothforward and reverse reactions. The metabolic network is sub-divided into pathways, which are a small set of related reac-tions that may contain cycles and branches. The products ofone pathway may be the starting material of another.

The specific comparative functional genomics study of

our biology collaborators depends on the following fourmain categories of data.Gene activity and metabolite levels are measured for ap-proximately 6,000 genes and 140 metabolites for each of the14 species of yeast. Each gene and metabolite is measuredat six physiologically relevant time points. While the num-ber of species to study may grow in the future, the num-ber of genes, metabolites, and time points is fixed due to thenature of the biological processes being studied. Each mea-surement of gene activity or metabolite level has three asso-ciated attributes: the time point, the name of the species, andthe name of the gene or metabolite.Metabolic pathway information in the form of a di-rected graph is taken from the publicly available BioCycdatabase [CFF∗08]. Graph nodes are metabolites, and theedges represent small sets of genes, the products of whichcatalyze the reactions. The number of reactions in anygiven pathway studied by our collaborators is small, usuallyaround a dozen. Researchers typically choose only a hand-ful of pathways to look at simultaneously, filtering the fullgene and metabolite dataset down from several thousand toseveral dozen.Similarity scores, which are high-level aggregate scores oftime series similarity, are computed for each gene or metabo-lite across a set of multiple time series. The set can eitherbe all species, or any subset of species that interests the re-searcher. Currently our collaborators use the standard Pear-son and Spearman correlation functions as direct similaritymetrics and compute an aggregate metric of average pair-wise similarity within a set.Phylogenetic relationships showing the ancestral relation-ships between yeast species are also important data used inthe analysis process. These relationships are represented bya tree where the leaf nodes are the living species and the in-ternal nodes indicate speciation events, meaning a commonancestor that eventually gave rise to two distinct species.

3. TasksThe level of gene activity, analogous to enzyme levels, andthe level of the metabolites, change over time. Finding dif-ferences between species in the patterns of the these changesis an important part of comparative functional genomics. Atthe high level, understanding these differences will allow bi-ologists to extrapolate the functioning of extant species (i.e.,those species that exist today) to that of their ancestors. Theycan then infer the evolution of specific cellular processes andof regulatory mechanisms in the genome.

More specifically, our collaborators would like to identifywhen different yeast species regulate metabolic processesthe same way, either across all 14 species or by finding sub-sets of species that behave similarly. They look for trendsand try to determine similarity between time series across aset of species at specific genes or metabolites along one orseveral pathways. The biologists need to carry out tasks atthree levels:

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

• Detailed comparison of a limited number of time series.• Aggregate comparison of the similarity score of genes and

metabolites across a pathway.• Multiple similarity score comparison.

At the detailed task level, the analysis involves inspect-ing the full time series of a small subset of carefully chosengenes and metabolites. The specific tasks at this level are:• Look for trends in a set of time series for a gene or

metabolite across species.• Look for trends in a set of time series within a species for

several genes and metabolites.• Compare time series to find:

– valleys that exist in some but not others.– time series that are a time shift of another, such as early

peaks versus late peaks.– the most similarly shaped time series to one of interest.– which time series in a group are the same.– how many classes of time series exist in a set.

For the aggregate task level, researchers look at a singlenumber for each gene and metabolite that represents theirsimilarity across all of the species, or a subset of the species.This similarity score is an aggregation of the underlying setof time series for each gene and metabolite.

Researchers compare multiple possible aggregations atthe third task level, looking for the differences between bio-logically interesting subsets of species and the combinationof all of them. For example, they often look at three num-bers for each gene and metabolite: an aggregation across allspecies, across one subset, and across another subset. Theyare interested in discovering when the members of the twosubsets are similar themselves, but the entire set is not. Theyare also interested in discovering when only members ofone subset are similar. At this level they also want to com-pare the results of different similarity metrics. They suspectthat purely statistical methods such as the Spearman or Pear-son correlations currently used to compute similarity scores,which take a shape-to-shape matching approach, may not ex-pose meaningful biological relationships between time se-ries. They would like to compare them with other more bio-logically inspired metrics in the near future.

Validating and understanding the results at the aggregateand similarity score levels requires frequent cross-checkingwith a detailed visual comparison at selected points.

4. PathlinePathline, shown in Figure 1, is an interactive prototype toolthat supports the visual analysis of all four kinds of dataand all three kinds of task levels discussed above. The toolwas designed in a user-centered process with iterative refine-ment. Our design decisions were motivated by the specificneeds of our genomics collaborators.

Using Pathline begins with loading a set of pathways andtheir associated gene and metabolite data. This choice im-plicitly filters the full gene and metabolite data set of 84,000

possibilities (from 6000 genes across 14 species) down to amuch smaller set of a few hundred, as is the case with allviewers where a set of pathways is chosen by the user. Theinterface has two linked components:Curvemap: This detail view shows a small-multiple matrixof filled line charts of time series data. Overlay columns onthe right and the bottom show an overlay of all the curvesfor each row and column, respectively. The rows show the14 species of yeast, ordered according to the phylogenetictree shown to the left of the curvemap. The columns showthe genes and metabolites chosen by the user in the orderthat they were selected.Linearized pathways: This overview is a vertical stripshowing the chosen pathways as grey segments, placed endto end. Each pathway segment contains the aggregate sim-ilarity scores for the genes and metabolites in the pathway,encoded with horizontal spatial position. The metabolites areencoded as lines, and the genes are encoded as points thatare colored according to directionality. Up to three similar-ity scores can be viewed simultaneously. The pathways havebeen linearized to create an ordered list of genes and metabo-lites. Selecting a gene or metabolite adds a column showingall of its underlying data in the curvemap detail view.

We now describe each of these novel visual encodings inmore detail, including justifications for our design decisions.

5. Curvemap Detail ViewEach column in the curvemap detail view correspond toa single aggregate similarity score shown in the pathwaysoverview. More specifically, each column corresponds to aselected gene or metabolite and shows the measured valuesas time series curves for all six time points. Each row showsthe data for each of the species. The number of genes andmetabolites that can be shown is limited by the screen reso-lution, and is typically between 4 and 15 columns.

The name curvemap alludes to its semblance to heatmap.Both have a matrix representing species and genes usingrows and columns. A heatmap encodes a single value withcolor in each matrix cell. A curvemap shows a full timeseries curve in each matrix cell, encoding multiple valueswith spatial position. The phylogenetic tree to the left ofthe curvemap is analogous to the widely-used combinationof trees with heatmaps: it shows the hierarchical structureof ancestral relationships that leads to the ordering of thespecies rows.

There are two obvious ways to extend a matrix view fromshowing one value per cell to showing multiple data points ineach cell. In the language of Keim and Kriegel [KK94], whocast the problem as high-dimensional data visualization, thechoice is to use multiple simple matrices side by side, witheach matrix showing all the information about a single di-mension; or use a single matrix with a more complex glyphin each cell, showing the values for the additional dimen-sions contiguously at each point. They call this latter ap-proach grouping. In the cartographically-inspired language

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

!444

!888

!111666!111777!111888!111999!222000!222111!222222!222333

mmm111

mmm111888

PPP111

mmm111888PPP222

mmm222

mmm333333

PPP333

mmm111333

mmm333000

PPP444

ccc

ddd

mmm777

mmm111000

aaa

bbb

mmm222

mmm111000

BBB111

!333000

mmm222888

000...000000 111...000000

000...000000 111...000000

!EEYY !eeennneeesssfffooorrrwwwaaarrrddd rrreeevvveeerrrssseee bbbiiidddiiirrreeeccctttiiiooonnnaaalll

MMMeeetttaaabbbooollliiittteeesss MMMeeetttrrriiicccsssPPPeeeaaarrrsssooonnnAAALLLLLL

gg44

---555...000

222...888

gg88

---666...444

333...999

gg1166

---444...666

333...333

gg1177

---444...888

777...555

gg1188

---555...444

555...000

gg1199

---555...222

555...000

gg2200

---666...999

444...333

gg2211

---666...333

333...888

gg2222

---555...000

888...111

gg2233

---666...222

888...111

---666...999

777...000

---444...999

777...555

---111...777

222...444

---222...333

333...999

---444...111

444...666

---000...333

555...000

---666...444

111...999

---555...888

222...555

---222...888

333...000

---444...666

333...444

---555...555

555...000

---333...777

111...777

---333...111

111...222

---111...999

888...111

ss11

ss22

ss33

ss44

ss55

ss66

ss77

ss88

ss99

ss1100

ss1111

ss1122

ss1133

ss1144

PPAATTHHLLIINNEE AA    TTOOOOLL    FFOORR    CCOOMMPPAARRAATTIIVVEE    FFUUNNCCTTIIOONNAALL    GGEENNOOMMIICCSSPPAATTHHWWAAYY MMEETTRRIICC    OOVVEERRVVIIEEWW SSPPEECCIIEESS CCUURRVVEEMMAAPP OOVVEERRLLAAYYSS

Figure 1: Pathline, an interactive tool for the visualization of comparative functional genomics data. The left side shows thelinearized pathways, whereas the right side shows the curvemap. Here, four different pathways are shown. This image, as wellas all other images in this paper, can be found online at http://www.pathline.org/.

of Slingsby et al. [SDW09], the question is which attributeto condition on at the deepest level of the multivariate at-tribute hierarchy: time point, species, or gene/metabolite.

The curvemap design is an example of grouping: timepoint conditions the deepest level of the hierarchy becausethe basic biological unit for comparison used by our collab-orators is the time series of gene activity and metabolite lev-els. Before using Pathline, the biologists tried a heatmap-oriented version of grouping using TreeView [ESBB98,Sal04], which supports a nested mini-heatmap within eachcell of the larger enclosing heatmap. Many of their tasks,however, were difficult to carry out using this representation.

The nature of the tasks carried out in the detailed anal-ysis by the comparative functional genomics researchersdrove the design decision to use curves rather than col-ored blocks. As discussed above, people can make more ac-curate absolute perceptual judgements for spatial positionthan color [CM84]. Moreover, people can make judgementsabout curve shapes that are far more subtle than those aboutcolor changes [LMK07]. The very language used by our col-laborators to describe their detailed analysis tasks in terms ofpeaks and valleys reflects this sort of spatial thinking.

The curvemap uses filled line charts, also known as areaplots, for the time series curves in the main matrix. We shadethe area under the curve in dark grey against a background

of lighter grey in order to make the shape of the curve moreperceptually salient than with a line alone. (We note that thearea under the curve is not directly meaningful with respectto the data.) We also surround the plots with a bounding boxto create the clear perception of positive and negative space.Each time series curve is individually normalized such thatthe minimum value meets the bottom of the plot and themaximum value reaches the top. This normalization supportsthe comparison of the shape of the curves and the trends inthe time series.

In addition to the main matrix, the curvemap view in-cludes overlay multiples where all the curves for each col-umn and row are superimposed in a single shared frame.These plots support the detection of trends in each columnand row of curves. We normalize them on a per-column andper-row basis to provide an absolute scale that is important tounderstanding the differences across genes, metabolites, orspecies. The gene and metabolite measurements are unitlessfold-change values that show whether the levels at one timepoint went up or down as compared to a baseline measure-ment. The biologists chose the second time point for theirbaseline because the yeast gene activity data are most ro-bust to experimental error at this point. All of the curves thuscross at the second time point in these overlay multiples.

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

6. Linearized Pathways OverviewThe linearized pathways overview is a high information den-sity display showing multiple quantitative similarity scoresfor each gene and metabolite over multiple pathways. Thedesign of this view focuses on supporting comparison ofquantitative values along the pathways, a notable differencefrom previous systems that focus on the task of understand-ing pathway topology.

6.1. LinearizationThrough discussions with our biology collaborators, wecame to understand that their analysis calls for a schematicview that emphasizes a linear ordering of pathway elements,with topological information available at a secondary, ratherthan a primary, level. As discussed in Section 3, at the ag-gregate analysis levels the researchers want to understandwhen gene activity along specific pathways are similar be-tween species. This type of inquiry requires only a coarseunderstanding of the pathway topology.

Many previous systems, however, include a detailedvisual representation of a pathway’s topological struc-ture as a node-link graph, requiring a significant amountof screen real estate. Thus many systems, such asGeneShelf [KLK∗09], show only a single pathway at a time,making comparison between multiple pathways and multi-ple metrics difficult. This restriction requires the user to nav-igate between multiple screens, each showing a single path-way, and to rely on memory to perform comparative tasks. Incontrast, we designed the pathway overview to be a visuallyconcise display supporting inline comparisons, which hasbeen shown to be more perceptually effective than relyingon the user’s memory of what has been seen before [PW06].

As such, a fundamental design decision for the pathwaysoverview is the transformation of each pathway from a di-rected graph to an ordered list of genes and metabolites. Thislinearization allows for a shared axis along which quantita-tive information is visually encoded using spatial position,a more perceptually accurate encoding than color [Mac86,CM84, LMK07].

Figure 2 shows the the logic behind the linearization pro-cess for a pathway that contains both a branch and a cycle.As we discuss in Section 2, the nodes are metabolites andthe edges represent small sets of genes. Figure 2(a) shows anode-link representation of the pathway. In Figure 2(b) thecycle is unrolled and the branch is disconnected. The branchis then reinserted just above its reconnection point in Figure2(c). This process is carried out recursively as branches canbe nested.

Pathline requires a manually created ordered list of genesand metabolites as depicted in 2(c) as input, and producesthe visual representation shown in 2(d). Our collaboratorscreated linear pathways based on graphs from the BioCycdatabase using the set of rules described above. It is up tothe users how to handle situations such as multiple metabo-

Branch

Cycle

Gene(s)

Metabolite

Branchpoint

(a) (b) (c) (d)

1

2

1

2

Figure 2: Linearizing a pathway. (a) The node-link repre-sentation of the directed graph includes both a branch andcycle. (b) Loops are unrolled and branches are disconnected.(c) Branches are reinserted just above their reconnectionpoints. (d) The pathway is represented as a grey segment,with genes encoded spatially with points and metabolites aslines. Short breaks in the pathway segment indicate branchpoints, along with stylized marks to the left of the blocks. Cy-cle start points are also shown to the left with another mark.

lites at the starting point, or overlapping cycles, so that theordered list best reflects their mental models of the data.

In Figure 2(d), the entire pathway is rendered as a greysegment, with the individual genes and metabolites shownusing different types of marks: points for genes and linesfor metabolites. The directionality of the enzyme encodedby each gene is shown using color: blue for reverse, greenfor forward, and orange for bidirectional. The short breaksin the grey segment indicate branch points with additionalmarks to the left of the grey blocks indicating the topologicalstructure of the branch. At a branch point, the two possiblepaths are shown with stylized branch marks and letter labels,along with vertical bars marking the extent of each branchalternative. A circular icon is shown to the left of the geneor metabolite at a cycle start point and a similar vertical barshows the extent of the cycle.

In the linearized pathway overview in Figure 1, the extentof an entire pathway is indicated with a vertical line on thefar left labelled with the pathway name. Some pathway seg-ments, such as major branches, also have names, which areshown in the same way. The names of the genes and metabo-lites at the start of segments, at branch points, and at the startand stop points of a cycle are shown on the left. These namescreate visual landmarks that reflect the way the biologiststypically refer to segments that do not have canonical names.Pathways that are not connected by a direct reaction are vi-sually separated with longer white breaks between them.

The example in Figure 1 has four pathways labelled ver-tically on the left: P1, P2, P3, and P4. The P1 pathwaybranches at metabolite m2 into the segment marked a, andan alternative branch marked b that is also labelled with its

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

name, B1. This branch contains another nested branch start-ing from metabolite m7, where the alternatives are labelledc and d. The P1 pathway runs directly into the P2 path-way, which includes a cycle. The cycle in P2 starts at g30and ends at m28. The P3 and P4 pathways have no directlinkages to their preceding pathways.

6.2. Similarity Score DisplayThe horizontal extents of the grey pathway segments encodethe quantitative data of the aggregate similarity score foreach gene and metabolite, as shown in Figures 1 and 2(d).Up to three metrics can be displayed at once, visually en-coded as an ordered bar-chart for metabolites, and as differ-ent shapes for genes. The three shapes used for the genes area hollow circle, a plus sign, and a smaller filled-in rectangle.We hand-tuned the colors and shapes to have roughly equalvisual salience.

Figure 5(a) shows the encoding of three different met-rics. The PearsonALL aggregates over the entire set ofspecies, whereas PearsonSubgroup1 and Pearson-Subgroup2 are metrics computed for two biologically in-teresting subsets of species. When more than one metricis shown, they are linked with a low-saturation bar thatstretches between the two points. This was scientifically mo-tivated as the researchers want to directly compare two sub-sets of species and use a third metric as a reference. For thisreason we explicitly show the difference between two met-rics.

We designed the visual encoding with mark type, markshape, and color to create visual layers, allowing for selec-tive attention when just one attribute type is interesting, orperceived together as a whole when the full context is re-quired. Our collaborators need to see both the genes andmetabolites along the pathways, but may need to focus onone or the other when looking for certain trends. A simi-lar situation holds for all genes versus those of a particulardirectionality, or when comparing different metrics for ag-gregating similarity scores.

7. Interaction and ImplementationPathline is built using multiple views linked through explicitclicks and lightweight mouseover interaction, following thegeneral tradition of many previous information visualizationsystems [MMP09, SS02].

In the linearized pathways overview, mousing over thepathway elements shows the metadata of their name and thenumerical value for the metric. Clicking on the point repre-senting the aggregate similarity score for an element selectsthe underlying data for further investigation as a full col-umn of time series curves in the curvemap detail view. Inthe overview where the click occurred, that element is high-lighted with a small red bar drawn to the right of the segment,along with the name of the element. In Figure 1, 10 elementsare selected across multiple pathways.

In the curvemap detail view, mousing over the name of a

species in the phylogenetic tree highlights the curve associ-ated with it in the overlay plots at the bottom of the mainmatrix, and mousing over a subtree highlights all of its asso-ciated curves. Similarly, mousing over a label at the top of acurvemap column highlights the associated gene or metabo-lite curve in the overlay plots to the right of the main matrix.This latter interaction also highlights the selected gene ormetabolite name in the pathways overview.

The number of viewable genes and metabolites that canbe added to the curvemap view and the size of the plots isdetermined by the window size. A vertical scrollbar appearsin the pathway overview if the window is too short to renderall of the pathways at once.

Pathline was implemented in the Processing lan-guage [RFM07]. Executables, source code, and exampledata are available at http://www.pathline.org.

8. Previous WorkWe divide the most related previous work into general timeseries, networks and pathways, heatmaps, and genomics-specific time series visualization.

8.1. General Time Series VisualizationSeveral previous information visualization systems tacklethe problem of visualizing time series data. Time-Searcher [HS01] offers good support for exploring a fewlong series, and for finding patterns within them. In con-trast, the problem we address is to inspect a large collec-tion of very short series. LiveRAC [MMKN08] is designedfor a similar dataset, but the semantic zooming and guar-anteed visibility techniques that support large-scale systemmanagement tasks are not appropriate for the comparativefunctional genomics analysis tasks presented in Section 3.

8.2. Network and Pathway VisualizationCytoscape [SMO∗03] is a leading example of a system de-signed to show the detailed topological structure of largebiological networks made up of many pathways intercon-nected in complex ways. In contrast, Pathline addresses anal-ysis tasks where only a small number of pathways are underconsideration at one time, and their topological structure isa secondary, rather than primary, concern.

Many pathway visualization tools use color encoding toshow a single experimental data value at each node, includ-ing Cytoscape [SMO∗03], MicrobesOnline [AHP∗05], andiPath [LYKB08]. Others show an entire set of experimentaldata values at each node. Cerebral [BMGK08] uses linkedsmall multiples, with a separate node-link graph shown foreach time point whose nodes are colored by the value at thattime. Pathway Tools [KPR02] encodes a single data valuewith color at each pathway node, and uses animation to showthe entire set of values. Several tools use a glyph to encodea set of values at each node, including VANTED [JKS06],PathwayExplorer [MSH∗05], and GENeVis II [BW09].

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

All of these tools focus on graph layout, overlaying ex-perimental data on a node-link graph representation wherenodes are distributed in space to emphasize the connectiv-ity relationships of nodes and edges. This representation isless useful for the tasks of our collaborators, where the goalis to accurately compare values of, and look for trends in,the experimental data. Thus, in Pathline we instead treat theexperimental data as primary structure that drives spatial po-sitioning, and relegate the topological structure to secondarystatus. Moreover, these tools are also limited to showing onlya single set of data points at each pathway node, while ourcollaborators want to analyze multiple sets of data points.

8.3. HeatmapsA different set of visualization tools focus on the taskof comparing multiple sets of experimental data points;for biological data the most common visual encoding is aheatmap [WF09]. Heatmap visualizations are often coupledwith clustering algorithms, as in TreeView [ESBB98,Sal04],the Hierarchical Clustering Explorer [SS02], and Genom-ica [LS10]. As discussed in Section 5, heatmaps showonly a single value at each cell, and extending them to en-code multiple values is nontrivial for perceptual reasons.Also, they do not explicitly show pathway information. Pastwork [SND05] has shown that the ability of scientists toextract biologically meaningful insights from gene activitydata is severely hampered by the lack of this kind of contex-tual information.

8.4. Genomics Pathway and Time Series VisualizationGeneShelf [KLK∗09], designed to be a lightweight web toolfor exploring large public gene expression databases, han-dles data most similar to that supported by Pathline. The userselects a single pathway, which the system shows side byside with the time series data for the genes contained withinthe pathway. The time series data are shown using a smallmultiples matrix where each axis is an experimental condi-tion, and each grid cell has a parallel coordinates view of thegene set over the time points. Any time point in a parallelcoordinates view can be expanded to show a bar chart of theexpression level of all the genes. Extending this approach tothe time series curves that we show would disrupt the shapeperception required for many of the detailed analysis tasks ofour collaborators. As we discuss in Section 6.1, seeing onlya single pathway at once is a limitation of GeneShelf that weexplicitly address in Pathline, as is the problem of encodingaggregate quantitative values directly along the pathway.

9. Case StudiesWe collaborated with a team of seven biologists who havebeen collecting and analyzing comparative functional ge-nomics data for several years — one of the authors on thispaper is a member of the team. We conducted weekly meet-ings over the course of three months with members of theteam to learn about their scientific questions, analysis needs,and available visualization tools. The team was previouslyusing conventional heatmaps generated using Java TreeView

[Sal04] to analyze gene activity and metabolite levels. Theybegan using Pathline through a series of interactive proto-types that were gradually refined to the current design overthe course of two months. The current version of Pathlineis now their main visualization tool for analyzing this data.The biologists verified that Pathline can show known infor-mation more clearly than could be seen with their previoustools, and they directly attribute new insights into their datato the use of Pathline. We present their experiences with thetool as preliminary evidence towards the validity of our coredesign choices.

We have anonymized the species, gene, metabolite, andpathway names in these case studies by request from our col-laborators, because they represent as-yet unpublished dataand findings. Their insights derived from using Pathline havespurred them to undertake further analyses, and they antici-pate publishable results.

9.1. Missing DataOne member of the research team noticed that the occur-rence of missing data in many of the time series coincidedwith high values in the curve, as shown in Figure 3(a). Sub-sequent analysis of their data processing pipeline revealeda previously unknown bias for discarding high values. Theythen modified their pipeline; the reprocessed data shown inFigure 3(b) contain fewer missing data points for the sameset of genes. Although the team was aware of the existenceof missing values from the heatmap visualizations, they hadnot noticed that they occurred most often near high values.

!"#

!$%!$"!$$!$&

'"

'"(

)"

'"()$

'$

'&&

)&

'"&

'&%

)*

+

,

'-

'"%

.

/

'$

'"%

0"

!&%

'$(

%1%% "1%%

%1%% "1%%

!"# !"#"$2345.4, 4676486 /9,946+:93;.<

%"&'()*+&"$ %"&,+-$)6.483;=>>

$%&

?*1#

@&1&

$'(

?#1A

@*1&

$'%

?#1&

@&1(

$''

?B1%

@(1"

$')

?#1$

@(1"

?#1A

@B1"

?*1A

@&1(

?"1#

@"1(

?$1&

@$1#

?$1A

@*1#

?%1&

@$1"

?*1#

@"1#

?B1$

@$1B

?$1(

@&1%

?*1#

@&1*

?B1B

@*1*

?&1-

@"1-

?&1"

@%1(

?"1&

@(1"

*%

*'

*)

*+

*,

*&

*-

*.

*/

*%(

*%%

*%'

*%)

*%+

0123456" 172884798:7;8<01:125="79>6;2586147?"68<5;@0123A1# <"2:5;78=":=5"A @0";5"@ ;>:="<10 8=":41#@

!"#

!$%!$"!$$!$&

'"

'"(

)"

'"()$

'$

'&&

)&

'"&

'&%

)*

+

,

'-

'"%

.

/

'$

'"%

0"

!&%

'$(

%1%% "1%%

%1%% "1%%

!"# !"#"$2345.4, 4676486 /9,946+:93;.<

%"&'()*+&"$ %"&,+-$)6.483;=>>

$%&

?*1#

@&1&

$'(

?#1A

@*1&

$'%

?#1&

@&1(

$''

?B1%

@(1"

$')

?#1$

@(1"

?#1A

@B1"

?*1A

@&1(

?"1#

@"1(

?$1&

@$1#

?$1A

@*1#

?%1&

@$1"

?*1#

@"1#

?B1$

@$1B

?$1(

@&1%

?*1#

@&1*

?B1B

@*1*

?&1-

@"1-

?&1"

@%1(

?"1&

@(1"

*%

*'

*)

*+

*,

*&

*-

*.

*/

*%(

*%%

*%'

*%)

*%+

0123456" 172884798:7;8<01:125="79>6;2586147?"68<5;@0123A1# <"2:5;78=":=5"A @0";5"@ ;>:="<10 8=":41#@

!"#$ "%&$ %"'$ ()*$ ()*+,"(-#(, -./0('1"

234+

,567

,389

!"#

!$%!$"!$$!$&

'"

'"(

)"

'"()$

'$

'&&

)&

'"&

'&%

)*

+

,

'-

'"%

.

/

'$

'"%

0"

!&%

'$(

%1%% "1%%

%1%% "1%%

!"# !"#"$2345.4, 4676486 /9,946+:93;.<

%"&'()*+&"$ %"&,+-$)6.483;=>>

$%&

?*1#

@&1&

$'(

?#1A

@*1&

$'%

?#1&

@&1(

$''

?B1%

@(1"

$')

?#1$

@(1"

?#1A

@B1"

?*1A

@&1(

?"1#

@"1(

?$1&

@$1#

?$1A

@*1#

?%1&

@$1"

?*1#

@"1#

?B1$

@$1B

?$1(

@&1%

?*1#

@&1*

?B1B

@*1*

?&1-

@"1-

?&1"

@%1(

?"1&

@(1"

*%

*'

*)

*+

*,

*&

*-

*.

*/

*%(

*%%

*%'

*%)

*%+

0123456" 172884798:7;8<01:125="79>6;2586147?"68<5;@0123A1# <"2:5;78=":=5"A @0";5"@ ;>:="<10 8=":41#@

!"#

!$%!$"!$$!$&

'"

'"(

)"

'"()$

'$

'&&

)&

'"&

'&%

)*

+

,

'-

'"%

.

/

'$

'"%

0"

!&%

'$(

%1%% "1%%

%1%% "1%%

!"# !"#"$2345.4, 4676486 /9,946+:93;.<

%"&'()*+&"$ %"&,+-$)6.483;=>>

$%&

?*1#

@&1&

$'(

?#1A

@*1&

$'%

?#1&

@&1(

$''

?B1%

@(1"

$')

?#1$

@(1"

?#1A

@B1"

?*1A

@&1(

?"1#

@"1(

?$1&

@$1#

?$1A

@*1#

?%1&

@$1"

?*1#

@"1#

?B1$

@$1B

?$1(

@&1%

?*1#

@&1*

?B1B

@*1*

?&1-

@"1-

?&1"

@%1(

?"1&

@(1"

*%

*'

*)

*+

*,

*&

*-

*.

*/

*%(

*%%

*%'

*%)

*%+

0123456" 172884798:7;8<01:125="79>6;2586147?"68<5;@0123A1# <"2:5;78=":=5"A @0";5"@ ;>:="<10 8=":41#@

!"#$ "%&$ %"'$ ()*$ ()*+,"(-#(, -./0('1"

234+

,567

,389

(a) (b)

Figure 3: (a) Our collaborators noticed a correlation be-tween high values and missing data in the curvemap de-tail view that had not been obvious when inspecting theheatmaps, and after investigation found a problem in thedata processing pipeline. (b) After fixing the problem manymissing data points were recovered.

9.2. Whole Genome DuplicationA whole genome duplication event occurred in yeast some150 million years ago. An ancestral yeast species gained anextra copy of all its genes, and thus living descendants ofthat common ancestor often have multiple copies of genesas a result of that duplication. Two such genes are g1 andg2, which are shown in Figure 4. Scientists know that thegenetic controls responsible for the activity levels of thesegenes have evolved to behave differently. A telltale sign ofthis phenomenon is the shift in the peaks of the time seriescurves for the g2 gene activity patterns compared to those

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

for g1 in the post-duplication species s1 to s5, shown inthe first five rows of Figure 4(a). In the eighth row is one ofthe pre-duplication species, s8, where only one of the genesexists and the data have been duplicated for consistency, re-sulting in identical curves in both columns.

!111!222

mmm111

mmm111888

PPP111

mmm111888PPP222

mmm222

mmm333333

PPP333

mmm111333

mmm333000

PPP444

ccc

ddd

mmm777

mmm111000

aaa

bbb

mmm222

mmm111000

BBB111

!333000

mmm222888

000...000000 111...000000

000...000000 111...000000

!EEYY !eeennneeesssfffooorrrwwwaaarrrddd rrreeevvveeerrrssseee bbbiiidddiiirrreeeccctttiiiooonnnaaalll

MMMeeetttaaabbbooollliiittteeesss MMMeeetttrrriiicccsssPPPeeeaaarrrsssooonnnAAALLLLLL

!222 /// !eeennneee222 000...333888

gg11

---555...222

888...333

gg22

---777...111

444...777

---777...111

777...777

---555...000

888...333

---333...888

555...444

---555...222

333...000

---444...444

444...666

000...000

444...222

---333...444

222...333

---111...555

333...999

---222...111

444...777

---444...666

333...777

---333...666

333...999

---222...666

000...555

---111...111

000...000

---111...000

111...333

ss11

ss22

ss33

ss44

ss55

ss66

ss77

ss88

ss99

ss1100

ss1111

ss1122

ss1133

ss1144

PPAATTHHLLIINNEE AA    TTOOOOLL    FFOORR    CCOOMMPPAARRAATTIIVVEE    FFUUNNCCTTIIOONNAALL    GGEENNOOMMIICCSSPPAATTHHWWAAYY MMEETTRRIICC    OOVVEERRVVIIEEWW SSPPEECCIIEESS CCUURRVVEEMMAAPP OOVVEERRLLAAYYSS

!"#$ !"#%

!"!#

$"

$"%

&"

$"%&#

$#

$''

&'

$"'

$'(

&)

*

+

$,

$"(

-

.

$#

$"(

/"

!'(

$#%

(0(( "0((

(0(( "0((

!"# !"#"$1234-3+ 3565375 .8+835*982:-;

%"&'()*+&"$ %"&,+-$&5-372:<==

!#>?>!5:5#>>(0'%

$%

@A0#

>%0'

$&

@,0"

>)0,

@,0"

>,0,

@A0(

>%0'

@'0%

>A0)

@A0#

>'0(

@)0)

>)0B

>(0(

>)0#

@'0)

>#0'

@"0A

>'0C

@#0"

>)0,

@)0B

>'0,

@'0B

>'0C

@#0B

>(0A

@"0"

>(0(

@"0(

>"0'

'%

'&

'(

')

'*

'+

',

'-

'.

'%/

'%%

'%&

'%(

'%)

0123456" 172884798:7;8<01:125="79>6;2586147?"68<5;@0123A1# <"2:5;78=":=5"A @0";5"@ ;>:="<10 8=":41#@

(a) (b)

Figure 4: Whole genome duplication event. (a) The knownpost-duplication shift in activity patterns in the first five rowsbetween the g1 and g2 genes is immediately obvious inPathline, where the curves clearly have mirror symmetry. (b)The mirror symmetry is much less apparent in a conventionalheatmap view showing the same data.

The evolution of g2 in the post-duplication species wasknown by our collaborators, and was one of the first thingsthey looked for using an early version of Pathline. Accordingto the biologists, the expected shift is much more apparentin the curvemap encoding than in the conventional heatmapview, which is shown in Figure 4(b). The group remarkedthat these types of inquires generally take about 30 minutesto uncover in a heatmap, and require on the order of 5 min-utes or less to see in Pathline — in this example, the trendwas immediately obvious to them.

The biologists attribute this efficiency gain to several as-pects of Pathline. One central aspect is the ability to iden-tify similarities and differences in shapes in the curvemapviews while still being able to understand the absolute mag-nitudes of the changes in the curve overlay plots. A heatmaptypically relies on a fixed saturation value of the data corre-sponding to the brightest colors in the image; changing thesaturation value significantly affects the viewer’s perceptionof the trends in the data. Detecting the equivalent of similarcurve shapes requires a detailed analysis of the heatmap atmultiple saturation values — the identification of similari-ties across these multiple versions can be difficult. Combin-ing the curvemap views with the curve overlays in the samewindow streamlines this process. Another central aspect ofPathline is the pathway-centered approach for customizing

the curvemap view, which allows for the direct comparisonof two genes or metabolites that would likely be displayedfar from each other in a standard clustered heatmap, obscur-ing their key similarities and differences.

After using Pathline to see this previously known find-ing, the team continued their analysis and found somethingnew. They noticed that although the s6 species in the sixthrow is a post-duplication species, it exhibits behavior closerto the pre-duplication species in the bottom rows than itspost-duplication relatives in the first five rows. Rather thandiverging to display early versus late activation, these twodifferent genes still display the same behavior, providing apossible hint as to where in the phylogeny the regulation ofg2 changed to the dominant post-duplication behavior.

9.3. Pathway-Level AnalysisOur collaborators were also able to quickly identify someknown pathway trends in their dataset. Shown in Figure 5(a),the metabolites in the P1 pathway and the P2 cycle show ageneral decline in similarity for elements later in the path-way with one notable outlier, the m19 metabolite. Identify-ing this trend and its outlier previously required a significantamount of effort; our collaborators stated that finding thissame information using Pathline is straightforward and ob-vious using the linearized pathway view.

Our collaborators then probed deeper, and again analysiswith Pathline led them to new insights. The metabolite m18is acted upon by several enzymes to form m19. Comparingthe m18 similarity scores for two important subgroups ofspecies it was clear that m18 is not only poorly conservedacross all of the species, but also within the two subgroups.This result is in contrast to the highly conserved nature ofm19 across virtually the entire set of the species. By creat-ing a curvemap only with m18 and m19, our collaboratorsimmediately recognized some previously unknown behav-iors. These behaviors have provided our collaborators withan interesting set of problems and hypotheses made possi-ble by the link between the pathway view and the curvemapview in Pathline.

9.4. Gene-Level RelationshipsAnother interesting set of new insights involves the g5, g6,and g7 genes. g5 and g6 are both forward enzymes thatwork together to catalyze a single reaction in the P1 path-way, while g7 is a reverse enzyme that catalyzes the samereaction in the opposite direction — these three genes are se-lected in Figure 5(a). In the curvemap view, shown in Figure5(b), it is evident that in almost every species g5 and g6 dis-play nearly the same behavior. This fact would also be easyto detect in a heatmap because the genes would cluster to-gether, making the rows adjacent — the proximity of the twogenes to each other in the heatmap would be an indicationthat in each species the time series are nearly identical. Thenew insight, however, is that the temporal behavior of thesetwo genes changes over the course of evolution, i.e., across

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

!555!666!777

mmm111888

mmm111999

mmm111

mmm111888

PPP111

mmm111888PPP222

mmm222

mmm333333

PPP333

mmm111333

mmm333000

PPP444

ccc

ddd

mmm777

mmm111000

aaa

bbb

mmm222

mmm111000

BBB111

!333000

mmm222888

---000...555000 111...000000

---000...555000 111...000000

!EEYY !eeennneeesssfffooorrrwwwaaarrrddd rrreeevvveeerrrssseee bbbiiidddiiirrreeeccctttiiiooonnnaaalll

MMMeeetttaaabbbooollliiittteeesss MMMeeetttrrriiicccsssPPPeeeaaarrrsssooonnnSSSuuubbb!rrrooouuuppp111PPPeeeaaarrrsssooonnnSSSuuubbb!rrrooouuuppp222PPPeeeaaarrrsssooonnnAAALLLLLL

gg55

---333...222

222...666

gg66

---444...333

444...333

gg77

---222...333

999...555

mm1188

---222...777

222...333

mm1199

---333...000

222...666

---222...555

777...555

---222...777

999...333

---333...222

222...222

---111...999

999...555

---222...444

333...000

---222...111

222...444

---222...888

777...888

---333...222

666...000

---222...000

888...555

---333...111

666...666

---444...333

555...555

---222...888

111...888

---111...999

000...222

---111...777

999...111

ss11

ss22

ss33

ss44

ss55

ss66

ss77

ss88

ss99

ss1100

ss1111

ss1122

ss1133

ss1144

PPAATTHHLLIINNEE AA    TTOOOOLL    FFOORR    CCOOMMPPAARRAATTIIVVEE    FFUUNNCCTTIIOONNAALL    GGEENNOOMMIICCSSPPAATTHHWWAAYY MMEETTRRIICC    OOVVEERRVVIIEEWW SSPPEECCIIEESS CCUURRVVEEMMAAPP OOVVEERRLLAAYYSS

!"!#!$

%&

%&'

(&

%&'()

%)

%**

(*

%&*

%*+

(,

-

.

%$

%&+

/

0

%)

%&+

1&

!*+

%)'

2+3"+ &3++

2+3"+ &3++

!"# !"#"$4567/6. 68986:8 0;.;68-<;5=/>

%"&'()*+&"$ %"&,+-$(8/6:5=?@0!65@A&(8/6:5=?@0!65@A)(8/6:5=BCC

!$DED!8=8$DD+3')

$%

2*3)

D)3#

$&

2,3*

D,3*

$'

2)3*

DF3"

2)3"

D$3"

2&3*

DF3*

2*3)

D)3)

2&3*

DF3"

2)3,

D*3+

2+3,

D)3,

2)3'

D$3'

2*3)

D#3+

2&3,

D'3"

2*3&

D#3#

2,3*

D"3"

2)3'

D&3'

2&3*

D+3+

2&3&

DF3&

()

(*

(+

(,

(%

(&

('

(-

(.

()/

())

()*

()+

(),

0123456" 172884798:7;8<01:125="79>6;2586147?"68<5;@0123A1# <"2:5;78=":=5"A @0";5"@ ;>:="<10 8=":41#@

(a) (b)

Figure 5: (a) The metabolites along the P1 and P2 path-ways show a general decline except for the outlier m19. (b)Using this curvemap view of the genes g5, g6, and g7 thebiologists confirmed known trends and discovered unknowngene duplication in the s7 species.

the species. Some species display an early peak, some a latepeak, and others no strong peak at all. In contrast, the re-verse gene g7 is fairly similar in shape across all species, al-though the magnitude of changes varies, which is evident inthe curve overlay plot at the bottom of the column. Our col-laborators state that these latter two detailed analyses wouldhave been difficult to perform using standard heatmaps: forinstance, the different magnitudes of g7 data would requireinspection at many saturation levels to identify similaritiesacross species. Additionally, they would have been unlikelyto even probe for these specific behaviors due to the effortneeded to produce targeted, customized heatmaps for smallsets of genes.

In this same analysis, they quickly identified the ex-tremely different behaviors in the s7 species for g5 and g6as the shape-based curvemap view made the trend immedi-ately salient. From this observation they have since identi-fied the cause of the behavior as a gene duplication event,

and they now plan to investigate the potential function of theduplicate gene.

9.5. Bidirectional EnzymesAnother new insight sparked by the use of Pathline is theobservation that the bidirectional enzymes all seem to havesimilar patterns, with just a few exceptions — these enzymesare selected in Figure 1. This trend had gone unnoticed us-ing the team’s previous visualization tools; visually encod-ing directionality of enzymes on the pathway view and sup-porting the direct comparison of this subset of enzymes in acurvemap view helped the biologists to notice this trend.

10. Conclusions and Future WorkWe have presented the design of Pathline, an interactive pro-totype tool for visualizing comparative functional genomicsdata across multiple pathways, multiple genes and metabo-lites, and multiple species. Its curvemap detail view is analternative to the color-based visual encoding of traditionalheatmaps that supports detailed analysis of the shapes oftime series curves across species and genes. The linearizedpathways view provides an overview of multiple aggregatesimilarity scores for each gene across multiple pathways. Wetook a user-centered design approach in developing Pathline,working closely with our biology collaborators to refine thetool’s design. These biologists used Pathline in their analysisprocess to confirm known findings and to generate new in-sights, and are using the tool to communicate these findings.

We believe that Pathline will be an effective visualizationtool for many other biological problems. We have alreadyidentified two other research groups as potential users ofPathline — one group studies how stem cells give rise tothe various types of blood cells in the body, while the othergroup is interested in abnormalities along cellular pathwaysthat are involved with cancer. Furthermore, the curvemapand linearized pathway visual encodings are applicable tothe much larger bioinformatics community that currently useheatmap and pathway visualization tools to explore a widerange of scientific topics. We hope the open-source releaseof Pathline will encourage a broader user base.

We would like to extend Pathline by allowing it to importmetabolic and cellular pathway information directly fromdatabases such as KEGG [KAG∗08] and linearize it auto-matically. A very exciting direction for future work wouldbe to add support for showing DNA sequence information,which could help researchers answer the deep question ofhow related genes have different gene activity patterns.

Acknowledgments

We thank our biology collaborators in the Regev lab for all oftheir time, enthusiasm, and support: Michelle Chan, Court-ney French, Jay Konieczka, Jenna Pfiffner, Aviv Regev, andDawn Thompson. This work was supported by the Data Vi-sualization Initiative at the Broad Institute, and by the Na-tional Science Foundation under Grant 0937060 to the Com-puting Research Association for the CIFellows Project.

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

M. Meyer et al. / Pathline

References

[AHP∗05] ALM E., HUANG K., PRICE M., KOCHE R., KELLERK., DUBCHAK I., ARKIN A.: The MicrobesOnline web site forcomparative genomics. Genome Res 15, 7 (July 2005), 1015–1022. 1, 6

[BMGK08] BARSKY A., MUNZNER T., GARDY J., KINCAIDR.: Cerebral: Visualizing multiple experimental conditions ona graph with biological context. IEEE Trans. Visualization andComputer Graphics (Proc. InfoVis 2008) 14, 6 (2008), 1253–1260. 1, 6

[BW09] BOURQUI R., WESTENBERG M. A.: Visualizing tem-poral dynamics at the genomic and metabolic level. In IV ’09:Proceedings of the 2009 13th International Conference Informa-tion Visualisation (Washington, DC, USA, 2009), IEEE Com-puter Society, pp. 317–322. 1, 6

[CFF∗08] CASPI R., FOERSTER H., FULCHER C., KAIPA P.,KRUMMENACKER M., LATENDRESSE M., PALEY S., RHEE S.,SHEARER A., TISSIER C., WALK T., ZHANG P., KARP P.: TheMetaCyc database of metabolic pathways and enzymes and theBioCyc collection of pathway/genome databases. Nucleic AcidsResearch 36, Database-Issue (2008), 623–631. 2

[CM84] CLEVELAND W. S., MCGILL R.: Graphical perception:Theory, experimentation, and application to the development ofgraphical methods. Journal of the American Statistical Associa-tion 79, 387 (1984), 531–554. 4, 5

[ESBB98] EISEN M. B., SPELLMAN P. T., BROWN P. O., BOT-STEIN D.: Cluster analysis and display of genome-wide expres-sion patterns. Proc. National Academy of Sciences 95, 25 (1998),14863–14868. 2, 4, 7

[HS01] HOCHHEISER H., SHNEIDERMAN B.: Interactive explo-ration of time series data. In Proc. Intl. Conf. on Discovery Sci-ence (DS) (2001), Springer-Verlag, pp. 441–446. 6

[JKS06] JUNKER B. H., KLUKAS C., SCHREIBER F.: VANTED:A system for advanced data analysis and visualization in the con-text of biological networks. BMC Bioinformatics 7 (2006), 109.1, 6

[KAG∗08] KANEHISA M., ARAKI M., GOTO S., HATTORI M.,HIRAKAWA M., ITOH M., KATAYAMA T., KAWASHIMA S.,OKUDA S., TOKIMATSU T., YAMANISHI Y.: KEGG for linkinggenomes to life and the environment. Nucleic Acids Research 36,Database-Issue (2008), 480–484. 9

[KK94] KEIM D. A., KRIEGEL H. P.: VisDB: Database explo-ration using multidimensional visualization. Computer Graphicsand Applications, IEEE 14, 5 (1994), 40–49. 3

[KLK∗09] KIM B., LEE B., KNOBLACH S., HOFFMAN E., SEOJ.: GeneShelf: A web-based visual interface for large gene ex-pression time-series data repositories. IEEE Trans. Visualizationand Computer Graphics (Proc. InfoVis 2009) 15, 6 (2009), 905–912. 5, 7

[KPR02] KARP P., PALEY S., ROMERO P.: The pathway toolssoftware. Bioinformatics 18 (2002), S225–S232. 1, 6

[LMK07] LAM H., MUNZNER T., KINCAID R.: Overview usein multiple visual information resolution interfaces. IEEE Trans.Visualization and Computer Graphics (Proc. InfoVis 2007) 13, 6(2007), 1278–1285. 4, 5

[LS10] LUBLING Y., SEGAL E.: Genomica, http://genomica.weizmann.ac.il/, last accessed 7 Apr 2010. 7

[LYKB08] LETUNIC I., YAMADA T., KANEHISA M., BORK P.:iPath: interactive exploration of biochemical pathways and net-works. Trends in biochemical sciences 33, 3 (March 2008), 101–103. 1, 6

[Mac86] MACKINLAY J. D.: Automating the Design of GraphicalPresentations of Relational Information. ACM Trans. on Graph-ics (TOG) 5, 2 (1986), 111–141. 5

[MMKN08] MCLACHLAN P., MUNZNER T., KOUTSOFIOS E.,NORTH S.: LiveRAC: interactive visual exploration of systemmanagement time-series data. In Proc. ACM Conf. on HumanFactors in Computing Systems (CHI) (2008), pp. 1483–1492. 6

[MMP09] MEYER M., MUNZNER T., PFISTER H.: MizBee: Amultiscale synteny browser. IEEE Trans. Visualization and Com-puter Graphics 15, 6 (2009), 897–904. 6

[MSH∗05] MLECNIK B., SCHEIDELER M., HACKL H.,HARTLER J., SANCHEZ-CABO F., TRAJANOSKI Z.: Path-wayExplorer: web service for visualizing high-throughputexpression data on biological pathways. Nucleic Acids Research33, Web-Server-Issue (2005), 633–637. 1, 6

[PW06] PLUMLEE M., WARE C.: Zooming versus multiple win-dow interfaces: Cognitive costs of visual comparisons. ACMTrans. on Computer-Human Interaction (ToCHI) 13, 2 (2006),179–209. 5

[RFM07] REAS C., FRY B., MAEDA J.: Processing: A Program-ming Handbook for Visual Designers and Artists. MIT Press,2007. 6

[Sal04] SALDANHA A. J.: Java Treeview – extensible visualiza-tion of microarray data. Bioinformatics 20, 17 (2004), 3246–3248. 2, 4, 7

[SDW09] SLINGSBY A., DYKES J., WOOD J.: Configuring hi-erarchical layouts to address research questions. IEEE Trans.Visualization and Computer Graphics (Proc. InfoVis 2009) 15, 6(2009), 977–984. 4

[SMO∗03] SHANNON P., MARKIEL A., OZIER O., BALIGA N.,WANG J., RAMAGE D., AMIN N., SCHWIKOWSKI B., IDEKERT.: Cytoscape: a software environment for integrated modelsof biomolecular interaction networks. Genome research 13, 11(November 2003), 2498–2504. 1, 6

[SND05] SARAIYA P., NORTH C., DUCA K.: An insight-basedmethodology for evaluating bioinformatics visualizations. IEEETrans. Visualization and Computer Graphics 11, 4 (2005), 443–456. 2, 7

[SS02] SEO J., SHNEIDERMAN B.: Interactively exploring hier-archical clustering results. Computer 35, 7 (2002), 80–86. 2, 6,7

[WF09] WILKINSON L., FRIENDLY M.: The history of the clus-ter heat map. The American Statistician 63, 2 (2009), 179–184.7

c© 2010 The Author(s)Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.